active_learning_tutorial.rst 6.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
  1. Running the active learning core
  2. ================================
  3. The active learning core works by trying to predict the relations using information provided by the user.
  4. This means you'll have to label some of the examples and based on those, the core will infer the rest.
  5. The core will also give you to label the more important examples (those which best helps
  6. to figure out the other cases).
  7. To start using it you'll need to define a relation, run the core, label some evidence and re-run the core loop.
  8. You can also label evidences and re-run the core as much as you like to have a better performance.
  9. Creating a relation
  10. -------------------
  11. To create a relation, first `open up the web server <tutorial.html#open-the-web-interface>`__ if you haven't already, and use a
  12. web browser to navigate on `http://127.0.0.1:8000 <http://127.0.0.1:8000>`_.
  13. There you'll find instructions on how to create a relation.
  14. Running the core
  15. ----------------
  16. After creating a relation, you can start the core to look for instances of that relation.
  17. You can run this core in two modes: **High precision** or **high recall**.
  18. `Precision and recall <http://en.wikipedia.org/wiki/Precision_and_recall>`_ can be traded with one another up to a certain point. I.e. it is possible to trade some
  19. recall to get better precision and vice versa.
  20. To visualize better this trade off, lets see an example:
  21. A precision of 99% means that 1 of every 100 predicted relations will be wrong and the rest will be correct.
  22. A recall of 30% means that only 30 out of 100 existent relations will be detected by the algorithm and the rest
  23. will be wrongly discarded as "no relation present".
  24. Run the active learning core by doing:
  25. .. code-block:: bash
  26. python bin/iepy_runner.py <relation_name> <output>
  27. And add ``--tune-for=high-prec`` or ``--tune-for=high-recall`` before the relation name to switch
  28. between modes. The default is **high precision**.
  29. This will run until it needs you to label some of the evidences. At this point, what you
  30. need to do is go to the web interface that you ran on the previous step, and there you
  31. can label some evidences.
  32. When you consider that is enough, on the prompt that the iepy runner presented you,
  33. continue the execution by typing **run**.
  34. That will cycle again and repeat the process.
  35. Run the active learning core in the command line and ask it to **STOP**.
  36. It'll save a csv with the automatic classifications for all evidences in the database.
  37. Also, note that you can only predict a relation for a text that has been inserted into the database.
  38. The csv output file has the primary key of an object in the database that represents the evidence that
  39. was classified as "relation present" or "relation not present". An evidence object in the database is a
  40. rich-in-information object containing the entities and circumstances surrounding the prediction that
  41. is too complex to put in a single csv file.
  42. In order to access the entities and other details you'll need to write a script
  43. to talk with the database (see iepy/data/models.py).
  44. Fine tuning
  45. -----------
  46. If you want to modify the internal behavior, you can change the settings file. On your instance
  47. folder you'll fine a file called ``extractor_config.json``. There you've all the configuration
  48. for the internal classifier, such as:
  49. Classifier
  50. ..........
  51. This sets the classifier algorithm to be used, you can choose from:
  52. * sgd: `Stochastic Gradient Descent <http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html>`_
  53. * knn: `Nearest Neighbors <http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier>`_
  54. * svc `(default)`: `C-Support Vector Classification <http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>`_
  55. * randomforest: `Random Forest <http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html>`_
  56. * adaboost: `AdaBoost <http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html>`_
  57. Features
  58. ........
  59. Features to be used in the classifier, you can use a subset of:
  60. * number_of_tokens
  61. * symbols_in_between
  62. * in_same_sentence
  63. * verbs_count
  64. * verbs_count_in_between
  65. * total_number_of_entities
  66. * other_entities_in_between
  67. * entity_distance
  68. * entity_order
  69. * bag_of_wordpos_bigrams_in_between
  70. * bag_of_wordpos_in_between
  71. * bag_of_word_bigrams_in_between
  72. * bag_of_pos_in_between
  73. * bag_of_words_in_between
  74. * bag_of_wordpos_bigrams
  75. * bag_of_wordpos
  76. * bag_of_word_bigrams
  77. * bag_of_pos
  78. * bag_of_words
  79. These can be added as `sparse` adding them into the
  80. `sparse_features` section or added as `dense` into the `dense_features`.
  81. The features in the sparse section will go through a stage of linear dimension reduction
  82. and the dense features, by default, will be used with a non-linear classifier.
  83. Viewing predictions on the web user interface
  84. ---------------------------------------------
  85. If you prefer to review the predictions using the web interface is possible to run the
  86. active learning core in a way that stores the results on the database and they are accesible
  87. through the web.
  88. To do so, you'll have to run the core like this:
  89. .. code-block:: bash
  90. python bin/iepy_runner.py --db-store <relation_name>
  91. We do not have an specialized interface to review predictions but you can still view them
  92. by using the :doc:`interface to create a reference corpus <corpus_labeling>`.
  93. This way, you'll get labels as a new **judge** called iepy-run and a date.
  94. .. image:: labels_by_iepy.png
  95. Saving predictor for later use
  96. ------------------------------
  97. Since training could be a slow process, you might want to save your trained predictor and
  98. re-use it several times without the need to train again.
  99. You can save it this by doing:
  100. .. code-block:: bash
  101. python bin/iepy_runner.py --store-extractor=myextractor.pickle <relation_name> <output>
  102. And re use it like this:
  103. .. code-block:: bash
  104. python bin/iepy_runner.py --trained-extractor=myextractor.pickle <relation_name> <output>