how_to_hack.rst 5.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192
  1. How to Hack
  2. ===========
  3. There are several places where you can incorporate your own ideas and needs into IEPY.
  4. Here you'll see how to modify different parts of the iepy core.
  5. Altering how the corpus is created
  6. ----------------------------------
  7. On the `preprocess <preprocess.html#how-to-customize>`_ section was already mentioned that you can customize how the corpus is created.
  8. Using your own classifier
  9. -------------------------
  10. You can change the definition of the *extraction classifier* that is used when running
  11. iepy in *active learning* mode.
  12. As the simplest example of doing this, check the following example.
  13. First, define your own custom classifier, like this:
  14. .. code-block:: python
  15. from sklearn.linear_model import SGDClassifier
  16. from sklearn.pipeline import make_pipeline
  17. from sklearn.feature_extraction.text import CountVectorizer
  18. class MyOwnRelationClassifier:
  19. def __init__(self, **config):
  20. vectorizer = CountVectorizer(
  21. preprocessor=lambda evidence: evidence.segment.text)
  22. classifier = SGDClassifier()
  23. self.pipeline = make_pipeline(vectorizer, classifier)
  24. def fit(self, X, y):
  25. self.pipeline.fit(X, y)
  26. return self
  27. def predict(self, X):
  28. return self.pipeline.predict(X)
  29. def decision_function(self, X):
  30. return self.pipeline.decision_function(X)
  31. and later, in iepy_runner.py of your IEPY instance, in the **ActiveLearningCore** creation,
  32. provide it as a configuration parameter like this
  33. .. code-block:: python
  34. iextractor = ActiveLearningCore(
  35. relation, labeled_evidences,
  36. tradeoff=tuning_mode,
  37. extractor_config={},
  38. extractor=MyOwnRelationClassifier
  39. )
  40. Implementing your own features
  41. ------------------------------
  42. Your classifier can use features that are already built within iepy or you can create your
  43. own. You can even use a rule (as defined in the :doc:`rules core <rules_tutorial>`) as feature.
  44. Start by creating a new file in your instance, you can call it whatever you want, but for this
  45. example lets call it ``custom_features.py``. There you'll define your features:
  46. .. code-block:: python
  47. # custom_features.py
  48. from featureforge.feature import output_schema
  49. @output_schema(int, lambda x: x >= 0)
  50. def tokens_count(evidence):
  51. return len(evidence.segment.tokens)
  52. .. note::
  53. Your features can use some of the `Feature Forge's <http://feature-forge.readthedocs.org/en/latest/>`__
  54. capabilities.
  55. Once you've defined your feature you can use it in the classifier by adding it to the configuration
  56. file. You should have one on your instance with all the default values, it's called ``extractor_config.json``.
  57. There you'll find 2 sets of features where you can add it: dense or sparse. Depending on the values returned
  58. by your feature you'll choose one over the other.
  59. To include it, you have to add a line with a python path to your feature function. If you're not familiarized with
  60. the format you should follow this pattern:
  61. ::
  62. {project_name}.{features_file}.{feature_function}
  63. In our example, our instance is called ``born_date``, so in the config this would be:
  64. .. code-block:: json
  65. "dense_features": [
  66. ...
  67. "born_date.custom_features.tokens_count",
  68. ...
  69. ],
  70. Remember that if you want to use that configuration file you have to use the option ``--extractor-config``
  71. Using rules as features
  72. -----------------------
  73. In the same way, and without doing any change to the rule, you can
  74. add it as feature by declaring it in your config like this:
  75. Suppose your instance is called ``born_date`` and your rule is called ``born_date_in_parenthesis``,
  76. then you'll do:
  77. .. code-block:: json
  78. "dense_features": [
  79. ...
  80. "born_date.rules.born_date_in_parenthesis",
  81. ...
  82. ],
  83. This will run your rule as a feature that returns 0 if it didn't match and 1 if it matched.
  84. Using all rules as one feature
  85. ..............................
  86. Suppose you have a bunch of rules defined in your rules file and instead of using each rule as a
  87. different feature you want to use a single feature that runs all the rules to test if the evidence
  88. matches. You can write a custom feature that does so. Let's look an example snippet:
  89. .. code-block:: python
  90. # custom_features.py
  91. import refo
  92. from iepy.extraction.rules import compile_rule, generate_tokens_to_match, load_rules
  93. rules = load_rules()
  94. def rules_match(evidence):
  95. tokens_to_match = generate_tokens_to_match(evidence)
  96. for rule in rules:
  97. regex = compile_rule(rule, evidence.relation)
  98. if refo.match(regex, tokens_to_match):
  99. if rule.answer: # positive rule
  100. return 1
  101. else: # negative rule
  102. return -1
  103. # no rule matched
  104. return 0
  105. This will define a feature called ``rules_match`` that tries every rule for an evidence
  106. until a match occurs, and returns one of three different values, depending on the type
  107. of match.
  108. To use this you have to add this single feature to your config like this:
  109. .. code-block:: json
  110. "dense_features": [
  111. ...
  112. "born_date.custom_features.rules_match",
  113. ...
  114. ],
  115. Documents Metadata
  116. ------------------
  117. While building your application, you might want to store some extra information about your documents.
  118. To avoid loading this data every time when predicting, we've separated the place to put this
  119. information into another model called **IEDocumentMetadata** that is accessible through the **metadata** attribute.
  120. IEDocumentMetadata has 3 fields:
  121. * title: for storing document's title
  122. * url: to save the source url if the document came from a web page
  123. * itmes: a dictionary that you can use to store anything you want.
  124. By default, the **csv importer** uses the document's metadata to save the filepath of the csv file on the *items* field.