123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192 |
- How to Hack
- ===========
- There are several places where you can incorporate your own ideas and needs into IEPY.
- Here you'll see how to modify different parts of the iepy core.
- Altering how the corpus is created
- ----------------------------------
- On the `preprocess <preprocess.html#how-to-customize>`_ section was already mentioned that you can customize how the corpus is created.
- Using your own classifier
- -------------------------
- You can change the definition of the *extraction classifier* that is used when running
- iepy in *active learning* mode.
- As the simplest example of doing this, check the following example.
- First, define your own custom classifier, like this:
- .. code-block:: python
- from sklearn.linear_model import SGDClassifier
- from sklearn.pipeline import make_pipeline
- from sklearn.feature_extraction.text import CountVectorizer
- class MyOwnRelationClassifier:
- def __init__(self, **config):
- vectorizer = CountVectorizer(
- preprocessor=lambda evidence: evidence.segment.text)
- classifier = SGDClassifier()
- self.pipeline = make_pipeline(vectorizer, classifier)
- def fit(self, X, y):
- self.pipeline.fit(X, y)
- return self
- def predict(self, X):
- return self.pipeline.predict(X)
- def decision_function(self, X):
- return self.pipeline.decision_function(X)
- and later, in iepy_runner.py of your IEPY instance, in the **ActiveLearningCore** creation,
- provide it as a configuration parameter like this
- .. code-block:: python
- iextractor = ActiveLearningCore(
- relation, labeled_evidences,
- tradeoff=tuning_mode,
- extractor_config={},
- extractor=MyOwnRelationClassifier
- )
- Implementing your own features
- ------------------------------
- Your classifier can use features that are already built within iepy or you can create your
- own. You can even use a rule (as defined in the :doc:`rules core <rules_tutorial>`) as feature.
- Start by creating a new file in your instance, you can call it whatever you want, but for this
- example lets call it ``custom_features.py``. There you'll define your features:
- .. code-block:: python
- # custom_features.py
- from featureforge.feature import output_schema
- @output_schema(int, lambda x: x >= 0)
- def tokens_count(evidence):
- return len(evidence.segment.tokens)
- .. note::
- Your features can use some of the `Feature Forge's <http://feature-forge.readthedocs.org/en/latest/>`__
- capabilities.
- Once you've defined your feature you can use it in the classifier by adding it to the configuration
- file. You should have one on your instance with all the default values, it's called ``extractor_config.json``.
- There you'll find 2 sets of features where you can add it: dense or sparse. Depending on the values returned
- by your feature you'll choose one over the other.
- To include it, you have to add a line with a python path to your feature function. If you're not familiarized with
- the format you should follow this pattern:
- ::
- {project_name}.{features_file}.{feature_function}
- In our example, our instance is called ``born_date``, so in the config this would be:
- .. code-block:: json
- "dense_features": [
- ...
- "born_date.custom_features.tokens_count",
- ...
- ],
- Remember that if you want to use that configuration file you have to use the option ``--extractor-config``
- Using rules as features
- -----------------------
- In the same way, and without doing any change to the rule, you can
- add it as feature by declaring it in your config like this:
- Suppose your instance is called ``born_date`` and your rule is called ``born_date_in_parenthesis``,
- then you'll do:
- .. code-block:: json
- "dense_features": [
- ...
- "born_date.rules.born_date_in_parenthesis",
- ...
- ],
- This will run your rule as a feature that returns 0 if it didn't match and 1 if it matched.
- Using all rules as one feature
- ..............................
- Suppose you have a bunch of rules defined in your rules file and instead of using each rule as a
- different feature you want to use a single feature that runs all the rules to test if the evidence
- matches. You can write a custom feature that does so. Let's look an example snippet:
- .. code-block:: python
- # custom_features.py
- import refo
- from iepy.extraction.rules import compile_rule, generate_tokens_to_match, load_rules
- rules = load_rules()
- def rules_match(evidence):
- tokens_to_match = generate_tokens_to_match(evidence)
- for rule in rules:
- regex = compile_rule(rule, evidence.relation)
- if refo.match(regex, tokens_to_match):
- if rule.answer: # positive rule
- return 1
- else: # negative rule
- return -1
- # no rule matched
- return 0
- This will define a feature called ``rules_match`` that tries every rule for an evidence
- until a match occurs, and returns one of three different values, depending on the type
- of match.
- To use this you have to add this single feature to your config like this:
- .. code-block:: json
- "dense_features": [
- ...
- "born_date.custom_features.rules_match",
- ...
- ],
- Documents Metadata
- ------------------
- While building your application, you might want to store some extra information about your documents.
- To avoid loading this data every time when predicting, we've separated the place to put this
- information into another model called **IEDocumentMetadata** that is accessible through the **metadata** attribute.
- IEDocumentMetadata has 3 fields:
- * title: for storing document's title
- * url: to save the source url if the document came from a web page
- * itmes: a dictionary that you can use to store anything you want.
- By default, the **csv importer** uses the document's metadata to save the filepath of the csv file on the *items* field.
|