123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188 |
- Running the rule based core
- ===========================
- Here we will guide you through the steps to use the rule based system
- to detect relations on the documents.
- How they work
- -------------
- In the rule based system, you have to define a set of "regular expression like" rules
- that will be tested against the segments of the documents. Roughly speaking,
- if a rule matches it means that the relation is present.
- This is used to acquire high precision because you control exactly what is matched.
- Anatomy of a rule
- -----------------
- .. note::
- If you don't know how to define a python function,
- `check this out <https://docs.python.org/3/tutorial/controlflow.html#defining-functions>`_
- A rule is basically a *decorated python function*.
- We will see where this needs to be added later, for now lets concentrate on how it is written.
- .. code-block:: python
- @rule(True)
- def born_date_and_death_in_parenthesis(Subject, Object):
- """ Example: Carl Bridgewater (January 2, 1965 - September 19, 1978) was shot dead """
- anything = Star(Any())
- return Subject + Pos("-LRB-") + Object + Token("-") + anything + Pos("-RRB-") + anything
- First you have to specify that your function is in fact a rule by using the **decorator @rule**.
- As you can see in the first line, this is added on top of the function.
- In this decorator you have to define if the rule is going to be *positive* or *negative*. A positive
- rule that matches will label the relations as present and a negative one will label it as not present.
- You can define this by passing the True or False parameter to the rule decorator.
- Then it comes the definition of the function. This functions takes two parameters: the **Subject** and the **Object**.
- This are patterns that will be part of the regex that the function has to return.
- After that it comes the body of the function. Here it has to be constructed the regular expression and needs to be
- returned by the function. This is not an ordinary regular expression, it
- uses `ReFO <https://github.com/machinalis/refo>`_.
- In ReFO you have to operate with objects that does some kind of check to the text segment.
- For our example, we've chosen to look for the *Was Born* relation. Particularly we look for the date of birth of a
- person when it is written like this:
- ::
- Carl Bridgewater (January 2, 1965 - September 19, 1978)
- To match this kind of cases, we have to specify the regex as a sum of predicates. This will check if every
- part matches.
- Rule's building blocks
- ----------------------
- Aside of every ReFO predicates, iepy comes with a bunch that you will find useful for creating your own rules
- * **Subject**: matches the evidence's left part.
- * **Object**: matches the evidence's right part.
- * **Token**: matches if the token is literally the one specified.
- * **Lemma**: matches if the lemma literally the one specified.
- * **Pos**: matches the *part of speech* of the token examined.
- * **Kind**: matches if the token belongs to an entity occurrence with a given kind.
- Setting priority
- ----------------
- Using the **rule decorator**, you can set that a rule is more important than another, and because of that it should
- try to match before.
- IEPY will run the rules ordered decreasingly by its priority number, and the default priority is 0.
- For example, to set a priority of 1 you do:
- .. code-block:: python
- @rule(True, priority=1)
- def rule_name(Subject, Object):
- ...
- Negative rules
- --------------
- If you spot that your rules are matching things erroneously, you can write a rule
- that catches that before it is taken by a positive rule.
- You do this by setting the rule as a *negative rule* using the decorator. Also is
- recommended to set higher priority so it is checked before the other ones.
- Example:
- .. code-block:: python
- @rule(False, priority=1)
- def incorrect_labeling_of_place_as_person(Subject, Object):
- """
- Ex: Sophie Christiane of Wolfstein (24 October 24, 1667 - 23 August 1737)
- Wolfstein is a *place*, not a *person*
- """
- anything = Star(Any())
- person = Plus(Pos("NNP") + Question(Token(",")))
- return anything + person + Token("of") + Subject + anything
- Note that the parameters of the rule decorator are **False** and **priority=1**
- Where do I place the rules
- --------------------------
- On your project's instance folder, there should be a *rules.py* file. All rules should be place
- there along with a **RELATION** variable that sets which relation is going to be used.
- This is the file that will be loaded when you run the *iepy_rules_runner*.
- Example
- -------
- This is a portion of the example provided with IEPY, you can view the `complete
- file here <https://github.com/machinalis/iepy/blob/develop/examples/birthdate/was_born_rules_sample.py>`__.
- .. code-block:: python
- from refo import Question, Star, Any, Plus
- from iepy.extraction.rules import rule, Token, Pos
- RELATION = "was born"
- @rule(True)
- def was_born_explicit_mention(Subject, Object):
- """
- Ex: Shamsher M. Chowdhury was born in 1950.
- """
- anything = Star(Any())
- return anything + Subject + Token("was born") + Pos("IN") + Object + anything
- @rule(True)
- def is_born_in(Subject, Object):
- """
- Ex: Xu is born in 1902 or 1903 in a family of farmers in Hubei ..
- """
- anything = Star(Any())
- return Subject + Token("is born in") + Object + anything
- @rule(True)
- def just_born(Subject, Object):
- """
- Ex: Lyle Eugene Hollister, born 6 July 1923 in Sioux Falls, South Dakota, enlisted in the Navy....
- """
- anything = Star(Any())
- return Subject + Token(", born") + Object + anything
- Verifying your rules
- --------------------
- During the construction of your rules, you might want to check whether if the rules are matching or if they
- aren't. Even more, if you have tagged data in your corpus, you can know how good is the performance.
- The rules verifier is located on your instance under the ``bin`` directory, it's called ``rules_verifier.py``
- You can run the verifier with every rule or with a single rule, on all of the segments or in a sample of those.
- Take a look at the parameters on the rules verifier to find out how to use them by running:
- .. code-block:: bash
- $ python bin/rules_verifier.py --help
- If you have labeled data on your corpus, the run will calculate how it scored in terms of precision, recall and
- other metrics. You have to keep in mind that this is not exactly what you'll get when you run the rules core, even
- if you run the verifier with all the rules and all the data, the numbers are going to be a little different because
- this will run every evidence with every rule, and the core instead stops at the first match. This is just a warning so you
- don't get too excited or too depressed with these results.
|