123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231 |
- About the Pre-Process
- =====================
- The preprocessing adds the metadata that iepy needs to detect the relations, which includes:
- * Text tokenization and sentence splitting.
- * Text lemmatization
- * Part-Of-Speech (POS) tagging.
- * Named Entity Recognition (NER).
- * Gazettes resolution
- * Syntactic parsing.
- * TextSegments creation (internal IEPY text unit).
- We're currently running all this steps (except the last one) using the `Stanford CoreNLP <http://nlp.stanford.edu/software/corenlp.shtml>`_ tools.
- This runs in a all-in-one run, but every step can be :ref:`modified to use a custom version <customize>` that adjust your needs.
- About the Tokenization and Sentence splitting
- ---------------------------------------------
- The text of each Document is split on tokens and sentences, and that information is stored
- on the document itself, preserving (and also storing) for each token the offset (in chars)
- to the original document text.
- The one used by default it's the one that the `Stanford CoreNLP <http://nlp.stanford.edu/software/corenlp.shtml>`_ provides.
- .. note::
- While using the Stanford tokenizer, you can customize some of tokenization options.
- First read here: `tokenizer options <http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html>`_
- On your instance *settings.py* file, add options as keys on the CORENLP_TKN_OPTS dict.
- You can use as key any of the "known options", and as value,
- use True or False for booleans, or just strings when option requires a text.
- Example:
- .. code-block:: python
- CORENLP_TKN_OPTS = {
- 'latexQuotes': False
- }
- Lemmatization
- -------------
- .. note::
- Lemmatization was added on the version 0.9.2, all instances that were created before that,
- need to run the preprocess script again. This will run only the lemmatization step.
- The text runs through a step of lemmatization where each token gets a lemma. This is a canonical form of the word that
- can be used in the classifier features or the rules core.
- Part of speech tagging (POS)
- ----------------------------
- Each token is augmented with metadata about its part of speech such as noun, verb,
- adjective and other grammatical tags.
- Along the token itself, this may used by the NER to detect an entity occurrence.
- This information is also stored on the Document itself, together with the tokens.
- The one used by default it's the one that the `Stanford CoreNLP <http://nlp.stanford.edu/software/corenlp.shtml>`_ provides.
- Named Entity Recognition (NER)
- ------------------------------
- To find a relation between entities one must first recognize these entities in the text.
- As an result of NER, each document is added with information about all the found
- Named Entities (together with which tokens are involved in each occurrence).
- An automatic NER is used to find occurrences of an entity in the text.
- The default pre-process uses the Stanford NER, check the Stanford CoreNLP's `documentation <http://nlp.stanford.edu/software/corenlp.shtml>`_
- to find out which entity kinds are supported, but includes:
- * Location
- * Person
- * Organization
- * Date
- * Number
- * Time
- * Money
- * Percent
- Others remarkable features of this NER (that are incorporated to the default pre-process) are:
- - pronoun resolution
- - simple co-reference resolution
- This step can be customized to find entities of kinds defined by you, or anything else you may need.
- Gazettes resolution
- -------------------
- In case you want to add named entity recognition by matching literals, iepy provides a system of gazettes.
- This is a mapping of literals and entity kinds that will be run on top of the basic stanford NER.
- With this, you'll be able to recognize entities out of the ones done by the stanford NER, or even correct
- those that are incorrectly tagged.
- :doc:`Learn more about here. <gazettes>`
- Syntactic parsing
- -----------------
- .. note::
- Syntactic parsing was added on the version 0.9.3, all instances that were created before that,
- need to run the preprocess script again. This will run only the syntactic parsing step.
- The sentences are parsed to works out the syntactic structure. Each sentence gets an structure tree
- that is stored in `Penn Treebank notation <http://en.wikipedia.org/wiki/Treebank>`__. IEPY presents
- this to the user using a `NLTK Tree object <http://www.nltk.org/howto/tree.html>`__.
- By default the sentences are processed with the `Stanford Parser <http://nlp.stanford.edu/software/lex-parser.shtml>`__
- provided within the `Stanford CoreNLP <http://nlp.stanford.edu/software/corenlp.shtml>`__.
- For example, the syntactic parsing of the sentence ``Join the dark side, we have cookies`` would be:
- ::
- (ROOT
- (S
- (S
- (VP (VBN Join)
- (NP (DT the) (JJ dark) (NN side))))
- (, ,)
- (NP (PRP we))
- (VP (VBP have)
- (NP (NNS cookies)))))
- About the Text Segmentation
- ---------------------------
- IEPY works on a **text segment** (or simply **segment**) level, meaning that will
- try to find if a relation is present within a segment of text. The
- pre-process is the responsible for splitting the documents into segments.
- The default pre-process uses a segmenter that creates for documents with the following criteria:
- * for each sentence on the document, if there are at least 2 Entity Occurrences in there
- .. _customize:
- How to customize
- ----------------
- On your own IEPY instances, there's a file called ``preprocess.py`` located in the ``bin`` folder.
- There you'll find that the default is simply run the Stanford preprocess, and later the segmenter.
- This can be changed to run a sequence of steps defined by you
- For example, take this pseudo-code to guide you:
- .. code-block:: python
- pipeline = PreProcessPipeline([
- CustomTokenizer(),
- CustomSentencer(),
- CustomLemmatizer(),
- CustomPOSTagger(),
- CustomNER(),
- CustomSegmenter(),
- ], docs)
- pipeline.process_everything()
- .. note::
- The steps can be functions or callable objects. We recommend objects because generally you'll
- want to do some load up of things on the `__init__` method to avoid loading everything over and over again.
- Each one of those steps will be called with each one of the documents, meaning that every step will be called
- with all the documents, after finishing with that the next step will be called with each one of the documents.
- Running in multiple cores
- -------------------------
- Preprocessing might take a lot of time. To handle this you can run the preprocessing on several cores of the
- same machine or even run it on differents machines to accelerate the processing.
- To run it on the same machine using multiple cores, all you need to do is run:
- .. code-block:: bash
- $ python bin/preprocess.py --multiple-cores=all
- This will use all the available cores. You can also specify a number if you want to
- use less than that, like this:
- .. code-block:: bash
- $ python bin/preprocess.py --multiple-cores=2
- Running in multiple machines
- ----------------------------
- Running the preprocess on different machines it's a bit tricky, here's what you'll need:
- * A iepy instance with a database that allows remote access (such as postgres)
- * One iepy instance on each extra machine that has the database setting pointing to the main one.
- Then you'll need to decide on how many parts do you want to split the document set
- and run each part on a different machine. For example, you could split the documents in 4 and run 2 processes
- on one machine and 2 on another one. To do this you'll run:
- On one of the machines, in two different consoles run:
- .. code-block:: bash
- $ python bin/preprocess.py --split-in=4 --run-part=1
- .. code-block:: bash
- $ python bin/preprocess.py --split-in=4 --run-part=2
- And on the other machine:
- .. code-block:: bash
- $ python bin/preprocess.py --split-in=4 --run-part=3
- .. code-block:: bash
- $ python bin/preprocess.py --split-in=4 --run-part=4
|