preprocess.rst 7.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
  1. About the Pre-Process
  2. =====================
  3. The preprocessing adds the metadata that iepy needs to detect the relations, which includes:
  4. * Text tokenization and sentence splitting.
  5. * Text lemmatization
  6. * Part-Of-Speech (POS) tagging.
  7. * Named Entity Recognition (NER).
  8. * Gazettes resolution
  9. * Syntactic parsing.
  10. * TextSegments creation (internal IEPY text unit).
  11. We're currently running all this steps (except the last one) using the `Stanford CoreNLP <http://nlp.stanford.edu/software/corenlp.shtml>`_ tools.
  12. This runs in a all-in-one run, but every step can be :ref:`modified to use a custom version <customize>` that adjust your needs.
  13. About the Tokenization and Sentence splitting
  14. ---------------------------------------------
  15. The text of each Document is split on tokens and sentences, and that information is stored
  16. on the document itself, preserving (and also storing) for each token the offset (in chars)
  17. to the original document text.
  18. The one used by default it's the one that the `Stanford CoreNLP <http://nlp.stanford.edu/software/corenlp.shtml>`_ provides.
  19. .. note::
  20. While using the Stanford tokenizer, you can customize some of tokenization options.
  21. First read here: `tokenizer options <http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html>`_
  22. On your instance *settings.py* file, add options as keys on the CORENLP_TKN_OPTS dict.
  23. You can use as key any of the "known options", and as value,
  24. use True or False for booleans, or just strings when option requires a text.
  25. Example:
  26. .. code-block:: python
  27. CORENLP_TKN_OPTS = {
  28. 'latexQuotes': False
  29. }
  30. Lemmatization
  31. -------------
  32. .. note::
  33. Lemmatization was added on the version 0.9.2, all instances that were created before that,
  34. need to run the preprocess script again. This will run only the lemmatization step.
  35. The text runs through a step of lemmatization where each token gets a lemma. This is a canonical form of the word that
  36. can be used in the classifier features or the rules core.
  37. Part of speech tagging (POS)
  38. ----------------------------
  39. Each token is augmented with metadata about its part of speech such as noun, verb,
  40. adjective and other grammatical tags.
  41. Along the token itself, this may used by the NER to detect an entity occurrence.
  42. This information is also stored on the Document itself, together with the tokens.
  43. The one used by default it's the one that the `Stanford CoreNLP <http://nlp.stanford.edu/software/corenlp.shtml>`_ provides.
  44. Named Entity Recognition (NER)
  45. ------------------------------
  46. To find a relation between entities one must first recognize these entities in the text.
  47. As an result of NER, each document is added with information about all the found
  48. Named Entities (together with which tokens are involved in each occurrence).
  49. An automatic NER is used to find occurrences of an entity in the text.
  50. The default pre-process uses the Stanford NER, check the Stanford CoreNLP's `documentation <http://nlp.stanford.edu/software/corenlp.shtml>`_
  51. to find out which entity kinds are supported, but includes:
  52. * Location
  53. * Person
  54. * Organization
  55. * Date
  56. * Number
  57. * Time
  58. * Money
  59. * Percent
  60. Others remarkable features of this NER (that are incorporated to the default pre-process) are:
  61. - pronoun resolution
  62. - simple co-reference resolution
  63. This step can be customized to find entities of kinds defined by you, or anything else you may need.
  64. Gazettes resolution
  65. -------------------
  66. In case you want to add named entity recognition by matching literals, iepy provides a system of gazettes.
  67. This is a mapping of literals and entity kinds that will be run on top of the basic stanford NER.
  68. With this, you'll be able to recognize entities out of the ones done by the stanford NER, or even correct
  69. those that are incorrectly tagged.
  70. :doc:`Learn more about here. <gazettes>`
  71. Syntactic parsing
  72. -----------------
  73. .. note::
  74. Syntactic parsing was added on the version 0.9.3, all instances that were created before that,
  75. need to run the preprocess script again. This will run only the syntactic parsing step.
  76. The sentences are parsed to works out the syntactic structure. Each sentence gets an structure tree
  77. that is stored in `Penn Treebank notation <http://en.wikipedia.org/wiki/Treebank>`__. IEPY presents
  78. this to the user using a `NLTK Tree object <http://www.nltk.org/howto/tree.html>`__.
  79. By default the sentences are processed with the `Stanford Parser <http://nlp.stanford.edu/software/lex-parser.shtml>`__
  80. provided within the `Stanford CoreNLP <http://nlp.stanford.edu/software/corenlp.shtml>`__.
  81. For example, the syntactic parsing of the sentence ``Join the dark side, we have cookies`` would be:
  82. ::
  83. (ROOT
  84. (S
  85. (S
  86. (VP (VBN Join)
  87. (NP (DT the) (JJ dark) (NN side))))
  88. (, ,)
  89. (NP (PRP we))
  90. (VP (VBP have)
  91. (NP (NNS cookies)))))
  92. About the Text Segmentation
  93. ---------------------------
  94. IEPY works on a **text segment** (or simply **segment**) level, meaning that will
  95. try to find if a relation is present within a segment of text. The
  96. pre-process is the responsible for splitting the documents into segments.
  97. The default pre-process uses a segmenter that creates for documents with the following criteria:
  98. * for each sentence on the document, if there are at least 2 Entity Occurrences in there
  99. .. _customize:
  100. How to customize
  101. ----------------
  102. On your own IEPY instances, there's a file called ``preprocess.py`` located in the ``bin`` folder.
  103. There you'll find that the default is simply run the Stanford preprocess, and later the segmenter.
  104. This can be changed to run a sequence of steps defined by you
  105. For example, take this pseudo-code to guide you:
  106. .. code-block:: python
  107. pipeline = PreProcessPipeline([
  108. CustomTokenizer(),
  109. CustomSentencer(),
  110. CustomLemmatizer(),
  111. CustomPOSTagger(),
  112. CustomNER(),
  113. CustomSegmenter(),
  114. ], docs)
  115. pipeline.process_everything()
  116. .. note::
  117. The steps can be functions or callable objects. We recommend objects because generally you'll
  118. want to do some load up of things on the `__init__` method to avoid loading everything over and over again.
  119. Each one of those steps will be called with each one of the documents, meaning that every step will be called
  120. with all the documents, after finishing with that the next step will be called with each one of the documents.
  121. Running in multiple cores
  122. -------------------------
  123. Preprocessing might take a lot of time. To handle this you can run the preprocessing on several cores of the
  124. same machine or even run it on differents machines to accelerate the processing.
  125. To run it on the same machine using multiple cores, all you need to do is run:
  126. .. code-block:: bash
  127. $ python bin/preprocess.py --multiple-cores=all
  128. This will use all the available cores. You can also specify a number if you want to
  129. use less than that, like this:
  130. .. code-block:: bash
  131. $ python bin/preprocess.py --multiple-cores=2
  132. Running in multiple machines
  133. ----------------------------
  134. Running the preprocess on different machines it's a bit tricky, here's what you'll need:
  135. * A iepy instance with a database that allows remote access (such as postgres)
  136. * One iepy instance on each extra machine that has the database setting pointing to the main one.
  137. Then you'll need to decide on how many parts do you want to split the document set
  138. and run each part on a different machine. For example, you could split the documents in 4 and run 2 processes
  139. on one machine and 2 on another one. To do this you'll run:
  140. On one of the machines, in two different consoles run:
  141. .. code-block:: bash
  142. $ python bin/preprocess.py --split-in=4 --run-part=1
  143. .. code-block:: bash
  144. $ python bin/preprocess.py --split-in=4 --run-part=2
  145. And on the other machine:
  146. .. code-block:: bash
  147. $ python bin/preprocess.py --split-in=4 --run-part=3
  148. .. code-block:: bash
  149. $ python bin/preprocess.py --split-in=4 --run-part=4