tutorial.rst 3.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
  1. From 0 to IEPY
  2. ==============
  3. In this tutorial we will guide you through the steps to create your first
  4. Information Extraction application with IEPY.
  5. Be sure you have a working :doc:`installation <installation>`.
  6. IEPY internally uses `Django <https://www.djangoproject.com/>`_ to define the database models,
  7. and to provide a web interface. You'll see some components of Django around the project, such as the
  8. configuration file (with the database definition) and the ``manage.py`` utility. If you're familiar
  9. with Django, you will move faster in some of the steps.
  10. 0 - Creating an instance of IEPY
  11. --------------------------------
  12. To work with IEPY, you'll have to create an *instance*.
  13. This is going to be where the configuration, database and some binary files are stored.
  14. To create a new instance you have to run:
  15. .. code-block:: bash
  16. iepy --create <project_name>
  17. Where *<project_name>* is something that you choose.
  18. This command will ask you a few things such as database name, its username and its password.
  19. When that's done, you'll have an instance in a folder with the name that you chose.
  20. Read more about the instantiation process :doc:`here <instantiation>`.
  21. 1 - Loading the database
  22. ------------------------
  23. The way we load the data into the database is importing it from a *csv* file. You can use the script **csv_to_iepy**
  24. provided in your application folder to do it.
  25. .. code-block:: bash
  26. python bin/csv_to_iepy.py data.csv
  27. This will load **data.csv** into the database, from which the data will subsequently be accessed.
  28. Learn more about the required CSV file format `here <instantiation.html#csv-importer>`_.
  29. .. note::
  30. You might also provide a *gziped csv file.*
  31. 2 - Pre-processing the data
  32. ---------------------------
  33. Once you have your database with the documents you want to analyze, you have to
  34. run them through the pre-processing pipeline to generate all the information needed by IEPY's core.
  35. The pre-processing pipeline runs a series of steps such as
  36. text tokenization, sentence splitting, lemmatization, part-of-speech tagging,
  37. and named entity recognition
  38. :doc:`Read more about the pre-processing pipeline here. <preprocess>`
  39. Your IEPY application comes with code to run all the pre-processing steps.
  40. You can run it by doing:
  41. .. code-block:: bash
  42. python bin/preprocess.py
  43. This *will* take a while, especially if you have a lot of data.
  44. 3 - Open the web interface
  45. --------------------------
  46. To help you control IEPY, you have a web user interface.
  47. Here you can manage your database objects and label the information
  48. that the active learning core will need.
  49. To access the web UI, you must run the web server. Don't worry, you have everything
  50. that you need on your instance folder and it's as simple as running:
  51. .. code-block:: bash
  52. python bin/manage.py runserver
  53. Leave that process running, and open up a browser at `http://127.0.0.1:8000 <http://127.0.0.1:8000>`_ to view
  54. the user interface home page.
  55. Now it's time for you to *create a relation definition*. Use the web interface to create the relation that you
  56. are going to be using.
  57. IEPY
  58. ----
  59. Now, you're ready to run either the :doc:`active learning core <active_learning_tutorial>`
  60. or the :doc:`rule based core <rules_tutorial>`.
  61. Constructing a reference corpus
  62. -------------------------------
  63. To test information extraction performance, IEPY provides a tool for labeling the entire corpus "by hand"
  64. and the check the performance experimenting with that data.
  65. If you would like to create a labeled corpus to test the performance or for other purposes, take a look at
  66. the :doc:`corpus labeling tool <corpus_labeling>`