The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Treex::Tutorial::Install - Installation Guide for the Treex NLP framework

VERSION

version 2.20150928

SYNOPSIS

This synopsis is just an overview of the six steps, which are described below in more detail.

1. Prepare your local Perl environment, so Perl modules will be installed to ~/perl5.

We expect no admin rights and no previous local Perl environment. If env | grep PERL prints something or directory ~/.cpan exists, there is a risk that your previously installed local Perl environment will be in conflict with the new one.

 wget -O- http://cpanmin.us | perl - -l ~/perl5 App::cpanminus local::lib
 eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`
 echo '## Treex installation ##'                        >> ~/.bashrc
 echo 'eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`'   >> ~/.bashrc
 grep bashrc ~/.bash_profile || echo 'source ~/.bashrc' >> ~/.bash_profile
2. Install Treex::Core and its dependencies from CPAN
 # First, try to install XML::LibXML
 cpanm XML::LibXML
 # If it fails and the build.log contains "looking for -lxml2... no",
 # you are probably missing libxml2 header files (or the whole libxml2).
 # On Ubuntu/Debian you can install it with
 #  sudo apt-get install libxml2-dev zlib1g-dev
 # Few more possibly problematic modules
 cpanm -n PerlIO::Util
 cpanm Moose
 moose-outdated | cpanm
 # and finally the Treex::Core and its dependencies
 cpanm Treex::Core # this may take about 10 minutes
 treex -h          # just to check if it was installed correctly
3. Install Treex modules for processing English
 cpanm Treex::EN
 cpanm Lingua::Interset URI::Find Cache::LRU
4. Install TrEd tree viewer and editor (optional)

See TrEd home page for details. To install Perl Tk module, you need several header files, on Ubuntu/Debian you can install them with sudo apt-get install libx11-dev libxft-dev libfontconfig1-dev libpng12-dev patch.

 # Get a script which automatically downloads and builds everything else
 wget http://ufal.mff.cuni.cz/tred/install_tred.bash
 bash install_tred.bash --tred-dir ~/tred
 # Instruct Treex where to find TrEd and its dependencies
 echo "tred_dir: $HOME/tred" >> ~/.treex/config.yaml
 echo "source ~/tred/bin/init_tred_environment" >> ~/.bashrc
 source ~/tred/bin/init_tred_environment
 ttred # run TrEd with Treex extension
5. Download the newest version of the whole Treex from GIT repository (optional)
 git clone https://github.com/ufal/treex.git ~/treex
 # Add the following lines to your ~/.bashrc
 export PATH="$HOME/treex/bin:$PATH"
 export PERL5LIB="$HOME/treex/lib:$PERL5LIB"
 export TMT_ROOT=$HOME/.treex
6. Install MorphoDiTa tagger and NameTag NER (optional)
 cpanm Ufal::MorphoDiTa Ufal::NameTag

Prerequisites

In this tutorial, we expect Linux OS with Bash shell and Perl 5.10 or higher. Also basic development tools, such as make, patch, and a C compiler (gcc), are required. You can easily use different shell (e.g. csh), just modify accordingly the shell commands. It is possible to install Treex also on MacOS and Windows+StrawberryPerl, but it is less tested so far. If you have a Perl version older than 5.10 (or if you just want to try the newest Perl), you can install your own Perl using perlbrew -- it is really simple.

Note that if you have Windows and only want to browse *.treex files, you can install TrEd and (in menu Setup - Manage Extensions - Get New Extension) select EasyTreex extension. However, for completing tutorial you need to install Treex (and setup TrEd) as described below, so EasyTreex is superfluous.

1. Prepare local Perl environment

In order to install Treex, you must be able to install Perl modules from CPAN. This step is not specific to Treex, it is a basic Perl skill. There are several ways how to achieve the goal, but I consider this the easiest one. There are two things you should be aware of:

  • If env | grep PERL prints something or directory ~/.cpan exists (ls -l ~/.cpan), it is probable that you have already configured a local Perl environment. In such a case it is important to

    either reuse the environment and skip this step

    If you used local::lib or perlbrew to set the environment, it should be configured properly and you can continue with step 2 (if you want to use cpanm, install it by cpan App::cpanminus). If you used another method, such as modifying $PERL5LIB in your ~/.bashrc or setting PREFIX or INSTALL_BASE options in cpan configuration, there is a possibility that your previously installed local Perl environment is configured only partially and the procedure described here may fail. If you decide to reuse your previous local Perl environment, the modules will be installed to whatever path you had chosen (instead of ~/perl5) and you should skip this step 1 (otherwise the installation fails with "WHOA THERE! It looks like you've got ..." in ~/.cpanm/build.log).

    or deactivate completely the environment before doing this step.

    If you do not need/want to use your previous local Perl environment, you should delete (rename) the ~/.cpan directory and edit your shell profile (~/.bashrc, ~/.profile etc), so no Perl-related variables (such as PERL5LIB, PERL_MB_OPT, PERL_MM_OPT) are exported. After running a new shell (new ssh session), env | grep PERL should print nothing.

  • If you have a root access and really want to install Treex and its dependencies for all users to system paths (/usr/lib etc.), just skip this step (you can install cpanm by sudo cpan App::cpanminus). However, in course of this tutorial you will be advised to modify some of the modules (Treex::Block::Tutorial::*), so it may be a good compromise to install only the dependencies to system paths using sudo cpanm --installdeps Treex::Core, but otherwise follow this local Perl setup.

Download and install locally two useful tools (Perl modules) – cpanm and local::lib:

 wget -O- http://cpanmin.us | perl - -l ~/perl5 App::cpanminus local::lib

App::cpanminus provides cpanm script which is a fast, dependency free, zero-configuration substitute for the standard cpan. local::lib takes care of setting all the environment variables needed to install modules without administrative privileges.

Instead of wget -0-, you can use curl -L or simply download cpanm from http://cpanmin.us, save it as cpanm and run perl cpanm -l ~/perl5 App::cpanminus local::lib. Instead of ~/perl5, you can use any path you like, but ~/perl5 is a common standard used in this tutorial.

In the following steps, you can use cpan instead of cpanm. The advantage is that you can start an interactive cpan shell which provides more features (I recommend to install first Bundle::CPAN and Term::ReadLine::Perl, so you can browse the history using up/down keys). The disadvantage is that you cannot use it for installing local::lib locally before local::lib is installed :-). Also, you will need to go through a configuration dialogue when cpan is executed for the first time.

 eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`
 echo '## Treex installation ##'                        >> ~/.bashrc
 echo 'eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib`'   >> ~/.bashrc
 grep bashrc ~/.bash_profile || echo 'source ~/.bashrc' >> ~/.bash_profile

The first line sets up the environment variables $PERL5LIB, $PATH, $PERL_MB_OPT etc. for the current shell session. It enables you to use the modules installed in ~/perl5 (without specifying this path using perl -I) and also it ensures that new modules will be installed (using cpanm or cpan) to ~/perl5 (not to the system paths). The third line ensures that this setting will be applied also in other (non-login) shell sessions. The fourth line ensures that this setting will be applied also in "login" shell sessions (e.g. when you log in via ssh). If you prefer to use ~/.profile instead of ~/.bash_profile, adapt the fourth line accordingly.

2. Install Treex::Core from CPAN

Treex is divided into several CPAN distributions. Treex::Core contains the main ("core") functionality and almost all other Treex modules depend on it. Treex::Core itself has many dependencies, most notably Moose and Treex::PML (which have many dependencies and so on), so the installation takes several minutes. One of the most frequent problems in installation is that the Perl module XML::LibXML, which is a binding for libxml2 library, needs apart from the library also its header files (*.h). So let's check first, whether you can install XML::LibXML:

 cpanm XML::LibXML

If it fails and ~/.cpanm/build.log contains "Cannot write to /usr/lib/ ... XML/SAX.pm line 191", try to run it again and it should show that it was actually installed. If it fails and ~/.cpanm/build.log contains "looking for -lxml2... no", you are probably missing the header files or the whole library. On Ubuntu/Debian you can install it with:

 sudo apt-get install libxml2-dev zlib1g-dev

If you know a simple way how to do this without admin privileges, let me know. You can check for the packages with LANG=C dpkg-query -s libxml2-dev zlib1g-dev 2>&1 | grep Package. On other systems (e.g. RPM based), try to find similarly named packages (libxml2-devel), or look at http://xmlsoft.org.

There are few other possibly problematic modules. PerlIO::Util has known (and reported)

 cpanm -n PerlIO::Util
 cpanm Moose
 moose-outdated | cpanm

Now, the installation of Treex::Core should be smooth (but it takes more than 8 minutes if no dependencies were installed before):

 cpanm Treex::Core

Rarely, you may encounter problems with installing some modules. In that case, you should find the first module where something went wrong. You can read the documentation of the module, check its bug tracker, try to install it manually etc. If you cannot diagnose and fix the failure, you may try to install it with --prompt, --force or --notest options, but this may cause troubles later on.

 treex -h

treex is the main Treex script. treex -h should just print the usage information and exit. Its actual usage will be described later on in this tutorial (Treex::Tutorial::FirstSteps); running the command serves here only as a check that treex was installed and can be found in the $PATH. The installation created a configuration file ~/.treex/config.yaml which will be described in Treex::Tutorial::Config.

3. Install Treex::EN from CPAN

Treex Core itself has no modules for any particular NLP task. There is a separate distribution Treex-Unilang for such modules that are language independent. In this tutorial, we will mainly work with English, so you need to install a distribution Treex-EN, which contains only modules specific to English. It is dependent on Treex-Unilang, so both the distributions can be installed by:

 cpanm Treex::EN
 cpanm Lingua::Interset URI::Find Cache::LRU

4. Install TrEd

TrEd is a fully customizable and programmable graphical editor and viewer for tree-like structures. Although TrEd visualization of the linguistic trees produced by Treex can be very helpful, it is not required, i.e. Treex is fully functional even without installing TrEd.

To install Perl Tk module, you may need to install some header files and TrEd needs also the patch tool. On Ubuntu/Debian you can install these prerequisites using:

 sudo apt-get install libx11-dev libxft-dev libfontconfig1-dev libpng12-dev patch

Now, download a small installation script

 wget http://ufal.mff.cuni.cz/tred/install_tred.bash

You can type bash install_tred.bash -h to see the installation options. To automatically download and build the latest TrEd and its dependencies to ~/tred, use:

 bash install_tred.bash --tred-dir ~/tred

You can run ~/tred/bin/start_tred to check the GUI. When a dialog box "Manage extensions" appears, you can ignore it (click on "Later").

Treex Core contains an extension for TrEd, which enables it to open *.treex, *.treex.gz and *.streex files and use the Treex stylesheet. Treex Core also contains a simple wrapper script ttred which runs TrEd with this extension enabled (pre-installed). We must instruct Treex where to find TrEd:

 echo "tred_dir: $HOME/tred" >> ~/.treex/config.yaml

TrEd installed some of its dependencies to ~/tred/dependencies, but we want to make them permanently available for Treex (and all Perl modules):

 echo "source ~/tred/bin/init_tred_environment" >> ~/.bashrc
 source ~/tred/bin/init_tred_environment

Finally, you can run TrEd with the Treex extension enabled:

 ttred

5. Download Treex from GIT repository

Some Treex modules are not mature enough to be released on CPAN. You may also want to test the newest Treex version or commit your own code to the repository. So let's create your local clone of Treex in ~/treex.

 git clone https://github.com/ufal/treex.git ~/treex

You need to include the path to the downloaded modules in your $PERL5LIB. Add the following lines to the end of your ~/.bashrc:

 export PATH="$HOME/treex/bin:$PATH"
 export PERL5LIB="$HOME/treex/lib:$HOME/treex/oldlib:$PERL5LIB"
 export TMT_ROOT=$HOME/.treex

It is important that these lines follow eval `perl -I ~/perl5/lib/perl5 -Mlocal::lib` in your ~/.bashrc, so a GIT module is preferred over a CPAN modules of the same name. To apply the setting for the current bash session, type the three export commands or start a new session. You can check it with:

 echo $PERL5LIB  # ~/treex/lib should precede ~/perl5/...
 treex -v        # should print "Treex version: DEV from..."

Now you can use Perl modules that were not installed from CPAN (but were downloaded from GIT). Some of the modules may have dependencies that you do not have (installed). When you load such a module (e.g. by running treex) it will fail with an error message like Can't locate Acme/Time/Baby.pm in @INC (@INC contains:... You can install the missing dependencies (Acme::Time::Baby in this imaginary example) simply with

 cpanm Acme::Time::Baby

If you happen to need any of the modules CzechMorpho, Morce::Czech and Morce::English, you must install them manually, because these modules were not released on CPAN, but they are XS-based (involve compiling C code), so you cannot just download them.

 svn --username public export $SVN_TRUNK/libs/packaged /tmp/packaged
 cd /tmp/packaged/Morce-English
 perl Build.PL
 ./Build
 ./Build test
 ./Build install --prefix $HOME/perl5/lib/perl5

In the same way, you can install CzechMorpho and Morce-Czech (in this order because the latter depends on the former).

6. Install MorphoDiTa tagger and NameTag NER (optional)

MorphoDiTa is an open-source tool for morphological analysis of natural language texts. Currently there is a Perl module, Ufal::MorphoDiTa, available on CPAN providing bindings to the MorphoDiTa library. This module is necessary for running Treex::Tool::Tagger::MorphoDiTa and consequently Treex::Block::W2A::EN::TagMorphoDiTa.

To compile the module, C++11 compiler is needed, either g++ 4.7 or newer, alternatively clang 3.2 or newer. You may check if you have the required compiler installed on your computer.

 g++ --version
 # Or alternatively ...
 clang --version

When not installed, install it. On Ubuntu/Debian etc. use this command:

 sudo apt-get install g++ 

When the installed compiler version is too old, upgrade it. On Ubuntu/Debian etc. use this command:

 sudo apt-get upgrade g++

Finally, you can install the module:

 cpanm Ufal::MorphoDiTa

Another useful tool is Ufal::NameTag, a tool for named entity recognition. It should have similar prerequisities as Ufal::MorphoDiTa, so if you followed the previous steps, just install the module.

 cpanm Ufal::NameTag

Uninstall

Although there is no standardized way to uninstall Perl modules, in most cases it is enough to delete the respective files and directories. If you followed this installation guide and you want to remove all the installed stuff and if you had nothing in ~/perl5 before, you can delete the directories ~/perl5, ~/treex, ~/.treex, ~/.tred and ~/.cpanm. You can also delete the added lines from ~/.bashrc (starting with ## Treex installation ##) and ~/.bash_profile.

AUTHOR

Martin Popel <popel@ufal.mff.cuni.cz>

Dušan Variš <varis@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2012 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.