The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::YALI::Examples - Examples of usages.

VERSION

version 0.016

Introduction

Basic information about YALI package can be found at Lingua::YALI.

This documentation introduces the most important commands for using YALI package.

Preparation

In this documentation we will be using texts from Wikipedia. So we start with downloading 20 articles for Czech, English, and French.

    # download data
    for i in `seq 1 20`; do
        id=`printf "%02d" $i`;
        echo "Processing document $id";
        lynx --dump 'http://en.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > eng.$id.txt;
        lynx --dump 'http://cs.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > ces.$id.txt;
        lynx --dump 'http://fr.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > fra.$id.txt;
    done;

    # create list of files for training
    ls ces.* | head -n15 > list.ces.train;
    ls eng.* | head -n15 > list.eng.train;
    ls fra.* | head -n15 > list.fra.train;

    # create list of files for testing
    ls ces.* | tail -n5 > list.ces.test;
    ls eng.* | tail -n5 > list.eng.test;
    ls fra.* | tail -n5 > list.fra.test;

Scripts

This section provides information how to use scripts yali-builder, yali-identifier, and yali-language-identifier.

Language Identification with Pretrained Models

The script yali-language-identifier is distributed with pretrained language models for 122 languages.

    # check out possible options
    yali-language-identifier --help

    # language identification for Czech
    # option --filelist
    yali-language-identifier -l="eng ces fra" --filelist=list.ces.test

    # language identification for English files with different output format
    # option -f (--format)
    yali-language-identifier -l="eng ces fra" --filelist=list.eng.test -f=all_p

    # language identification for French files read from STDIN
    # --filelist is equal to -
    cat list.fra.test | yali-language-identifier -l="eng ces fra" --filelist=- -f=tabbed

    # identify only single file
    # option -i (--input)
    yali-language-identifier -l="eng ces fra" -i=ces.20.txt -f=all

    # single file read from STDIN
    # option -i is equal to -
    cat eng.20.txt | yali-identifier -l="eng ces fra" -i=- -f=all_p

    # single file read from STDIN
    # when --filelist or --input is not used then it is equal to -i=-
    cat fra.20.txt | yali-identifier -l="eng ces fra" -f=all_p

Building Your Own Models

If you have texts from specific domain it is worth to train your own models on texts from this domain to achieve higher accuracy.

Options --filelist and --input has same meaning as options for "Language Identification with Pretrained Models".

    # check out possible options
    yali-builder --help

    # create Czech bigram model with only 5 most frequent bigrams stored
    # option -n (--ngram) for specifying n-gram size to 2
    # option -c (--count) for storing only 5 most frequent bigrams
    # option -o (--output) for specifying output file name
    yali-builder --filelist=list.ces.train -n=2 -c=5 -o model.2.5.ces.gz

    # create English bigram model with all bigrams stored
    # option -c is ommited that means all
    cat list.eng.train | yali-builder --filelist=- -n=2 -o model.2.5.eng.gz

    # create French bigram model with only 5 most frequent bigrams stored
    # option -i=- means that all training files are read from STDIN
    cat list.eng.train | xargs cat | yali-builder -i=- -n=2 -c=5 -o model.2.5.fra.gz

    # create list with models
    # list of models in format class1[TAB]path-to-model is required for identification
    echo -e "ces\tmodel.2.5.ces.gz" > list.models.2
    echo -e "eng\tmodel.2.5.eng.gz" >> list.models.2
    echo -e "fra\tmodel.2.5.fra.gz" >> list.models.2

Language Identification with Your Own Models

Only two changes are required to the commands presented in section "Language Identification with Pretrained Models".

  • Change yali-language-identifier to yali-identifier.

  • Change -l="eng ces fra" to -c=list.models.2.

    # language identification for Czech files
    yali-identifier -c=list.models.2 -filelist=list.ces.test

    # language identification for English files with different output format
    yali-identifier -c=list.models.2 -filelist=list.eng.test -f=all_p

    # language identification for French files read from STDIN
    cat list.fra.test | yali-identifier -c=list.models.2 -filelist=- -f=tabbed

    # single file
    yali-identifier -c=list.models.2 -i=ces.20.txt -f=all

    # single file read from STDIN
    cat eng.20.txt | yali-identifier -c=list.models.2 -i=- -f=all_p

    # single file read from STDIN
    cat fra.20.txt | yali-identifier -c=list.models.2 -f=all_p

Modules

This section provides information how to use modules Lingua::YALI::LanguageIdentifier, Lingua::YALI::Builder, and Lingua::YALI::Identifier.

Language Identification with Your Own Models

This example shows how to detect languages with Lingua::YALI::LanguageIdentifier.

    use Lingua::YALI::LanguageIdentifier;

    # create identifier and register languages
    my $identifier = Lingua::YALI::LanguageIdentifier->new();
    $identifier->add_language("ces", "eng", "fra");

    # identify string
    my $result_s = $identifier->identify_string("CPAN, the Comprehensive Perl Archive Network, is an archive of modules written in Perl.");

    print "The most probable language is " . $result_s->[0]->[0] . ".\n";
    # prints out The most probable language is eng.

    # identify file
    my $result_f = $identifier->identify_file("ces.01.txt");

    print "The most probable language is " . $result_f->[0]->[0] . ".\n";
    # hopefully prints out The most probable language is ces.

    # identify file handle
    open(my $fh, "<:bytes", "fra.01.txt");
    my $result_h = $identifier->identify_handle($fh);

    print "The most probable language is " . $result_h->[0]->[0] . ".\n";
    # hopefully prints out The most probable language is fra.

Training Your Own Models

This example shows how to train language models with Lingua::YALI::Builder.

    use Lingua::YALI::Builder;
    use File::Glob;
    use Carp;

    # read file with training files
    for my $file (File::Glob::bsd_glob("list.*.train")) {
        my @p = split(/\./, $file);
        my $lang = $p[1];

        print STDERR "Building model for $lang\n";

        # create builder for 2-grams
        my $builder = Lingua::YALI::Builder->new(ngrams=>[2]);

        open(my $fh_train, "<", $file) or croak($file . "\n" . $!);
        while ( my $f = <$fh_train> ) {
            chomp $f;

            # train on file
            $builder->train_file($f);
        }

        # store trained model
        $builder->store("model.".$lang.".gz", 2);
        print STDERR "\tDONE\n";
    }

Using Your Own Models

This example shows how to use trained language models with Lingua::YALI::Identifier.

    use Lingua::YALI::Identifier;
    use File::Glob;
    use Carp;

    # load models
    my $identifier = Lingua::YALI::Identifier->new();
    $identifier->add_class("ces", "model.ces.gz");
    $identifier->add_class("eng", "model.eng.gz");
    $identifier->add_class("fra", "model.fra.gz");

    # identify string
    my $result_s = $identifier->identify_string("CPAN, the Comprehensive Perl Archive Network, is an archive of modules written in Perl.");
    print "The most probable language is " . $result_s->[0]->[0] . ".\n";
    # prints out The most probable language is eng.

    # identify all testing files
    for my $file (File::Glob::bsd_glob("list.*.test")) {
        open(my $fh_train, "<", $file) or croak($file . "\n" . $!);
        while ( my $f = <$fh_train> ) {

            chomp $f;
            # identify file
            my $result_f = $identifier->identify_file($f);
            print $f . "\t" . $result_f->[0]->[0] . "\n";
        }
    }

AUTHOR

Martin Majlis <martin@majlis.cz>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2012 by Martin Majlis.

This is free software, licensed under:

  The (three-clause) BSD License