Roger A Hall

NAME

Text::Mining - Perl Tools for Text Mining Research

VERSION

This document describes Text::Mining version 0.0.8

SYNOPSIS

To run the shell:

    use Text::Mining;

    my $tm = Text::Mining->new();
    $tm->shell();

To use the objects:

    use Text::Mining;

    my $tm = Text::Mining->new();
    my $corpus = $tm->get_corpus({ corpus_name => 'Test' });
    my $document = $corpus->add_document({ file_path => 'data/file42.txt' });
    my $parser   = Text::Mining::Parser->new({ parser    => 'Text', 
                                               algorithm => 'Base' });

  

DESCRIPTION

Text::Mining manages multiple corpuses with unlimited documents and annotations and calculates representations of the documents using a variety of algorithms.

The primary design considerations are token provenance in the face of ever-changing protocols of analysis and pipeline automation for corpus recalculations.

INTERFACE

The command line interface is self-describing via the "help" command. Copy the "kodos" script from package "scripts" directory to someplace in your path. Check the permissions and adjust as necessary. To start the shell, enter "kodos" at the prompt.

METHODS

  • shell

     $tm->shell();

    Uses Term::Shell plus a few enhancements to provide a live environment for developing flexible and repreatable text mining protocols and manage multi-release projects encompassing multiple corpuses.

  • version

     print $tm->version(), "\n";

    Reports the version of Text::Mining.

  • create_corpus

     print $tm->version(), "\n";

    Reports the version of Text::Mining.

  • get_corpus

     my $corpus = $tm->get_corpus({ corpus_id = 1 });
     my $corpus = $tm->get_corpus({ corpus_name = 'Test' });

    Retrieves a corpus object from the database.

  • delete_corpus

     $corpus->delete();

    Deletes a corpus from the database. Deletes all related documents.

  • get_root_dir

     print $tm->get_root_dir(), "\n";

    Reports the root directory from the configuration file.

  • get_root_url

     print $tm->get_root_url(), "\n";

    Reports the root URL of the the webserver from the configuration file.

  • get_data_dir

     print $tm->get_data_dir(), "\n";

    Reports the main data directory from the configuration file.

  • get_submitted_document

     print $tm->submitted_document(), "\n";

    Reports the

  • count_submitted_waiting

     print $tm->count_submitted_waiting(), "\n";

    Reports the number of documents waiting to be included for a given corpus.

  • count_submitted_complete

     print $tm->count_submitted_complete(), "\n";

    Reports the number of documents ...

  • get_all_corpuses

     my $corpuses = $tm->get_all_corpuses();

    Returns the corpuses as DBI table.

  • get_corpus_id

     print $corpus->get_corpus_id(), "\n";

    Reports the corpus_id of the current_corpus

CONFIGURATION AND ENVIRONMENT

Text::Mining requires a set of configuration files stored at "~/.corpus":

  • shellrc

    Currently holds pwd and current_corpus. Loaded when you start the shell. These settings are saved in real time with _updated_config();

  • shell_history

    Holds the last 1,000 commands. Reloaded when you start the shell. Saved in postcmd().

DEPENDENCIES

 Test::More
 version
 Class::Std
 Class::Std::Utils
 YAML
 Carp
 LWP::Simple
 Time::HiRes
 DBIx::MySperqlOO
 File::Spec
 

INCOMPATIBILITIES

None reported.

BUGS AND LIMITATIONS

No bugs have been reported.

Please report any bugs or feature requests to bug-text-mining@rt.cpan.org, or through the web interface at http://rt.cpan.org.

AUTHORS

Roger A Hall <rogerhall@cpan.org> Michael Bauer <mbkodos@gmail.com>

LICENSE AND COPYRIGHT

Copyright (c) 2009, the Authors. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENSE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 168:

'=item' outside of any '=over'

Around line 270:

You forgot a '=back' before '=head1'