README - metacpan.org


            
              1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
              NAME
    "Text::Corpus::NewYorkTimes" - Interface to New York Times corpus.
SYNOPSIS
      use Text::Corpus::NewYorkTimes;
      use Data::Dump qw(dump);
      use Log::Log4perl qw(:easy);
      Log::Log4perl->easy_init ($INFO);
      my $corpus = Text::Corpus::NewYorkTimes->new (fileList => $fileList, corpusDirectory => $corpusDirectory);
      dump $corpus->getTotalDocuments;
DESCRIPTION
    "Text::Corpus::NewYorkTimes" provides an interface for accessing the
    documents in the New York Times corpus from Linguistic Data Consortium.
    The categories, description, title, etc... of a specified document are
    accessed using Text::Corpus::NewYorkTimes::Document. Also, all errors
    and warnings are logged using Log::Log4perl, which should be
    initialized.
CONSTRUCTOR
  "new"
    The method "new" creates an instance of the "Text::Corpus::NewYorkTimes"
    class with the following parameters:
    "corpusDirectory"
         corpusDirectory => '...'
        "corpusDirectory" is the path to the top most directory of the
        corpus; it usually is the path to the directory named "nyt_corpus".
        It is needed to locate all the documents in the corpus. If it is not
        defined, then the enviroment variable
        "TEXT_CORPUS_NEWYORKTIMES_CORPUSDIRECTORY" is used if it is defined;
        if neither of these are defined then all the paths in the file
        specified by "fileList" are assumed to be full path names.
        "corpusDirectory" and "fileList" can both be defined to locate the
        documents in the corpus by having the path names in "fileList" be
        defined relative to "corpusDirectory".
    "fileList"
         fileList => '...'
        "fileList" is an optional parameter that can be used to save time
        when creating the list of documents in the corpus; each line in the
        file must be the path to an XML document in the corpus. If
        "fileList" is not defined, then the environment variable
        "TEXT_CORPUS_NEWYORKTIMES_FILELIST" is used if it is defined;
        otherwise all the XML documents in the corpus are located by
        searching the directory specified by "corpusDirectory". If the file
        defined by "fileList" or "TEXT_CORPUS_NEWYORKTIMES_FILELIST" does
        not exist, it will be created and the path to each XML document in
        the corpus, relative to "corpusDirectory", will be written to it.
        This is done to speed-up subsequent invocations of the object.
METHODS
  "getDocument"
     getDocument (index => $documentIndex)
     getDocument (uri => $uri)
    "getDocument" returns a Text::Corpus::NewYorkTimes::Document object for
    the document with index $documentIndex or uri $uri. The document indices
    range from zero to "getTotalDocument()-1"; "getDocument" returns "undef"
    if any errors occurred and logs them using Log::Log4perl.
    For example:
      use Text::Corpus::NewYorkTimes;
      use Data::Dump qw(dump);
      use Log::Log4perl qw(:easy);
      Log::Log4perl->easy_init ($INFO);
      my $corpus = Text::Corpus::NewYorkTimes->new (fileList => $fileList, corpusDirectory => $corpusDirectory);
      my $document = $corpus->getDocument (index => 0);
      dump $document->getBody;
      dump $document->getCategories;
      dump $document->getContent;
      dump $document->getDate;
      dump $document->getTitle;
      dump $document->getUri;
  "getTotalDocuments"
     getTotalDocuments ()
    "getTotalDocuments" returns the total number of documents in the corpus.
    The index to the documents in the corpus ranges from zero to
    "getTotalDocuments() - 1".
  "test"
     test ()
    "test" does tests to ensure the documents in the corpus are accessible
    and can be parsed. It returns true if all tests pass, otherwise a
    description of the test that failed is logged using Log::Log4perl and
    false is returned.
    For example:
      use Text::Corpus::NewYorkTimes;
      use Data::Dump qw(dump);
      use Log::Log4perl qw(:easy);
      Log::Log4perl->easy_init ($INFO);
      my $corpus = Text::Corpus::NewYorkTimes->new (fileList => $fileList, corpusDirectory => $corpusDirectory);
      dump $corpus->test;
EXAMPLES
    The example below will print out all the information for each document
    in the corpus.
      use Text::Corpus::NewYorkTimes;
      use Data::Dump qw(dump);
      use Log::Log4perl qw(:easy);
      Log::Log4perl->easy_init ($INFO);
      my $corpus = Text::Corpus::NewYorkTimes->new (fileList => $fileList, corpusDirectory => $corpusDirectory);
      my $totalDocuments = $corpus->getTotalDocuments;
      for (my $i = 0; $i < $totalDocuments; $i++)
      {
        eval
          {
            my $document = $corpus->getDocument(index => $i);
            next unless defined $document;
            my %documentInfo;
            $documentInfo{title} = $document->getTitle();
            $documentInfo{body} = $document->getBody();
            $documentInfo{content} = $document->getContent();
            $documentInfo{categories} = $document->getCategories();
            $documentInfo{description} = $document->getDescription();
            $documentInfo{uri} = $document->getUri();
            dump \%documentInfo;
          };
      }
INSTALLATION
    To install the module set the environment variable
    "TEXT_CORPUS_NEWYORKTIMES_CORPUSDIRECTORY" to the path of the New York
    Times corpus and run the following commands:
      perl Makefile.PL
      make
      make test
      make install
    If you are on a windows box you should use 'nmake' rather than 'make'.
    The module will install if "TEXT_CORPUS_NEWYORKTIMES_CORPUSDIRECTORY" is
    not defined, but less testing will be performed. After the New York
    Times corpus is installed testing of the module can be performed by
    running:
      use Text::Corpus::NewYorkTimes;
      use Data::Dump qw(dump);
      use Log::Log4perl qw(:easy);
      Log::Log4perl->easy_init ($INFO);
      my $corpus = Text::Corpus::NewYorkTimes->new (corpusDirectory => $corpusDirectory);
      dump $corpus->test;
AUTHOR
     Jeff Kubina<jeff.kubina@gmail.com>
COPYRIGHT
    Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is
    free software; you can redistribute it and/or modify it under the same
    terms as Perl itself.
    The full text of the license can be found in the LICENSE file included
    with this module.
KEYWORDS
    nyt, new york times, english corpus, information processing
SEE ALSO
    Log::Log4perl, Text::Corpus::NewYorkTimes::Document
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)