NAME

Text::Corpus::NewYorkTimes::Document - Parse NYT article for research.

SYNOPSIS

use Text::Corpus::NewYorkTimes;
use Data::Dump qw(dump);
use Log::Log4perl qw(:easy);
Log::Log4perl->easy_init ($INFO);
my $corpus = Text::Corpus::NewYorkTimes->new (fileList => $fileList, corpusDirectory => $corpusDirectory);
my $document = $corpus->getDocument (index => 0);
dump $document->getBody;
dump $document->getCategories;
dump $document->getContent;
dump $document->getDate;
dump $document->getDescription;
dump $document->getTitle;
dump $document->getUri;

DESCRIPTION

Text::Corpus::NewYorkTimes::Document provides methods for accessing specific portions of news articles from the New York Times corpus.

CONSTRUCTOR

`new`

The constructor new creates an instance of the Text::Corpus::NewYorkTimes::Document class with the following parameters:

filename

filename => '...'

filename is the path name to the XML document that is to be parsed.

dtdname

dtdname => '...'

dtdname is the path name to the data type definition file provided with the corpus; it is usually something like .../nyt_corpus/dtd/nitf-3-3.dtd. If not defined an attempt is made to located it using the path provided by filename.

METHODS

`getBody`

getBody ()

getBody returns an array reference of strings of sentences that are the body of the article.

`getCategories`

getCategories (type => 'all')

The method getCategories returns the categories of type assigned to the document, where type must be 'all', 'controlled', or 'uncontrolled'. The 'uncontrolled' categories are those assigned to the document by an editor without machine assistance, the 'controlled' categories are those assigned with machine assistance. The type 'all' returns the union of the categories from 'controlled' and 'uncontrolled'. The default is 'all'.

`getContent`

getContent ()

The method getContent returns the content of the document as an array reference of the text where each item in the array is a sentence, with the first sentence being the headline or title of the article. If the lead sentence equals the headline of the article, then the headline is not prefixed to the list.

`getDate`

getDate (format => '%g')

getDate returns the date and time of the article in the format speficied by format that uses the print directives of Date::Manip::Date. The default is to return the date and time in RFC2822 format.

`getDescription`

getDescription ()

getDescription returns an array reference of strings of sentences that describe the articles content.

`getTitle`

getTitle ()

getTitle returns an array reference of strings, usually one, of the title of the article.

`getUri`

getUri (type => 'file')

getUri returns the URI of the document where type must be 'file' or 'url'. If type is 'file', the file path of the document is returned; otherwise the URL of the document is returned. The default is 'file'.

INSTALLATION

For installation instructions see Text::Corpus::NewYorkTimes.

AUTHOR

Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

nyt, new york times, english corpus, information processing

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

CONSTRUCTOR

new

METHODS

getBody

getCategories

getContent

getDate

getDescription

getTitle

getUri