Text::Corpus::CNN::Document - Parse CNN article for research.
Text::Corpus::CNN::Document
use Cwd; use File::Spec; use Text::Corpus::CNN; use Data::Dump qw(dump); use Log::Log4perl qw(:easy); Log::Log4perl->easy_init ($INFO); my $corpusDirectory = File::Spec->catfile (getcwd(), 'corpus_cnn'); my $corpus = Text::Corpus::CNN->new (corpusDirectory => $corpusDirectory); $corpus->update (verbose => 1); my $document = $corpus->getDocument (index => 0); dump $document->getBody; dump $document->getCategories; dump $document->getContent; dump $document->getDate; dump $document->getDescription; dump $document->getHighlights; dump $document->getTitle; dump $document->getUri;
Text::Corpus::CNN::Document provides methods for accessing specific portions of CNN news articles for personnel researching and testing of information processing methods.
Read the CNN Interactive Service Agreement to ensure you abide with their Service Agreement when using this module.
new
The constructor new creates an instance of the Text::Corpus::CNN::Document class with the following parameters:
htmlContent
htmlContent => '...'
htmlContent must be a string containing the HTML of the document to be parsed. The string should already be encoded as a Perl internal string.
uri
uri => '...'
uri must be a string containing the URL of the document provided by htmlContent; it is also returned as the document's unique identifier with getUri.
getUri
encoding
encoding => '...'
encoding is the encoding that the HTML content of the document uses. It is returned by getEncoding.
getEncoding
getBody
getBody ()
getBody returns an array reference of strings of sentences that are the body of the document.
getCategories
getCategories ()
getCategories returns an array reference of strings of categories assigned to the document. They are the phrases and words extracted from the /html/head/meta[@name="KEYWORDS"] field in the HTML of the document, from the 'RELATED TOPICS' section of the document, and from the URL of the document.
/html/head/meta[@name="KEYWORDS"]
getContent
getContent ()
getContent returns an array reference of strings of sentences that form the content of the document, which are the title and body of the document.
getDate
getDate (format => '%g')
getDate returns the date and time of the article in the format speficied by format that uses the print directives of Date::Manip::Date. The default is to return the date and time in RFC2822 format.
format
getDescription
getDescription ()
getDescription returns an array reference of strings of sentences, usually one, that describes the document content. It is from the /html/head/meta[@name="description"] field in the HTML of the document.
/html/head/meta[@name="description"]
getEncoding ()
getEncoding returns the original encoding used by the HTML of the document.
getHighlights
getHighlights ()
getHighlights returns an array reference of the highlights of the document.
getHtml
getHtml ()
getHtml returns the HTML of the document as a string.
getTitle
getTitle ()
getTitle returns an array reference of strings, usually one, of the title of the document.
getUri ()
getUri returns the URL of the document.
For installation instructions see Text::Corpus::CNN.
This module uses xpath expressions to extract links and text which may become invalid as the format of various pages change, causing a lot of bugs.
Please email bugs reports or feature requests to text-corpus-cnn@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Text-Corpus-CNN. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
text-corpus-cnn@rt.cpan.org
Jeff Kubina<jeff.kubina@gmail.com>
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
cnn, cable news network, english corpus, information processing
Date::Manip::Date, HTML::TreeBuilder::XPath, Lingua::EN::Sentence, Log::Log4perl,
To install Text::Corpus::CNN, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Corpus::CNN
CPAN shell
perl -MCPAN -e shell install Text::Corpus::CNN
For more information on module installation, please visit the detailed CPAN module installation guide.