Jeff Kubina
and 1 contributors

NAME

create_summary_corpus.pl - Script to create corpus for summary testing.

SYNOPSIS

  create_summary_corpus.pl [-d corpusDirectory -l languageCode -p maxProcesses -h -t n]

DESCRIPTION

The script create_summary_corpus.pl makes a corpus for summarization testing using the featured articles of various Wikipedias.

All errors and warnings are logged using Log::Log4perl to the file corpusDirectory/languageCode/log.txt.

OPTIONS

-d corpusDirectory

The option -d sets the directory to store the corpus of documents; the directory is created if it does not exist. The default is the cwd.

A language subdirectory is created at corpusDirectory/languageCode that will contain the directories log, html, unparsable, text, and xml. The directory log will contain the file log.txt that all errors, warnings, and informational messages are logged to using Log::Log4perl. The directory html will contain copies of the HTML versions of the featured article pages fetched using LWP. The directory text will contain two files for each article; one file will end with _body.txt and contain the body text of the article, the other will end with _summary.txt and will contain the summary. The directory unparsable will contain the HTML files that could not be parsed into body and summary sections. The XML files are UTF8 encoded, the text and html files are saved as UTF8 octets.

-l languageCode

The option -l sets the language code of the Wikipedia from which the corpus of featured articles are to be created. The supported language codes are af:Afrikaans, ar:Arabic, az:Azerbaijani, bg:Bulgarian, bs:Bosnian, ca:Catalan, cs:Czech, de:German, el:Greek, en:English, eo:Esperanto, es:Spanish, eu:Basque, fa:Persian, fi:Finnish, fr:French, he:Hebrew, hr:Croatian, hu:Hungarian, id:Indonesian, it:Italian, ja:Japanese, jv:Javanese, ka:Georgian, kk:Kazakh, km:Khmer, ko:Korean, li:Limburgish, lv:Latvian, ml:Malayalam, mr:Marathi, ms:Malay, mzn:Mazandarani, nl:Dutch, nn:Norwegian (Nynorsk), no:Norwegian (Bokm?l), pl:Polish, pt:Portuguese, ro:Romanian, ru:Russian, sh:Serbo-Croatian, simple:Simple English, sk:Slovak, sl:Slovenian, sr:Serbian, sv:Swedish, sw:Swahili, ta:Tamil, th:Thai, tl:Tagalog, tr:Turkish, tt:Tatar, uk:Ukrainian, ur:Urdu, vi:Vietnamese, vo:Volap?k, and zh:Chinese. If the language code is all, then the corpus for each supported language is created (which takes a long time). The default is en.

-p maxProcesses

 maxProcesses => 1

The option -p is the maximum number of processes that can be running simultaneously to parse the files. Parsing the files for the summary and body sections may be computational intensive so the module Forks::Super is used for parallelization. The default is one.

-r

Causes only the text and XML files from all the HTML files that have already been fetched to be created; no new files are downloaded.

-h

Makes this documentation print.

-t 0

The option -t initiates testing mode; only the specified number of pages are fetched and parsed. The default is zero, indicating no testing, all possible pages are fetched and parsed.

BUGS

This script creates corpora by parsing Wikipedia pages, the xpath expressions used to extract links and text will become invalid as the format of the various pages changes, causing some corpora not to be created.

Please email bugs reports or feature requests to bug-text-corpus-summaries-wikipedia@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Corpus-Summaries-Wikipedia. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.

AUTHOR

 Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

Copyright (c) 2010 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

corpus, information processing, summaries, summarization, wikipedia

SEE ALSO

Forks::Super, Log::Log4perl, Text::Corpus::Summaries::Wikipedia

Links to the featured article page for the supported language codes: af:Afrikaans, ar:Arabic, az:Azerbaijani, bg:Bulgarian, bs:Bosnian, ca:Catalan, cs:Czech, de:German, el:Greek, en:English, eo:Esperanto, es:Spanish, eu:Basque, fa:Persian, fi:Finnish, fr:French, he:Hebrew, hr:Croatian, hu:Hungarian, id:Indonesian, it:Italian, ja:Japanese, jv:Javanese, ka:Georgian, kk:Kazakh, km:Khmer, ko:Korean, li:Limburgish, lv:Latvian, ml:Malayalam, mr:Marathi, ms:Malay, mzn:Mazandarani, nl:Dutch, nn:Norwegian (Nynorsk), no:Norwegian (Bokm?l), pl:Polish, pt:Portuguese, ro:Romanian, ru:Russian, sh:Serbo-Croatian, simple:Simple English, sk:Slovak, sl:Slovenian, sr:Serbian, sv:Swedish, sw:Swahili, ta:Tamil, th:Thai, tl:Tagalog, tr:Turkish, tt:Tatar, uk:Ukrainian, ur:Urdu, vi:Vietnamese, vo:Volap?k, and zh:Chinese.

Copies of the data sets generated in May 2010 and February 2013 can be download here.