The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

MediaWiki::DumpFile::SimplePages - Fast and easy access to the pages and titles from a Mediawiki XML dump file.

SYNOPSIS

  use MediaWiki::DumpFile::SimplePages;

  my $p = MediaWiki::DumpFile::SimplePages->new($filename);
  my $p = MediaWiki::DumpFile::SimplePages->new(\*FILEHANDLE);

  while(my ($title, $article) = $p->next) {
        print "Title: $title\n";
        print "$article\n";
  }

ABOUT

This object parses the contents of the page dump file but only supports article titles and text. The benefit of using this object is that it is extremely fast.

FUNCTIONS

new

This is the constructor for this package. It is called with a single parameter: the location of a MediaWiki XML page dump file or a reference to an already open file handle.

next

This method returns a two item list where the first item is the page title and the second item is the page text. When there are no more pages left it returns an empty list.

HISTORY

This software started life as a benchmark for comparing various XML parsers for perl. When I discovered just how fast this implementation went I realized 80% of the people who access a MediaWiki dump file are going to be accessing the article titles and text of the English Wikipedia. This means the XML parsing needs to be really fast. This package is twice as fast as the fastest SAX parser and five times faster than Parse::MediaWikiDump (as of Dec 2, 2009).

LIMITATIONS

This software is fairly fragile and is really a hack. If things go awry it might not even be able to tell. If the XML format changes the behavior is completely undefined.

AUTHOR

"Tyler Riddle", <"triddle at gmail.com">

BUGS

Please see MediaWiki::DumpFile for information on how to report bugs in this software.

COPYRIGHT & LICENSE

Copyright 2009 "Tyler Riddle".

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.