The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

RDF::Scutter - Perl extension for harvesting distributed RDF resources

SYNOPSIS

  use RDF::Scutter;
  use RDF::Redland;
  my $scutter = RDF::Scutter->new(scutterplan => ['http://www.kjetil.kjernsmo.net/foaf.rdf','http://my.opera.com/kjetilk/xml/foaf/'], from => 'scutterer@example.invalid');

  my $storage=new RDF::Redland::Storage("hashes", "rdfscutter", "new='yes',hash-type='bdb',dir='/tmp/'");
  my $model = $scutter->scutter($storage, 30);
  my $serializer=new RDF::Redland::Serializer("ntriples");
  print $serializer->serialize_model_to_string(undef,$model);

DESCRIPTION

As the name implies, this is an RDF Scutter. A scutter is a web robot that follows seeAlso-links, retrieves the content it finds at those URLs, and adds the RDF statements it finds there to its own store of RDF statements.

This module is an alpha release of such a Scutter. It builds a RDF::Redland::Model, and can add statements to any RDF::Redland::Storage (file, memory, Berkeley DB, MySQL, etc).

This class inherits from LWP::RobotUA, which again is a LWP::UserAgent and can therefore use all methods of these classes.

The latter implies it is robot that by default behaves nicely, it checks robots.txt, and sleeps between connections to make sure it doesn't overload remote servers.

CAUTION

This is an alpha release, and I haven't tested what it can do if left unsupervised, and you might want to be careful about finding out... The example in the Synopsis a complete scutter, but one that will retrieve only 30 URLs before returning. You could test it by entering your own URLs (optional) and a valid email address (mandatory). It'll count and report what it is doing.

METHODS

new(scutterplan => ARRAYREF, from => EMAILADDRESS [, any LWP::RobotUA parameters])

This is the constructor of the Scutter. You will have to initialise it with a scutterplan argument, which is an ARRAYREF containing URLs pointing to RDF resources. The Scutter will start its traverse of the web there. You must also set a valid email address in a from, so that if your scutter goes amok, your victims will know who to blame.

Finally, you may supply any arguments a LWP::RobotUA and LWP::UserAgent accepts.

scutter(RDF::Redland::Storage [, MAXURLS]);

This method will launch the Scutter. As first argument, it takes a RDF::Redland::Storage object. This allows you to store your model any way Redland supports, and it is very flexible, see its documentation for details. Optionally, it takes an integer as second argument, giving the maximum number of URLs to retrieve. This provides some security against a runaway robot.

It will return a RDF::Redland::Model containing a model with all statements retrieved from all visited resources.

BUGS/TODO

There are no known real bugs at the time of this writing, keeping in mind it is an alpha. If you find any, please use the CPAN Request Tracker to report them.

The code that loops to retrieve the URLs are not very elegant, and will undergo revision in later releases.

Allthough it uses LWP::Debug to debugging, the author feels it is somewhat problematic to find the right amount of output from the module. Subsequent releases are likely to be more quiet than the present release, however.

For an initial release, heeding robots.txt is actually pretty groundbreaking. However, a good robot should also make use of HTTP caching, keywords are Etags, Last-Modified and Expiry. It will be a focus of upcoming development.

It is not clear how long it would be running, or how it would perform if set to retrieve as much as it could. Currently, it is a serial robot, but there exists Perl modules to make parallell robots. If it is found that a serial robot is too limited, it will necessarily require attention.

SEE ALSO

RDF::Redland, LWP.

SUBVERSION REPOSITORY

This code is maintained in a Subversion repository. You may check out the trunk using e.g.

  svn checkout http://svn.kjernsmo.net/RDF-Scutter/trunk/ RDF-Scutter

AUTHOR

Kjetil Kjernsmo, <kjetilk@cpan.org>

ACKNOWLEDGEMENTS

Many thanks to Dave Beckett for writing the Redland framework and for helping when the author was confused, and to Dan Brickley for interesting discussions. Also thanks to the LWP authors for their excellent library.

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Kjetil Kjernsmo

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.