Benjamin Habegger
and 1 contributors


WebSource - a general data wrapping tool particularly well suited for online data (but what data in not online in some way today ;) )


WebSource gives a general and normalized framework way to access data made available via the web. An access to subparts of the Web is made by defining a task. This task is built by composing query building, extraction, fetching and filtering subtasks.


  $source = WebSource->new(wsd => $description);
  @results = $source->query($query);
  $result = $source->set_query($query);
  while($result = $source->next_result()) {


WebSource originally was a generic wrapper around a Web Source. Given an XML description of a source it allows to query the source and retreive its results. The format of the query and the result remain source dependant however.

It is now configurable enough allow to do complex tasks on the web : such as fetching, extracting, filtering data one the Web. Each complex task is described by an XML task description file (WebSource description). This task is decomposed into simple subtasks of different flavors.

Existing subtask flavors are : - extract input an XML::LibXML::Document output an XML::LibXML::Node Applys an Xpath on the document and returns the set of nodes - fetch input a URL (or XML::LibXML::Node containing a url) output an XML::LibXML::Document - format input an XML::Document output a string - filter input anything output anything (but not all) - external This type of subtask uses an external perl module as a task. This allows to define highly configurable tasks. input depends on external module output depends on external module - meta-tag input anything output anything (with updated meta-data)


$source = WebSource->new(wsd => $wsd);

Create a new WebSource object working with the given a WebSource description

The following named paramters can be given :


Use a generic engine with the given source description file


Do not output more than max_results


Pass the initial data to the first subtask


Build a query %hash for the given parameters and push it in


Set the maximum number of results to output to $count


Returns the following result for the task


Returns a has of the initial tasks parameters


Returns the spec of the options translated for Getopt::Mixed


Sets source specific option $opt to value $val


Handles node of type <ws:import href="" /> by inserting nodes from the wsd file referenced by href into (imported document) into the current wsd document (target document). A node is inserted from the imported document into the target document only if a node with the same name does not exist in the target document.


Handles node of type <ws:attribute name="aname" value="oname" /> by adding and attribut name aname with the value of the option named oname to the parent node. The ws:attribute node is then removed.


ws-query, WebSource::Extract, WebSource::Fetch, WebSource::Filter, etc.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 362:

'=item' outside of any '=over'

Around line 524:

You forgot a '=back' before '=head1'