The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WARC::Collection - Interface to a group of WARC files

SYNOPSIS

  use WARC::Collection;

  $collection = assemble WARC::Collection ($index_1, $index_2, ...);
  $collection = assemble WARC::Collection from => ($index_1, ...);

  $yes_or_no = $collection->searchable( $key );

  $record = $collection->search(url => $url, time => $when);
  @records = $collection->search(url => $url, time => $when);

DESCRIPTION

The WARC::Collection class is the primary means by which user code is expected to use the WARC library. This class uses indexes to efficiently search for records in one or more WARC files.

Search Keys

The search method accepts a list of parameters as key => value pairs with each pair narrowing the search, sorting the results, or both, indicated in the following list with "[N ]", "[ S]", or "[NS]", respectively.

Supplying an array reference as a value indicates a search where any of the values in the array are acceptable. This does not affect sorting.

The same search keys documented here are used for searching indexes, since WARC::Collection is a wrapper around one or more indexes, but index support modules do not sort their results. Only WARC::Collection sorts the returned entries, so keys listed below as "sort-only" are ignored by the index support modules.

The keys supported are:

[N ] url

An exact match for a URL.

[NS] url_prefix

A prefix match for a URL. Prefers records with shorter URLs.

[ S] time

Prefer records collected nearer to the requested time.

[N ] record_id

An exact match for a (presumably unique) WARC-Record-ID.

[N ] segment_origin_id

Exact match for continuation records for a WARC-Record-ID that identifies a logical record stored using WARC record segmentation. Searching on this key returns only the continuation records.

Methods

$collection = assemble WARC::Collection ($index_1, $index_2, ...);
$collection = assemble WARC::Collection from => ($index_1, ...);

Assemble a collection of WARC files from one index or multiple indexes, specified either as objects derived from WARC::Index or filenames.

While multiple indexes can be used in a collection, note that searching a collection requires individually searching every index in the collection.

$yes_or_no = $collection->searchable( $key )

Return true or false to reflect if any index in the collection can search for the requested key.

$record = $collection->search( ... )
@records = $collection->search( ... )

Search the indexes for records matching the parameters and return the best match in scalar context or a list of all matches in list context. The returned values are WARC::Record objects.

See "Search Keys" for more information about the parameters.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

WARC

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.