The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WARC::Record::Logical::Heuristics - heuristics for locating record segments

SYNOPSIS

  use WARC::Record::Logical::Heuristics;

DESCRIPTION

This is an internal module that provides functions for locating record segments when the needed information is not available from an index.

These mostly assume that IIPC WARC guidelines have been followed, as otherwise there simply is no efficient solution.

Implementations vary, however, with some using only an incrementing serial number and a constant timestamp from the initiation of the crawl job, while the guidelines and specification envision a timestamp reflecting the first write to that specific file rather than the start of the crawl. Constant timestamps are checked first, since the search is simpler.

$WARC::Record::Logical::Heuristics::Patience

This variable sets a threshold used to limit the reach of an unproductive search. This module tracks the "effort" expended (I/O performed) during a search and abandons the search if the threshold is exceeded. Finding results dynamically (and temporarily) increases this threshold during a search, such that this really sets how far the search will go between results before giving up and concluding that there are no more results.

The search will reach farther if either the WARC files are not compressed, or the "sl" GZIP extension documented in WARC::Builder is used. Decompressing record data to find the next record is considerable effort for larger records, but is not counted for very small records that the system is likely to already have cached after the header has been read.

%WARC::Record::Logical::Heuristics::Effort

This internal hash indicates how costly certain operations should be considered. The keys and their meanings are subject to change at whim, but this is available for quick tuning if needed. Generally, the better solution is to index your data rather than spend time tuning heuristics.

( $first_segment, @clues ) = find_first_segment( $record )

Attempt to locate the first segment of the logical record suggested by the given record without using indexes. Croaks if given a record that does not appear to have been written using WARC segmentation. Returns a WARC::Record object for the first record and a list of other objects that may be useful for locating continuation records. Returns undef in the first slot if no clear first segment was found, but can still return other records encountered during the search even if the search was ultimately unsuccessful.

( @segments ) = find_continuation( $first_segment, @clues )

Attempt to locate the continuation segments of a logical record without using indexes. Uses the clues returned from find_first_segment to aid in the search and returns a list of continuation records found that appear to be part of the same logical record as the given first segment.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

WARC, WARC::Collection, WARC::Record

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.