- SEE ALSO
- COPYRIGHT AND LICENSE
WARC - Web ARChive support for Perl
use WARC; $collection = assemble WARC::Collection (@indexes); $record = $collection->search(url => $url, time => $when); $volume = mount WARC::Volume ($filename); $record = $volume->first_record; $next_record = $record->next; $record = $volume->record_at($offset); # $record is a WARC::Record object
WARC module is a convenience module for loading basic WARC support. After loading this module, the
WARC::Collection classes are available.
WARC::Collectionobject represents a set of indexed WARC files.
WARC::Volumeobject represents a single WARC file.
Each record in a WARC volume is analogous to an
HTTP::Message, with headers specific to the WARC format.
Support class for WARC records that span multiple segments.
Planned support for tied filehandles reading WARC payloads.
WARC::Fieldsobject represents the set of headers in a WARC record, analogous to the use of
HTTP::Headersclass is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.
The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The
WARC::Fieldsclass also provides support for objects of this type.
WARC::Indexis the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling
WARC::Index::Entryis the base class for WARC index entries returned from the various index formats.
Access module for the common CDX WARC index format.
Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.
Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.
Simple in-memory index module for small-scale applications that need index support but want to avoid requiring additional files beyond the WARC volume itself. This reads an entire WARC volume to build and attach an index.
WARC::Builderclass provides a means to write new WARC files.
WARC::Index::Builderis the base class for the index-building tools.
WARC::Index::File::*::Builderclasses provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.
WARC::Indexprovides uses one of these classes for the actual work.
Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.
Support for WARC record segmentation is planned but not yet implemented.
Handling segmented WARC records requires using the
WARC::Collection interface to find the next segment in a different WARC file. The
WARC::Volume interface is only usable for access within one WARC file.
The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as
WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.
Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.
Jacob Bachmeyer, <firstname.lastname@example.org>
Information about the WARC format at http://bibnum.bnf.fr/WARC/.
An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
# TODO: add relevant RFCs.
The POD pages for the modules mentioned in the overview lists.
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.