- SEE ALSO
- COPYRIGHT AND LICENSE
WARC - Web ARChive support for Perl
use WARC; $collection = assemble WARC::Collection (@indexes); $record = $collection->search(url => $url, time => $when); $volume = mount WARC::Volume ($filename); $record = $volume->first_record; $next_record = $record->next; $record = $volume->record_at($offset); # $record is a WARC::Record object
WARC module is a convenience module for loading basic WARC support. After loading this module, the
WARC::Collection classes are available.
WARC::Collectionobject represents a set of indexed WARC files.
WARC::Volumeobject represents a single WARC file.
Each record in a WARC volume is analogous to an
HTTP::Message, with headers specific to the WARC format.
Planned support for tied filehandles reading WARC payloads.
Planned support class for handling WARC segmentation.
WARC::Fieldsobject represents the set of headers in a WARC record, analogous to the use of
HTTP::Headersclass is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.
The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The
WARC::Fieldsclass also provides support for objects of this type.
WARC::Indexis the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling
WARC::Index::Entryis the base class for WARC index entries returned from the various index formats.
Access module for the common CDX WARC index format.
Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.
Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.
WARC::Builderclass provides a means to write new WARC files.
WARC::Index::Builderis the base class for the index-building tools.
WARC::Index::File::*::Builderclasses provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.
WARC::Indexprovides uses one of these classes for the actual work.
Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.
Support for WARC record segmentation is planned but not yet implemented.
Handling segmented WARC records requires using the
WARC::Collection interface to find the next segment in a different WARC file. The
WARC::Volume interface is only usable for access within one WARC file.
The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as
WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.
Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.
Jacob Bachmeyer, <email@example.com>
Information about the WARC format at http://bibnum.bnf.fr/WARC/.
An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
# TODO: add relevant RFCs.
The POD pages for the modules mentioned in the overview lists.
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.