WARC - Web ARChive support for Perl
use WARC; $collection = assemble WARC::Collection (@indexes); $record = $collection->search(url => $url, time => $when); $volume = mount WARC::Volume ($filename); $record = $volume->first_record; $next_record = $record->next; $record = $volume->record_at($offset); # $record is a WARC::Record object
The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.
WARC
WARC::Volume
WARC::Collection
A WARC::Collection object represents a set of indexed WARC files.
A WARC::Volume object represents a single WARC file.
Each record in a WARC volume is analogous to an HTTP::Message, with headers specific to the WARC format.
HTTP::Message
Planned support for tied filehandles reading WARC payloads.
Planned support class for handling WARC segmentation.
A WARC::Fields object represents the set of headers in a WARC record, analogous to the use of HTTP::Headers with HTTP::Message. The HTTP::Headers class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.
WARC::Fields
HTTP::Headers
The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The WARC::Fields class also provides support for objects of this type.
WARC::Index is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling WARC::Collection objects.
WARC::Index
WARC::Index::Entry is the base class for WARC index entries returned from the various index formats.
WARC::Index::Entry
Access module for the common CDX WARC index format.
Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.
Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.
The WARC::Builder class provides a means to write new WARC files.
WARC::Builder
WARC::Index::Builder is the base class for the index-building tools.
WARC::Index::Builder
The WARC::Index::File::*::Builder classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.
WARC::Index::File::*::Builder
The build constructor that WARC::Index provides uses one of these classes for the actual work.
build
Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.
Support for WARC record segmentation is planned but not yet implemented.
Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.
The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.
WARC::Alike::*
Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.
Jacob Bachmeyer, <jcb@cpan.org>
Information about the WARC format at http://bibnum.bnf.fr/WARC/.
An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
# TODO: add relevant RFCs.
The POD pages for the modules mentioned in the overview lists.
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install WARC, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WARC
CPAN shell
perl -MCPAN -e shell install WARC
For more information on module installation, please visit the detailed CPAN module installation guide.