WARC - Web ARChive support for Perl
use WARC; $collection = assemble WARC::Collection (@indexes); $record = $collection->search(url => $url, time => $when); $volume = mount WARC::Volume ($filename); $record = $volume->first_record; $next_record = $record->next; $record = $volume->record_at($offset); # $record is a WARC::Record object
The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.
WARC
WARC::Volume
WARC::Collection
A WARC::Collection object represents a set of indexed WARC files.
A WARC::Volume object represents a single WARC file.
Each record in a WARC volume is analogous to an HTTP::Message, with headers specific to the WARC format.
HTTP::Message
Support class for WARC records that span multiple segments.
Planned support for tied filehandles reading WARC payloads.
A WARC::Fields object represents the set of headers in a WARC record, analogous to the use of HTTP::Headers with HTTP::Message. The HTTP::Headers class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.
WARC::Fields
HTTP::Headers
The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The WARC::Fields class also provides support for objects of this type.
WARC::Index is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling WARC::Collection objects.
WARC::Index
WARC::Index::Entry is the base class for WARC index entries returned from the various index formats.
WARC::Index::Entry
Access module for the common CDX WARC index format.
Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.
Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.
Simple in-memory index module for small-scale applications that need index support but want to avoid requiring additional files beyond the WARC volume itself. This reads an entire WARC volume to build and attach an index.
The WARC::Builder class provides a means to write new WARC files.
WARC::Builder
WARC::Index::Builder is the base class for the index-building tools.
WARC::Index::Builder
The WARC::Index::File::*::Builder classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.
WARC::Index::File::*::Builder
The build constructor that WARC::Index provides uses one of these classes for the actual work.
build
Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.
Support for WARC record segmentation is planned but not yet implemented.
Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.
The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.
WARC::Alike::*
Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.
Jacob Bachmeyer, <jcb@cpan.org>
Information about the WARC format at http://bibnum.bnf.fr/WARC/.
An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
# TODO: add relevant RFCs.
The POD pages for the modules mentioned in the overview lists.
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install WARC, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WARC
CPAN shell
perl -MCPAN -e shell install WARC
For more information on module installation, please visit the detailed CPAN module installation guide.