The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WARC - Web ARChive support for Perl

SYNOPSIS

  use WARC;

  $collection = assemble WARC::Collection (@indexes);

  $record = $collection->search(url => $url, time => $when);

  $volume = mount WARC::Volume ($filename);

  $record = $volume->first_record;
  $next_record = $record->next;

  $record = $volume->record_at($offset);

  # $record is a WARC::Record object

DESCRIPTION

The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.

Overview of the WARC reader support modules

WARC::Collection

A WARC::Collection object represents a set of indexed WARC files.

WARC::Volume

A WARC::Volume object represents a single WARC file.

WARC::Record

Each record in a WARC volume is analogous to an HTTP::Message, with headers specific to the WARC format.

WARC::Record::Logical

Support class for WARC records that span multiple segments.

WARC::Record::Payload

Planned support for tied filehandles reading WARC payloads.

WARC::Fields

A WARC::Fields object represents the set of headers in a WARC record, analogous to the use of HTTP::Headers with HTTP::Message. The HTTP::Headers class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.

The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The WARC::Fields class also provides support for objects of this type.

WARC::Index

WARC::Index is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling WARC::Collection objects.

WARC::Index::Entry

WARC::Index::Entry is the base class for WARC index entries returned from the various index formats.

WARC::Index::File::CDX

Access module for the common CDX WARC index format.

WARC::Index::File::SDBM

Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.

WARC::Index::File::SQLite

Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.

WARC::Index::Volatile

Simple in-memory index module for small-scale applications that need index support but want to avoid requiring additional files beyond the WARC volume itself. This reads an entire WARC volume to build and attach an index.

Overview of the WARC writer support modules

WARC::Builder

The WARC::Builder class provides a means to write new WARC files.

WARC::Index::Builder

WARC::Index::Builder is the base class for the index-building tools.

WARC::Index::File::CDX::Builder
WARC::Index::File::SDBM::Builder
WARC::Index::File::SQLite::Builder

The WARC::Index::File::*::Builder classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.

The build constructor that WARC::Index provides uses one of these classes for the actual work.

CAVEATS

Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.

Support for WARC record segmentation is planned but not yet implemented.

Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.

The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.

Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

Information about the WARC format at http://bibnum.bnf.fr/WARC/.

An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.

# TODO: add relevant RFCs.

The POD pages for the modules mentioned in the overview lists.

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.