The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WARC - Web ARChive support for Perl

SYNOPSIS

  use WARC;

  $collection = assemble WARC::Collection (@indexes);

  $record = $collection->search(url => $url, time => $when);

  $volume = mount WARC::Volume ($filename);

  $record = $volume->first_record;
  $next_record = $record->next;

  $record = $volume->record_at($offset);

  # $record is a WARC::Record object

DESCRIPTION

The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.

Overview of the WARC reader support modules

WARC::Collection

A WARC::Collection object represents a set of indexed WARC files.

WARC::Volume

A WARC::Volume object represents a single WARC file.

WARC::Record

Each record in a WARC volume is analogous to an HTTP::Message, with headers specific to the WARC format.

WARC::Record::Payload

Planned support for tied filehandles reading WARC payloads.

WARC::Record::Segment

Planned support class for handling WARC segmentation.

WARC::Fields

A WARC::Fields object represents the set of headers in a WARC record, analogous to the use of HTTP::Headers with HTTP::Message. The HTTP::Headers class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.

The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The WARC::Fields class also provides support for objects of this type.

WARC::Index

WARC::Index is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling WARC::Collection objects.

WARC::Index::Entry

WARC::Index::Entry is the base class for WARC index entries returned from the various index formats.

WARC::Index::File::CDX

Access module for the common CDX WARC index format.

WARC::Index::File::SDBM

Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.

WARC::Index::File::SQLite

Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.

Overview of the WARC writer support modules

WARC::Builder

The WARC::Builder class provides a means to write new WARC files.

WARC::Index::Builder

WARC::Index::Builder is the base class for the index-building tools.

WARC::Index::File::CDX::Builder
WARC::Index::File::SDBM::Builder
WARC::Index::File::SQLite::Builder

The WARC::Index::File::*::Builder classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.

The build constructor that WARC::Index provides uses one of these classes for the actual work.

CAVEATS

Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.

Support for WARC record segmentation is planned but not yet implemented.

Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.

The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.

Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

Information about the WARC format at http://bibnum.bnf.fr/WARC/.

An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.

# TODO: add relevant RFCs.

The POD pages for the modules mentioned in the overview lists.

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.