WARC - Web ARChive support for Perl


  use WARC;

  $collection = assemble WARC::Collection (@indexes);

  $record = $collection->search(url => $url, time => $when);

  $volume = mount WARC::Volume ($filename);

  $record = $volume->first_record;
  $next_record = $record->next;

  $record = $volume->record_at($offset);

  # $record is a WARC::Record object


The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.

Overview of the WARC reader support modules


A WARC::Collection object represents a set of indexed WARC files.


A WARC::Volume object represents a single WARC file.


Each record in a WARC volume is analogous to an HTTP::Message, with headers specific to the WARC format.


Support class for WARC records that span multiple segments.


Planned support for tied filehandles reading WARC payloads.


A WARC::Fields object represents the set of headers in a WARC record, analogous to the use of HTTP::Headers with HTTP::Message. The HTTP::Headers class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.

The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The WARC::Fields class also provides support for objects of this type.


WARC::Index is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling WARC::Collection objects.


WARC::Index::Entry is the base class for WARC index entries returned from the various index formats.


Access module for the common CDX WARC index format.


Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.


Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.


Simple in-memory index module for small-scale applications that need index support but want to avoid requiring additional files beyond the WARC volume itself. This reads an entire WARC volume to build and attach an index.

Overview of the WARC writer support modules


The WARC::Builder class provides a means to write new WARC files.


WARC::Index::Builder is the base class for the index-building tools.


The WARC::Index::File::*::Builder classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.

The build constructor that WARC::Index provides uses one of these classes for the actual work.


Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.

Support for WARC record segmentation is planned but not yet implemented.

Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.

The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.

Formats planned for eventual inclusion include MAFF described at and the MHTML format defined in RFC 2557.


Jacob Bachmeyer, <>


Information about the WARC format at

An overview of the WARC format at

# TODO: add relevant RFCs.

The POD pages for the modules mentioned in the overview lists.


Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.