The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WARC::Builder - Web ARChive construction support for Perl

SYNOPSIS

  use WARC::Builder;

  $warcinfo_data = new WARC::Fields (software => 'MyWebCrawler/1.2.3 ...',
                                     format => 'WARC File Format 1.0',
                                     # other fields omitted ...
                                     );

  $warcinfo = new WARC::Record (type => 'warcinfo',
                                content => $warcinfo_data);

  # for a small-scale crawl
  $build = new WARC::Builder (warcinfo => $warcinfo,
                              filename => $warcfilename);

  # for a large-scale crawl
  $index1 = build WARC::Index::File::CDX (into => $indexprefix.'.cdx');
  $index2 = build WARC::Index::File::SDBM (into => $indexprefix.'.sdbm');
  $build = new WARC::Builder (warcinfo => $warcinfo,
                              filename_template =>
                                $warcprefix.'-%s-%05d-'.$hostname.'.warc.gz',
                              index => [$index1, $index2]);

  # for each collected object
  $build->append(@records);     # or ...
  $build->append($record1, $record2, ... );

DESCRIPTION

The WARC::Builder class is the high-level interface for writing WARC archives. It is a very simple interface, because, at this level, WARC is a very simple format: a simple sequence of WARC records, which WARC::Builder accepts as WARC::Record objects to append to the in-progress WARC file.

WARC file size limits are handled automatically if configured.

Methods

$build = new WARC::Builder (key => value, ...)

Construct a WARC::Builder object. The following keys are supported:

index => [$index]
index => [$index1, $index2, ...]

If set, must be an array reference of a list of index builder objects. Each newly-added WARC::Record will be presented to all index builder objects in this list.

filename => $warcfilename

If set, create a single WARC file with the given file name. The file name must match m/\.warc(?:\.gz)?$/. The presence of a final ".gz" indicates that the WARC file should be written with per-record gzip compression.

This option is mutually exclusive with the filename_template option.

Using this option inhibits starting a new WARC file and causes the max_file_size option to be ignored. A warning is emitted in this case.

filename_template => $warcprefix.'-%s-%05d-'.$hostname.'.warc.gz'

Establish an sprintf format string to construct file names. The file name produced by the template string must match m/\.warc(?:\.gz)?$/. The presence of a final ".gz" indicates that the WARC file should be written with per-record gzip compression.

The filename_template option gives the format string, while filename_template_vars gives an array reference of named parameters to be used with the format.

If constructing file names in accordance with the IIPC WARC implementation guidelines, this string should be of the form 'PREFIX-%s-%05d-HOSTNAME.warc.gz' where PREFIX is any chosen prefix to name the crawl and HOSTNAME is the name or other identifier for the machine writing the file.

This option is mutually exclusive with the filename option.

filename_template_vars => [qw/timestamp serial/]

Provide the list of parameters to the sprintf call used to produce a WARC filename from the filename_template option.

The available variables are:

serial

A number, incremented each time adding a record causes a new WARC file to be started.

timestamp

A 14-digit timestamp in the YYYYmmddHHMMSS format recommended in the IIPC WARC implementation guidelines. The timestamp is always in UTC. The time used is the time at which the WARC::Builder object was constructed and is constant between WARC files. This should be substituted as a string.

Default [qw/timestamp serial/] in accordance with IIPC guidelines.

first_serial => $count

The initial value of the serial filename variable for this object. Default 0.

max_file_size => $size

Maximum size of a WARC file. A new WARC file is started if appending a record would cause the current file to exceed this length.

The limit can be specified as an exact number of bytes, or a number followed by a size suffix m/[KMG]i?/. The "K", "M", and "G" suffixes indicate base-10 multiples (10**(3*n)), while the "Ki", "Mi", and "Gi" suffixes indicate base-2 multiples (2**(10*n)) widely used in computing.

Default "1G" == 1_000_000_000.

warcinfo => $warcinfo_record

A WARC::Record object of type "warcinfo" that will be written at the start of each WARC file. This record will be cloned and written with a distinct "WARC-Record-ID" as the first record in each WARC file, including the first. As a consequence, it does not require a "WARC-Record-ID" header and any "WARC-Record-ID" given is silently ignored.

Each clone of this record will also have the "WARC-Filename" header added.

Each clone of this record will also have the "WARC-Date" header set to the time at which the WARC::Builder object was constructed.

warcversion => 'WARC/1.0'

Set the version of the WARC format to be written. This string is the first line of each WARC record. It must begin with the prefix 'WARC/' and should be the version from the WARC specification that the crawler follows.

Default "WARC/1.0".

$build->append( $record1, ... )

Add any number of WARC::Record objects to the growing WARC file. If WARC file size limits are configured, and a record would cause the current WARC file to exceed the configured size limits, a new WARC file is opened automatically.

All records passed to a single append call are added to the same WARC file. If a new WARC file is to be started, it will be started before any records are written.

All records passed to a single append call are considered "concurrent" and all subsequent records will have a "WARC-Concurrent-To" header added referencing the first record, if they do not already have a "WARC-Concurrent-To" header. This is a convenience feature for simpler crawlers and is inhibited if any record already has a "WARC-Concurrent-To" header when append is called.

If a WARC::Record passed to this method lacks a "WARC-Record-ID" header, a warning will be emitted using carp(), a UUID will be generated, and a record ID of the form "urn:uuid:UUID" will be assigned. If the record object is read-only, this method will croak() instead.

If a WARC::Record passed to this method lacks any of the "WARC-Date", "WARC-Type", or "Content-Length" headers, this method will croak().

Extension subfield 'sl' in gzip header

The WARC::Builder class can write a "skip length" extension subfield in the GZIP extra header like that produced by GNU Wget. The Wget sources claim that this header is defined by a "WARC standard" but the publically-available document that Wget cites with "conformsTo" in the warcinfo records it writes seems to have no mention of the concept.

The WARC::Record class understands this header and can use it to improve the performance of the next method if it is present.

This extension does not seem to be documented anywhere and the definition used in this library was extracted from the source code of GNU Wget.

subfield ID: ASCII 'sl'

Encode as: pack "a2": (hex) [73 6c]

subfield length: 8 bytes

Encode as: pack "v": 16-bit little-endian integer; (hex) [08 00]

compressed size in bytes

Encode as: pack "V": 32-bit little-endian integer

uncompressed size in bytes

Encode as: pack "V": 32-bit little-endian integer

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

WARC, WARC::Record

International Internet Preservation Consortium (IIPC) WARC implementaion guidelines. https://netpreserve.org/resources/WARC_Guidelines_v1.pdf

RFC1952 "GZIP file format specification version 4.3". P. Deutsch. May 1996.

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.