The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WARC::Index::File::CDX - CDX index support for WARC library

SYNOPSIS

  use WARC::Index::File::CDX;

  $index = attach WARC::Index::File::CDX ($cdx_file);

DESCRIPTION

A WARC::Index::File::CDX object represents a CDX index file and provides access to the entries within as WARC::Index::Entry objects, which provide access to the indexed WARC records.

The CDX format is a sequential index format and every search involves a scan over the entire CDX file. This is still useful because CDX files are considerably smaller than the WARC volumes that they index.

The CDX Format

The CDX index format appears to be a simplification of Alexa's DAT format originating at the Internet Archive. The Internet Archive Wayback Machine was originally loaded using the pages collected by the crawler for the Alexa search engine and used Alexa's internal data formats. (This was a very different "Alexa" from the service offered by Amazon as of this writing.) The official list of CDX field codes is described as meaningful for both CDX and DAT files, but some of the codes may only be relevant for the latter.

The CDX index format is a simple line-oriented format similar to an Apache server log, including the use of - as a placeholder for an unknown or irrelevant value. CDX indexes often store only response records. The CDX format is very simple, which is good, because the documentation is very lacking.

A CDX file begins with the single character that will be used as field delimiter throughout the file. This is normally a space, ASCII 2/0 SP, but is not actually required to be so. The field delimiter is followed by the magic string "CDX", a field delimiter, and the list of field codes, using the delimiter to separate the elements, and continuing until the first newline, ASCII 0/10 LF. CDX files always use Unix-style line endings, consisting of a single ASCII 0/10 LF character.

In practice, the CDX field delimiter probably must be chosen from 7-bit ASCII, and some implementations incorrectly assume that the delimiter is always ASCII horizontal whitespace.

This library supports the following CDX field codes, with each item in the list showing the level of support, the field code letter, the description of the field, a fat comma, and the key(s) derived in this implementation from the field value. The level of support is indicated with "[RW]" for fields both read and written, "[ W]" for fields supported when building an index but not used when reading an entry, and "[R ]" for fields used when reading an entry but not produced by this implementation.

[RW] a "original url" => url

The URL that was used in the request that produced this response.

The url_prefix search key matches a prefix of this value.

This value is copied from the WARC-Target-URI header and is written as "-" if the record does not have that header.

[RW] b "date" => time

A timestamp for this record, stored as the 14 digits from the text form of a WARC::Date without the associated marker characters, i.e. YYYYmmddHHMMSS instead of YYYY-mm-ddTHH:MM:SSZ.

[RW] g "file name"

The name of the WARC volume containing this record. This value is assumed to be relative to the directory containing the CDX file and to always be a Unix-style filename, regardless of the local file name conventions.

This does not appear as a searchable field key, but is used by the $entry->volume method on a CDX index entry to return a WARC::Volume object.

[ W] k "new style checksum"

Typically the base32-encoded SHA1 digest of the response payload. This value is copied from the WARC-Payload-Digest header of a record if available, otherwise "-" is written.

[ W] m "mime type of original document"

The MIME type reported in the Content-Type header of the response. Written as "-" if the record does not contain an HTTP response with an entity body.

[ W] r "redirect"

The contents of the Location header of the response, URL-escaped. Some implementations do not properly set this field. Written as "-" if not present or if the record does not contain an HTTP response.

[ W] s "response code"

The HTTP status code used in the response. Written as "-" if the record does not contain an HTTP response.

[ ] M "meta tags (AIF)"

This field seems to be very common in CDX files, but no values have been observed and this field appears to be completely undocumented. Ignored on read, but written as "-" if included when building an index.

Support for this field is a stub at this time.

[ W] N "massaged url"

The URL used in the request, as for the a field, but with the hostname translated to SURT form and the scheme component removed so that the value starts with the TLD.

This is based on observed data in samples from a single source and may change without warning in future versions if better semantics are found.

[ W] S "compressed record size"

The number of octets in the compressed record in the WARC file.

This field is supported when writing an index, but not used in this library due to loose coupling between the index and record readers.

[RW] V "compressed arc file offset"
[RW] v "uncompressed arc file offset"

The offset of the record within a WARC file. This is the value that can be passed to $volume->record_at to retrieve the record.

This does not appear as a searchable field key, but is used by the $entry->record method on a CDX index entry to return a WARC::Record object for the record.

The uncompressed offset is used if the file name in the record matches m/[.]w?arc$/, otherwise the archive is assumed to be compressed. When writing an index, the V field is written as "-" if the volume is not compressed and the v field is written as if the volume were not compressed if that information is available or as "-" if the volume is compressed and the uncompressed record sizes are not known.

There is an additional field that GNU Wget writes:

[RW] u "record-id" => record_id

The WARC-Record-ID of the record.

The g and v/V fields are required for this implementation and attaching a CDX file that does not have those fields will croak.

The documentation at the Internet Archive also lists the upper-case letters ABCDFGHIJKLPQRUXYZ, the lower-case letters cdefhijlmnoptxyz, and the # symbol. The # symbol is labeled as indicating a comment and is almost certainly a leftover from the older Alexa DAT format.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

WARC, WARC::Index

The CDX file format definition at the Internet Archive: http://archive.org/web/researcher/cdx_file_format.php

The list of CDX field codes at the Internet Archive: http://archive.org/web/researcher/cdx_legend.php

IIPC 2006 CDX format specification: https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2006/

IIPC 2015 CDX format specification: https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/

SURT form at the Internet Archive: http://crawler.archive.org/articles/user_manual/glossary.html#surt

COPYRIGHT AND LICENSE

Copyright (C) 2019, 2020 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.