The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WARC::Fields - WARC record headers and application/warc-fields

SYNOPSIS

  require WARC::Fields;

  $f = new WARC::Fields;
  $f = $record->fields;                 # get WARC record headers
  $g = $f->clone;                       # make writable copy

  $g->set_readonly;                     # make read-only

  $f->field('WARC-Type' => 'metadata'); # set
  $value = $f->field('WARC-Type');      # get

  $fields_text = $f->as_string;         # get WARC header lines for display
  $fields_block = $f->as_block;         # format for WARC file

  tie @field_names, ref $f, $f;         # bind ordered list of field names

  tie %fields, ref $f, $f;              # bind hash of field names => values

  $entry = $f->[$num];                  # tie an anonymous array and access it
  $value = $f->{$name};                 # likewise with an anonymous tied hash

  $name = "$entry";                     # tied array returns objects
  $value = $entry->value;               # one specific value
  $offset = $entry->offset;             # N of M with same name

  foreach (keys %{$f}) { ... }          # iterate over names, in order

DESCRIPTION

The WARC::Fields class encapsulates information in the "application/warc-fields" format used for WARC record headers. This is a simple key-value format closely analogous to HTTP headers, however differences are significant enough that the HTTP::Headers class cannot be reliably reused for WARC fields.

Instances of this class are usually created as member variables of the WARC::Record class, but can also be returned as the content of WARC records with Content-Type "application/warc-fields".

Instances of WARC::Fields retrieved from WARC files are read-only and will croak() if any attempt is made to change their contents.

This class strives to faithfully represent the contents of a WARC file, while providing a simple interface to answer simple questions.

Multiple Values

Most WARC headers may only appear once and with a single value in valid WARC records, with the notable exception of the WARC-Concurrent-To header. WARC::Fields neither attempts to enforce nor relies upon this constraint. Headers that appear multiple times are considered to have multiple values. When iterating a tied hash, all values of a recurring header are collected and returned with the first occurrence of its key.

Multiple values are returned from the field method and tied hash interface as array references, and are set by passing in an array reference. Existing rows are reused where possible when updating a field with multiple values. If the new array reference contains fewer items (including the special case of replacing multiple values with a single value) excess rows are deleted. If the new array reference requires additional rows to be inserted, they are inserted immediately after the last existing row for a field, with the same name case as that row.

Precise control of the layout is available using the tied array interface, but the ordering of the header rows is not constrained in the WARC specification.

Field Name Mangling

As with HTTP::Headers, the '_' character is converted to '-' in field names unless the first character of the name is ':', which cannot itself appear in a field name. Unlike HTTP::Headers, the leading ':' is stripped off immediately and the name stored otherwise exactly as given. The field method and tied hash interface allow this convenience feature. The field names exposed via the tied array interface are reported exactly as they appear in the WARC file.

Strictly, "X-Crazy-Header" and "X_Crazy_Header" are two different headers that the above convenience mechanism conflates. The solution is simple: if (and only if) a header field already exists with the exact name given, it is used, otherwise s/_/-/g occurs and the name is rechecked for another exact match. If no match is found, case is folded and a third check performed. If a match is found, the existing header is updated, otherwise a new header is created with character case as given.

The WARC specification specifically states that field names are case-insensitive, accordingly, "X-Crazy-Header" and "X-CRAZY-HeAdEr" are considered the same header for the field method and tied hash interface. They will appear exactly as given in the tied array interface, however.

Methods

$f = WARC::Fields->new

Construct a new WARC::Fields object. Initial contents can be passed as key-value pairs to this constructor and will be added in the given order.

Repeating a key or supplying an array reference as a value assigns multiple values to a key. To reduce the risk of confusion, only quoting with a leading ':' overrides the convenience feature of applying s/_/-/g when constructing a WARC::Fields object. The exact match rules used when setting values on an existing object do not apply here.

Field names given when constructing a WARC::Fields object are otherwise stored exactly as given, with case preserved, even when other names that fold to the same string have been given earlier in the argument list.

$f->clone

Copy a WARC::Fields object. A copy of a read-only object is writable.

$f->field( $name )
$f->field( $name => $value )
$f->field( $n1 => $v1, $n2 => $v2, ... )

Get or set the value of one or more fields. The field name is not case sensitive, but WARC::Fields will preserve its case if a new entry is created.

Setting a field to undef effectively deletes that field, although it remains visible in the tied array interface and will retain its position if a new value is assigned. Setting a field to an empty array reference removes that field entirely.

$f = WARC::Fields->parse( $text )
$f = WARC::Fields->parse( from => $fh )
$f = parse WARC::Fields from => $fh

Construct a new WARC::Fields object, reading initial contents from the provided text string or filehandle.

The parse method throws an exception if it encounters input that it does not understand.

If the parse method encounters a field name with a leading ':', which implies an empty name and is not allowed, the leading ':' is silently dropped from the line and parsing retried. If the line is not valid after this change, the parse method throws an exception. This feature is in keeping with the general principle of "be liberal in what you accept" and is a preemptive workaround for a predicted bug in other implementations.

$f->as_block
$f->as_string

Return the contents as a formatted WARC header or application/warc-fields block. The as_block method uses network line endings and UTF-8 as specified for the WARC format, while the as_string method uses the local line endings and does not perform encoding.

$f->set_readonly

Mark a WARC::Fields object read-only. All methods that modify the object will croak() if called on a read-only object.

Tied Array Access

The order of fields can be fully controlled by tying an array to a WARC::Fields object and manipulating the array using ordinary Perl operations. The splice and sort functions are likely to be useful for reordering array elements if desired.

WARC::Fields will croak() if an attempt is made to set a field name with a leading ':' using the tied array interface.

The tied array interface accepts simple string values but returns objects with additional information. The returned object has an overloaded string conversion that yields the name for that entry but additionally has value and offset methods.

An entry object is bound to a slot in its parent WARC::Fields object, but will be copied if it is assigned to another slot in the same or another WARC::Fields object.

Due to complex aliasing rules necessary for array slice assignment to work for permuting rows in the table, entry objects must be short-lived. Storing the object read from a tied array and attempting to use it after modifying its parent WARC::Fields object produces unspecified results.

$entry = $array[$n]
$entry = $f->[$n]

The tied array FETCH method returns a "entry object" instead of the name itself.

$name = "$entry"
$name = $entry->name
$name = "$f->[$n]"
$name = $f->[$n]->name

The name method on a entry object returns the field name. String conversion is overloaded to call this method.

$value = $entry->value
$value = $array[$n]->value
$value = $f->[$n]->value
$entry->value( $new_value )
$array[$n]->value( $new_value )
$f->[$n]->value( $new_value )

The value method on a entry object returns the field value for this particular entry. Only a single scalar is returned, even if multiple entries share the same name.

If given an argument, the value method replaces the value for this particular entry. The argument will be coerced to a string.

$offset = $entry->offset
$offset = $array[$n]->offset
$offset = $f->[$n]->offset

The offset method on a entry object returns the position of this entry amongst multiple entries with the same field name. These positions are numbered from zero and are identical to the positions in the array reference returned for this entry's field name from the field method or the tied hash interface.

Tied Hash Access

The contents of a WARC::Fields object can be easily examined by tying a hash to the object. Reading or setting a hash key is equivalent to the field method, but the tied hash will iterate keys and values in the order in which each key first appears in the internal table.

Like the tied array interface, the tied hash interface returns magical objects that internally refer back to the parent WARC::Fields object. These objects remain valid if the underlying WARC::Fields object is changed, but further use may produce surprising and unspecified results.

The use of magical objects enables the values in a tied hash to always be arrays, even for keys that do not exist (the array will have zero elements) or that have only one value (the array will have a string conversion that produces that one value). This allows a tied hash to support autovivification of an array value just as Perl's own hashes do.

Overloaded Dereference Operators

The WARC::Fields class provides overloaded dereference operators for array and hash dereferencing. The overloaded operators provide an anonymous tied array or hash as needed, allowing the object itself to be used as a reference to its tied array and hash interfaces. There is a caveat, however, so read on.

Reference Count Trickery with Overloaded Dereference Operators

To avoid problems, the underlying tied object is a reference to the parent object. For ordinary use of tie, this is a strong reference, however, the anonymous tied array and hash are cached in the object to avoid having to tie a new object every time the dereference operators are used.

To prevent memory leaks due to circular references, the overloaded dereference operators tie a weak reference to the parent object. The tied aggregate always holds a strong reference to its object, but when the dereference operators are used, that inner object is a weak reference to the actual WARC::Fields object.

The caveat is thus: do not attempt to save a reference to the array or hash produced by dereferencing a WARC::Fields object. The parent WARC::Fields object must remain in scope for as long as any anonymous tied aggregates exist.

CAVEATS

Do not save references to the anonymous tied aggregates returned by dereferencing a WARC::Fields object.

Do not save references to the entries read from tied aggregates unless the WARC::Fields object is read-only. Modifications may or may not be reflected in previously constructed entry objects and hash value arrays and the exact behavior may change without warning or notice.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

WARC, HTTP::Headers, Scalar::Util for weaken

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.