The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WARC::Record - one record from a WARC file

SYNOPSIS

  use WARC;             # or ...
  use WARC::Volume;     # or ...
  use WARC::Collection;

  # WARC::Record objects are returned from ->record_at and ->search methods

  # Construct a record, as when preparing a WARC file
  $warcinfo = new WARC::Record (type => 'warcinfo');

  # Accessors

  $value = $record->field($name);

  $version = $record->protocol; # analogous to HTTP::Message::protocol
  $volume = $record->volume;
  $offset = $record->offset;
  $record = $record->next;

  $fields = $record->fields;

  # Supply a data block for an in-memory record
  $warcinfo->block(new WARC::Fields ( ... ));

DESCRIPTION

WARC::Record objects come in two flavors with a common interface. Records read from WARC files are read-only and have meaningful return values from the methods listed in "Methods on records from WARC files". Records constructed in memory can be updated and those same methods all return undef.

Common Methods

$record->fields

Get the internal WARC::Fields object that contains WARC record headers.

$record->field( $name )

Get the value of the WARC header named $name from the internal WARC::Fields object.

$record <=> $other_record
$record->compareTo( $other_record )

Compare two WARC::Record objects according to a simple total order: ordering by starting offset for two records in the same file, and by filename of the containing WARC::Volume objects for records in different files. Constructed WARC::Record objects are assumed to come from a volume named "" (the empty string) for this purpose, and are ordered in an arbitrary but stable manner amongst themselves. Distinct constructed WARC::Record objects never compare as equal.

Perl constructs a == operator using this method, so WARC record objects will compare as equal iff they refer to the same physical record.

Convenience getters

$record->type

Alias for $record->field('WARC-Type').

$record->id

Alias for $record->field('WARC-Record-ID').

$record->content_length

Alias for $record->field('Content-Length').

$record->date

Return the 'WARC-Date' field as a WARC::Date object.

Methods on records from WARC files

These methods all return undef if called on a WARC::Record object that does not represent a record in a WARC file.

$record->protocol

Return the format and version tag for this record. For WARC 1.0, this method returns 'WARC/1.0'.

$record->volume

Return the WARC::Volume object representing the file in which this record is located.

$record->offset

Return the file offset at which this record can be found.

$record->logical

Return the logical record object for this record. Logical records reassemble WARC continuation segments. Records recorded without using WARC segmentation are their own logical records. Reassembled logical records are also their own logical records.

$record->segments

Return a list of segments for this record. A record recorded without using WARC segmentation, including a segment of a larger logical record, is considered its own only segment. A constructed record is considered to have no segments at all.

This method exists on all records to allow $record->logical->segments to work.

$record->next

Return the next WARC::Record in the WARC file that contains this record. Returns an undefined value if called on the last record in a file.

$record->open_block

Return a tied filehandle that reads the WARC record block.

The WARC record block is the content of a WARC record, analogous to the entity body in an HTTP::Message.

$record->open_continued

Return a tied filehandle that reads the logical WARC record block.

For records that do not use WARC segmentation, this is effectively an alias for $record->open_block. For records that span multiple segments, this is an alias for $record->logical->open_block.

$record->replay
$record->replay( as => $type )

Return a protocol-specific object representing the record contents.

This method returns undef if the library does not recognize the protocol message stored in the record and croaks if a requested conversion is not supported.

A record with Content-Type "application/http" with an appropriate "msgtype" parameter produces an HTTP::Request or HTTP::Response object. The returned object may be a subclass to support deferred loading of entity bodies.

A request to replay a record "as => http" attempts to convert whatever is stored in the record to an HTTP exchange, analogous to the "everything is HTTP" interface that LWP provides.

$record->open_payload

Return a tied filehandle that reads the WARC record payload.

The WARC record payload is defined as the decoded content of the protocol response or other resource stored in the record. This method returns undef if called on a WARC record that has no payload or that has content that we do not recognize.

Methods on fresh WARC records

$record = new WARC::Record (key => value, ...)

Construct a fresh WARC record, suitable for use with WARC::Builder.

$record->block
$record->block( $new_value )

Get or set the block contents of an in-memory record. This method returns undef if called on a WARC record from a volume and croaks if setting the contents is attempted on a record from a volume.

Quick Reference to Record Types and Field Names

The WARC specification defines eight standard record types and nineteen standard named fields, at length across several pages. This section is a brief summary with emphasis on the applicability of the standard named fields to the standard record types.

Record Types

[I ] warcinfo
[ M ] metadata
[ S ] resource
[ Q ] request
[ P ] response
[ V ] revisit
[ R ] conversion
[ T] continuation

Field Names

[IMSQPVRT] Content-Type MIME-type
[IMSQPVRT] Content-Length octet-count
[IMSQPVRT] WARC-Type type-name
[IMSQPVRT] WARC-Date datestamp
[IMSQPVRT] WARC-Record-ID URI-for-record-ID
[ MSQPVRT] WARC-Warcinfo-ID record-ID
[ MSQPV ] WARC-Concurrent-To record-ID
[ M VR ] WARC-Refers-To record-ID
[IMSQPVRT] WARC-Block-Digest digest
[IMSQPVRT] WARC-Payload-Digest digest
[ SQPVRT] WARC-Identified-Payload-Type MIME type
[ MSQPVRT] WARC-Target-URI URI
[ MSQPV ] WARC-IP-Address address
[IMSQPVRT] WARC-Truncated reason
[IMSQPVRT] WARC-Segment-Number ordinal
[ T] WARC-Segment-Origin-ID record-ID
[ T] WARC-Segment-Total-Length octet-count
[I ] WARC-Filename original-WARC-filename
[ V ] WARC-Profile namespace-URI

Required ("shall") Fields

All records require: (listed once instead of in every set)
    WARC-Type
    WARC-Date
    WARC-Record-ID
    Content-Length
Any record written using WARC segmentation requires:
    WARC-Segment-Number
      This always has the value "1" if present, except in "continuation"
      records, where it provides the segment ordering.
Last continuation record in a segmented record requires:
    WARC-Segment-Total-Length
      This is the "Content-Length" of the reassembled record.
Type "resource" requires:
    WARC-Target-URI
Type "request" requires:
    WARC-Target-URI
Type "response" requires:
    WARC-Target-URI
Type "revisit" requires:
    WARC-Target-URI
    WARC-Profile
Type "conversion" requires:
    WARC-Target-URI
Type "continuation" requires:
    WARC-Target-URI
    WARC-Segment-Number
    WARC-Segment-Origin-ID
All records
    Content-Type
      Default is "application/octet-stream" or the result of analysis.
      This default should not be relied upon and this header should be
      used.  May be safely omitted if Content-Length is zero.
    WARC-Block-Digest
    WARC-Truncated
Any record that has a "well-defined payload"
    WARC-Payload-Digest
    WARC-Identified-Payload-Type
Any record not of type "warcinfo"
    WARC-Warcinfo-ID
Type "warcinfo"
    WARC-Filename
Type "metadata"
    WARC-Concurrent-To
    WARC-Refers-To
    WARC-Target-URI
    WARC-IP-Address
Type "resource"
    WARC-Concurrent-To
    WARC-IP-Address
Type "request"
    WARC-Concurrent-To
    WARC-IP-Address
Type "response"
    WARC-Concurrent-To
    WARC-IP-Address
Type "revisit"
    WARC-Concurrent-To
    WARC-Refers-To
    WARC-IP-Address
Type "conversion"
    WARC-Refers-To

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

WARC, HTTP::Message

"Extension subfield 'sl' in gzip header" in WARC::Builder

COPYRIGHT AND LICENSE

Copyright (C) 2019 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.