The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Store::Digest::HTTP - Map HTTP methods and URI space to Store::Digest

VERSION

Version 0.01

SYNOPSIS

    use Store::Digest::HTTP;

    my $sd = Store::Digest::HTTP->new(store => $store);

    # $request is a HTTP::Request, Plack::Request, Catalyst::Request
    # or Apache2::RequestRec. $response is a Plack::Response.

    my $response = $sd->respond($request);

DESCRIPTION

This module provides a reference implementation for an HTTP interface to Store::Digest, a content-addressable storage system based on RFC 6920 and named information (ni:) URIs and their HTTP expansions. It is intended to provide a generic, content-based storage mechanism for opaque data objects, either uploaded by users, or the results of computations. The goal of this system is to act as a holding tank for both permanent storage and temporary caching, with its preservation/expiration policy handled out of scope.

This module is designed with only a robust set of essential functionality, with the expectation that it will be used as a foundation for far more elaborate systems. Indeed, this module is conceived primarily as an internal Web service which is only accessible by trusted clients, even though through use it may be found to exhibit value as a public resource.

SECURITY

This module has no concept of access control, authentication or authorization. Those concepts have been intentionally left out of scope. There are more than enough existing mechanisms available to protect, for instance, writing to and deleting from the store. Preventing unauthorized reads is a little bit trickier.

The locations of the indexes can obviously be protected from unauthorized reading through straight-forward authentication rules. The contents of the store, however, will require an authorization system which is considerably more sophisticated.

Scanning/Trawling

With the default SHA-256 digest algorithm, this (or any other) implementation will keel over long before the distance between hash values becomes short enough that a brute force scan will be feasible. That won't stop people from trying. Likewise, by default, Store::Digest computes (and this module exposes) shorter digests like MD5 for the express purpose of matching objects to hashes in the event that that's all you've got. If you don't want this behaviour, you can use external access control mechanisms to wall off entire digest algorithms, or consider disabling the computation of those algorithms altogether (since in that case they're only costing you).

A persistent danger pertaining to the feasibility of scanning, and this is untested, is if some algorithm or other peaks, statistically, around certain values. This would drastically reduce the effort required to score arbitrary hits, though they would be arbitrary.

For all other intents and purposes, the likelihood that an attacker could correctly guess the location of a sensitive piece of data, especially without setting off alarm bells, is infinitesimal.

Go Fish attacks

If an attacker has a particular data object, he/she can ask the system if it has that object as well, simply by generating a digest and crafting a GET request for it. This scenario is obviously completely inconsequential, except for the rare case wherein you need to be able to repudiate having some knowledge or other, at which point it could be severely damaging.

Locking down individual objects

The objects in the store should be seen as representations: images of information. It is entirely conceivable, if not expressly anticipated, that two abstract resources, one public and one confidential, could have identical literal representations, with identical cryptographic signatures. This would amount to one object being stored, presumably with two (or more) references to it inscribed in some higher-level system. The difference between what is confidential, and what is public, is in the context. As such, access control to concrete representations should be mediated by access control to abstract resources, in some other part of the system.

RESOURCE TYPES

All resources respond to OPTIONS requests, which list available methods. Requests for resources for methods that have not been specified will result in a "405 Method Not Allowed" response.

Store contents: opaque data objects

These resources are identified by their full digest value. By default, that means these URI paths:

    /.well-known/[dn]i/{algorithm}/{digest}
    /{algorithm}/{digest}

...where {algorithm} is an active digest algorithm in the store, and {digest} is a complete, base64url or hexadecimal-encoded cryptographic digest. If the digest is hexadecimal, the request will be redirected (301 for GET/HEAD, 307 for the rest) to its base64url equivalent.

GET/HEAD

When successful, this method returns the content of the identified object. If the object has been deleted from the store, the response will be 410 Gone. If it was never there in the first place, 404 Not Found. If the Accept-* headers explicitly reject any of the properties of the object, the response will properly be 406 Not Acceptable.

Since these resources only have one representation which by definition cannot be modified, the If-* headers respond appropriately. The ETag of the object is equivalent to its ni: URI (in double quotes, as per RFC 2616).

If the request includes a Range header, the appropriate range will be returned via 206 Partial Content. Note however that at this time, multiple or non-byte ranges are not implemented, and such requests will be met with a 501 Not Implemented error.

PUT

A store object responds to PUT requests, primarily for the purpose of symmetry, but it is also applicable to verifying arbitrary data objects against supplied digests. That is, the URI of the PUT request must match the actual digest of the object's contents in the given algorithm. A mismatch between digest and content is interpreted as an attempt to PUT the object in question in the wrong place, and is treated as 403 Forbidden.

If, however, the digest matches, the response will be either 204 No Content or 201 Created, depending on whether or not the object was already in the store. A PUT request with a Range header makes no sense in this context and is therefore not implemented, and will appropriately respond with 501 Not Implemented.

Any Date header supplied with the request will become the mtime of the stored object, and will be reflected in the Last-Modified header in subsequent requests.

DELETE

Note: This module has no concept of access control.

This request, as expected, unquestioningly deletes a store object, provided one is present at the requested URI. If it is, the response is 204 No Content. If not, the response is either 404 Not Found or 410 Gone, depending on whether or not there ever was an object at that location.

PROPFIND

A handler for the PROPFIND request method is supplied to provide direct access to the metadata of the objects in the store. Downstream WebDAV applications can therefore use this module as a storage back-end while only needing to interface at the level of HTTP and/or WebDAV.

PROPPATCH

Note: This module has no concept of access control.

The PROPPATCH method is supplied, first for parity with the PROPFIND method, but also so that automated agents, such as syntax validators, can directly update the objects' metadata with their findings.

Here are the DAV properties which are currently editable:

creationdate

This property sets the mtime of the stored object, not the ctime. The ctime of a Store::Digest::Object is the time it was added to the store, not the modification time of the object supplied when it was uploaded. Furthermore, per RFC 4918, the getlastmodified property SHOULD be considered protected. As such, the meanings of the creationdate and getlastmodified properties are inverted from their intuitive values.

(XXX: is this dumb? Will I regret it?)

getcontentlanguage

This property permits the data object to be annotated with one or more RFC 3066 (5646) language tags.

getcontenttype

This property permits automated agents to update the content type, and when applicable, the character set of the object. This is useful for providing an interface for storing the results of an asynchronous verification of the store's contents through a trusted mechanism, instead of relying on the claim of whoever uploaded the object that these values match their contents.

Individual metadata

This is a read-only hypertext resource intended primarily as the response content to a POST of a new storage object, such that the caller can retrieve the digest value and other useful metadata. It also doubles as a user interface for successive manual uploads, both as interstitial feedback and as a control surface.

GET/HEAD

    .../{algorithm}/{digest}?meta=true # not sure which of these yet
    .../{algorithm}/{digest};meta      # ... can't decide

Depending on the Accept header, this resource will either return RDFa-embedded (X)HTML, RDF/XML or Turtle (or JSON-LD, or whatever). The HTML version includes a rudimentary interface to the multipart/form-data POST target.

Partial matches

Partial matches are read-only resources that return a list of links to stored objects. The purpose is to provide an interface for retrieving an object from the store when only the first few characters of its digest are known. These resources are mapped under the following URI paths by default:

    /.well-known/[dn]i/{algorithm}/{partial-digest}
    /.well-known/[dn]i/{partial-digest}
    /{algorithm}/{partial-digest}
    /{partial-digest}

...where {algorithm} is an active digest algorithm in the store, and {partial-digest} is an incomplete, base64url or hexadecimal-encoded cryptographic digest, that is, one that is shorter than the appropriate length for the given algorithm. If the path is given with no algorithm, the length of the digest content doesn't matter, and all algorithms will be searched.

GET/HEAD

A GET request will return a simple web page containing a list of links to the matching objects. If exactly one object matches, the response will be 302 Found (in case additional objects match in the future). If no objects match, the response will be 404 Not Found. If multiple objects match, this response will be returned with a 300 Multiple Choices status, to reinforce the transient nature of the resource.

TODO: find or make an appropriate collection vocab, then implement RDFa, RDF/XML, N3/Turtle, and JSON-LD variants.

PROPFIND

TODO: A PROPFIND response, if it even makes sense to implement, will almost certainly be contingent on whatever vocab I decide on.

Resource collections

These collections exist for diagnostic purposes, so that during development we may examine the contents of the store without any apparatus besides a web browser. By default, the collections are bound to the following URI paths:

    /.well-known/[dn]i/{algorithm}/
    /{algorithm}/

The only significance of the {algorithm} in the URI path is as a residual sorting parameter, to be used only after the contents of the store have been sorted by all other specified parameters. Otherwise the results are the same for all digest algorithms. The default sorting behaviour is to ascend lexically, first by type, then modification time (then tiebreak by whatever other means remain).

GET/HEAD

These resources are bona fide collections and will reflect the convention by redirecting via 301 Moved Permanently to a path with a trailing slash /. (Maybe?)

This is gonna have to respond to filtering, sort order and pagination.

(optional application/atom+xml variant?)

Here are the available parameters:

tz (ISO 8601 time zone)

Resolve date parameters against this time zone rather than the default (UTC).

    tz=-0800

(XXX: use Olson rather than ISO-8601 so we don't have to screw around with daylight savings? whynotboth.gif?)

boundary

Absolute offset of bounding record, starting with 1. One value present sets the upper bound; two values define an absolute range:

    boundary=100              # 1-100
    boundary=1&boundary=100   # same thing
    boundary=101&boundary=200 # 101-200
sort (Filter parameter name)

One or more instances of this parameter, in the order given, override the default sorting criterion, which is this:

    sort=type&sort=mtime
reverse (Boolean)

Flag for specifying a reverse sort order:

    reverse=true
complement (Filter parameter name)

Use the complement of the specified filter criteria:

    type=text/html&complement=type # everything but text/html

Here are the sorting/filtering criteria:

size

The number of bytes, as a range. One for lower bound, two for a range:

    size=1048576     # at least a megabyte
    size=0&size=1024 # no more than a kilobyte
type

The Content-Type of the object. Enumerable:

    type=text/html&type=text/plain&type=application/xml
charset

The character set of the object. Enumerable:

    charset=utf-8&charset=iso-8859-1&charset=windows-1252
encoding

The Content-Encoding of the object. Enumerable:

    encoding=gzip&encoding=bzip2&encoding=identity
ctime

The creation time, as in the time the object was added to the store. One for lower bound, two for range:

    ctime=2012-01-01 # everything added since January 1, 2012
    ctime=2012-01-01&ctime=2012-12-31 # only the year of 2012

Applying complement to this parameter turns the one-instance form into an upper bound, and the range to mean everything but its contents. This parameter takes ISO 8601 datetime strings or subsets thereof, or epoch seconds.

mtime

Same syntax as ctime, except concerns the modification time supplied by the user when the object was inserted into the store.

ptime

Same as above, except concerns the latest time at which only the metadata of the object was modified.

dtime

Same as above, except concerns the latest time the object was deleted. As should be expected, if this parameter is used, objects which are currently present in the store will be omitted. Only the traces of deleted objects will be shown.

PROPFIND

TODO: Again, PROPFIND responses, not sure how to define 'em at this time.

Summary and usage statistics

This resource acts as the "home page" of this module. Here we can observe the contents of Store::Digest::Stats, such as number of objects stored, global modification times, storage consumption , reclaimed, space, etc. We can also choose our preferred time zone and digest algorithm for browsing the store's contents, as well as upload a new file.

GET/HEAD

Depending on the Accept header, this handler returns a simple web page or set of RDF triples.

PROPFIND

TODO: Define RDF vocab before PROPFIND.

POST target, raw

This is a URI that only handles POST requests, which enable a thin (e.g., API) HTTP client to upload a data object without the effort or apparatus needed to compute its digest. Headers of interest to the request are naturally Content-Type, and Date. The path of this URI is set in the constructor, and defaults to:

    /0c17e171-8cb1-4c60-9c58-f218075ae9a9

POST

This response accepts the request content and attempts to store it. If unsuccessful, it will return either 507 Insufficient Storage or 500 Internal Server Error. If successful, the response will redirect via 303 See Other to the appropriate "Individual metadata" resource.

This resource is intended to be used in a pipeline with other web service code. POSTed request entities to this location will be inserted into the store as-is. Do not POST to this location from a Web form unless that's what you want to have happen. Use the other target instead.

The contents of the following request headers are stored along with the content of the request body:

  • Content-Type

  • Content-Language

  • Content-Encoding

  • Date

POST target, multipart/form-data

This resource behaves identically to the one above, except that takes its data from multipart/form-data fields rather than headers. This resource is designed as part of a rudimentary interface for adding objects to the store. It is intended for use during development and explicitly not for production, outside the most basic requirements. Its default URI path, also configurable in the constructor, is:

    /12d851b7-5f71-405c-bb44-bd97b318093a

POST

This handler expects a POST request with multipart/form-data content only; any other content type will result in a 409 Conflict. The same response will occur if the request body does not contain a file part. Malformed request content will be met with a 400 Bad Request. The handler will process only the first file part found in the request body; it will ignore the field name. If there are Content-Type, Date, etc. headers in the MIME subpart, those will be stored. The file's name, if supplied, is ignored, since mapping names to content is deliberately out of scope for Store::Digest.

METHODS

new

    my $sdh = Store::Digest::HTTP->new(store => $store);
store

This is a reference to a Store::Digest object.

base

This is the base URI path, which defaults to /.well-known/ni/.

post_raw

This overrides the location of the raw POST target, which defaults to /0c17e171-8cb1-4c60-9c58-f218075ae9a9.

post_form

This overrides the location of the form-interpreted POST target, which defaults to /12d851b7-5f71-405c-bb44-bd97b318093a.

If the

param_map

Any of the URI query parameters used in this module can be remapped to different literals using a HASH reference like so:

    # in case 'mtime' collides with some other parameter elsewhere
    { modified => 'mtime' }

respond

    my $response = $sdh->respond($request);

TO DO

I think diff coding/instance manipulation (RFC 3229 and RFC 3284) would be pretty cool. Might be better handled by some other module,

AUTHOR

Dorian Taylor, <dorian at cpan.org>

LICENSE AND COPYRIGHT

Copyright 2012 Dorian Taylor.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.