The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTTP::Promise::Parser - Fast HTTP Request & Response Parser

SYNOPSIS

    use HTTP::Promise::Parser;
    my $p = HTTP::Promise::Parser->new || 
        die( HTTP::Promise::Parser->error, "\n" );
    my $ent = $p->parse( '/some/where/http_request.txt' ) ||
        die( $p->error );
    my $ent = $p->parse( $file_handle ) ||
        die( $p->error );
    my $ent = $p->parse( $string ) ||
        die( $p->error );

VERSION

    v0.1.0

DESCRIPTION

This is an http request and response parser using XS modules whenever posible for speed and mindful of memory consumption.

As rfc7230 states in its section 3:

"The normal procedure for parsing an HTTP message is to read the start-line into a structure, read each header field into a hash table by field name until the empty line, and then use the parsed data to determine if a message body is expected. If a message body has been indicated, then it is read as a stream until an amount of octets equal to the message body length is read or the connection is closed."

Thus, HTTP::Promise approach is to read the data, whether a HTTP request or response, a.k.a, an HTTP message, from a filehandle, possibly chunked, and to first read the message headers and parse them, then to store the HTTP message in memory if it is under a specified threshold, or in a file. If the size is unknown, it would be first read in memory and switched automatically to a file when it reaches the threshold.

Once the overall message body is stored, if it is a multipart type, this class reads each of its parts into memory or separate file depending on its size until there is no more part, using the stream reader, which reads in chunks of bytes and not in lines. If the message body is a single part it is saved to memory or file depending on its size. Each part saved on file uses a file extension related to its mime type. Each of the parts are then accessible as a HTTP body object via the "parts" in HTTP::Promise::Entity method.

Note, however, that when dealing with multipart, this only recognises multipart/form-data, anything else will be treated as data.

The overall HTTP message is available as an HTTP::Promise::Entity object and returned.

If an error occurs, this module does not die, at least not voluntarily, but instead sets an error and returns undef, so always make sure to check the returned value from method calls.

CONSTRUCTOR

new

This instantiates a new HTTP::Promise::Parser object.

It takes the following options:

  • decode_body

    Boolean. If enabled, this will have this interface automatically decode the entity body upon parsing. Default is true.

  • decode_headers

    Boolean. If enabled, this will decode headers, which is used for decoding filename value in Content-Encoding. Default is false.

  • ignore_filename

    Boolean. Wether the filename provided in an Content-Disposition should be ignored or not. This defaults to false, but actually, this is not used and the filename specified in a Content-Disposition header field is never used. So, this is a no-op and should be removed.

  • max_body_in_memory_size

    Integer. This is the threshold beyond which an entity body that is initially loaded into memory will switched to be loaded into a file on the local filesystem when it is a true value and exceeds the amount specified.

    By defaults, this has the value set by the class variable $MAX_BODY_IN_MEMORY_SIZE, which is 102400 bytes or 100K

  • max_headers_size

    Integer. This is the threshold size in bytes beyond which HTTP headers will trigger an error. This defaults to the class variable $MAX_HEADERS_SIZE, which itself is set by default to 8192 bytes or 8K

  • max_read_buffer

    Integer. This is the read buffer size. This is used for HTTP::Promise::IO and this defaults to 2048 bytes (2Kb).

  • output_dir

    Filepath of the directory to be used to save entity body, when applicable.

  • tmp_dir

    Set the directory to use when creating temporary files.

  • tmp_to_core

    Boolean. When true, this will set the temporary file to an in-memory space.

METHODS

decode_body

Boolean. If enabled, this will have this interface automatically decode the entity body upon parsing. Default is true.

decode_headers

Boolean. If enabled, this will decode headers, which is used for decoding filename value in Content-Encoding. Default is false.

ignore_filename

Boolean. Wether the filename provided in an Content-Disposition should be ignored or not. This defaults to false, but actually, this is not used and the filename specified in a Content-Disposition header field is never used. So, this is a no-op and should be removed.

looks_like_request

Provided with a string or a scalar reference, and this returns an hash reference containing details of the request line attributes if it is indeed a request, or an empty string if it is not a request.

It sets an error and returns undef upon error.

The following attributes are available:

http_version

The HTTP protocol version used. For example, in HTTP/1.1, this would be 1.1, and in HTTP/2, this would be 2.

http_vers_minor

The HTTP protocol major version used. For example, in HTTP/1.0, this would be 1, and in HTTP/2, this would be 2.

http_vers_minor

The HTTP protocol minor version used. For example, in HTTP/1.0, this would be 0, and in HTTP/2, this would be undef.

method

The HTTP request method used. For example in GET / HTTP/1.1, this would be GET. This uses the rfc7231 semantics, which means any token even non-standard ones would match.

protocol

The HTTP protocol used, e.g. HTTP/1.0, HTTP/1.1, HTTP/2, etc...

uri

The request URI. For example in GET / HTTP/1.1, this would be /

    my $ref = $p->looks_like_request( \$str );
    # or
    # my $ref = $p->looks_like_request( $str );
    die( $p->error ) if( !defined( $ref ) );
    if( $ref )
    {
        say "Request method $ref->{method}, uri $ref->{uri}, protocol $ref->{protocol}, version major $ref->{http_vers_major}, version minor $ref->{http_vers_minor}";
    }
    else
    {
        say "This is not an HTTP request.";
    }

looks_like_response

Provided with a string or a scalar reference, and this returns an hash reference containing details of the response line attributes if it is indeed a response, or an empty string if it is not a response.

It sets an error and returns undef upon error.

The following attributes are available:

code

The 3-digits HTTP response code. For example in HTTP/1.1 200 OK, this would be 200.

http_version

The HTTP protocol version used. For example, in HTTP/1.1, this would be 1.1, and in HTTP/2, this would be 2.

http_vers_minor

The HTTP protocol major version used. For example, in HTTP/1.0, this would be 1, and in HTTP/2, this would be 2.

http_vers_minor

The HTTP protocol minor version used. For example, in HTTP/1.0, this would be 0, and in HTTP/2, this would be undef.

protocol

The HTTP protocol used, e.g. HTTP/1.0, HTTP/1.1, HTTP/2, etc...

status

The response status text. For example in HTTP/1.1 200 OK, this would be OK.

    my $ref = $p->looks_like_response( \$str );
    # or
    # my $ref = $p->looks_like_response( $str );
    die( $p->error ) if( !defined( $ref ) );
    if( $ref )
    {
        say "Response code $ref->{code}, status $ref->{status}, protocol $ref->{protocol}, version major $ref->{http_vers_major}, version minor $ref->{http_vers_minor}";
    }
    else
    {
        say "This is not an HTTP response.";
    }

looks_like_what

Provided with a string or a scalar reference, and this returns an hash reference containing details of the HTTP message first line attributes if it is indeed an HTTP message.

The attributes available depends on the type of HTTP message determined and are described in details in "looks_like_request" and "looks_like_response". In addition to those, it also returns the attribute type, which is a string representing the type of HTTP message this is, i.e. either request or response.

If this does not match either an HTTP request or HTTP response, it returns an empty string.

    my $ref = $p->looks_like_what( \$str );
    die( $p->error ) if( !defined( $ref ) );
    say "This is a ", ( $ref ? $ref->{type} : 'unknown' ), " HTTP message.";

    my $ref = $p->looks_like_what( \$str );
    die( $p->error ) if( !defined( $ref ) );
    if( !$ref )
    {
        say "This is unknown.";
    }
    else
    {
        say "This is a HTTP $ref->{type} with protocol version $ref->{http_version}";
    }

max_body_in_memory_size

Integer. This is the threshold beyond which an entity body that is initially loaded into memory will switched to be loaded into a file on the local filesystem when it is a true value and exceeds the amount specified.

By defaults, this has the value set by the class variable $MAX_BODY_IN_MEMORY_SIZE, which is 102400 bytes or 100K

max_headers_size

Integer. This is the threshold size in bytes beyond which HTTP headers will trigger an error. This defaults to the class variable $MAX_HEADERS_SIZE, which itself is set by default to 8192 bytes or 8K

max_read_buffer

Integer. This is the read buffer size. This is used for HTTP::Promise::IO and this defaults to 2048 bytes (2Kb).

new_tmpfile

Creates a new temporary file. If tmp_to_core is set to true, this will create a new file using a scalar object, or it will create a new temporary file under the directory set with the object parameter tmp_dir. The filehandle binmode is set to raw.

It returns a filehandle upon success, or upon error, it sets an error and return undef.

output_dir

The filepath to the output directory. This is used when saving entity bodies on the filesystem.

parse

This takes a scalar reference of data, a glob or a file path, and will parse the HTTP request or response by calling "parse_fh" and pass it whatever options it received.

It returns an entity object upon success and upon error, it sets an error and return undef.

parse_data

This takes a string or a scalar reference and returns an entity object upon success and upon error, it sets an error and return undef

parse_fh

This takes a filehandle and parse the HTTP request or response, and returns an entity object upon success and upon error, it sets an error and return undef.

It takes also an hash or hash reference of the following options:

  • reader

    An HTTP::Promise::IO. If this is not provided, a new one will be created. Note that data will be read using this reader.

  • request

    Boolean. Set this to true to indicate the data is an HTTP request. If neither request nor response is provided, the parser will attempt guessing it.

  • response

    Boolean. Set this to true to indicate the data is an HTTP response. If neither request nor response is provided, the parser will attempt guessing it.

parse_headers

This takes a string or a scalar reference including a scalar object, such as Module::Generic::Scalar, and an optional hash or hash reference of parameters and parse the headers found in the given string, if any at all.

It returns an hash reference with the same property names and values returned by "parse_headers_xs".

This method uses pure perl.

Supported options are:

  • convert_dash

    Boolean. If true, this will convert - in header fields to _. Default is false.

  • no_headers_ok

    Boolean. If set to true, this won't trigger if there is no headers

parse_headers_xs

    my $def = $p->parse_headers_xs( $http_request_or_response );
    my $def = $p->parse_headers_xs( $http_request_or_response, $options_hash_ref );

This takes a string or a scalar reference including a scalar object, such as Module::Generic::Scalar, and an optional hash or hash reference of parameters and parse the headers found in the given string, if any at all.

It returns a dictionary as an hash reference upon success, and it sets an error with an http error code set and returns undef upon error.

Supported options are:

  • convert_dash

    Boolean. If true, this will convert - in header fields to _. Default is false.

  • request

    Boolean. If true, this will parse the string assuming it is a request header.

  • response

    Boolean. If true, this will parse the string assuming it is a response header.

The properties returned in the dictionary depend on whether request or response were enabled.

For request:

  • headers

    An HTTP::Promise::Headers object.

  • length

    The length in bytes of the headers parsed.

  • method

    The HTTP method such as GET, or HEAD, POST, etc.

  • protocol

    String, such as HTTP/1.1 or HTTP/2

  • uri

    String, the request URI, such as /

  • version

    This is a version object and contains a value such as 1.1, so you can do something like:

        if( $def->{version} >= version->parse( '1.1' ) )
        {
            # Do something
        }

For response:

  • code

    The HTTP status code, such as 200

  • headers

    An HTTP::Promise::Headers object.

  • length

    The length in bytes of the headers parsed. This is useful so you can then remove it from the string you provided:

        my $resp = <<EOT;
        HTTP/1.1 200 OK
        Content-Type: text/plain
    
        Hello world!
        EOT
        my $def = $p->parse_headers_xs( \$resp, response => 1 ) || die( $p->error );
        $str =~ /^\r?\n//;
        substr( $str, 0, $def->{length} ) = '';
        # $str now contains the body, i.e.: "Hello world!\n"
  • status

    String, the HTTP status, i.e. something like OK

  • protocol

    String, such as HTTP/1.1

  • version

    This is a version object and contains a value such as 1.1, so you can do something like:

        if( $def->{version} >= version->parse( '1.1' ) )
        {
            # Do something
        }

If not enough data was provided to parse the headers, this will return an error object with code set to 425 (Too early).

If the headers is incomplete and the cumulated size exceeds the value set with "max_headers_size", this returns an error object with code set to 413 (Request entity too large).

If there are other issues with the headers, this sets the error code to 400 (Bad request), and for any other error, this returns an error object without code.

parse_multi_part

This takes an hash or hash reference of options and parse an HTTP multipart portion of the HTTP request or response.

It returns an entity object upon success and upon error it sets an error object and returns undef.

Supported options are:

parse_open

Provided with a filepath, and this will open it in read mode, parse it and return an entity object.

If there is an error, this returns undef and you can retrieve the error by calling "error" in Module::Generic which is inherited by this module.

parse_request

This takes a string or a scalar reference including a scalar object, such as Module::Generic::Scalar, and an optional hash or hash reference of parameters and parse the request found in the given string, including the header and the body.

It returns a dictionary as an hash reference upon success, and it sets an error with an http error code set and returns undef upon error.

The properties returned are the same as the ones returned for a request by "parse_headers_xs", and also sets the content property containing the body data of the request.

Obviously this works well for simple request, i.e. not multipart ones, otherwise the entire body, whatever that is, will be stored in content

parse_request_headers

This is an alias and is equivalent to calling "parse_headers_xs" and setting the request option.

parse_request_line

This takes a string or a scalar reference including a scalar object, such as Module::Generic::Scalar, and parse the reuqest line returning an hash reference containing 4 properties: method, path, protocol, version

parse_request_pp

This is the same as "parse_request", except it uses the pure perl method "parse_headers" to parse the headers instead of the XS one.

parse_response

This takes a string or a scalar reference including a scalar object, such as Module::Generic::Scalar, and an optional hash or hash reference of parameters and parse the response found in the given string, including the header and the body.

It returns a dictionary as an hash reference upon success, and it sets an error with an http error code set and returns undef upon error.

The properties returned are the same as the ones returned for a response by "parse_headers_xs", and also sets the content property containing the body data of the response.

parse_response_headers

This is an alias and is equivalent to calling "parse_headers_xs" and setting the response option.

parse_response_line

This takes a string or a scalar reference including a scalar object, such as Module::Generic::Scalar, and parse the reuqest line returning an hash reference containing 4 properties: method, path, protocol, version

parse_response_pp

This is the same as "parse_response", except it uses the pure perl method "parse_headers" to parse the headers instead of the XS one.

parse_singleton

Provided with an hash or hash reference of options and this parse a simple entity body.

It returns an entity object upon success and upon error it sets an error object and returns undef.

Supported options are:

  • entity

    The HTTP::Property::Entity object to which this multipart belongs.

  • read_until

    A string or a regular expression that indicates the string up to which to read data from the filehandle.

  • reader

    The HTTP::Property::Reader used for reading the data chunks from the filehandle.

parse_version

This takes an HTTP version string, such as HTTP/1.1 or HTTP/2 and returns its major and minor as a 2-elements array in list context, or just the version object in scalar context.

tmp_dir

Sets or gets the temporary directory to use when creating temporary files.

When set, this returns a file object

tmp_to_core

Boolean. When set to true, this will store data in memory rather than in a file on the filesystem.

AUTHOR

Jacques Deguest <jack@deguest.jp>

SEE ALSO

rfc6266 on Content-Disposition, rfc7230 on Message Syntax and Routing, rfc7231 on Semantics and Content, rfc7232 on Conditional Requests, rfc7233 on Range Requests, rfc7234 on Caching, rfc7235 on Authentication, rfc7578 on multipart/form-data, rfc7540 on HTTP/2.0

Mozilla documentation on HTTP protocol

Mozilla documentation on HTTP messages

Mozilla documentation

HTTP::Promise, HTTP::Promise::Request, HTTP::Promise::Response, HTTP::Promise::Message, HTTP::Promise::Entity, HTTP::Promise::Headers, HTTP::Promise::Body, HTTP::Promise::Body::Form, HTTP::Promise::Body::Form::Data, HTTP::Promise::Body::Form::Field, HTTP::Promise::Status, HTTP::Promise::MIME, HTTP::Promise::Parser, HTTP::Promise::IO, HTTP::Promise::Stream, HTTP::Promise::Exception

COPYRIGHT & LICENSE

Copyright(c) 2022 DEGUEST Pte. Ltd.

All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.