The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Regexp::Common::URI -- provide regexes for URIs.

SYNOPSIS

    use Regexp::Common qw /URI/;

    while (<>) {
        /$RE{URI}{HTTP}/       and  print "Contains an HTTP URI.\n";
    }

DESCRIPTION

Regexes are available for the following URI types:

$RE{URI}{HTTP}{-scheme}

Provides a regex for an HTTP URI as defined by RFC 2396 (generic syntax) and RFC 2616 (HTTP).

If -scheme=P is specified the pattern P is used as the scheme. By default P is qr/http/. https and https? are reasonable alternatives.

The syntax for an HTTP URI is:

    "http:" "//" host [ ":" port ] [ "/" path [ "?" query ]]

Under {-keep}, the following are returned:

$1

The entire URI.

$2

The scheme.

$3

The host (name or address).

$4

The port (if any).

$5

The absolute path, including the query and leading slash.

$6

The absolute path, including the query, without the leading slash.

$7

The absolute path, without the query or leading slash.

$8

The query, without the question mark.

$RE{URI}{FTP}{-type}{-password};

Returns a regex for FTP URIs. Note: FTP URIs are not formally defined. RFC 1738 defines FTP URLs, but parts of that RFC have been obsoleted by RFC 2396. However, the differences between RFC 1738 and RFC 2396 are such that they aren't applicable straightforwardly to FTP URIs.

There are two main problems:

Passwords.

RFC 1738 allowed an optional username and an optional password (separated by a colon) in the FTP URL. Hence, colons were not allowed in either the username or the password. RFC 2396 strongly recommends passwords should not be used in URIs. It does allow for userinfo instead. This userinfo part may contain colons, and hence contain more than one colon. The regexp returned follows the RFC 2396 specification, unless the {-password} option is given; then the regex allows for an optional username and password, separated by a colon.

The ;type specifier.

RFC 1738 does not allow semi-colons in FTP path names, because a semi-colon is a reserved character for FTP URIs. The semi-colon is used to separate the path from the option type specifier. However, in RFC 2396, paths consist of slash separated segments, and each segment is a semi-colon separated group of parameters. Straigthforward application of RFC 2396 would mean that a trailing type specifier couldn't be distinguished from the last segment of the path having a two parameters, the last one starting with type=. Therefore we have opted to disallow a semi-colon in the path part of an FTP URI.

Furthermore, RFC 1738 allows three values for the type specifier, A, I and D (either upper case or lower case). However, the internet draft about FTP URIs [DRAFT-FTP-URL] (which expired in May 1997) notes the lack of consistent implementation of the D parameter and drops D from the set of possible values. We follow this practise; however, RFC 1738 behaviour can be archieved by using the "-type=[ADIadi]" parameter.

FTP URIs have the following syntax:

    "ftp:" "//" [ userinfo "@" ] host [ ":" port ]
                [ "/" path [ ";type=" value ]]

When using {-password}, we have the syntax:

    "ftp:" "//" [ user [ ":" password ] "@" ] host [ ":" port ]
                [ "/" path [ ";type=" value ]]

Under {-keep}, the following are returned:

$1

The complete URI.

$2

The scheme.

$3

The userinfo, or if {-password} is used, the username.

$4

If {-password} is used, the password, else undef.

$5

The hostname or IP address.

$6

The port number.

$7

The full path and type specification, including the leading slash.

$8

The full path and type specification, without the leading slash.

$9

The full path, without the type specification nor the leading slash.

$10

The value of the type specification.

$RE{URI}{tel}

Returns a pattern that matches tel URIs, as defined by RFC 2806. Under {-keep}, the following are returned:

$1

The complete URI.

$2

The scheme.

$3

The phone number, including any possible add-ons like ISDN subaddress, a post dial part, area specifier, service provider, etc.

$RE{URI}{tel}{nofuture}

As above (including what's returned by {-keep}), with the exception that future extensions are not allowed. Without allowing those future extensions, it becomes much easier to check a URI if the correct syntax for post dial, service provider, phone context, etc has been used - otherwise the regex could always classify them as a future extension.

$RE{URI}{fax} and $RE{URI}{fax}{nofuture}

Similar to $RE{URI}{tel} and $RE{URI}{tel}{nofuture}, except that it will return patterns matching fax URIs, as defined in RFC 2806. {-keep} will return the same fragments as for tel URIs.

$RE{URI}{tv}

Returns a pattern that recognizes TV uris as per an Internet draft [DRAFT-URI-TV].

REFERENCES

[DRAFT-URI-TV]

Zigmond, D. and Vickers, M: Uniform Resource Identifiers for Television Broadcasts. December 2000.

[DRAFT-URL-FTP]

Casey, James: A FTP URL Format. November 1996.

[RFC 1035]

Mockapetris, P.: DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION. November 1987.

[RFC 1738]

Berners-Lee, Tim, Masinter, L., McCahill, M.: Uniform Resource Locators (URL). December 1994.

[RFC 2396]

Berners-Lee, Tim, Fielding, R., and Masinter, L.: Uniform Resource Identifiers (URI): Generic Syntax. August 1998.

[RFC 2616]

Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, Tim: Hypertext Transfer Protocol -- HTTP/1.1. June 1999.

[RFC 2806]

Vaha-Sipila, A.: URLs for Telephone Calls. April 2000.

HISTORY

 $Log: URI.pm,v $
 Revision 1.9  2003/01/01 23:00:54  abigail
 TV URIs

 Revision 1.8  2002/08/27 16:56:27  abigail
 Support for fax URIs.

 Revision 1.7  2002/08/06 14:44:07  abigail
 Local phone numbers can have future extensions as well.

 Revision 1.6  2002/08/06 13:18:03  abigail
 Cosmetic changes

 Revision 1.5  2002/08/06 13:16:27  abigail
 Added $RE{URI}{tel}{nofuture}

 Revision 1.4  2002/08/06 00:03:30  abigail
 Added $RE{URI}{tel}

 Revision 1.3  2002/08/04 22:51:35  abigail
 Added FTP URIs.

 Revision 1.2  2002/07/25 22:37:44  abigail
 Added 'use strict'.
 Added 'no_defaults' to 'use Regex::Common' to prevent loading of all
 defaults.

 Revision 1.1  2002/07/25 19:56:07  abigail
 Modularizing Regexp::Common.

SEE ALSO

Regexp::Common for a general description of how to use this interface.

AUTHOR

Damian Conway (damian@conway.org)

MAINTAINANCE

This package is maintained by Abigail (regexp-common@abigail.nl).

BUGS AND IRRITATIONS

Bound to be plenty.

For a start, there are many common regexes missing. Send them in to regexp-common@abigail.nl.

COPYRIGHT

     Copyright (c) 2001 - 2002, Damian Conway. All Rights Reserved.
       This module is free software. It may be used, redistributed
      and/or modified under the terms of the Perl Artistic License
            (see http://www.perl.com/perl/misc/Artistic.html)