Regexp::Common::URI -- provide regexes for URIs.
use Regexp::Common qw /URI/; while (<>) { /$RE{URI}{HTTP}/ and print "Contains an HTTP URI.\n"; }
Regexes are available for the following URI types:
Provides a regex for an HTTP URI as defined by RFC 2396 (generic syntax) and RFC 2616 (HTTP).
If -scheme=P is specified the pattern P is used as the scheme. By default P is qr/http/. https and https? are reasonable alternatives.
-scheme=P
qr/http/
https
https?
The syntax for an HTTP URI is:
"http:" "//" host [ ":" port ] [ "/" path [ "?" query ]]
Under {-keep}, the following are returned:
{-keep}
The entire URI.
The scheme.
The host (name or address).
The port (if any).
The absolute path, including the query and leading slash.
The absolute path, including the query, without the leading slash.
The absolute path, without the query or leading slash.
The query, without the question mark.
Returns a regex for FTP URIs. Note: FTP URIs are not formally defined. RFC 1738 defines FTP URLs, but parts of that RFC have been obsoleted by RFC 2396. However, the differences between RFC 1738 and RFC 2396 are such that they aren't applicable straightforwardly to FTP URIs.
There are two main problems:
RFC 1738 allowed an optional username and an optional password (separated by a colon) in the FTP URL. Hence, colons were not allowed in either the username or the password. RFC 2396 strongly recommends passwords should not be used in URIs. It does allow for userinfo instead. This userinfo part may contain colons, and hence contain more than one colon. The regexp returned follows the RFC 2396 specification, unless the {-password} option is given; then the regex allows for an optional username and password, separated by a colon.
RFC 1738 does not allow semi-colons in FTP path names, because a semi-colon is a reserved character for FTP URIs. The semi-colon is used to separate the path from the option type specifier. However, in RFC 2396, paths consist of slash separated segments, and each segment is a semi-colon separated group of parameters. Straigthforward application of RFC 2396 would mean that a trailing type specifier couldn't be distinguished from the last segment of the path having a two parameters, the last one starting with type=. Therefore we have opted to disallow a semi-colon in the path part of an FTP URI.
Furthermore, RFC 1738 allows three values for the type specifier, A, I and D (either upper case or lower case). However, the internet draft about FTP URIs [DRAFT-FTP-URL] (which expired in May 1997) notes the lack of consistent implementation of the D parameter and drops D from the set of possible values. We follow this practise; however, RFC 1738 behaviour can be archieved by using the "-type=[ADIadi]" parameter.
FTP URIs have the following syntax:
"ftp:" "//" [ userinfo "@" ] host [ ":" port ] [ "/" path [ ";type=" value ]]
When using {-password}, we have the syntax:
"ftp:" "//" [ user [ ":" password ] "@" ] host [ ":" port ] [ "/" path [ ";type=" value ]]
The complete URI.
The userinfo, or if {-password} is used, the username.
If {-password} is used, the password, else undef.
undef
The hostname or IP address.
The port number.
The full path and type specification, including the leading slash.
The full path and type specification, without the leading slash.
The full path, without the type specification nor the leading slash.
The value of the type specification.
Returns a pattern that matches tel URIs, as defined by RFC 2806. Under {-keep}, the following are returned:
The phone number, including any possible add-ons like ISDN subaddress, a post dial part, area specifier, service provider, etc.
$RE{URI}{tel}{nofuture}
As above (including what's returned by {-keep}), with the exception that future extensions are not allowed. Without allowing those future extensions, it becomes much easier to check a URI if the correct syntax for post dial, service provider, phone context, etc has been used - otherwise the regex could always classify them as a future extension.
$RE{URI}{fax}
$RE{URI}{fax}{nofuture}
Similar to $RE{URI}{tel} and $RE{URI}{tel}{nofuture}, except that it will return patterns matching fax URIs, as defined in RFC 2806. {-keep} will return the same fragments as for tel URIs.
$RE{URI}{tel}
$RE{URI}{tv}
Returns a pattern that recognizes TV uris as per an Internet draft [DRAFT-URI-TV].
Zigmond, D. and Vickers, M: Uniform Resource Identifiers for Television Broadcasts. December 2000.
Casey, James: A FTP URL Format. November 1996.
Mockapetris, P.: DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION. November 1987.
Berners-Lee, Tim, Masinter, L., McCahill, M.: Uniform Resource Locators (URL). December 1994.
Berners-Lee, Tim, Fielding, R., and Masinter, L.: Uniform Resource Identifiers (URI): Generic Syntax. August 1998.
Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, Tim: Hypertext Transfer Protocol -- HTTP/1.1. June 1999.
Vaha-Sipila, A.: URLs for Telephone Calls. April 2000.
$Log: URI.pm,v $ Revision 1.9 2003/01/01 23:00:54 abigail TV URIs Revision 1.8 2002/08/27 16:56:27 abigail Support for fax URIs. Revision 1.7 2002/08/06 14:44:07 abigail Local phone numbers can have future extensions as well. Revision 1.6 2002/08/06 13:18:03 abigail Cosmetic changes Revision 1.5 2002/08/06 13:16:27 abigail Added $RE{URI}{tel}{nofuture} Revision 1.4 2002/08/06 00:03:30 abigail Added $RE{URI}{tel} Revision 1.3 2002/08/04 22:51:35 abigail Added FTP URIs. Revision 1.2 2002/07/25 22:37:44 abigail Added 'use strict'. Added 'no_defaults' to 'use Regex::Common' to prevent loading of all defaults. Revision 1.1 2002/07/25 19:56:07 abigail Modularizing Regexp::Common.
Regexp::Common for a general description of how to use this interface.
Damian Conway (damian@conway.org)
This package is maintained by Abigail (regexp-common@abigail.nl).
Bound to be plenty.
For a start, there are many common regexes missing. Send them in to regexp-common@abigail.nl.
Copyright (c) 2001 - 2002, Damian Conway. All Rights Reserved. This module is free software. It may be used, redistributed and/or modified under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html)
To install Regexp::Common, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Regexp::Common
CPAN shell
perl -MCPAN -e shell install Regexp::Common
For more information on module installation, please visit the detailed CPAN module installation guide.