The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Apache::Gateway - A Bloated Gateway Module

SYNOPSIS

Example Apache configuration:

  <Location /CPAN/>
  SetHandler perl-script
  PerlHandler Apache::Gateway
  PerlSetVar GatewayConfig /etc/apache/gateway/CPAN
  PerlSetupEnv Off
  </Location>

Example GatewayConfig file:

  GatewayRoot /CPAN/
  
  <LocationMatch ".*">
  Site    ftp://ftp.perl.org/pub/perl/CPAN/
  MuxSite ftp://ftp.cdrom.com/pub/perl/CPAN/
  MuxSite ftp://ftp.digital.com/pub/plan/perl/CPAN/
  Site    ftp://ftp.orst.edu/pub/packages/CPAN/
  Site    ftp://ftp.funet.fi/pub/languages/perl/CPAN/
  </LocationMatch>
  
  ClockBroken ftp://ftp.cdrom.com       EET     PST8PDT
  ClockBroken ftp://ftp.digital.com     EET     PST8PDT
  ClockBroken ftp://ftp.orst.edu        EET     PST8PDT
  ClockBroken ftp://ftp.perl.org        CST     CST6CDT

See the examples directory for commented examples.

DESCRIPTION

The Apache::Gateway module implements a gateway with assorted optional features.

FEATURES

Standard Gateway Features

Apache::Gateway services requests using LWP and, hence, can gateway to any protocol that LWP understands. It also makes foreign URIs appear to be local URIs.

Apache::Gateway does not include a cache, but it can be used in combination with a proxy cache to cache what the gateway retrieves. For example, Apache can provide caching for the gateway by setting up a proxy cache virtual host and a gateway virtual host and then using the proxy to access the gateway.

Automatic Failover with Mirrored Instances

Multiple mirrors can provide an instance. Requests which fail will automatically be retried with the next mirror. This capability is very useful when some mirrors are busy or erratic.

Multiplexing

Like the CPAN multiplexer, Apache::Gateway can multiplex requests amongst several mirrors.

Pattern-dependent Gatewaying

The origin server to contact can vary depending upon the URL. This capability is particularly useful when dealing with partial mirrors. A common situation is that some files may be available at all mirrors, but less commonly used files will only be available at a few mirrors.

FTP Directory Gatewaying

(Need to think of a better name for this feature.) Remote FTP directory listings can be modified to refer to the gateway. This feature is somewhat similar to the ProxyPassReverse directive.

This feature was especially complicated and problematical. It has now been removed.

Timestamp Correction

Apache::Gateway can try to correct incorrect timestamps generated by popular mirroring software. In particular, it can try to compensate for the way the Perl mirror program sets timestamps.

CONFIGURATION

Most configuration is done in the GatewayConfig files. The regular Apache configuration files only need to include the handler directives and set the GatewayConfig filename. Environment variables are not used, so PerlSetupEnv can be Off.

GatewayConfig directives purposely look like Apache config directives so that the syntax will be familiar. However, GatewayConfig directives are not Apache config directives. They cannot be used in Apache config files (and vice versa)!

GatewayRoot path

Sets the root of the gatewayed area on the local server. Generally matches the Location setting in the Apache config files. Defaults to "/".

GatewayTimeout timeout

Passes timeout (in seconds) to LWP::UserAgent.

LocationMatch regexp

Begins a LocationMatch section. Works similarly to the ApacheLocation match directive except that the pattern is a Perl regexp. Note: there are no Location or other style sections, only LocationMatch.

LocationMatch sections are tried in order until a regexp is matched.

Site URI

Sets an upstream server to contact for this URI. In case of failure, requests are automatically retried with successive sites in the order they appear. Failures can include anything from the upstream server being down or flaky to a file not being present because the upstream mirror is out of synch with its primary site.

MuxSite URI

Sets an upstream server to contact for this URI. Adjacent MuxSite servers are tried in round robin order.

For example, here again is the default portion of the sample GatewayConfig file above:

  <LocationMatch ".*">
  Site    ftp://ftp.perl.org/pub/perl/CPAN/
  MuxSite ftp://ftp.cdrom.com/pub/perl/CPAN/
  MuxSite ftp://ftp.digital.com/pub/plan/perl/CPAN/
  Site    ftp://ftp.orst.edu/pub/packages/CPAN/
  Site    ftp://ftp.funet.fi/pub/languages/perl/CPAN/
  </LocationMatch>

With the Site and MuxSite directives here, the first request will be forwarded to ftp.perl.org. If it fails, the request will be retried with cdrom, digital, orst, and funet, in that order. The next request for that process will be tried with ftp.perl.org first again. If it fails, retries then go to digital, cdrom, orst, and finally funet.

A good general strategy for packages with multiple mirrors might be to specify one or two nearby sites to try first. Then specify some multiplexed sites slightly further away in case the nearby sites fail. Finally, fall back to the primary site if all else fails.

ClockBroken server-URL upstream^2-TZ upstream-TZ

When caching is employed and requests can be gatewayed to multiple mirrors, timestamp correctness becomes more important. Unfortunately, timestamps on mirrored files are usually wrong. For example, the popular Perl mirror program is generally configured to match timestamps using the local timezone both locally and on the server it is mirroring. This strategy is only guaranteed to work if both servers are in the same timezone.

Example: ClockBroken ftp://ftp.cdrom.com EET PST8PDT

cdrom gets files from funet, which seems always to use the EET timezone (which is two hours off from GMT) for purposes of mirroring. cdrom, however, uses the PST8PDT timezone, so that 00:00 on funet differs from 00:00 on cdrom by 9 or 10 hours, depending upon whether or not Daylight Savings Time is in effect. The example ClockBroken line corrects for this disparity.

Note: timezones are those understood by Time::Zone.

FUNCTIONS

The following internal functions are documented (mostly useful for hackers):

$gw = Apache::Gateway->new( [$ua] )

Construct a new Apache::Gateway object describing a gateway. If a LWP::UserAgent is not provided, a new one will be created. Note: the user agent is modified for seach request; it is not constant and is probably not shareable.

$gw->user_agent( [$ua] )

Get/set the user agent.

$gw->request( [$r] )

Get/set the Apache request currently being gatewayed. To send the request, see the send_request method.

$gw->location_config( [$config] )

Get/set the configuration information for this gateway location. Can be overridden to provide dynamic per location information

clear_headers_for_redirect($r)

Clear request headers in $r in preparation for a redirect.

canonicalized_server_URL($scheme, $hostname, $port)

Return semicanonicalized server URL (without trailing slash).

server_name_from_URL($r, $url)

Return the (somewhat canonicalized) "server name" portion of the URL. The "server name" is defined as the leading scheme://authority portion of the URL.

server_name($r)

Return the (somewhat canonicalized) "server name" portion of the URL of this server. The "server name" is defined as the leading scheme://authority portion of the URL. Currently assumes server access is via HTTP.

diff_TZ($origin_TZ, $mirror_TZ)

Get the usual time difference (in seconds) between the two time zones. Will yield the wrong results in the midst of a change to/from daylight savings time. Specifically, as used in this module, this function will return the wrong results when applied to files retrieved by the mirror during the two hours of the year when one server is in Daylight Savings Time and the other is not.

$gw->update_via_header_field($response)

Update Via header in HTTP::Response with information about this hop. Hop information combines protocol information from the message with server information from the Apache server. The server name returned is hardcoded as 'apache'.

Eventually, options should be provided to control hostname suppression and comment customization.

copy_header_to_Apache_request($r, $headers)

Copy the headers from an HTTP::Headers object to an Apache::Request. Hope that the Apache request object will later print out the headers in "Good Practice" order (there appears to be no way of controlling this).

The only tricky item is the Content-Type header, which needs special handling.

redirect($allow_abort);

Try a redirect. We do this via LWP::UserAgent because internal_redirect_handler does not provide hooks for detecting and recovering from errors.

$gw->site( [$site] )

Get/set the site tried. Can be used to determine which upstream server actually fields a request.

$gw->try_URI($allow_abort)

Try the site $gw->site. Ideally, we could use Apache::internal_redirect_handler to try the redirects. However, it provides no hook for detecting an error and aborting output. That's not mod_perl's fault--Apache source would need to be modified to support such a hook.

try_sites($allow_last_site_abort, @site)

Try sites in order until one succeeds. $allow_last_site_abort indicates if the last site can/should be aborted after examing the head for its error code. All other sites always allow premature abortion.

Abortion is needed because only one request can be allowed to run to completion and produce a message body.

$gw->site_list

Get the list of sites to try for this request. Can be overridden to customize the list of sites to try.

By default, this method looks through the LocationMatch sections in the GatewayConfig file in order and returns the sites in the first section matched.

$gw->send_request( [$r] )

Send the Apache request to the upstream server. Optionally sets it first.

CAVEATS

Apache::Gateway is a big, complicated module that loads many other modules. As such, it pushes mod_perl to its limits, especially when used with DSO/APXS.

The current version of LWP (5.35) only supports If-Modified-Since for file and ftp URLs. Thus, gatewaying to ftp servers will actually be better than gatewaying to http servers for cached responses.

BUGS

A ProxyRemote-like capability is needed for origin servers which must be accessed through a proxy.

A ProxyPassReverse analogue might be useful, too.

Apache::Gateway assumes it is being accessed using HTTP. Ought to handle cases where this gateway is accessed using https (SSL).

There is no way to tell LWP to use a proxy.

The Server response header field should contain information about the origin server, not this server. Unfortunately, Apache overrides any existing origin server information in this field.

AUTHOR

Charles C. Fu, perl@web-i18n.net

SEE ALSO

perl(1), Apache(3pm), LWP(3pm).