Ed J


a "listpage" is returned by the initial get_fill_submit which is parsed into:
a "listpage" is parsed into:
  { items => \@items, pageno => $pageno, num_pages => $num_pages,
    nextlink => $nextlink, }
an "item" is
  +{ id => $id, url => $url, }
the item url points to a "page" which is parsed into



  ($text, $pageurl, $listre)



  ($text, $scrapespec, $scrapepostpro)



  ($cjar, $html, $real_url, $vars, $varnamechange)


Parses out redirects done with Refresh header.

Gets web content, iterating through redirects while capturing cookies.


Mail::POP3::Folder::webscrape - class that makes a website look like a POP3 mailbox


  use Mail::POP3;
  my $m = Mail::POP3::Folder::webscrape->new(
    $starturl, # where the first form is found
    $userfieldnames, # listref same order as values supplied in USER
    $otherfields, # hash fieldname => value
    $listre, # field => RE; fields: pageno, num_pages, nextlink, itemurls
    $itemre, # hash extractfield => RE to get it from "page"
    $itempostpro, # extractfield => sub returns pairS of field/value
    $itemurl2id, # sub taking URL, returns unique, persistent item ID
    $itemformat, # takes item hash, returns email message


This class makes a website look like a POP3 mailbox in accordance with the requirements of a Mail::POP3 server. It is entirely API-compatible with Mail::POP3::Folder::mbox.

The virtual e-mails will all be at least (the amount specified in the last parameter to new - recommend 2000) octets long, being padded to this length. While it should truncate if necessary, the class currently does not.



The username is interpreted as a ":"-separated string, also "URL-encoded" such that spaces are encoded as "+" characters. The values supplied will be for variables named in the $userfieldnames parameter.


The password is ignored.


The webpage that contains the initial search form.


A reference to a list of the names of CGI variables whose values are supplied by the POP3 user in the username.


Reference to hash of CGI field mapped to value.


Reference to hash of fieldname mapped to regular expression for finding the relevant value on each search result page. The value is expected to be in $1. These fields must be defined: pageno, num_pages, nextlink, itemurls. The last may (obviously) match more than once.


Reference to hash of fieldname mapped to regular expression for finding the relevant value on each item's page (as linked to by an itemurl as found from the above parameter), similar to the above. Any number of fields may be sought, and a hash of the fieldname to the found value will be passed to the item-formatting function below.


Reference to hash of fieldname mapped to reference to function that is called with the field name and value, and will return a list of one or more pairs of fieldname / value. Typical use might be to remove HTML from a result.


Reference to function that is called with each itemurl, and will return a unique, persistent identifier for that item, compatible with an RFC 1939 message ID.


Reference to function that is called for each item, taking two parameters: a reference to a hash of fieldname / value (as extracted by the "item RE" above), and the unique message-ID (as generated above); and will return the text of an email message describing that item.


The size of each message, in the style of Procrustes. This is so the class can return an accurate(ish) result for the POP3 command STAT knowing only the number of hits there have been, and not having downloaded and formatted every single item to see how large each one is - such an extra step would probably trigger timeouts.

A script webscrape is supplied in the scripts subdirectory of the distribution that can be used to test and develop a working configuration for this class.


None extra are defined.


RFC 1939, Mail::POP3::Folder::mbox.