The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HoneyClient::Agent::Driver::Browser - Perl extension to drive a web browser, running inside a HoneyClient VM.

VERSION

This documentation refers to HoneyClient::Agent::Driver::Browser version 0.98.

SYNOPSIS

  use HoneyClient::Agent::Driver::Browser;

  # Library used exclusively for debugging complex objects.
  use Data::Dumper;

  # Create a new Browser object, initialized with a collection
  # of URLs to visit.
  my $browser = HoneyClient::Agent::Driver::Browser->new(
      links_to_visit => {
          'http://www.google.com'  => 1,
          'http://www.cnn.com'     => 1,
      },
  );

  # If you want to see what type of "state information" is physically
  # inside $browser, try this command at any time.
  print Dumper($browser);

  # Continue to "drive" the driver, until it is finished.
  while (!$browser->isFinished()) {

      # Before we drive the application to a new set of resources,
      # find out where we will be going within the application, first.
      print "About to contact the following resources:\n";
      print Dumper($browser->next());

      # Now, drive browser for one iteration.
      $browser->drive();

      # Get the driver's progress.
      print "Status:\n";
      print Dumper($browser->status());

  }

  # At this stage, the driver has exhausted its collection of links
  # to visit.  Let's say we want to add the URL "http://www.mitre.org"
  # to the driver's list.
  $browser->{links_to_visit}->{'http://www.mitre.org'} = 1;

  # Now, drive the browser for one iteration.
  $browser->drive();

  # Or, we can specify the URL as an argument.
  $browser->drive(url => "http://www.mitre.org");

DESCRIPTION

This library allows the Agent module to drive an instance of any browser, running inside the HoneyClient VM. The purpose of this module is to programmatically navigate the browser to different websites, in order to become purposefully infected with new malware.

This module is object-oriented in design, retaining all state information within itself for easy access. A specific browser implementation, such as 'IE' or 'FF', must inherit from this package.

Fundamentally, the Browser driver is initialized with a set of absolute URLs for the browser to drive to. Upon visiting each URL, the driver collects any new links found and will attempt to drive the browser to each valid URL upon subsequent iterations of work.

For each top-level URL given, the driver will attempt to process all corresponding links that are hosted on the same server, in order to simulate a complete 'spider' of each server.

URLs are added and removed from hashtables, as keys. For each URL, a calculated "priority" (a positive integer) of the URL is assigned the value. When the Browser is ready to go to a new link, it will always go to the next link that has the highest priority. If two URLs have the same priority, then the Browser will chose among those two at random.

Furthermore, the Browser driver will try to visit all links shared by a common server in order before moving on to drive to other, external links in an ordered fashion. However, the Browser may end up re-visiting old sites, if new links were found that the Browser has not visited yet.

As the Browser driver navigates the browser to each link, it maintains a set of hashtables that record when valid links were visited (see links_visited); when invalid links were found (see links_ignored); and when the browser attempted to visit a link but the operation timed out (see links_timed_out). By maintaining this internal history, the driver will never navigate the browser to the same link twice.

Lastly, it is highly recommended that for each driver $object, one should call $object->isFinished() prior to making a subsequent call to $object->drive(), in order to verify that the driver has not exhausted its set of links to visit. Otherwise, if $object->drive() is called with an empty set of links to visit, the corresponding operation will croak.

DEFAULT PARAMETER LIST

When a Browser $object is instantiated using the new() function, the following parameters are supplied default values. Each value can be overridden by specifying the new (key => value) pair into the new() function, as arguments.

Furthermore, as each parameter is initialized, each can be individually retrieved and set at any time, using the following syntax:

  my $value = $object->{key}; # Gets key's value.
  $object->{key} = $value;    # Sets key's value.

    This parameter is a hashtable of fully qualified URLs for the browser to visit. Specifically, each 'key' corresponds to an absolute URL and the 'value' is always 1.

    This parameter is a hashtable of fully qualified URLs that the browser has already visited. Specifically, each 'key' corresponds to an absolute URL and the 'value' is a string representing the date and time of when the link was visited.

    Note: See internal documentation of _getTimestamp() for the corresponding date/time format of each value.

    This parameter is a hashtable of fully qualified URLs that the browser has found during its link traversal process, but the browser could not access the link.

    Links could be added to this list if access requires any type of authentication, or if the link points to a non-HTTP or HTTPS resource (i.e., "javascript:doNetDetect()").

    Specifically, each 'key' corresponds to an absolute URL and the 'value' is a string representing the date and time of when the link was visited.

    Note: See internal documentation of _getTimestamp() for the corresponding date/time format of each value.

    This parameter is a hashtable of fully qualified URLs, such that each URL shares a common hostname. This is an internal hashtable used by the Browser driver that should be initially empty. As the Browser driver extracts and removes new URLs off the links_to_visit hashtable, driving the browser to each URL, any relative links found are added into this hashtable; any external links found are added back into the links_to_visit hashtable.

    When driving to the next link, this hashtable is exhausted prior to the main links_to_visit hashtable. This allows a browser to navigate to all links hosted on the same server, prior to contacting a different server.

    Specifically, each 'key' corresponds to an absolute URL and the 'value' is always 1.

    This parameter is a scalar that contains the next URL to visit. It is updated dynamically, any time $object->getNextLink() is called.

    When the browser is ready to drive to the next link, next_link_to_visit is checked first. If that value is undef, then the relative_links_to_visit hashtable is checked next. If that hashtable is empty, then finally the links_to_visit hashtable is checked last.

    This parameter is a hashtable of fully qualified URLs that the browser has found during its link traversal process, but the browser could not access the corresponding resource due to the operation timing out.

    Specifically, each 'key' corresponds to an absolute URL and the 'value' is a string representing the date and time of when access to the resource was attempted.

    Note: See internal documentation of _getTimestamp() for the corresponding date/time format of each value.

    If this parameter is set to 1, then the browser will also never attempt to revisit any links that caused the browser to time out.

process_name

    A string containing the process name of the browser application, as it appears in the Task Manager.

    An integer, representing the maximum number of relative links that the browser should visit, before moving onto another website. If negative, then the browser will exhaust all possible relative links found, before moving on. This functionality is best effort; it's possible for the browser to visit new links on previously visited websites.

positive_words

    An array of positive words, where a link's probability of being visited (its score) will increase, if the link contains any of these words.

negative_words

    An array of negative words, where a link's probability of being visited (its score) will decrease, if the link contains any of these words.

parse_active_content

    If set to 1, then the code will attempt to parse and extract links within active content (e.g., Flash animations). Otherwise, the code will ignore all active content.

METHODS IMPLEMENTED

The following functions have been implemented by the Browser driver. Many of these methods were implementations of the parent Driver interface.

As such, the following code descriptions pertain to this particular Driver implementation. For further information about the generic Driver interface, see the HoneyClient::Agent::Driver documentation.

HoneyClient::Agent::Driver::Browser->new($param => $value, ...)

    Creates a new Browser driver object, which contains a hashtable containing any of the supplied "param => value" arguments.

    Inputs: $param is an optional parameter variable. $value is $param's corresponding value.

    Note: If any $param(s) are supplied, then an equal number of corresponding $value(s) must also be specified.

    Output: The instantiated Browser driver $object, fully initialized.

$object->drive(url => $url)

    Drives an instance of the browser for one iteration, navigating to the next URL and updating the driver's corresponding internal hashtables accordingly.

    For a description of which hashtable is consulted upon each iteration of drive(), see the next_link_to_visit documentation, in the "DEFAULT PARAMETER LIST" section.

    Once a drive() iteration has completed, the corresponding browser process is terminated. Thus, each call to drive() invokes a new instance of the browser.

    Inputs: $url is an optional argument, specifying the next immediate URL the browser must drive to.

    Output: The updated Browser driver $object, containing state information from driving the browser for one iteration.

    Warning: This method will croak if the Browser driver object is unable to navigate to a new link, because its list of links to visit is empty and no new URL was supplied.

$object->getNextLink()

    Returns the next URL that the browser will navigate to, upon the next subsequent call to the $object's drive() method.

    Output: The next URL that the browser will be driven to. The returned data may be undef, if the Browser driver is finished and there are no links left to navigate to.

    Note: This function is deprecated. $object->next() should be used instead.

$object->next()

    Returns the next set of server hostnames and/or IP addresses that the browser will contact, upon the next subsequent call to the $object's drive() method.

    Specifically, the returned data is a reference to a hashtable, containing detailed information about which resources, hostnames, IPs, protocols, and ports that the browser will contact upon the next drive() iteration.

    Here is an example of such returned data:

      $hashref = {
    
          # The set of servers that the driver will contact upon
          # the next drive() operation.
          targets => {
              # The application will contact 'site.com' using
              # TCP ports 80 and 81.
              'site.com' => {
                  'tcp' => [ 80, 81 ],
              },
    
              # The application will contact '192.168.1.1' using
              # UDP ports 53 and 123.
              '192.168.1.1' => {
                  'udp' => [ 53, 123 ],
              },
    
              # Or, more generically:
              'hostname_or_IP' => {
                  'protocol_type' => [ portnumbers_as_list ],
              },
          },
    
          # The set of resources that the driver will operate upon
          # the next drive() operation.
          resources => {
              'http://www.mitre.org/' => 1,
          },
      };

    Note: For this implementation of the Driver interface, unless getNextLink() returns undef, the returned hashtable from this method will always contain only one hostname or IP address. Within this single entry, the protocol type is always guaranteed to be TCP, specifying a single port.

    Output: The aforementioned $hashref containing the next set of resources that the back-end application will attempt to contact upon the next drive() iteration. Returns undef values for both 'targets' and 'resources' keys, if getNextLink() also returns undef.

    # XXX: Resolve this, per parent Driver description.

_scoreLinks()

    The _scoreLinks helper function takes a scalar which is the base url for the web page, a scalar which holds the content of the page (HTML), and a hash which contain the good and bad words.

    This function will calculate the "popularity" scores of the links. The function returns a hash which is keyed on the _absolute_ url and contains the value of the score.

    Output: The populated %scored_links hash if the page is not empty. An empty hash otherwise.

    For example, if your raw HTML content is $content, and the base url is $base you would use the following call to this function.

    if ($content) { # Call the link scoring function %scored_links = $self->_scoreLinks($base, $content); }

$object->isFinished()

    Indicates if the Browser driver $object has driven the browser process to all possible links it has found within its hashtables and is unable to navigate the browser further without additional, external input.

    Output: True if the Browser driver $object is finished, false otherwise.

    Note: Additional links can be fed to this Browser driver at any time, by simply adding new hashtable entries to the links_to_visit hashtable within the $object.

    For example, if you wanted to add the URL "http://www.mitre.org" to the Browser driver $object, simply use the following code:

      $object->{links_to_visit}->{'http://www.mitre.org'} = 1;

$object->status()

    Returns the current status of the Browser driver $object, as it's state exists, between subsequent calls to $object->driver().

    Specifically, the data returned is a reference to a hashtable, containing specific statistical information about the status of the Browser driver's progress, between iterations of driving the browser process.

    The following is an example hashtable, containing all the (key => value) pairs that would exist in the output.

      $hashref = {
          'relative_links_remaining' =>       10, # Number of URLs left to
                                                  # process, at a given site.
          'links_remaining'          =>       56, # Number of URLs left to
                                                  # process, for all sites.
          'links_processed'          =>       44, # Number of URLs processed.
          'links_total'              =>      100, # Total number of URLs given.
          'percent_complete'         => '44.00%', # Percent complete,
                                                  #  (processed / total).
      };

    Output: A corresponding $hashref, containing statistical information about the Browser driver's progress, as previously mentioned.

    # XXX: Resolve this, per parent Driver description.

BUGS & ASSUMPTIONS

In a nutshell, this object is nothing more than a blessed anonymous reference to a hashtable, where (key => value) pairs are defined in the "DEFAULT PARAMETER LIST", as well as fed via the new() function during object initialization. As such, this package does not perform any rigorous data validation prior to accepting any new or overriding (key => value) pairs.

However, additional links can be fed to any Browser driver at any time, by simply adding new hashtable entries to the links_to_visit hashtable within the $object.

For example, if you wanted to add the URL "http://www.mitre.org" to the Browser driver $object, simply use the following code:

  $object->{links_to_visit}->{'http://www.mitre.org'} = 1;

In general, the Browser driver does not know how many links it will ultimately end up browsing to, until it conducts an exhaustive spider of all initial URLs supplied. As such, expect the output of $object->status() to change significantly, upon each $object->drive() iteration.

For example, if at one given point, the status of percent_complete is 30% and then this value drops to 15% upon another iteration, then this means that the total number of links to drive to has greatly increased.

Lastly, we assume the driven browser has been preconfigured to not cache any data. This ensures the browser will render the most recent version of the content hosted at each URL.

SEE ALSO

HoneyClient::Agent::Driver

HoneyClient::Agent::Driver::Browser::IE

HoneyClient::Agent::Driver::Browser::FF

http://www.honeyclient.org/trac

REPORTING BUGS

http://www.honeyclient.org/trac/newticket

AUTHORS

Kathy Wang, <knwang@mitre.org>

Thanh Truong, <ttruong@mitre.org>

Darien Kindlund, <kindlund@mitre.org>

Brad Stephenson, <stephenson@mitre.org>

COPYRIGHT & LICENSE

Copyright (C) 2007 The MITRE Corporation. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, using version 2 of the License.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.