The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

NNexus::Index::Template - Foundation Template for NNexus Domain Indexers

SYNOPSIS

  package NNexus::Index::Mydomain;
  use base qw(NNexus::Index::Template);
  
  # Instantiate the PULL API methods
  sub domain_root { 'http://mydomain.com' }
  sub candidate_links { ... }
  sub index_page { ... }
  sub depth_limit { 10; }
  
  1;

  # You can now invoke an indexing run from your own code:
  $indexer = NNexus::Index::Mydomain;
  $first_payload = $indexer->index_step('start'=>'default');
  while (my $concept_payload = $indexer->index_step ) {
    # Do something with the indexed concepts...
  }

DESCRIPTION

This class contains the generic NNexus indexing logic, and offers the PULL API for concrete domain indexers. There are three categories of methods:

  • External API - public methods, to be used to set up and drive the indexing process

  • Shared methods - defining the generic crawl process and logic, shared by all domain indexers

  • PULL API - per-page data-mining methods, to be overloaded by concrete domain indexers

EXTERNAL API

$indexer = NNexus::Index::Mydomain->new(start=>'default',dom=>$dom);

The most reliable way to instantiate a domain indexer. The 'Mydomain' string is conventionally the shorthand name a certain site is referred by, e.g. Planetmath, Wikipedia, Dlmf or Mathworld.

As a handy convention, all plug-in indexer names $domain should be compliant with $domain eq ucfirst(lc($domain))

$payload = $indexer->index_step;

While the index_step method is the main externally-facing interface method, it is also the most important shared method between all domain indexers, as it automates the crawling and PULL processes.

The index_step method is the core of the indexing logic behind NNexus. It provides:

  • Automatic crawling under the specified start domain root.

  • Fine-tuning of crawl targets. start can be both the default for the domain, as well as any specific URL.

  • Indexing as iteration. Each NNexus indexer object contains an iterator, which can be stepped through. The traversal is left-to-right and depth-first.

  • The indexing is bound by depth (if requested) and keeps a cache of visited pages, avoiding loops.

  • An automatic one second sleep is triggered whenever a page is fetched, in good crawling manners.

The only option accepted by the method is a boolean switch skip, which when turned on skips the next job in the queue.

SHARED METHODS

$url = $self->current_url

Getter, provides the current URL of the page being indexed. Dually acts as a setter when an argument is provided, mainly set from the index_step method.

$dom = $self->current_dom

Getter, provides the current Mojo::DOM of the page being indexed. Dually acts as a setter when an argument is provided, mainly set from the index_step method.

$categories = $self->current_categories

Getter, provides the current categories of the page being indexed. Dually acts as a setter when an argument is provided, mainly set from the index_step method.

The categories are a reference to an array of strings, ideally of MSC classes.

The main use of this method is for sites setup similarly to Wikipedia, where a sub-categorization scheme is being traversed and the current categories need to be remembered whenever a new leaf concept page is entered. See NNexus::Index::Wikipedia for examples.

PULL API

All PULL API methods are intended to be overridden in each concrete domain indexer, with occasional exceptions, where the default behaviour (if any) suffices.

sub domain_root { 'http://mydomain.org' }

Sets the default start URL for an indexing process of the entire domain.

Using the information provided by the shared methods, datamine zero or more candidate links for further indexing. The expected return value is a reference to an array of absolute URL strings.

sub index_page {...}

Using the information provided by the shared methods, datamine zero or more candidate concepts for NNexus indexing. The expected return value is a reference to an array of hash-references, each hash-reference being a concept hash datastructure, specified as:

  { concept => 'concept name',
    url => 'origin url',
    synonyms => [ qw(list of synonyms) ],
    categories => [ qw(list of categories) ],
    # ... TODO: More?
  }
  
sub candidate_categories {...}

Propose candidate categories for the current page, using the shared methods. Useful in cases where the category information of a concept is not recorded in the same page, but has to be inferred instead, as is the case for Wikipedia's traversal process. See Index::Template::Wikipedia for an example of overriding candidate_categories.

sub depth_limit { $depth; }

An integer constant specifying a depth-limit for the crawling process, wrt to the start URL provided. Useful for heavily interlinked sites, such as Wikipedia, in which as depth increases, the topicality of the indexed subcategories decreases.

sub request_interval { $seconds; }

Any Time::Hires admissible constant amount of $seconds. To be used for putting the current process to sleep, in order to avoid overloading the indexed server, as well as to avoid getting banned.

AUTHOR

Deyan Ginev <d.ginev@jacobs-university.de>

COPYRIGHT

 Research software, produced as part of work done by
 the KWARC group at Jacobs University Bremen.
 Released under the MIT license (MIT)