The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

PGXN::API::Index - PGXN API distribution indexer

Synopsis

  use PGXN::API::Indexer;
  my $indexer = PGXN::API::Indexer->new(verbose => $verbosity);
  $indexer->add_distribution({ meta => $dist_meta, zip => $zip });

Description

This module does the heavy lifting of indexing a PGXN distribution for the API server. Simply hand off the metadata loaded from a distribution META.json file and an Archive::Zip object loaded with the distribution download file and it will:

  1. Copy the distribution files from the local mirror to the API document root.

  2. Merge the distribution metadata between the metadata files matching the meta and dist URI templates. The templates are themselves loaded from the /index.json file from the mirror root. The two metadata documents each get additional data useful for API calls and become identical, as well.

  3. Searches for any and all README files and documentation files and parses them into HTML. The format of all documentation files may be any recognized by Text::Markup. The parsed HTML is then cleaned up and an table of contents added before being saved to its new home as a partial HTML document. See the pgTAP documentation on the PGXN API server for a nice example of the resulting format (generated from a Markdown document in the pgTAP distribution) and the same document used on the PGXN site for how it can be used.

  4. Merges the user, tag, and extension metadata for the distribution, adding extra data points useful for the API.

  5. Adds all documentation as well as, distribution, extension, user, and tag metadata, to full text indexes. These may be queried via the API server (provided by pgxn_api_server or locally with PGXN::API::Searcher.

The result is a robust API with much more information than is provided by the spare metadata JSON files on a normal PGXN mirror. The interface offered via pgxn_api_server is then a superset of that offered by a normal mirror. It's a PGXN mirror + more!

Class Interface

Constructor

new

  my $indexer = PGXN::API::Indexer->new(verbose => $verbosity);

Constructs and returns a new PGXN::API::Indexer object. There is only one parameter, verbose, an incremental integer specifying the level of verbosity to use while indexing. Defaults to 0, which is as quiet as possible.

Instance Interface

Instance Methods

update_root_json

Updates the index.json file at the root of the document root, copying the mirror's /index.json to the API's /index.json and adding three additional templates:

source

URI for browsing the source of a distribution. Its value is

  /src/{dist}/{dist}-{version}/

The URI for search/ Its value is

  /search/{in}/
doc

The URI for a documentation file. It's format is copied form the "meta" template, with the trailing META.json replaced with `{+doc}.html}`. `{+doc}` is the path to a documentation file (without a file extension) and may include slashes.

copy_from_mirror

  $indexer->copy_from_mirror($path);

Copies a file from the mirror to the document root. The path argument must be specified using Unix semantics (that is, using slashes for directory separators). Used by PGXN::API::Sync to sync metadata files and stats.

parse_from_mirror

  $indexer->parse_from_mirror($path, $format);

Uses Text::Markup to parse a file at $path on the mirror, sanitizes it and generates a table of contents, and saves it to the document root with its suffix changed to .html. Pass an optional format argument to force Text::Markup to parse the document in that format.

add_distribution

  $indexer->add_distribution({ meta => $meta, zip => $zip });

Adds a distribution to the index. This is the main method called to do all the work of indexing a distribution. The two required parameters are:

meta

The metadata file loaded from a distribution META.json file.

c<zip>

An Archive::Zip object loaded up with the distribution download file.

copy_files

  $indexer->copy_files($params);

Copies a distribution download and README files from the mirror to the API document root. The supported parameters are the same as those for add_distribution(), by which this method is called internally.

merge_distmeta

  $indexer->copy_files($params);

Merges the distribution metadata between the meta file and the dist file. These are the names of URI templates in the /index.json file. The supported parameters are the same as those for add_distribution(), by which this method is called internally.

Once the merge is complete, the two files will be identical, although the dist file will only be updated if the new distribution's release status is "stable" (or if there are no stable distributions). In addition to the data they provided via the mirror server, they will also have the following new keys:

special_files

An array of the names of special files in the distribution. These include any files which match the following regular expressions:

qr{Change(?:s|Log)(?:[.][^.]+)?}i
qr{README(?:[.][^.]+)?}i
qr{LICENSE(?:[.][^.]+)?}i
qr{META[.]json}
qr{Makefile}
qr{MANIFEST}
docs

A hash (dictionary) listing the documentation files found in the distribution. These include a README file and any files found under the doc or docs directory. The keys are paths to each document (without the file name extension) and the values are document titles.

provides/$extension/doc

Each extension listed under provides will get a new key, doc, if there is a document in the docs hash with the same base name as the extension. This is on the assumption that an included extension will have for its documentation a file with the same name (minus the file name extension) as the extension itself. The value will be the path to the document, the same as the key for the same document in the docs hash.

Getting all of this documentation information is handled via a call to parse_docs(), which of course also parses any docs it finds.

And finally, this method updates all other "dist" files for previous versions of the distribution with the latest releases information, so that they all have a complete list of all releases of the distribution.

find_docs

  my @docs = $indexer->find_docs($params);

Finds all the likely documentation files in the zip archive. A file is considered to contain documentation if one of the following is true:

  • It is identified under the doc key in the provides hash of the metadata and exists in the zip archive.

  • It has an extension recognized by Text::Markup and is not excluded by the no_index key in the metadata.

The list of files returned are relative to an unzipped archive root -- that is, they do not include the top-level directory prefix.

Used internally by parse_docs() to determine what files to parse.

parse_docs

  $indexer->parse_docs($params);

Searches the distribution download file fora README and for documentation files in a doc or docs directory, parses them into HTML (using Text::Markup), and the runs them through XML::LibXML to remove all unsafe HTML, to generate a table of contents, and to save them as partial HTML files. Their contents are also added to the "doc" full text index. Files matching the rules under the no_index key in the metadata (if any) will be ignored.

Returns a hash reference with information about the documentation, with the keys being paths to the documentation (without file name extensions) and the values being the titles of the documents. The supported parameters are the same as those for add_distribution(); this method is called internally by merge_distmeta().

update_extensions

  $indexer->update_extensions($params);

Iterates over the list of extensions under the provides key in the metadata and updates their respective metadata files (as specified by the "extension" URI template) and updates them with additional information.The supported parameters are the same as those for add_distribution(), by which this method is called internally.

The additional metadata added to the extension files is:

$release_status/doc

The path to the documentation (without the file name extension) for the extension for the given release status.

$release_status/abstract

The abstract for the latest release of the given release status.

versions/$version/date

The date of a given release.

The contents of the extension, including is name, abstract, distribution, distribution version, and doc path are added to the "extension" full text index.

update_tags

  $indexer->update_tags($params);

Iterates over the list of tags under the tags key in the metadata and updates their respective metadata files (as specified by the "tag" URI template). The supported parameters are the same as those for add_distribution(), by which this method is called internally.

The data added to each tag metadata file is the list of releases copied from the distribution metadata. A tag metadata file thus ends up with a complete list of all distribution releases associated with the tag. The tag is then added to the "tag" full text index.

update_user

  $indexer->update_user($params);

Updates the metadata for the user specified under the user key in the distribution metadata. The updated file is specified by the "user" URI template. The supported parameters are the same as those for add_distribution(), by which this method is called internally.

The data added to each user metadata file is the list of releases copied from the distribution metadata. A user metadata file thus ends up with a complete list of all distribution releases made by the user. The user is then added to the "user" full text index, where the name, nickname, email address, URI, and other metadata are indexed.

merge_user

  $indexer->merge_user($nickname);

Pass in the nickname of a user file and JSON file for that user on the mirror will be merged with the document index copy. If no document index copy exists, one will be created with an empty hash under the releases key. Called by PGXN::API::Sync for each user file seen during the sync.

finalize

  $indexer->finalize;

Method to call when a sync completes. At the moment, all it does is call update_user_lists() and commit any remaining index data to the full text index.

update_user_lists

  $indexer->update_user_lists;

Updates the user list files for any users seen in the distribution metadata processed by merge_distmeta().

doc_root_file_for

  my $doc_root_file = $indexer->doc_root_file_for($tmpl_name, $meta);

Returns the full path to a file in the API document root for the specified URI template, and using the specified distribution metadata to populate the variable values in the template. Used internally to figure out what files to write to.

mirror_file_for

  my $mirror_file = $indexer->mirror_file_for($tmpl_name, $meta);

Returns the full path to a file in local PGXN mirror directory for the specified URI template, and using the specified distribution metadata to populate the variable values in the template. Used internally to figure out what files to read from.

indexer_for

  my $ksi = $indexer->indexer_for($index_name);

Returns a Lucy::Index::Indexer object for updating named full text index. Used internally for updating the appropriate full text index when a distribution has been fully updated.

Instance Accessors

verbose

  my $verbose = $indexer->verbose;
  $indexer->verbose($verbose);

Get or set an incremental verbosity. The higher the integer specified, the more verbose the indexing.

to_index

  push @{ $indexer->to_index->{ $index } } => $data;

Stores a hash reference of array references of data to be added to full text indexes. As a distribution is merged and updated, data for adding to the full text index is added to this hash. Once the updating and merging has completed successfully, the data is read from this attribute and written to the appropriate full text indexes.

libxml

  my $libxml = $indexer->libxml;

Returns the XML::LibXML object used for parsing and cleaning HTML documents.

index_dir

  my $index_dir = $indexer->index_dir;

Returns the path to the parent directory of all of the full-text indexes.

schemas

  my $schema = $indexer->schemas->{$index_name};

Returns a hash reference of Lucy::Plan::Schema objects used to define the structure of the full text indexes. The keys identify the indexes and the values are the corresponding Lucy::Plan::Schema objects. The supported indexes are:

doc
dist
extension
tag
user

Author

David E. Wheeler <david.wheeler@pgexperts.com>

Copyright and License

Copyright (c) 2011 David E. Wheeler.

This module is free software; you can redistribute it and/or modify it under the PostgreSQL License.

Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies.

In no event shall David E. Wheeler be liable to any party for direct, indirect, special, incidental, or consequential damages, including lost profits, arising out of the use of this software and its documentation, even if David E. Wheeler has been advised of the possibility of such damage.

David E. Wheeler specifically disclaims any warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The software provided hereunder is on an "as is" basis, and David E. Wheeler has no obligations to provide maintenance, support, updates, enhancements, or modifications.