The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CPAN::Testers::Data::Generator - Download and summarize CPAN Testers data

SYNOPSIS

  % cpanstats
  # ... wait patiently, very patiently
  # ... then use the cpanstats MySQL database

DESCRIPTION

This distribution was originally written by Leon Brocard to download and summarize CPAN Testers data. However, all of the original code has been rewritten to use the CPAN Testers Statistics database generation code. This now means that all the CPAN Testers sites including the Reports site, the Statistics site and the CPAN Dependencies site, can use the same database.

This module retrieves and parses reports from the Metabase, generating or updating entries in the cpanstats database, which extracts specific metadata from the reports. The information in the cpanstats database is then presented via CPAN::Testers::WWW::Reports on the CPAN Testers Reports website.

A good example query from the cpanstats database for Acme-Colour would be:

  SELECT version, status, count(*) FROM cpanstats WHERE
  dist = "Acme-Colour" group by version, state;

To create a database from scratch can take several days, as there are now over 24 million submitted reports. As such updating from a known copy of the database is much more advisable. If you don't want to generate the database yourself, you can obtain a feed using CPAN::Testers::WWW::Report::Query::Reports.

With over 24 million reports in the database, if you do plan to run this software to generate the databases it is recommended you utilise a high-end processor machine. Even with a reasonable processor it can take over a week!

DATABASE SCHEMA

The cpanstats database schema is very straightforward, one main table with several index tables to speed up searches. The main table is as below:

  CREATE TABLE `cpanstats` (

    `id`          int(10) unsigned    NOT NULL AUTO_INCREMENT,
    `guid`        char(36)            NOT NULL DEFAULT '',
    `state`       varchar(32)         DEFAULT NULL,
    `postdate`    varchar(8)          DEFAULT NULL,
    `tester`      varchar(255)        DEFAULT NULL,
    `dist`        varchar(255)        DEFAULT NULL,
    `version`     varchar(255)        DEFAULT NULL,
    `platform`    varchar(255)        DEFAULT NULL,
    `perl`        varchar(255)        DEFAULT NULL,
    `osname`      varchar(255)        DEFAULT NULL,
    `osvers`      varchar(255)        DEFAULT NULL,
    `fulldate`    varchar(32)         DEFAULT NULL,
    `type`        int(2)              DEFAULT '0',
      
    PRIMARY KEY       (`id`),
    KEY `guid`        (`guid`),
    KEY `distvers`    (`dist`,`version`),
    KEY `tester`      (`tester`),
    KEY `state`       (`state`),
    KEY `postdate`    (`postdate`)
  
  )

It should be noted that 'postdate' refers to the YYYYMM formatted date, whereas the 'fulldate' field refers to the YYYYMMDDhhmm formatted date and time.

The metabase database schema is again very straightforward, and consists of one main table, as below:

  CREATE TABLE `metabase` (
  
    `guid`      char(36)            NOT NULL,
    `id`        int(10) unsigned    NOT NULL,
    `updated`   varchar(32)         DEFAULT NULL,
    `report`    longblob            NOT NULL,
    `fact`      longblob            NOT NULL,
  
    PRIMARY KEY     (`guid`),
    KEY `id`        (`id`),
    KEY `updated`   (`updated`)
  
  )

The id field is a reference to the cpanstats.id field.

The report field is JSON encoded, and is a cached version of the facts of a report, while the fact field is the full report fact, and associated child facts, Sereal encoded. Both are extracted from the returned fact from Metabase::Librarian.

See examples/cpanstats-createdb for the full list of tables used.

SIGNIFICANT CHANGES

v0.31 CHANGES

With the release of v0.31, a number of changes to the codebase were made as a further move towards CPAN Testers 2.0. The first change is the name for this distribution. Now titled 'CPAN-Testers-Data-Generator', this now fits more appropriately within the CPAN-Testers namespace on CPAN.

The second significant change is to now reference a MySQL cpanstats database. The SQLite version is still updated as before, as a number of other websites and toolsets still rely on that database file format. However, in order to make the CPAN Testers Reports website more dynamic, an SQLite database is not really appropriate for a high demand website.

The database creation code is now available as a standalone program, in the examples directory, and all the database communication is now handled by the new distribution CPAN-Testers-Common-DBUtils.

v0.41 CHANGES

In the next stage of development of CPAN Testers 2.0, the id field used within the database schema above for the cpanstats table no longer matches the NNTP ID value, although the id in the articles does still reference the NNTP ID, at least for the reports submitted prior to the switch to the Metabase in 2010.

In order to correctly reference the id in the articles table, you will need to use the function guid_to_nntp() with CPAN::Testers::Common::Utils, using the new guid field in the cpanstats table.

As of this release the cpanstats id field is a unique auto incrementing field.

The next release of this distribution will be focused on generation of stats using the Metabase storage API.

v1.00 CHANGES

Moved to Metabase API. The change to a definite major version number hopefully indicates that this is a major interface change. All previous NNTP access has been dropped and is no longer relavent. All report updates are now fed from the Metabase API.

INTERFACE

The Constructor

  • new

    Instatiates the object CPAN::Testers::Data::Generator. Accepts a hash containing values to prepare the object. These are described as:

      my $obj = CPAN::Testers::Data::Generator->new(
                    logfile => './here/logfile',
                    config  => './here/config.ini'
      );

    Where 'logfile' is the location to write log messages. Log messages are only written if a logfile entry is specified, and will always append to any existing file. The 'config' should contain the path to the configuration file, used to define the database access and general operation settings.

Public Methods

  • generate

    Starting from the last cached report, retrieves all the more recent reports from the Metabase Report Submission server, parsing each and recording each report in both the cpanstats database and the metabase cache database.

  • regenerate

    For a given date range, retrieves all the reports from the Metabase Report Submission server, parsing each and recording each report in both the cpanstats database and the metabase cache database.

    Note that as only 2500 can be returned at any one time due to Amazon SimpleDB restrictions, this method will only process the guids returned from a given start data, up to a maxiumu of 2500 guids.

    This method will return the guid of the last report processed.

  • rebuild

    In the event that the cpanstats database needs regenerating, either in part or for the whole database, this method allow you to do so. You may supply parameters as to the 'start' and 'end' values (inclusive), where all records are assumed by default. Records are rebuilt using the local metabase cache database.

  • reparse

    Rather than a complete rebuild the option to selective reparse selected entries is useful if there are reports which were previously unable to correctly supply a particular field, which now has supporting parsing code within the codebase.

    In addition there is the option to exclude fields from parsing checks, where they may be corrupted, and can be later amended using the 'cpanstats-update' tool.

  • parse

    Unlike reparse, parse is used to parse just missing reports. As such if a report has already been stored and cached, it won't be processed again, unless the 'force' option is used.

    In addition, as per reparse, there is the option to exclude fields from parsing checks, where they may be corrupted, and can be later amended using the 'cpanstats-update' tool.

  • tail

    Write to a file, the list of GUIDs returned from a tail request.

Private Methods

  • commit

    To speed up the transaction process, a commit is performed every 500 inserts. This method is used as part of the clean up process to ensure all transactions are completed.

  • get_tail_guids

    Get the list of GUIDs as would be seen for a tail log.

  • get_next_dates

    Get the list of dates to use in the next cycle of report retrieval.

  • get_next_guids

    Get the list of GUIDs for the reports that have been submitted since the last cached report.

  • retrieve_reports

    Abstracted loop of requesting GUIDs, then parsing, storing and caching each report as appropriate.

  • already_saved

    Given a guid, determines whether it has already been saved in the local metabase cache.

  • load_fact

    Get a specific report fact for a given GUID, from the local database.

  • get_fact

    Get a specific report fact for a given GUID, from the Metabase.

  • dereference_report

    When you retrieve the parent report fact from the database, you'll need to dereference it to ensure the child elements contain the child facts in the correct format for processing.

  • parse_report

    Parses a report extracting the metadata required for the cpanstats database.

  • reparse_report

    Parses a report (from a local metabase cache) extracting the metadata required for the stats database.

  • retrieve_report

    Given a guid will attempt to return the report metadata from the cpanstats database.

  • store_report

    Inserts the components of a parsed report into the cpanstats database.

  • cache_report

    Inserts a serialised report into a local metabase cache database.

  • cache_update

    For the current report will update the local metabase cache with the id used within the cpanstats database.

Very Private methods

The following modules load information enmasse to avoid DB connection hogging and IO blocking. Thus improving performance.

  • load_uploads

    Loads the upload information.

  • load_authors

    Loads information regarding each author's distribution.

  • load_perl_versions

    Loads all the known Perl versions.

  • save_perl_versions

    Saves any new Perl versions

HISTORY

The CPAN Testers was conceived back in May 1998 by Graham Barr and Chris Nandor as a way to provide multi-platform testing for modules. Today there are over 40 million tester reports and more than 100 testers each month giving valuable feedback for users and authors alike.

BECOME A TESTER

Whether you have a common platform or a very unusual one, you can help by testing modules you install and submitting reports. There are plenty of module authors who could use test reports and helpful feedback on their modules and distributions.

If you'd like to get involved, please take a look at the CPAN Testers Wiki, where you can learn how to install and configure one of the recommended smoke tools.

For further help and advice, please subscribe to the the CPAN Testers discussion mailing list.

  CPAN Testers Wiki
    - http://wiki.cpantesters.org
  CPAN Testers Discuss mailing list
    - http://lists.cpan.org/showlist.cgi?name=cpan-testers-discuss

BUGS, PATCHES & FIXES

There are no known bugs at the time of this release. However, if you spot a bug or are experiencing difficulties, that is not explained within the POD documentation, please send bug reports and patches to the RT Queue (see below).

Fixes are dependent upon their severity and my availability. Should a fix not be forthcoming, please feel free to (politely) remind me.

RT Queue - http://rt.cpan.org/Public/Dist/Display.html?Name=CPAN-Testers-Data-Generator

SEE ALSO

CPAN::Testers::Report, Metabase, Metabase::Fact, CPAN::Testers::Fact::LegacyReport, CPAN::Testers::Fact::TestSummary, CPAN::Testers::Metabase::AWS

CPAN::Testers::WWW::Statistics

http://www.cpantesters.org/, http://stats.cpantesters.org/, http://wiki.cpantesters.org/

AUTHOR

It should be noted that the original code for this distribution began life under another name. The original distribution generated data for the original CPAN Testers website. However, in 2008 the code was reworked to generate data in the format for the statistics data analysis, which in turn was reworked to drive the redesign of the all the CPAN Testers websites. To reflect the code changes, a new name was given to the distribution.

CPAN-WWW-Testers-Generator

  Original author:    Leon Brocard <acme@astray.com>   (C) 2002-2008
  Current maintainer: Barbie       <barbie@cpan.org>   (C) 2008-2010

CPAN-Testers-Data-Generator

  Original author:    Barbie       <barbie@cpan.org>   (C) 2008-2014

LICENSE

  This module is free software; you can redistribute it and/or
  modify it under the Artistic License 2.0.