The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

MusicRoom::Text::Nearest - Select the closest matching names

DESCRIPTION

When handling music tags you often find that variations in spelling and interpretation make it difficult to identify matchs. For example are these two tracks the same?

    The Sugarhill Gang - Rapper's Delight (reentry)
    Sugar Hill Gang - The Rappers Delight (2)

The module uses a number of techniques (implemented in modules like MusicRoom::Text::SoundexNG and Text::WagnerFischer) to identify "nearby" values that it can suggest.

Here is an example of how to use the module:

    use MusicRoom::Text::Nearest;
    MusicRoom::Text::Nearest::add_categories(
        artist => 1,
        song => 
          {
            qualifiers => ['\s*\(1\)', '\s*\(2\)', '\s*\(3\)',
                           '\s*\(remix\)', '\s*\(re\-entry\)', '\s*\(re\-issue\)', 
                           '\s*\(live\)',],
          },
      );
    my $fh = IO::File->new("valid-artists.txt");
    MusicRoom::Text::Nearest::read_names($fh,"artist");
    $fh->close();

    my $artist = "Sugar Hill Gang";
    my $real_artist = MusicRoom::Text::Nearest::get("artist",$artist,"Checking artists");

This adds the categories "artist" and "song" to the valid sets. The qualifiers flag tells the module to ignore certain strings in the song names. Then we read a file containing the valid names. Finally we select the closest one of those to use as the name of the artist.

Of course this module won't help identify that "The Plastic Ono Band" and "John Lennon" are the same, that you have to do for yourself.

FORMAT

The module depends on being seeded with a good set of valid entries. This is normally done by reading a file containing a list of values. These files are in a particular format, here is an example:

    # 1        -> src-tracks
    #  1       -> uk single
    #   1      -> uk totp
    #    1     -> au yearsong
    #     1    -> de decadesong
    #      1   -> us yearsong
    #       1  -> us riaaalbum
    #        1 -> nl single
    
     { X  1  4} Jason Donovan
     { 1      } Jason Downs
     {        } Jason Falkner
     {1    1  } Jason Mraz

There are two types of comments here, lines starting with hash are considered to be comments, entries between curly brackets are ignored as well. These allow the files to keep track of the source of the names (in the above example we have files called "src-tracks", "uk single" and so on with the numbers giving an indication of the count of those items in that file, so "Jason Donovan" is in "uk single" more than 9 times, in "de decadesong" once and in "nl single" four times.

Neither type of comment is necessary for the module, however they do make it easier to see where the names came from. This makes managing the names easier (or indeed possible).

list($category)

List all the names registered in a particular category

by_soundex($a,$b)

A sort routine that works by SoundexNG. Pretty much useless outside the module because of the scoping rules for $a and $b.

add_handler($fun,$args)

Add an error handler to be called when an error is encountered in the module.

   sub found_error
     {
       my($str) = @_;
       print STDERR $str;
     }

    MusicRoom::Text::Nearest::add_handler(\&found_error);

error($str)

This invokes an error in the module. I can't see why anyone should want to call this outside the module.

add_categories(%categories)

Add a set of categories. This can be called like:

    MusicRoom::Text::Nearest::add_categories(
        artist => 1,
        song => 
          {
            qualifiers => ['\s*\(1\)', '\s*\(2\)', '\s*\(3\)',
                           '\s*\(remix\)', '\s*\(re\-entry\)', '\s*\(re\-issue\)', 
                           '\s*\(live\)',],
          },
      );

If a hash is passed it is treated as a set of flags controlling the category. Currently "qualifiers" is the only handled flag.

add_name($category,$name)

Add a new valid name into the category.

read_names($fh,$cat)

Read valid names from the file. Because this prepares all the matching patterns it can take some time to run.