The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::TicketClusterer - A Perl module for retrieving Excel-stored past tickets that are most similar to a new ticket. Tickets are commonly used in software services industry and customer support businesses to record requests for service, product complaints, user feedback, and so on.

SYNOPSIS

    use Algorithm::TicketClusterer;

    #  Extract the tickets from the Excel spreadsheet and subject the
    #  textual content of the tickets to various preprocessing and doc
    #  modeling steps.  The preprocessing steps consist of removing markup,
    #  dropping the words in a stop list, correcting spelling errors,
    #  detecting the need for antonyms, and, finally, adding word synonyms
    #  to the tickets in order to ground the tickets in a common
    #  vocabulary. The doc modeling steps consist of fitting a standard
    #  vector space model to the tickets.

    my $clusterer = Algorithm::TicketClusterer->new( 
    
                         excel_filename            => $excel_filename,
                         clustering_fieldname      => $fieldname_for_clustering,
                         which_worksheet           => $which_worksheet,
                         unique_id_fieldname       => $unique_id_fieldname,
                         raw_tickets_db            => $raw_tickets_db,
                         processed_tickets_db      => $processed_tickets_db,
                         stemmed_tickets_db        => $stemmed_tickets_db,
                         inverted_index_db         => $inverted_index_db,
                         tickets_vocab_db          => $tickets_vocab_db,
                         idf_db                    => $idf_db,
                         tkt_doc_vecs_db           => $tkt_doc_vecs_db,
                         tkt_doc_vecs_normed_db    => $tkt_doc_vecs_normed_db,
                         synset_cache_db           => $synset_cache_db,
                         stop_words_file           => $stop_words_file,
                         misspelled_words_file     => $misspelled_words_file,
                         add_synsets_to_tickets    => 1,
                         want_synset_caching       => 1,
                         max_num_syn_words         => 3,
                         min_word_length           => 4,
                         want_stemming             => 1,
                    );
    
    ## Extract information from Excel spreadsheets:
    $clusterer->get_tickets_from_excel();
    
    ## Apply cleanup filters and add synonyms:
    $clusterer->delete_markup_from_all_tickets();
    $clusterer->apply_filter_to_all_tickets();
    $clusterer->expand_all_tickets_with_synonyms();
    
    ## Construct the VSM doc model for the tickets:
    $clusterer->get_ticket_vocabulary_and_construct_inverted_index();
    $clusterer->construct_doc_vectors_for_all_tickets();

    #  Of the various constructor parameters shown above, the following two
    #  are critical to how information is extracted from an Excel
    #  spreadsheet: `clustering_fieldname' and `unique_id_fieldname'.  The
    #  first is the heading of the column that contains the textual content
    #  of the tickets.  The second is the heading of the column that
    #  contains a unique integer identifier for each ticket.

    #  The nine database related constructor parameters (these end in the
    #  suffix `_db') are there in order to avoid repeated parsing of the
    #  spreadsheet and preprocessing of the tickets every time you need to
    #  make a retrieval for a new ticket.  The goal here is that after the
    #  ticket information has been ingested from a spreadsheet, you would
    #  want to carry out similar-ticket retrieval in real time.  (Whether
    #  or not real-time retrieval would be feasible in actual practice
    #  would also depend on what hardware you are using, obviously.)

    #  After the above preprocessing and doc modeling steps, you can
    #  extract the most similar past tickets for a new query ticket with a
    #  script in which the constructor call would look like:

    my $clusterer = Algorithm::TicketClusterer->new( 
    
                         clustering_fieldname      => $fieldname_for_clustering,
                         unique_id_fieldname       => $unique_id_fieldname,
                         raw_tickets_db            => $raw_tickets_db,
                         processed_tickets_db      => $processed_tickets_db,
                         stemmed_tickets_db        => $stemmed_tickets_db,
                         inverted_index_db         => $inverted_index_db,
                         tickets_vocab_db          => $tickets_vocab_db,
                         idf_db                    => $idf_db,
                         tkt_doc_vecs_db           => $tkt_doc_vecs_db,
                         tkt_doc_vecs_normed_db    => $tkt_doc_vecs_normed_db,
                         min_idf_threshold         => 1.8,
                         how_many_retrievals       => 5,
                    );
    
    my $query_tkt = 1393548;
    $clusterer->restore_ticket_vectors_and_inverted_index();
    my %retrieved = %{$clusterer->retrieve_similar_tickets_with_vsm($query_tkt)};
    foreach my $tkt_id (sort {$retrieved{$b} <=> $retrieved{$a}} keys %retrieved) {
        $clusterer->show_original_ticket_for_given_id( $tkt_id );
    }

    #  Of all the parameters shown above in the constructor call, the
    #  parameter min_idf_threshold plays a large role in what tickets are
    #  returned by the retrieval function. The value of this parameter
    #  depends on the number of tickets in your Excel spreadsheet.  If the
    #  number of tickets is in the low hundreds, this parameter is likely to
    #  require a value of 1.5 to 1.8.  If the number of tickets is in the
    #  thousands, the value of this parameter is likely to be between 2 and
    #  3. See the writeup on this parameter in the API description in the
    #  rest of this documentation.

CHANGES

Version 1.01 of the module removes the platform dependency of the functions used for reading the text files for stop words, misspelled words, etc.

DESCRIPTION

Algorithm::TicketClusterer is a perl5 module for retrieving previously processed Excel-stored tickets similar to a new ticket. Routing decisions made for the past similar tickets can be useful in expediting the routing of a new ticket.

Tickets are commonly used in software services industry and customer support businesses to record requests for service, product complaints, user feedback, and so on.

With regard to the routing of a ticket, you would want each new ticket to be handled by the tech support individual who is most qualified to address the issue raised in the ticket. Identifying the right individual for each new ticket in real-time is no easy task for organizations that man large service centers and helpdesks. So if it were possible to quickly identify the previously processed tickets that are most similar to a new ticket, one could think of constructing semi-automated (or, perhaps, even fully automated) ticket routers.

Identifying old tickets similar to a new ticket is made challenging by the fact that folks who submit tickets often write them quickly and informally. The informal style of writing means that different people may use different colloquial terms to describe the same thing. And the quickness associated with their submission causes the tickets to frequently contain spelling and other errors such as conjoined words, fragmentation of long words, and so on.

This module is an attempt at dealing with these challenges.

The problem of different people using different words to describe the same thing is taken care of by using WordNet to add to each ticket a designated number of synonyms for each word in the ticket. The idea is that after all the tickets are expanded in this manner, they would become grounded in a common vocabulary. The synonym expansion of a ticket takes place only after the negated phrases (that is, the words preceded by 'no' or 'not') are replaced by their antonyms.

Obviously, expanding a ticket by synonyms makes sense only after it is corrected for spelling and other errors. What sort of errors one looks for and corrects would, in general, depend on the application domain of the tickets. (It is not uncommon for engineering services to use jargon words and acronyms that look like spelling errors to those not familiar with the services.) The module expects to see a file that is supplied through the constructor parameter misspelled_words_file that contains misspelled words in the first column and their corrected versions in the second column. An example of such a file is included in the examples directory. You would need to create your own version of such a file for your application domain. Since conjuring up the misspellings that your ticket submitters are likely to throw at you is futile, you might consider using the following approach which I prefer to actually reading the tickets for such errors: Turn on the debugging options in the constructor for some initially collected spreadsheets and watch what sort of words the WordNet is not able to supply any synonyms for. In a large majority of cases, these would be the misspelled words.

Expanding a ticket with synonyms is made complicated by the fact that some common words have such a large number of synonyms that they can overwhelm the relatively small number of words in a ticket. Adding too many synonyms in relation to the size of a ticket can not only distort the sense of the ticket but it can also increase the computational cost of processing all the tickets.

In order to deal with the pros and the cons of using synonyms, the present module strikes a middle ground: You can specify how many synonyms to use for a word (assuming that the number of synonyms supplied by WordNet is larger than the number specified). This allows you to experiment with retrieval precision by altering the number of synonyms used. The retained synonyms are selected randomly from those supplied by WordNet. (A smarter way to select synonyms would be to base them on the context. For example, you would not want to use the synonym `programmer' for the noun `developer' if your application domain is real-estate. However, such context-dependent selection of synonyms would take us into the realm of ontologies that I have chosen to stay away from in this first version of the module.)

Another issue related to the overall run-time performance of this module is the computational cost of the calls to WordNet through its Perl interface WordNet::QueryData. This module uses what I have referred to as synset caching to make this process as efficient as possible. The result of each WordNet lookup is cached in a database file whose name you supply through the constructor option synset_cache_db. If you are doing a good job of catching spelling errors, the module will carry out a decreasing number of WordNet lookups as the tickets are scanned for expansion with synonyms. In an experiment with a spreadsheet that contained over 1400 real tickets, the last several hundred resulted in hardly any calls to WordNet.

As currently programmed, the synset cache is deleted and then created afresh at every call to the function that extracts information from an Excel spreadsheet. You would want to change this behavior of the module if you are planning to use it in a production environment where the different spreadsheets are likely to deal with the same application domain. To give greater persistence to the synset cache, comment out the unlink $self-{_synset_cache_db}> line in the method get_tickets_from_excel(). After a few updates of the synset cache, the module would almost never need to make direct calls to WordNet, which would enhance the speed of the module even further.

The textual content of the tickets, as produced by the preprocessing steps, is used for document modeling and the doc model thus created used subsequently for retrieving similar tickets. The doc modeling is carried out using the Vector Space Model (VSM) in which each ticket is represented by a vector whose size equals the size of the vocabulary used in all the tickets and whose elements represent the word frequencies in the ticket. After such a model is constructed, a query ticket is compared with the other tickets on the basis of the cosine similarity distance between the corresponding vectors.

My decision to use the simplest of the text models --- the Vector Space Model --- was based of the work carried out by Shivani Rao at Purdue who has demonstrated that the simpler models are more effective at retrieval from software libraries than the more complex models. (See the paper by Shivani Rao and Avinash Kak at the MSR'11 Conference.) Although tickets, in general, are not the same as software libraries, I have a strong feeling that Shivani's conclusions would extend to other domains as well. Having said that, it is important to mention that there remains the possibility that automated ticket routing for some applications may respond better to more elaborate text models.

The module uses three mechanisms to speed up the retrieval of tickets similar to a query ticket: (1) It uses the inverted index for all the words to construct for each query ticket a candidate pool of only those tickets in the database that have words in common with the query ticket; (2) Only those query-ticket words are used for retrieval whose inverse-document-frequency values exceed a user-specified threshold; and (3) The module uses stemming to reduce the variants of the same word to a common root in order to limit the size of the vocabulary. The stemming used in the current module is rudimentary. However, it would be easy to plug into the module more powerful stemmers through their Perl interfaces. Future versions of this module may do exactly that.

THE THREE STAGES OF PROCESSING

The tickets are processed in the following three stages:

Ticket Preprocessing:

This stage involves extracting the textual content of each ticket from the Excel spreadsheet and subjecting it to the following steps: (1) deleting markup; (2) dropping the stop words supplied through a file whose name is provided as a value for the constructor parameter stop_words_file; (3) correcting spelling errors through the `bad-word good-word' entries in a file whose name is supplied as a value for the constructor parameter misspelled_words_file; (4) replacing negated words with their antonyms; and, finally, (5) adding synonyms.

Doc Modeling:

Doc modeling consists of creating a Vector Space Model for the tickets after they have been processed as described above. VSM modeling involves scanning the preprocessed tickets, stemming the words, and constructing a vocabulary for all of the stemmed words in all the tickets. Subsequently, the alphabetized list of the vocabulary serves as a vector template for the tickets. Each ticket is represented by a vector whose dimensionality equals the size of the vocabulary; each element of this vector is an integer that is the frequency of the vocabulary word corresponding to the index of the element. Doc modeling also involves calculating the inverse document frequencies (IDF) values for the words and the inverted index for the words. The IDF values are used to diminish the importance of the words that carry little discriminatory power vis-a-vis the tickets. IDF for a word is the logarithm of the ratio of the total number of tickets to the number of tickets in which the word appears. Obviously, if a word were to appear in all the tickets, its IDF value would be zero. The inverted index entry for a word is the list of all the tickets that contain that word. The inverted index greatly expedites the retrieval of tickets similar to a given query ticket.

Similarity Based Retrieval:

A query ticket is subject to the same preprocessing steps as all other tickets. Subsequently, it is also represented by a vector in the same manner as the other tickets. Using the stemmed words in the query ticket, the inverted index is used to create a candidate list of ticket vectors for matching with the query ticket vector. For this, only those query words are chosen whose IDF values exceed a threshold. Finally, we compute the cosine similarity distance between the query ticket vector and the ticket vectors in the candidate list. The matching ticket vectors are returned in the order of decreasing similarity.


METHODS

The module provides the following methods for ticket preprocessing and for the retrieval of tickets most similar to a given ticket:

new()

A call to new() constructs a new instance of the Algorithm::TicketClusterer class:

    my $clusterer = Algorithm::TicketClusterer->new( 

                     excel_filename            => $excel_filename,
                     clustering_fieldname      => $fieldname_for_clustering,
                     unique_id_fieldname       => $unique_id_fieldname,
                     which_worksheet           => $which_worksheet,
                     raw_tickets_db            => $raw_tickets_db,
                     processed_tickets_db      => $processed_tickets_db,
                     stemmed_tickets_db        => $stemmed_tickets_db,
                     inverse_index_db          => $inverse_index_db,
                     tickets_vocab_db          => $tickets_vocab_db,
                     idf_db                    => $idf_db,
                     tkt_doc_vecs_db           => $tkt_doc_vecs_db,
                     tkt_doc_vecs_normed_db    => $tkt_doc_vecs_normed_db,
                     synset_cache_db           => $synset_cache_db,
                     stop_words_file           => $stop_words_file,
                     misspelled_words_file     => $misspelled_words_file,
                     add_synsets_to_tickets    => 1,
                     want_synset_caching       => 1,
                     min_idf_threshold         => 2.0,
                     max_num_syn_words         => 3,
                     min_word_length           => 4,
                     want_stemming             => 1,
                     how_many_retrievals       => 5,
                     debug1                    => 1,  # for processing, filtering Excel
                     debug2                    => 1,  # for doc modeling
                     debug3                    => 1,  # for retrieving similar tickets

                   );

Obviously, before you can invoke the constructor, you must provide values for the variables shown to the right of the big arrows. As to what these values should be is made clear by the following alphabetized list that describes each of the constructor parameters shown above:

add_synsets_to_tickets:

You can turn off the addition of synonyms to the tickets by setting this boolean parameter to 0.

clustering_fieldname:

This is for supplying to the constructor the heading of the column in your Excel spreadsheet that contains the textual data for the tickets. For example, if the column heading for the textual content of the tickets is `Description', you must supply this string as the value for the parameter clustering_fieldname.

debug1:

When this parameter is set, the module prints out information regarding what columns of the spreadsheet it is extracting information from, the headers for those columns, the index of the column that contains the textual content of the tickets, and of the column that contains the unique integer identifier for each ticket. If you are dealing with spreadsheets with a large number of tickets, it is best to pipe the output of the module into a file to see the debugging information.

debug2:

When this parameter is set, you will see how WordNet is being utilized to generate word synonyms. This debugging output is also useful to see the extent of misspellings in the tickets. If WordNet is unable to find the synonyms for a word, chances are that the word is not spelled correctly (or that it is a jargon word or a jargon acronym).

debug3:

This debug flag applies to the calculations carried out during the retrieval of similar tickets. When this flag is set, the module will display the candidate set of tickets to be considered for matching with the query ticket. This candidate set is chosen by using the inverted index to collect all the tickets that share words with the query word provided the IDF value for each such word exceeds the threshold set by the constructor parameter min_idf_threshold.

excel_filename:

This is obviously the name of the Excel file that contains the tickets you want to process.

how_many_retrievals:

The integer value supplied for this parameter determines how many tickets that are most similar to a query ticket will be returned.

idf_db:

You store the inverse document frequencies for the vocabulary words in a database file whose name is supplied through this constructor parameter. As mentioned earlier, the IDF for a word is, in principle, the logarithm of the ratio of the total number of tickets to the DF (Document Frequency) for the word. The DF of a word is the number of tickets in which the word appears.

inverted_index_db:

If you plan to create separate scripts for the three stages of processing described earlier, you must store the inverted index in a database file so that it can be used by the script whose job is to carry out similarity based ticket retrieval. The inverted index is stored in a database file whose name is supplied through this constructor parameter.

max_num_syn_words:

As mentioned in DESCRIPTION, some words can have a very large number of synonyms --- much larger than the number of words that may exist in a typical ticket. If you were to add all such synonyms to a ticket, you run the danger of altering the sense of the ticket, besides unnecessarily increasing the size of the vocabulary. This parameter limits the number of synonyms chosen to the value used for the parameter. When the number of synonyms returned by WordNet is greater than the value set for this parameter, the synonyms retained are chosen randomly from the list returned by WordNet.

min_idf_threshold:

First recall that IDF stands for Inverse Document Frequency. It is calculated during the second of the three-stage processing of the tickets as described in the section THE THREE STAGES OF PROCESSING TICKETS. The IDF value of a word gives us a measure of the discriminatory power of the word. Let's say you have a word that occurs in only one out of 1000 tickets. Such a word is obviously highly discriminatory and its IDF would be the logarithm (to base 10) of the ratio of 1000 to 1, which is 3. On the other hand, for a word that occurs in every one of 1000 tickets, its IDF value would be the logarithm of the ratio of 1000 to 1000, which is 0. So, for the case when you have 1000 tickets, the upper bound on IDF is 3 and the lower bound 0. This constructor parameter controls which of the query words you will use for constructing the initial pool of tickets that will be used for matching. The larger the value of this threshold, the smaller the pool obviously.

min_word_length:

This parameter sets the minimum number of characters in a word in order for it to be included for ticket processing.

misspelled_words_file:

As to what extent you can improve ticket retrieval precision with the addition of synonyms depends on the degree to which you can make corrections on the fly for the spelling errors that occur frequently in tickets. That fact makes the file you supply through this constructor parameter very important. For the current version of the module, this file must contain exactly two columns, with the first entry in each row the misspelled word and the second entry the correctly spelled word. See this file in the examples directory for how to format it.

processed_tickets_db:

As mentioned earlier in DESCRIPTION, the tickets must be subject to various preprocessing steps before they can be used for document modeling for the purpose of retrieval. Preprocessing consists of stop words removal, spelling corrections, antonym detection, synonym addition, etc. The tickets resulting from preprocessing are stored in a database file whose name you supply through this constructor parameter.

raw_tickets_db:

The raw tickets extracted from the Excel spreadsheet are stored in a database file whose name you supply through this constructor parameter. The idea here is that we do not want to process an Excel spreadsheet for each new attempt at matching a query ticket with the previously recorded tickets in the same spreadsheet. It is much faster to load the database back into the runtime environment than to process a large spreadsheet.

stemmed_tickets_db:

As mentioned in the section THE THREE STAGES OF PROCESSING, one of the first things you do in the second stage of processing is to stem the words in the tickets. Stemming is important because it reduces the size of the vocabulary. To illustrate, stemming would reduce both the words `programming' and `programmed' to the common root 'program'. This module uses a very simple stemmer whose rules can be found in the utility subroutine _simple_stemmer(). It would be trivial to expand on these rules, or, for that matter, to use the Perl module Lingua::Stem::En for a full application of the Porter Stemming Algorithm. The stemmed tickets are saved in a database file whose name is supplied through this constructor parameter.

stop_words_file:

This constructor parameter is for naming the file that contains the stop words, these being words you do not wish to be included in the vocabulary. The format of this file must be as shown in the sample file stop_words.txt in the examples directory.

synset_cache_db:

As mentioned in DESCRIPTION, we expand each ticket with a certain number of synonyms for the words in the ticket for the purpose of grounding all the tickets in a common vocabulary. This entails making calls to WordNet through its Perl interface WordNet::QueryData. Since these calls can be expensive, you can vastly improve the runtime performance of the module by caching the results returned by WordNet. This constructor parameter is for naming a diskfile in which the cache will be stored.

tickets_vocab_db:

This parameter is for naming the DBM in which the ticket vocabulary is stored after it is subject to stemming.

tkt_doc_vecs_db:

The database file named by this constructor parameter stores the document vector representations for the tickets. Each document vector has the same size as the vocabulary for all the tickets; each element of such a vector is the number of occurrences of the corresponding word in the ticket.

tkt_doc_vecs_normed_db:

The database file named by this parameter stores the normalized document vectors. Normalization of a ticket vector consists of factoring out the size of the ticket by dividing the term frequency for each word in the ticket by the number of words in the ticket, and then multiplying the result by the IDF value for the word.

unique_id_fieldname:

One of the columns of your Excel spreadsheet must contain a unique integer identifier for each ticket-bearing row of the sheet. The head of this column, a string obviously, is supplied as the value for this constructor parameter.

want_stemming:

This boolean parameter determines whether or not the words extracted from the tickets would be subject to stemming. As mentioned elsewhere, stemming means that related words like `programming' and `programs' would both be reduced to the root word `program'. Stemming is important for limiting the size of the vocabulary.

want_synset_caching:

Turning this boolean parameter on is a highly effective way to improve the computational speed of the module. As mentioned earlier, it is important to ground the tickets in a common vocabulary and this module does that by adding to the tickets a designated number of the synonyms for the words in the tickets. However, the calls to WordNet for the synonyms through the Perl interface WordNet::QueryData can be expensive. Caching means that only one call would need to be made to WordNet for any given word regardless of how many times the word appears in all of the tickets.

which_worksheet:

This specifies the Excel worksheet that contains the tickets. Its value should be 1 for the first sheet, 2 for the second, and so on.


apply_filter_to_all_tickets()
    $clusterer->apply_filter_to_all_tickets()

The filtering consists of dropping words from the tickets that are in your stop-list file, fixing spelling errors using the `bad-word good-word' pairs in your spelling errors file, and deleting short words.

construct_doc_vectors_for_all_tickets()
    $clusterer->construct_doc_vectors_for_all_tickets()

This method is used in the doc modeling stage of the computations. As stated earlier, doc modeling of the tickets consists of representing each ticket by a vector whose size equals that of the vocabulary and whose elements represent the frequencies of the corresponding words in the ticket. In addition to calculating the doc vectors, this method also constructs a normalized version of the doc vectors. The normalization for a ticket consists of multiplying the word frequencies in the vectors by the IDF values associated with the words and dividing the result by the total number of words in the ticket.

delete_markup_from_all_tickets()
    $clusterer->delete_markup_from_all_tickets()

It is not uncommon for the textual content of a ticket to contain HTML markup. This method deletes such strings. Note that this method is not capable of deleting complex markup that may include HTML comment blocks, may cross line boundaries, or when the textual content includes angle brackets that denote "less than" or "greater then". If your tickets require more sophisticated processing for the removal of markup, you might consider using the HTML::Restrict module.

display_all_doc_vectors()
display_all_normalized_doc_vectors()

These two methods are useful for troubleshooting if things don't look right with regard to retrieval.

display_inverse_document_frequencies()
    $clusterer->display_inverse_document_frequencies()

As mentioned earlier, the document frequency (DF) of a word is the number of tickets in which the word appears. The IDF of a word is the logarithm of the ratio of the total number of tickets to the DF of the word. A call to this method displays the IDF values for the words in the vocabulary.

display_inverted_index()
display_inverted_index_for_given_word( $word )
display_inverted_index_for_given_query( $ticket_id )

The above three methods are useful for troubleshooting the issues that are related to the generation of the inverted index. The first method shows the entire inverted index, the second the inverted index for a single specified word, and the third for all the words in a query ticket.

display_tickets_vocab()
    $clusterer->display_tickets_vocab()

This method displays the ticket vocabulary constructed by a call to get_ticket_vocabulary_and_construct_inverted_index(). The vocabulary display consists of an alphabetized list of the words in all the tickets along with the frequency of each word.

expand_all_tickets_with_synonyms()
    $clusterer->expand_all_tickets_with_synonyms();

This is the final step in the preprocessing of the tickets before they are ready for the doc modeling stage. This method calls other functions internal to the module that ultimately make calls to WordNet through the Perl interface provided by the WordNet::QueryData module.

get_tickets_from_excel():
    $clusterer->get_tickets_from_excel()

This method calls on the Spreadsheet::ParseExcel module to extract the tickets from the old-style Excel spreadsheets and the Spreadsheet::XLSX module for doing the same from the new-style Excel spreadsheets.

get_ticket_vocabulary_and_construct_inverted_index()
    $clusterer->get_ticket_vocabulary_and_construct_inverted_index()

As mentioned in THE THREE STAGES OF PROCESSING, the second stage of processing --- doc modeling of the tickets --- starts with the stemming of the words in the tickets, constructing a vocabulary of all the stemmed words in all the tickets, and constructing an inverted index for the vocabulary words. All of these things are accomplished by this method.

restore_processed_tickets_from_disk()
    $clusterer->restore_processed_tickets_from_disk()

This loads into your script the output of the ticket preprocessing stage. This method is called internally by restore_ticket_vectors_and_inverted_index(), which you would use in your ticket retrieval script, assuming it is separate from the ticket preprocessing script.

restore_raw_tickets_from_disk()
    $clusterer->restore_raw_tickets_from_disk()    

With this method, you are spared the trouble of having to repeatedly parse the same Excel spreadsheet during the development phase as you are testing the module with different query tickets. This method is called internally by restore_ticket_vectors_and_inverted_index().

restore_stemmed_tickets_from_disk()
        $clusterer->restore_stemmed_tickets_from_disk();

This method is called internally by restore_ticket_vectors_and_inverted_index().

restore_ticket_vectors_and_inverted_index()
    $clusterer->restore_ticket_vectors_and_inverted_index()

If you are going to be doing ticket preprocessing and doc modeling in one script and ticket retrieval in another, then this is the first method you would need to call in the latter for the restoration of the VSM model for the tickets and the inverted index.

retrieve_similar_tickets_with_vsm()
    my $retrieved_hash_ref = $clusterer->retrieve_similar_tickets_with_vsm( $ticket_num )

It is this method that retrieves tickets that are most similar to a query ticket. The method first utilizes the inverted index to construct a candidate list of the tickets that share words with the query ticket. Only those words play a role here whose IDF values exceed min_idf_threshold. Subsequently, the query ticket vector is matched with each of the ticket vectors in the candidate list. The method returns a reference to a hash whose keys are the IDs for the tickets that match the query ticket and whose values the cosine similarity distance.

show_original_ticket_for_given_id()
    $clusterer->show_original_ticket_for_given_id( $ticket_num )

The argument to the method is the unique integer ID of a ticket for which you want to see all the fields as stored in the Excel spreadsheet.

show_raw_ticket_clustering_data_for_given_id()

While the previous method shows all the fields for a ticket, this method shows only the textual content --- the content you want to use for establishing similarity between a query ticket and the other tickets.

show_processed_ticket_clustering_data_for_given_id()
    $clusterer->show_processed_ticket_clustering_data_for_given_id( $ticket_num );

This is the method to call if you wish to examine the textual content of a ticket after it goes through the preprocessing steps. In particular, you will see the corrections made, the synonyms added, etc. You would need to set the argument $ticket_num to the unique integer ID of the ticket you are interested in.

store_processed_tickets_on_disk()
    $clusterer->store_processed_tickets_on_disk();

This stores in a database file the preprocessed textual content of the tickets.

store_raw_tickets_on_disk()
    $clusterer->store_raw_tickets_on_disk();

This method is called by the get_tickets_from_excel() method to store on the disk the tickets extracted from the Excel spreadsheet. Obviously, you can also call it in your own script for doing the same.

store_stemmed_tickets_and_inverted_index_on_disk()
    $clusterer->store_stemmed_tickets_and_inverted_index_on_disk()

This method stores in a database file the stemmed tickets and the inverted index that are produced at the end of the second stage of processing.

show_stemmed_ticket_clustering_data_for_given_id()
    $clusterer->show_stemmed_ticket_clustering_data_for_given_id( $ticket_num );

If you want to see what sort of a job the stemmer is doing for a ticket, this is the method to call. You would need to set the argument $ticket_num to the unique integer ID of the ticket you are interested in.

store_ticket_vectors()
    $clusterer->store_ticket_vectors()

As the name implies, this call stores the vectors, both regular and normalized, in a database file on the disk.

HOW THE MATCHING TICKETS ARE RETRIEVED

It is the method retrieve_similar_tickets_with_vsm() that returns the best ticket matches for a given query ticket. What this method returns is a hash reference; the keys in this hash are the integer IDs of the matching tickets and the values the cosine similarity distance between the query ticket and the matching tickets. The number of matching tickets returned by retrieve_similar_tickets_with_vsm() is set by the constructor parameter how_many_retrievals. Note that retrieve_similar_tickets_with_vsm() takes a single argument, which is the integer ID of the query ticket.

THE examples DIRECTORY

The examples directory contains the following two scripts that would be your quickest way to become familiar with this module:

For ticket preprocessing and doc modeling:

Run the script

    ticket_preprocessor_and_doc_modeler.pl

This will carry out preprocessing and doc modeling of the tickets that are stored in the Excel file ExampleExcelFile.xls that you will find in the same directory.

For retrieving similar tickets:

Next, run the script

    retrieve_similar_tickets.pl

to retrieve five tickets that are closest to the query ticket whose integer ID is supplied to the retrieve_similar_tickets_with_vsm() method in the script.

Note that the tickets in the ExampleExcelFil.xls file are contrived. The sole purpose of executing the above two scripts is just to get you started with the use of this module.

HOW YOU CAN TURN THIS MODULE INTO A PRODUCTION-QUALITY TOOL

By a production-quality tool, I mean a software package that you can actually use in a production environment for automated or semi-automated ticket routing in your organization. I am assuming you already have the tools in place that insert in real-time the new tickets in an Excel spreadsheet.

Turning this module into a production tool will require that you find the best values to use for the following three parameters that are needed by the constructor: (1) min_idf_threshold for the minimum idf value for the words in a query ticket in order for them to be considered for matching with the other tickets; (2) min_word_length for discarding words that are too short; and (3) max_num_syn_words for how many synonyms to retain for a word if the number of synonyms returned by WordNet is too large. In addition, you must also come up with a misspelled-words file that is appropriate to your application domain and a stop-words file.

In order to find the best values to use for the parameters that are mentioned above, I suggest creating a graphical front-end for this module that would allow for altering the values of the three parameters listed above in response to the prevailing mis-routing rates for the tickets. The front-end will display to an operator the latest ticket that needs to be routed and a small set of the best-matching previously routed tickets as returned by this module. Used either in a fully-automated mode or a semi-automated mode, this front-end would contain a feedback recorder that would keep track of mis-routed tickets --- the mis-routed tickets would presumably bounce back to the central operator monitoring the front-end. The front-end display could be equipped with slider controls for altering the values used for the three parameters. Obviously, as a parameter is changed, some of the database files stored on the disk would need to be recomputed. The same would be the case if you make changes to the misspelled-words file or to the stop-words file.

REQUIRED

This module requires the following five modules:

    Spreadsheet::ParseExcel
    Spreadsheet::XLSX
    WordNet::QueryData
    Storable
    SDBM_File

the first for extracting information from the old-style Excel sheets that are commonly used for storing tickets, the second for extracting the same information from the new-style Excel sheets, the third for interfacing with WordNet for extracting the synonyms and antonyms, the fourth for creating the various disk-based database files needed by the module, and the last for disk-based hashes used to lend persistence to the extraction of the alphabet used by the tickets and the inverse document frequencies of the words.

EXPORT

None by design.

CAVEATS

An automated or semi-automated ticket router based on the concepts incorporated in this module may not be appropriate for all applications, especially in domains where highly jargonified expressions are used to describe faults and problems associated with an application.

BUGS

Please notify the author if you encounter any bugs. When sending email, please place the string 'TicketClusterer' in the subject line to get past my spam filter.

INSTALLATION

Download the archive from CPAN in any directory of your choice. Unpack the archive with a command that on a Linux machine would look like:

    tar zxvf Algorithm-TicketClusterer-1.01.tar.gz

This will create an installation directory for you whose name will be Algorithm-TicketClusterer-1.01. Enter this directory and execute the following commands for a standard install of the module if you have root privileges:

    perl Makefile.PL
    make
    make test
    sudo make install

If you do not have root privileges, you can carry out a non-standard install the module in any directory of your choice by:

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

With a non-standard install, you may also have to set your PERL5LIB environment variable so that this module can find the required other modules. How you do that would depend on what platform you are working on. In order to install this module in a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment variable by

    setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

If I used bash, I'd need to declare:

    export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

THANKS

I owe Shivani Rao many thanks for sharing with me the deep insights she has developed over the years in practically every facet of information retrieval.

AUTHOR

Avinash Kak, kak@purdue.edu

If you send email, please place the string "TicketClusterer" in your subject line to get past my spam filter.

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

 Copyright 2014 Avinash Kak