Anand Jha
and 1 contributors

Name

Text::SenseClusters::LabelEvaluation::ReadingFilesData - Module for reading the data from a file as single string object.

SYNOPSIS

        The following code snippet will show how to use this module.

Example 1: Reading the label file generated by sense cluster.

                use Text::SenseClusters::LabelEvaluation::ReadingFilesData;
                
                # Reading the cluster's labels file.
                my $clusterFileName = "TVS.label";
        
                # Getting the clusters file name.
                my $clusterFileName = $driverObject->{$senseClusterLabelFileName};
        
                # Creating the read file object and reading the label examples.
                my $readClusterFileObject = 
                                Text::SenseClusters::LabelEvaluation::ReadingFilesData->new ($clusterFileName);
                my %labelSenseClustersHash = ();
                my $labelSenseClustersHashRef = 
                                $readClusterFileObject->readLinesFromClusterFile(\%labelSenseClustersHash);
                %labelSenseClustersHash = %$labelSenseClustersHashRef;
                                        
                # Iterating the Hash to print the value.
                foreach my $key (sort keys %labelSenseClustersHash){
                        foreach my $innerkey (sort keys %{$labelSenseClustersHash{$key}}){
                                print "$key :: $innerkey :: $labelSenseClustersHash{$key}{$innerkey} \n";
                        }
                }       
        
        

Example 2: Reading the user provided Gold Standard keys and their data.

                use Text::SenseClusters::LabelEvaluation::ReadingFilesData;
                # Reading the topic file name.
                my $topicsFileName = "TVS.txt";
        
                # Creating the read object, which will read the gold-standard keys and data provided by user.
                my $readFileObject =
                        Text::SenseClusters::LabelEvaluation::ReadingFilesData->new($topicsFileName);
        
                # Reading the Mapping with help of function.
                my ( $hashRef, $topicArrayRef ) = $readFileObject->readMappingFromTopicFile();
                
                # Reading the hash from its reference.
                my %mappingHash = %$hashRef;
                my @topicArray  = @$topicArrayRef;
                # Iterating the Hash to print the value.
                foreach my $key ( sort keys %mappingHash ) {
                        print "$key=$mappingHash{$key}\n";
                }
                # Iterating the Hash to print the value.
                foreach my $key (@topicArray) {
                        print "$key\n";
                }

DESCRIPTION

        This module provides the various functions to read the labels and topic files. 
        
        The first function reads the labelled data generated by the SenseClusters and 
        create hash from it. The data-format of the input file must match the format 
        of label-file generated by SenseClusters. 
        
        The second function reads a file into a string variable by removing all the 
        newline characters from it.
        
        The remaining functions read the user provided file that contains the mapping 
        of clusters labels with gold standard keys, and/or data about the gold standard
        key or list of topics.          
                        

Constructor: new()

This is the constructor which will create object for this class. Reference : http://perldoc.perl.org/perlobj.html

This constructor takes the following argument: 1. $fileNameArg : The name of the file whose data has to be read.

Function: readLinesFromClusterFile

This function will read lines from the file containing the Labels of the Clusters and make the hash file.

@argument1 : Name of the cluster file name.

@argument2 : Reference of Hash ($labelSenseClustersHash) which will hold the information in the following format:

                For e.g.:\tCluster0{
                                        Descriptive    => George Bush, Al Gore, White House, New York
                                        Discriminating => George Bush, York Times       
                                  } 
                                  Cluster1{
                                        Descriptive    => George Bush, BRITAIN London, Prime Minister
                                        Discriminating => BRITAIN London, Prime Minister        
                                  } 

@return : It will return the reference of the Hash mentioned above: $labelSenseClustersHashRef.

@description :

1. Read the file line by line. 2. Ignore the lines which do not follow one of the following format: Cluster 0 (Descriptive): George Bush, Al Gore, White House, New York Cluster 0 (Discriminating): George Bush, BRITAIN London 3. Create Key from the "Cluster # (Descriptive)" or "Cluster # (Discrim - inating)" as "OuterKey: Cluster#" "InnerKey: Descriptive". 4. Store the value of hash as the keywords similar to above example: for e.g: $labelSenseClustersGlobalRef{Cluster0}{Discriminating} = "BRITAIN London, Prime Minister";

Function: readLinesFromTopicFile

This function will read lines from the topic file and list of all the topics.

@argument1 : Name of the topicFile.

@return : String containing the list of all the topics(labels) for the clusters.

@description : 1. Read the file line by line. 2. Remove the new line characters and making string variable which contains the list of all the topics.

Function: readMappingFromTopicFile

This function will read mapping provided by the user for the Cluster's label (Cluster#) and gold standard key(topic-name).

        Syntax of the file:
                <Cluster><#><Seprator(:::)><topic>
        Example:
                 Cluster0:::topic1
                 Cluster1:::topic2
                 Cluster2:::topic0

@argument : $readFileObject : Object of the current file.

@return1 : \%clusterTopicMappingHash : DataType : (Reference to Hash) Reference of Hash containing the mapping between the Cluster's label and gold standard key.

@return2 : \@topicArray : DataType : (Reference to array) Reference of array containg the gold standard keys.

@description : 1. Read the file line by line. 2. Check the line, if it contains the "Cluster#:::". 3. Spliting these line with Seprator":::". 4. A WordArray do not have 2 elements, ignore it. 3. Otherwise ignore the remaining lines.

        Reason for selecting the separtor as ":::"
                1. It will ensure that it is unique and it has very rare chance of occuring
                   in a documents or text.                 
                   

Function: readTopicDataFromTopicFile

This function will read data about the gold standard key(topic-name).

        Syntax of the file:
                <topicName><Seprator(:::)><multi lines topic data>
        Example:

        topic1:::data1, data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1
        data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1
        topic2:::data2, data2 data2 data2 data2 data2 data2 data2 data2 data2 data2 data2 data2
        data2 data2 data2 data2 data2 data2 data2 data2 data2 data1 data1 data1 data1 data1     
        

@argument : $readFileObject : Object of the current file.

@return : \%topicDataHash : DataType : (Reference to Hash) Reference of Hash containing the topics and their corresponding data.

@description : 1. Read the file line by line. 2. Check the line, if it contains the ":::" and starts with one of the topic: a. This indicates the start of the topic's data. b. Read the line till we encounter another "topic-name:::" or "cluster#:::" 4. Finally, make hash containing the topic as the key and topic's data as the value. 3. Return the reference of this hash.

Function: readTopicNamesFromTopicFile

This function will list all the topics from the file provided by user.

        Syntax of the file:
                <Cluster#><Seprator(:::)><topicName>
                <topicName><Seprator(:::)><multi lines topic data>
                <topicName><Seprator(:::)><multi lines topic data>
                <topicName><Seprator(:::)><multi lines topic data>
                <Cluster#><Seprator(:::)><topicName>
                <Cluster#><Seprator(:::)><topicName>
                
        Example:

        topic1:::data1, data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1
        data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1 data1
        topic2:::data2, data2 data2 data2 data2 data2 data2 data2 data2 data2 data2 data2 data2
        data2 data2 data2 data2 data2 data2 data2 data2 data2 data1 data1 data1 data1 data1
        cluster0:::topic1
        cluster1:::topic2       
        cluster2:::topic0
        

@argument : $readFileObject : Object of the current file.

@return : \@topicNameArray : DataType : (Reference to array) Reference of array containing the list of topics.

@description : 1. Read the file line by line. 2. Check the line, if it contains the ":::" a. if starts with "cluster" ignore it. b. otherwise, split that line with separator, ":::" and store the results in array. c. The first element of the array is the topic-name. d. Push, this topic-name into the array. 3. Return the reference of this array.

Reason for selecting the separtor as ":::" 1. It will ensure that it is unique and it has very rare chance of occuring in a documents or text.

SEE ALSO

http://senseclusters.cvs.sourceforge.net/viewvc/senseclusters/LabelEvaluation/

Last modified by : $Id: ReadingFilesData.pm,v 1.5 2013/03/07 23:15:49 jhaxx030 Exp $

AUTHORS

        Anand Jha, University of Minnesota, Duluth
        jhaxx030 at d.umn.edu

        Ted Pedersen, University of Minnesota, Duluth
        tpederse at d.umn.edu

COPYRIGHT AND LICENSE

Copyright (C) 2012-2013 Ted Pedersen, Anand Jha

See http://dev.perl.org/licenses/ for more information.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

        The Free Software Foundation, Inc., 59 Temple Place, Suite 330, 
        Boston, MA  02111-1307  USA