The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Word2vec-Interface.pl - Word2Vec Package Driver

SYNOPSIS

This program houses a set of functions and utilities for use with UMLS Similarity.

USAGE

Usage: Word2vec-Interface.pl [OPTIONS]

Command-Line Arguments

Displays the quick summary of the program options.

--test

Description:

 Executes word2vec file and run-time checks.

Parameters:

 None

Output:

 None

Example:

 interface.pm --test

--cos

Description:

 Computes cosine similarity of two words given a trained vector file.

Parameters:

 vector_binary_path (String)
 wordA              (String)
 wordB              (String)

Output:

 Cosine Similarity Value

Example:

 Word2vec-Interface.pl --cos "samples/samplevectors.bin" heart angina

--cosmulti

Description:

 Computes cosine similarity between multiple words given a trained vector file.

 Note: There is no limit to the number of words that can be concatenated with the colon character for each parameter.

   ie: --cosmulti "vectors.bin" acute:heart:attack chronic:obstructive:pulmonary:disease

Parameters:

 vector_binary_path (String)
 wordA1:wordA2      (String)
 wordB1:wordB2      (String)

Output:

 Cosine Similarity Value

Example:

 Word2vec-Interface.pl --cosmulti "samples/samplevectors.bin" heart:attack myocardial:infarction

--cosavg

Description:

 Computes cosine similarity average between multiple words given a trained vector file.

 Note: There is no limit to the number of words that can be concatenated with the colon character for each parameter.

   ie: --cosavg "vectors.bin" heart:attack six:sea:snakes:were:sailing

Parameters:

 vector_binary_path (String)
 wordA1:wordA2      (String)
 wordB1:wordB2      (String)

Output:

 Cosine Similarity Value

Example:

 Word2vec-Interface.pl --cosavg "samples/samplevectors.bin" heart:attack myocardial:infarction

--cos2v

Description:

 Computes cosine similarity between two words, each in differing trained vector data files.

Parameters:

 vector_data_fileA_path (String)
 wordA                  (String)
 vector_data_fileB_path (String)
 wordB                  (String)

Output:

 Cosine Similarity Value

Example:

 Word2vec-Interface.pl --cos2v "samples/medline_vectors.bin" heart "samples/pubmed_vectors.bin" infarction

--multiwordcosuserinput

Description:

 Computes cosine similarity based on user input on a trained vector file.

 Note: There is no limit to the number of words that can be concatenated with the colon character for each comparison string.

Parameters:

 vector_binary_path

Output:

 None

Example:

 Word2vec-Interface.pl --multiwordcosuserinput "samples/samplevectors.bin"

--addvectors

Description:

 Adds two word vectors and prints the value.

Parameters:

 vector_binary_path (String)
 wordA              (String)
 wordB              (String)

Output:

 Summed word vectors string

Example:

 Word2vec-Interface.pl --addvectors "samples/samplevectors.bin" heart attack

--subtractvectors

Description:

 Subtracts two word vectors and outputs the value.

Parameters:

 vector_binary_path (String)
 wordA              (String)
 wordB              (String)

Output:

 Difference between word vectors string.

Example:

 Word2vec-Interface.pl --subtractvectors "vectors.bin" heart attack

--w2vtrain

Description:

 Executes word2vec training based on user-specified options.

Parameters:

 -trainfile       file_path         (String)
 -outputfile      file_path         (String)
 -size            x                 (Integer)
 -window          x                 (Integer)
 -mincount        x                 (Integer)
 -sample          x.x               (Float)
 -negative        x                 (Integer)
 -alpha           x.x               (Float)
 -hs              x                 (Integer)
 -binary          x                 (Integer)
 -threads         x                 (Integer)
 -iter            x                 (Integer)
 -cbow            x                 (Integer)
 -classes         x                 (Integer)
 -read-vocab      file_path         (String)
 -save-vocab      file_path         (String)
 -debug           x                 (Integer)
 -overwrite       x                 (Integer)

 Note: Minimal required parameters to run are -trainfile and -outputfile. All other parameters not specified will be set to default settings.

Output:

 None

Example:

 Word2vec-Interface.pl --w2vtrain -trainfile "../../samples/textcorpus.txt" -outputfile "../../samples/tempvectors.bin" -size 200 -window 8 -sample 0.0001 -negative 25 -hs 0 -binary 0 -threads 20 -iter 15 -cbow 1 -overwrite 1

--w2ptrain

Description:

 Executes word2phrase conversion based on user-specified options and text corpus.

Parameters:

 -trainfile     file_path       (String)
 -outputfile    file_path       (String)
 -mincount      x               (Integer)
 -threshold     x               (Integer)
 -debug         x               (Integer)
 -overwrite     x               (Integer)

 Note: Minimal required parameters to run are -trainfile and -outputfile. All other parameters not specified will be set to default settings.

Example:

 Word2vec-Interface.pl --w2ptrain -trainfile "../../samples/textcorpus.txt" -outputfile "../../samples/phrasecorpus.txt" -min-count 10 -threshold -200 -overwrite 1

--cleantext

Description:

 Cleans text based on XML-to-W2V text corpus generation text normalization methods.
   - All Text Conveted To Lowercase
   - Duplicate White Spaces Removed
   - "'s" (Apostrophe 's') Characters Removed
   - Hyphen "-" Replaced With Whitespace
   - All Characters Outside Of "a-z" and NewLine Characters Are Removed
   - Lastly, Whitespace Before And After Text Is Removed

Parameters:

 -inputfile       file_path       (String)
 -outputfile      file_path       (String)

 Note: Minimal required parameter to run is "-inputfile". All other parameters not specified will be set to default settings.

Example:

 Word2vec-Interface.pl --cleantext -inputfile "../../samples/text.txt"
 Word2vec-Interface.pl --cleantext -inputfile "../../samples/text.txt" -outputfile "../../samples/cleaned_text.txt"

--compiletextcorpus

Description:

 Executes Medline XML-To-W2V text corpus generation based on user-specified options.

Parameters:

 -workdir       file_path       (String)
 -savedir       file_path       (String)
 -startdate     "XX/XX/XXXX"    (String)
 -enddate       "XX/XX/XXXX"    (String)
 -title         x               (Integer)
 -abstract      x               (Integer)
 -qparse        x               (Integer)
 -compwordfile  file_path       (String)
 -threads       x               (Integer)
 -overwrite     x               (Integer)

 Note: Minimal required parameter to run is "-workdir". All other parameters not specified will be set to default settings.

Example:

 Word2vec-Interface.pl --compiletextcorpus -workdir "../../samples"
 Word2vec-Interface.pl --compiletextcorpus -workdir "../../samples" -savedir "../../samples/textcorpus.txt" -startdate 01/01/1900 -enddate 99/99/9999 -title 1 -abstract 1 -qparse 1 -compwordfile "../../samples/compoundword.txt" -threads 2 -overwrite 1

--converttotextvectors

Description:

 Converts user-specified word2vec binary formatted file to human-readable text.

 Note: This will freely convert all formats to plain text format.

Parameters:

 input_file_path  (String)
 output_file_path (String)

Output:

 None

Example:

 Word2vec-Interface.pl ---converttotextvectors "binaryvectors.bin" "textvectors.bin"

--converttobinaryvectors

Description:

 Converts user-specified vector text data to word2vec binary formatted file.

 Note: This will freely convert all formats to word2vec binary format.

Parameters:

 input_file_path  (String)
 output_file_path (String)

Output:

 None

Example:

 Word2vec-Interface.pl ---converttobinaryvectors "textvectors.bin" "binaryvectors.bin"

--converttosparsevectors

Description:

 Converts user-specified vector text data to sparse vector data formatted file.

 Note: This will freely convert all formats to sparse vector data format.

Parameters:

 input_file_path  (String)
 output_file_path (String)

Output:

 None

Example:

 Word2vec-Interface.pl ---converttosparsevectors "textvectors.bin" "sparsevectors.bin"

--compoundifyfile

Description:

 Compoundifies file based on user-specified compound word file.

Parameters:

 input_file             (String)
 output_file            (String)
 compound_word_file     (String)

Output:

 Compoundified file using 'compound_word_file' data at 'output_file' path.

Example:

 Word2vec-Interface.pl --compoundifyfile "samples/textcorpus.txt" "samples/compoundedtext.txt" "samples/compoundword.txt"

--sortvectorfile

Description:

 Sorts specified vector file in alphanumeric order.

Parameters:

 input_file                                           (String)
 -overwrite    1 = Overwrite / 0 = Save to new file   (Integer)

Output:

 Generates a sorted vector file consisting either replacing the old file or saving to the file "sortedvectors.bin".

Example:

 Word2vec-Interface.pl --sortvectorfile "vectors.bin"

 Or

 Word2vec-Interface.pl --sortvectorfile "vectors.bin" -overwrite 1

 Or

 Word2vec-Interface.pl --sortvectorfile "vectors.bin" -overwrite 0

--findsimilarterms

Description:

 Prints the nearest n terms using cosine similarity as the metric of determining similar terms.

Parameters:

 -vectors   vector_binary_file          (String)
 -term      term                        (String)
 -neighbors number_of_similar_neighbors (Integer)

 ( Optional Parameter(s) )
 -threads   number of threads           (Integer)

Output:

 "number_of_similar_neighbors" value nearest similar terms using cosine similarity.

Example:

 Word2vec-Interface.pl --findsimilarterms -vectors vectors.bin -term heart -neighbors 10

--spearmans

Description:

 Computes Spearman's Rank Correlation Score between two files of a specific format.

 File Format:
 "score(float)<>term1<>term2"
 "score(float)<>term3<>term4"

 Note: Optional Parameters: -n -> Prints N value with Spearman's Rank Correlation Score

Parameters:

 input_file_a     (String)
 input_file_b     (String)
 (Optional Parameters)

Output:

 Spearman's Rank Correlation Score.

Example:

 Word2vec-Interface.pl --spearmans "samples/MiniMayoSRS.terms.comp_results" "Similarity/MiniMayoSRS.terms.coders"

 Or

 Word2vec-Interface.pl --spearmans "samples/MiniMayoSRS.terms.comp_results" "Similarity/MiniMayoSRS.terms.coders" -n

--similarity

Description:

 Computes average, compound and summed cosine similarity values for a list of word comparisons in a specified file or directory.
 When using a directory of files, files to be parsed must end with ".sim" extension.

 Note: Optional Parameters: -all -> Computes Average, Compound and Summed files
                            -a   -> Only computes Average file
                            -c   -> Only computes Compound file
                            -s   -> Only computes Summed file

       Specifying no optional parameters imples "-all". Parameters can be combined to produce multiple results. See examples below.

Parameters:

 -sim     input_file             (String)
 -vectors vector_binary_file     (String)
 (Optional Parameters)

Output:

 Generates a text file with a list of cosine similarity values followed by the word pairs.

Example:

 Word2vec-Interface.pl --similarity -sim "samples/MiniMayoSRS.terms" -vectors "vectors.bin"

 Or

 Word2vec-Interface.pl --similarity -sim "samples/MiniMayoSRS.terms" -vectors "vectors.bin" -all

 Or

 Word2vec-Interface.pl --similarity -sim "samples/MiniMayoSRS.terms" -vectors "vectors.bin" -a -s

 Or

 Word2vec-Interface.pl --similarity -sim "samples/MiniMayoSRS.terms" -vectors "vectors.bin" -c

--wsd

Description:

 Word Sense Disambiguation: Reads an instance and sense file in SVL format, removes stop words using the user specified stoplist and assigns a sense
 identification number to an instance identification number using cosine similarity to compare all sense ids to an instance. The highest cosine
 similarity value between a specific sense and instance is assigned to that particular instance.

 Warning: WSD instance and sense files must be in SVL format.

Parameters:

 No parameters        <- This will prompt the user to input required files for WSD processing. (Must be in SVL format)

 Or

 -instances  file_path              (String)
 -senses     file_path              (String)
 -vectors    vector_binary_file     (String)
 -stoplist   file_path              (String) <- (Not required)

 Or

 -dir        directory_of_files     (String)
 -vectors    vector_binary_file     (String)
 -stoplist   file_path              (String) <- (Not required)

 Or

 -list       file_path              (String)

 Note: "-list" parameter requires the input file to meet format specifications for use. See example "samples/wsdlist.txt" for details.
 Note: "-dir" parameter requires the user to specify the "-vectors" file path. "-stoplist" parameter is not requried.

Output:

 None

Examples:

 Word2vec-Interface.pl --wsd -instances "ACE.instances.sval" -senses "ACE.senses.sval" -vectors "vectors.bin"
 Word2vec-Interface.pl --wsd -instances "ACE.instances.sval" -senses "ACE.senses.sval" -vectors "vectors.bin" -stoplist "stoplist"

 Word2vec-Interface.pl --wsd -dir "../../wsd" -vectors vectors.bin
 Word2vec-Interface.pl --wsd -dir "../../wsd" -vectors vectors.bin -stoplist "../../stoplist"

 Word2vec-Interface.pl --wsd -list "../../wsd/abbrevlist.txt"
 Word2vec-Interface.pl --wsd -list "../../wsd/abbrevlist.txt" -vectors vectors.bin
 Word2vec-Interface.pl --wsd -list "../../wsd/abbrevlist.txt" -vectors vectors.bin -stoplist "../../wsd/stoplist"

--clean

Description:

 Cleans up word2vec directory. Removes C object and executable files.

 This is useful when moving the development directory between computers with different CPU architectures (x86/x64) and attempting to run word2vec executable files.
 Errors could occur when trying to run a 64-bit executable on a 32-bit machine. Cleaning up the word2vec directory and re-building the executable files
 resolves this issue.

Parameters:

 None

Output:

 None

Example:

 Word2vec-Interface.pl --clean

--version

Description:

 Displays the version information.

Parameters:

 None

Output:

 Displays version information to the console.

 Note: '--debuglog' and '--writelog' can also be combined to print debug statements to the console and write to their log files.

Example:

 Word2vec-Interface.pl --version

--help

Description:

 Displays the quick summary of program options.

Parameters:

 None

Output:

 Displays help information to the console.

Example:

 Word2vec-Interface.pl --help

Debugging Arguments

List of debugging options.

--debuglog

Description:

 Prints debugging statements to the console window.

 Note: This parameter can be specified anywhere within the parameter string.

Parameters:

 None

Output:

 Prints real-time debug log to the console window.

Examples:

 Word2vec-Interface.pl --test --debuglog

 Word2vec-Interface.pl --debuglog --test

 Word2vec-Interface.pl --debuglog --wsd -list "samples/wsd/abbrevlist.txt"

 Word2vec-Interface.pl --debuglog --w2vtrain "samples/textcorpus.txt" "samples/tempvectors.bin"

--writelog

Description:

 Writes debugging statements to log module files.

 Note: This parameter can be specified anywhere within the parameter string.

Parameters:

 None

Output:

 Writes debug log statements to specified log files. Each module will write to its respective log file.
 ie. 'interface.pm' module will write to log file 'InterfaceLog.txt'.

Examples:

 Word2vec-Interface.pl --test --writelog

 Word2vec-Interface.pl --writelog --test

 Word2vec-Interface.pl --writelog --wsd -list "samples/wsd/abbrevlist.txt"

 Word2vec-Interface.pl --writelog --w2vtrain -trainfile "samples/textcorpus.txt" -outputfile "samples/tempvectors.bin"

Command-Line Notes

Note that when using command-line parameters, multiple commands are supported.

ie. Word2vec-Interface.pl --compiletextcorpus -workdir "samples" -savedir "samples/textcorpus.txt" --w2vtrain -trainfile "samples/textcorpus.txt" -outputfile "samples/tempvectors.bin" --cos "samples/tempvectors.bin" of the

This string of commands instructs the script to compile a text corpus of the Medline XML files in the "samples" directory. Initiate word2vec training based on the newly compiled text corpus and create a word2vec trained word vector file in the specified directory. Then subsequently use the newly trained vector data to compute the cosine similarity between the words "of" and "the".

This scripts supports as many continous commands as the user wishes to impose. All commands are checked for errors and the script will exit gracefully if such an event takes place. To obtain a better understanding of any errors, '--debuglog' or '--writelog' commands must be enabled.

SYSTEM REQUIREMENTS

  • Perl (version 5.24.0 or better) - http://www.perl.org

CONTACT US

    If you have trouble installing and executing Word2vec-Interface.pl,
    please contact us at

    cuffyca at vcu dot edu.

Author

 Clint Cuffy, Virginia Commonwealth University

COPYRIGHT

Copyright (c) 2016

 Bridget T McInnes, Virginia Commonwealth University
 btmcinnes at vcu dot edu

 Clint Cuffy, Virginia Commonwealth University
 cuffyca at vcu dot edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.