Word2vec::Xmltow2v - Medline XML-To-W2V Module.
use Word2vec::Xmltow2v; # Parameters: Debug Output = True, Write Log = False, StoreTitle = True, StoreAbstract = True, Quick Parse = True, CompoundifyText = True, Use Multi-Threading (Default = 1 Thread Per CPU Core) my $xmlconv = new xmltow2v( 1, 0, 1, 1, 1, 1, 2 ); # Note: Specifying no parameters implies default settings. $xmlconv->SetWorkingDir( "Medline/XML/Directory/Here" ); $xmlconv->SetSavePath( "textcorpus.txt" ); $xmlconv->SetStoreTitle( 1 ); $xmlconv->SetStoreAbstract( 1 ); $xmlconv->SetBeginDate( "01/01/2004" ); $xmlconv->SetEndDate( "08/13/2016" ); $xmlconv->SetOverwriteExistingFile( 1 ); # If Compound Word File Exists, Store It In Memory And Create Compound Word Binary Search Tree $xmlconv->ReadCompoundWordDataFromFile( "compoundword.txt", 1 ); $xmlconv->CreateCompoundWordBST(); # Parse XML Files or Directory Of Files $xmlconv->ConvertMedlineXMLToW2V( "/xmlDirectory/" ); undef( $xmlconv );
Word2vec::Xmltow2v is a XML-to-text module which converts Medline XML article title and abstract data, given a date range, into a plain text corpus for use with Word2vec::Interface. It also "compoundifies" during text corpus compilation given a compound word file.
Description:
Returns a new 'Word2vec::Xmltow2v' module object. Note: Specifying no parameters implies default options. Default Parameters: debugLog = 0 writeLog = 0 storeTitle = 1 storeAbstract = 1 quickParse = 0 compoundifyText = 0 numOfThreads = Number of CPUs/CPU cores (1 thread per core/CPU) workingDir = Current Directory savePath = Current Directory beginDate = "00/00/0000" endDate = "99/99/9999" xmlStringToParse = "(null)" textCorpusString = "" twigHandler = 0 parsedCount = 0 tempDate = "" tempStr = "" outputFileName = "textcorpus.txt" compoundWordAry = () compoundWordBST = Word2vec::Bst->new() maxCompoundWordLength = 0 overwriteExistingFile = 0
Input:
$debugLog -> Instructs module to print debug statements to the console. (1 = True / 0 = False) $writeLog -> Instructs module to print debug statements to a log file. (1 = True / 0 = False) $storeTitle -> Instructs module to store Medline article titles during text corpus compilation. (1 = True / 0 = False) $storeAbstract -> Instructs module to store Medline article abstracts during text corpus compilation. (1 = True / 0 = False) $quickParse -> Instructs module to utilize quick XML parsing Functions for known Medline article title and abstract tags. (1 = True / 0 = False) $compoundifyText -> Instructs module to compoundify text on the fly given a compound word file. This is automatically set when reading the compound word file to memory regardless of user setting. (1 = True / 0 = False) $numOfThreads -> Specifies the number of worker threads which parse Medline XML files simultaneously to create the text corpus. This speeds up text corpus generation by the number of physical cores present an a given machine. (Positive integer value) ie. Using four threads of a Intel i7 core machine speeds up text corpus generation roughly four times faster than being single threaded. $workingDir -> Specifies the current working directory. (String) $savePath -> Specifies the save path for text corpus generation. (String) $beginDate -> Specifies the beginning date range for Medline article text corpus composition. (Format: XX/XX/XXXX) $endDate -> Specifies the ending date range for Medline article text corpus composition. (Format: XX/XX/XXXX) $xmlStringToParse -> Storage location for the current Medline XML file in memory. (String) $textCorpusString -> Temporary storage location for text corpus generation in memory. (String) $twigHandler -> XML::Twig object location. $parsedCount -> Number of parsed Medline articles during text corpus generation. $tempDate -> Temporary storage location for current Medline article date during text corpus compilation. $tempStr -> Temporary storage location for current Medline article title/abstract during text corpus compilation. $outputFileName -> Output file path/name. $compoundWordAry -> Storage location for compound words, used to compoundify text. (Array) <- Depreciated $compoundWordBST -> Storage location for compound words, used to compoundify text. (Binary Search Tree) <- Supersedes '$compoundWordAry' $maxCompoundWordLength -> Maximum number of words able to be compoundified in one phrase. ie "six_sea_snakes_were_sailing" = 5 compoundified words. The compounding algorithm will attempt to compoundify no more than this set value, even-though the compound word list could possibly contain larger compounded phrases. $overwriteExistingFile -> Instructs the module to either overwrite any existing text corpus files or append to the existing file. Note: It is not recommended to specify all new() parameters, as it has not been thoroughly tested. Maximum recommended parameters to be specified include: "debugLog, writeLog, storeTitle, storeAbstract, quickParse, compoundifyText, numOfThreads, workingDir, savePath, beginDate, endDate"
Output:
Word2vec::Xmltow2v object.
Example:
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); # Note: Specifying no parameters implies default settings as listed above. undef( $xmlconv ); # Or use Word2vec::Xmltow2v; # Parameters: Debug Output = True, Write Log = False, StoreTitle = True, StoreAbstract = True, Quick Parse = True, CompoundifyText = True, Use Multi-Threading (2 Threads) my $xmlconv = new xmltow2v( 1, 0, 1, 1, 1, 1, 2 ); undef( $xmlconv );
Removes module objects and variables from memory.
None
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->DESTROY(); undef( $xmlconv );
Parses specified parameter Medline XML file or directory of files, creating a text corpus. Returns 0 if successful or -1 during an error. Note: Supports plain Medline XML or gun-zipped XML files.
$filePath -> XML file path to parse. (This can be a single file or directory of XML/XML.gz files).
$value -> '0' = Successful / '-1' = Un-Successful
use Word2vec::Xmltow2v; $xmlconv = new xmltow2v(); # Note: Specifying no parameters implies default settings $xmlconv->SetSavePath( "testCorpus.txt" ); $xmlconv->SetStoreTitle( 1 ); $xmlconv->SetStoreAbstract( 1 ); $xmlconv->SetBeginDate( "01/01/2004" ); $xmlconv->SetEndDate( "08/13/2016" ); $xmlconv->SetOverwriteExistingFile( 1 ); $xmlconv->ConvertMedlineXMLToW2V( "/xmlDirectory/" ); undef( $xmlconv );
Multi-Threaded Medline XML to text corpus conversion function.
$directory -> File directory or directory of files to parse.
$value -> '0' = Successful / '-1' = Un-successful
Warning: This is a private function called by 'ConvertMedlineXMLToW2V()'. It should not be called outside of xmltow2v module.
Parses passed string parameter for Medline XML article title and abstract data and appends found data to the text corpus.
$string -> Medline XML string data to parse.
Warning: This is a private function called by "ConvertMedlineXMLToW2V()" and "_ThreadedConvert()". It should not be called outside of xmltow2v module.
Checks passed string parameter to see if it contains relevant data and XML::Twig handler is initialized.
$string -> String data to check
Warning: This is a private function called "_ParseXMLString()". It should not be called outside of xmltow2v module.
Checks passed string parameter for "(null)" string.
$string -> String data to be checked.
$value -> '1' = True/Null data or '0' = False/Valid data
Warning: This is a private function called by "new()" and "_ParseXMLString()". It should not be called outside of xmltow2v module.
Removes the XML Version string prior to parsing the XML string data. (Depreciated)
$string -> Medline XML string data
Parses 'MedlineCitationSet' tag data in Medline XML file.
$twigHandler -> XML::Twig handler $root -> Beginning of XML directory to parse. ( Directory in Medline XML string data )
Warning: This is a private function and is called by xmltow2v's XML::Twig handler. It should not be called outside of xmltow2v module.
Parses 'MedlineArticle' tag data in Medline XML file.
$medlineArticle -> Current Medline article directory in XML data (XML::Twig directory)
$value -> '1' = Finished parsing Medline article.
Parses 'DateCreated' tag data in Medline XML file.
$article -> Current Medline article in XML data (XML::Twig directory)
$date -> 'XX/XX/XXXX' (Month/Day/Year)
Parses 'Article' tag data in Medline XML file. Fetches 'ArticleTitle', 'Journal' and 'Abstract' XML tags.
Parses 'Journal' tag data in Medline XML file. Fetches 'Title' XML tag.
$journalRoot -> Current Medline journal directory in XML data (XML::Twig directory)
Parses 'Abstract' tag data in Medline XML file. Fetches 'AbstractText' XML tag.
$abstractRoot -> Current Medline abstract directory in XML data (XML::Twig directory)
Parses 'DateCreated' tag data in Medline XML file. Used when 'QuickParse' member variable is enabled. Sets $tempDate member variable to parsed 'DateCreated' tag data.
$twigHandler -> 'XML::Twig' handler $article -> Current Medline article directory in XML data (XML::Twig directory)
Parses 'Journal' tag data in Medline XML file. Fetches 'Title' XML tag. Used when 'QuickParse' member variable is enabled. Sets $tempStr to parsed data and stores in text corpus.
$twigHandler -> 'XML::Twig' handler. $journalRoot -> Current Medline journal directory in XML data (XML::Twig directory)
Parses 'Article' tag data in Medline XML file. Fetches 'ArticleTitle' and 'Abstract' XML tags. Used when 'QuickParse' member variable is enabled. Sets $tempStr to parsed data and stores in text corpus.
$twigHandler -> 'XML::Twig' handler. $article -> Current Medline article directory in XML data (XML::Twig directory)
Parses 'Abstract' tag data in Medline XML file. Fetches 'AbstractText' XML tag. Used when 'QuickParse' member variable is enabled. Sets $tempStr to parsed data and stores in text corpus.
$twigHandler -> 'XML::Twig' handler. $anstractRoot -> Current Medline abstract directory in XML data (XML::Twig directory)
Creates a binary search tree using compound word data in memory and stores root node. This also clears the compound word array afterwards. Warning: Compound word file must be loaded into memory using ReadCompoundWordDataFromFile() prior to calling this method. This function will also delete the compound word array upon completion as it will no longer be necessary.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" ); $xmlconv->CreateCompoundWordBST();
Compoundifies string parameter based on compound word data in memory using the compound word binary search tree. Warning: Compound word file must be loaded into memory using ReadCompoundWordDataFromFile() prior to calling this method.
$string -> String to compoundify
$string -> Compounded string or "(null)" if string parameter is not defined.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" ); $xmlconv->CreateCompoundWordBST(); my $compoundedString = $xmlconv->CompoundifyString( "String to compoundify" ); print( "Compounded String: $compoundedString\n" ); undef( $xmlconv );
Recursive method used by CompoundifyString() to fetch compound word data in binary search tree. Warning: This function requires specific parameters and should not be called outside of CompoundifyString() method.
$stringArrayRef -> Array reference containing string data $oldNode -> Last 'Word2vec::Node' data match was found $searchStr -> Search phrase $index -> Current string array index
Word2vec::Node -> Last node containing positive search phrase match
Warning: This is a private function and is called by 'CompoundifyString()'. It should not be called outside of xmltow2v module.
Reads compound word file and stores in memory. $autoSetMaxCompWordLength parameter is not required to be set. This parameter instructs the method to auto set the maximum compound word length dependent on the longest compound word found. Note: $autoSetMaxCompWordLength options: defined = True and Undefined = False.
$filePath -> Compound word file path $autoSetMaxCompWordLength -> Maximum length of a given compoundified phrase the module's compoundify algorithm will permit. Note: Calling this method with $autoSetMaxCompWordLength defined will automatically set the maxCompoundWordLength variable to the longest compound phrase.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt", 1 ); undef( $xmlconv );
Saves compound word data in memory to a specified file location.
$savePath -> Path to save compound word list to file.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ReadCompoundWordDataFromFile( "samples/compoundword.txt" ); $xmlconv->SaveCompoundWordDataFromFile( "samples/newcompoundword.txt" ); undef( $xmlconv );
Reads a plain text file with utf8 encoding in memory. Returns string data if successful and "(null)" if unsuccessful.
$filePath -> Text file to read into memory
$string -> String data if successful or "(null)" if un-successful.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $textData = $xmlconv->ReadTextFromFile( "samples/textcorpus.txt" ); print( "Text Data: $textData\n" ); undef( $xmlconv );
Saves a plain text file with utf8 encoding in a specified location.
$savePath -> Path to save string data. $string -> String to save
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $result = $xmlconv->SaveTextToFile( "text.txt", "Hello world!" ); print( "File saved\n" ) if $result == 0; print( "File unable to save\n" ) if $result == -1; undef( $xmlconv );
Reads an XML file from a specified location. Returns string in memory if successful and "(null)" if unsuccessful.
$filePath -> File to read given path
Warning: This is a private function and is called by XML::Twig parsing functions. It should not be called outside of xmltow2v module.
Saves text corpus data to specified file path. This method will append to any existing file if $appendToFile parameter is defined or "overwrite" option is disabled. Enabling "overwrite" option will overwrite any existing files.
$savePath -> Path to save the text corpus $appendToFile -> Specifies whether the module will overwrite any existing data or append to existing text corpus data. Note: Leaving this variable undefined will fetch the "Overwrite" member variable and set the value to this parameter.
Checks to see if $date is within $beginDate and $endDate range. Returns 1 if true and 0 if false. Note: Date Format: XX/XX/XXXX (Month/Day/Year)
$date -> Date to check against minimum and maximum data range. (String) $beginDate -> Minimum date range (String) $endDate -> Maximum date range (String)
$value -> '1' = True/Date is within specified range Or '0' = False/Date is not within specified range.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); print( "Is \"01/01/2004\" within the date range: \"02/21/1985\" to \"08/13/2016\"?\n" ); print( "Yes\n" ) if $xmlconv->IsDateInSpecifiedRange( "01/01/2004", "02/21/1985", "08/13/2016" ) == 1; print( "No\n" ) if $xmlconv->IsDateInSpecifiedRange( "01/01/2004", "02/21/1985", "08/13/2016" ) == 0; undef( $xmlconv );
Checks to see if specified path is a file or directory.
$path -> File or directory path. (String)
$string -> Returns: "file" = file, "dir" = directory and "unknown" if the path is not a file or directory (undefined).
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $path = "path/to/a/directory"; print( "Is \"$path\" a file or directory? " . $xmlconv->IsFileOrDirectory( $path ) . "\n" ); $path = "path/to/a/file.file"; print( "Is \"$path\" a file or directory? " . $xmlconv->IsFileOrDirectory( $path ) . "\n" ); undef( $xmlconv );
Removes special characters from string parameter, removes extra spaces and converts text to lowercase. Note: This method is called when parsing and compiling Medline title/abstract data.
$string -> String passed to remove special characters from and convert to lowercase.
$string -> String with all special characters removed and converted to lowercase.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $str = "Heart Attack is$ an!@ also KNOWN as an Acute MYOCARDIAL inFARCTion!"; print( "Original String: $str\n" ); $str = $xmlconv->RemoveSpecialCharactersFromString( $str ); print( "Modified String: $str\n" ); undef( $xmlconv );
Returns file data type (string).
$filePath -> File to check located at file path
$string -> File type
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new() my $fileType = $xmlconv->GetFileType( "samples/textcorpus.txt" ); undef( $xmlconv );
Checks specified begin and end date strings for formatting and logic errors.
$value -> "0" = Passed Checks / "-1" = Failed Checks
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new() print "Passed Date Checks\n" if ( $xmlconv->_DateCheck() == 0 ); print "Failed Date Checks\n" if ( $xmlconv->_DateCheck() == -1 ); undef( $xmlconv );
Returns the _debugLog member variable set during Word2vec::Xmltow2v object initialization of new function.
$value -> '0' = False, '1' = True
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new() my $debugLog = $xmlconv->GetDebugLog(); print( "Debug Logging Enabled\n" ) if $debugLog == 1; print( "Debug Logging Disabled\n" ) if $debugLog == 0; undef( $xmlconv );
Returns the _writeLog member variable set during Word2vec::Xmltow2v object initialization of new function.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $writeLog = $xmlconv->GetWriteLog(); print( "Write Logging Enabled\n" ) if $writeLog == 1; print( "Write Logging Disabled\n" ) if $writeLog == 0; undef( $xmlconv );
Returns the _storeTitle member variable set during Word2vec::Xmltow2v object instantiation of new function.
$value -> '1' = True / '0' = False
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $storeTitle = $xmlconv->GetStoreTitle(); print( "Store Title Option: Enabled\n" ) if $storeTitle == 1; print( "Store Title Option: Disabled\n" ) if $storeTitle == 0; undef( $xmlconv );
Returns the _storeAbstract member variable set during Word2vec::Xmltow2v object instantiation of new function.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $storeAbstract = $xmlconv->GetStoreAbstract(); print( "Store Abstract Option: Enabled\n" ) if $storeAbsract == 1; print( "Store Abstract Option: Disabled\n" ) if $storeAbstract == 0; undef( $xmlconv );
Returns the _quickParse member variable set during Word2vec::Xmltow2v object instantiation of new function.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $quickParse = $xmlconv->GetQuickParse(); print( "Quick Parse Option: Enabled\n" ) if $quickParse == 1; print( "Quick Parse Option: Disabled\n" ) if $quickParse == 0; undef( $xmlconv );
Returns the _compoundifyText member variable set during Word2vec::Xmltow2v object instantiation of new function.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $compoundify = $xmlconv->GetCompoundifyText(); print( "Compoundify Text Option: Enabled\n" ) if $compoundify == 1; print( "Compoundify Text Option: Disabled\n" ) if $compoundify == 0; undef( $xmlconv );
Returns the _numOfThreads member variable set during Word2vec::Xmltow2v object instantiation of new function.
$value -> Number of threads
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $numOfThreads = $xmlconv->GetNumOfThreads(); print( "Number of threads: $numOfThreads\n" ); undef( $xmlconv );
Returns the _workingDir member variable set during Word2vec::Xmltow2v object instantiation of new function.
$string -> Working directory string
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $workingDirectory = $xmlconv->GetWorkingDir(); print( "Working Directory: $workingDirectory\n" ); undef( $xmlconv );
Returns the _saveDir member variable set during Word2vec::Xmltow2v object instantiation of new function.
$string -> Save directory string
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $savePath = $xmlconv->GetSavePath(); print( "Save Directory: $savePath\n" ); undef( $xmlconv );
Returns the _beginDate member variable set during Word2vec::Xmltow2v object instantiation of new function.
$date -> Beginning date range - Format: XX/XX/XXXX (Mon/Day/Year)
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $date = $xmlconv->GetBeginDate(); print( "Date: $date\n" ); undef( $xmlconv );
Returns the _endDate member variable set during Word2vec::Xmltow2v object instantiation of new function.
$date -> End date range - Format: XX/XX/XXXX (Mon/Day/Year).
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $date = $xmlconv->GetEndDate(); print( "Date: $date\n" ); undef( $xmlconv );
Returns the XML data (string) to be parsed.
Returns the _xmlStringToParse member variable set during Word2vec::Xmltow2v object instantiation of new function.
$string -> Medline XML data string
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $xmlStr = $xmlconv->GetXMLStringToParse(); print( "XML String: $xmlStr\n" ); undef( $xmlconv );
Returns the _textCorpusStr member variable set during Word2vec::Xmltow2v object instantiation of new function.
$string -> Text corpus string
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $str = $xmlconv->GetTextCorpusStr(); print( "Text Corpus: $str\n" ); undef( $xmlconv );
Returns the _fileHandle member variable set during Word2vec::Xmltow2v object instantiation of new function. Warning: This is a private function. File handle is used by WriteLog() method. Do not manipulate this file handle as errors can result.
$fileHandle -> Returns file handle for WriteLog() method.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $fileHandle = $xmlconv->GetFileHandle(); undef( $xmlconv );
Returns XML::Twig handler.
Returns the _twigHandler member variable set during Word2vec::Xmltow2v object instantiation of new function. Warning: This is a private function and should not be called or manipulated.
$twigHandler -> XML::Twig handler.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $xmlHandler = $xmlconv->GetTwigHandler(); undef( $xmlconv );
Returns the _parsedCount member variable set during Word2vec::Xmltow2v object instantiation of new function.
$value -> Number of parsed Medline articles.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $numOfParsed = $xmlconv->GetParsedCount(); print( "Number of parsed Medline articles: $numOfParsed\n" ); undef( $xmlconv );
Returns the _tempStr member variable set during Word2vec::Xmltow2v object instantiation of new function. Warning: This is a private function and should not be called or manipulated. Used by module as a temporary storage location for parsed Medline 'Title' and 'Abstract' flag string data.
$string -> Temporary string storage location.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $tempStr = $xmlconv->GetTempStr(); print( "Temp String: $tempStr\n" ); undef( $xmlconv );
Returns the _tempDate member variable set during Word2vec::Xmltow2v object instantiation of new function. Used by module as a temporary storage location for parsed Medline 'DateCreated' flag string data.
$date -> Date string - Format: XX/XX/XXXX (Mon/Day/Year).
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $date = $xmlconv->GetTempDate(); print( "Temp Date: $date\n" ); undef( $xmlconv );
Returns the _compoundWordAry member array reference set during Word2vec::Xmltow2v object instantiation of new function. Warning: Compound word data must be loaded in memory first via ReadCompoundWordDataFromFile().
$arrayReference -> Compound word array reference.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $arrayReference = $xmlconv->GetCompoundWordAry(); my @compoundWord = @{ $arrayReference }; print( "Compound Word Array: @compoundWord\n" ); undef( $xmlconv );
Returns the _compoundWordBST member variable set during Word2vec::Xmltow2v object instantiation of new function.
$bst -> Compound word binary search tree.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $bst = $xmlconv->GetCompoundWordBST(); undef( $xmlconv );
Returns the _maxCompoundWordLength member variable set during Word2vec::Xmltow2v object instantiation of new function. Note: If not defined, it is automatically set to and returns 20.
$value -> Maximum number of compound words in a given phrase.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $compoundWordLength = $xmlconv->GetMaxCompoundWordLength(); print( "Maximum Compound Word Length: $compoundWordLength\n" ); undef( $xmlconv );
Returns the _overwriteExisitingFile member variable set during Word2vec::Xmltow2v object instantiation of new function. Enables overwriting of existing text corpus if set to '1' or appends to the existing text corpus if set to '0'.
$value -> '1' = Overwrite existing file / '0' = Append to exiting file.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); my $overwriteExitingFile = $xmlconv->GetOverwriteExistingFile(); print( "Overwrite Existing File? YES\n" ) if ( $overwriteExistingFile == 1 ); print( "Overwrite Existing File? NO\n" ) if ( $overwriteExistingFile == 0 ); undef( $xmlconv );
Sets member variable to passed integer parameter. Instructs module to store article title if true or omit if false.
$value -> '1' = Store Titles / '0' = Omit Titles
Ouput:
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetStoreTitle( 1 ); undef( $xmlconv );
Sets member variable to passed integer parameter. Instructs module to store article abstracts if true or omit if false.
$value -> '1' = Store Abstracts / '0' = Omit Abstracts
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetStoreAbstract( 1 ); undef( $xmlconv );
Sets member variable to passed string parameter. Represents the working directory.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetWorkingDir( "/samples/" ); undef( $xmlconv );
Sets member variable to passed integer parameter. Represents the text corpus save path.
$string -> Text corpus save path
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetSavePath( "samples/textcorpus.txt" ); undef( $xmlconv );
Sets member variable to passed integer parameter. Instructs module to utilize quick parse routines to speed up text corpus compilation. This method is somewhat less accurate due to its non-exhaustive nature.
$value -> '1' = Enable Quick Parse / '0' = Disable Quick Parse
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetQuickParse( 1 ); undef( $xmlconv );
Sets member variable to passed integer parameter. Instructs module to utilize 'compoundify' option if true. Warning: This requires compound word data to be loaded into memory with ReadCompoundWordDataFromFile() method prior to executing text corpus compilation.
$value -> '1' = Compoundify text / '0' = Do not compoundify text
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetCompoundifyText( 1 ); undef( $xmlconv );
Sets member variable to passed integer parameter. Sets the requested number of threads to parse Medline XML files and compile the text corpus.
$value -> Integer (Positive value)
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetNumOfThreads( 4 ); undef( $xmlconv );
Sets member variable to passed string parameter. Sets beginning date range for earliest articles to store, by 'DateCreated' Medline tag, within the text corpus during compilation. Note: Expected format - "XX/XX/XXXX" (Mon/Day/Year)
$string -> Date string - Format: "XX/XX/XXXX"
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetBeginDate( "01/01/2004" ); undef( $xmlconv );
Sets member variable to passed string parameter. Sets ending date range for latest article to store, by 'DateCreated' Medline tag, within the text corpus during compilation. Note: Expected format - "XX/XX/XXXX" (Mon/Day/Year)
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetEndDate( "08/13/2016" ); undef( $xmlconv );
Sets member variable to passed string parameter. This string normally consists of Medline XML data to be parsed for text corpus compilation. Warning: This is a private function and should not be called or manipulated.
$string -> String
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetXMLStringToParse( "Hello World!" ); undef( $xmlconv );
Sets member variable to passed string parameter. Overwrites any stored text corpus data in memory to the string parameter. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetTextCorpusStr( "Hello World!" ); undef( $xmlconv );
Sets member variable to passed string parameter. Appends string parameter to text corpus string in memory. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->AppendStrToTextCorpus( "Hello World!" ); undef( $xmlconv );
Clears text corpus data in memory. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ClearTextCorpus(); undef( $xmlconv );
Sets member variable to passed string parameter. Sets temporary member string to passed string parameter. (Temporary placeholder for Medline Title and Abstract data). Note: This removes special characters and converts all characters to lowercase. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetTempStr( "Hello World!" ); undef( $xmlconv );
Appends string parameter to temporary member string in memory. Note: This removes special characters and converts all characters to lowercase. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->AppendToTempStr( "Hello World!" ); undef( $xmlconv );
Clears the temporary string storage in memory. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ClearTempStr(); undef( $xmlconv );
Sets member variable to passed string parameter. Sets temporary date string to passed string. Note: Date Format - "XX/XX/XXXX" (Mon/Day/Year) Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetTempDate( "08/13/2016" ); undef( $xmlconv );
Clears the temporary date storage location in memory. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ClearTempDate(); undef( $xmlconv );
Sets member variable to de-referenced passed array reference parameter. Stores compound word array by de-referencing array reference parameter. Note: Clears previous data if existing. Warning: This is a private function and should not be called or manipulated.
$arrayReference -> Array reference of compound words
use Word2vec::Xmltow2v; my @compoundWordAry = ( "big dog", "respiratory failure", "seven large masses" ); my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetCompoundWordAry( \@compoundWordAry ); undef( $xmlconv );
Clears compound word array in memory. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ClearCompoundWordAry(); undef( $xmlconv );
Sets member variable to passed Word2vec::Bst parameter. Sets compound word binary search tree to passed binary tree parameter. Note: Un-defines previous binary tree if existing. Warning: This is a private function and should not be called or manipulated.
Word2vec::Bst -> Binary Search Tree
use Word2vec::Xmltow2v; my @compoundWordAry = ( "big dog", "respiratory failure", "seven large masses" ); @compoundWordAry = sort( @compoundWordAry ); my $arySize = @compoundWordAry; my $bst = Word2vec::Bst; $bst->CreateTree( \@compoundWordAry, 0, $arySize, undef ); my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetCompoundWordBST( $bst ); undef( $xmlconv );
Clears/Un-defines existing compound word binary search tree from memory. Warning: This is a private function and should not be called or manipulated.
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->ClearCompoundWordBST(); undef( $xmlconv );
Sets member variable to passed integer parameter. Sets maximum number of compound words in a phrase for comparison. ie. "medical campus of Virginia Commonwealth University" can be interpreted as a compound word of 6 words. Setting this variable to 3 will only attempt compoundifying a maximum amount of three words. The result would be "medical_campus_of Virginia commonwealth university" even-though an exact representation of this compounded string can exist. Setting this variable to 6 will result in compounding all six words if they exists in the compound word array/bst. Warning: This is a private function and should not be called or manipulated.
$value -> Integer
use Word2vec::Xmltow2v; my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->SetMaxCompoundWordLength( 8 ); undef( $xmlconv );
Sets member variable to passed integer parameter. Requires 0 = False or 1 = True. Sets option to overwrite existing text corpus during compilation if 1 or append to existing text corpus if 0.
Returns current time string in "Hour:Minute:Second" format.
$string -> XX:XX:XX ("Hour:Minute:Second")
use Word2vec::Xmltow2v: my $xmlconv = Word2vec::Xmltow2v->new(); my $time = $xmlconv->GetTime(); print( "Current Time: $time\n" ) if defined( $time ); undef( $xmlconv );
Returns current month, day and year string in "Month/Day/Year" format.
$string -> XX/XX/XXXX ("Month/Day/Year")
use Word2vec::Xmltow2v: my $xmlconv = Word2vec::Xmltow2v->new(); my $date = $xmlconv->GetDate(); print( "Current Date: $date\n" ) if defined( $date ); undef( $xmlconv );
Prints passed string parameter to the console, log file or both depending on user options. Note: printNewLine parameter prints a new line character following the string if the parameter is undefined and does not if parameter is 0.
$string -> String to print to the console/log file. $value -> 0 = Do not print newline character after string, all else prints new line character including 'undef'.
use Word2vec::Xmltow2v: my $xmlconv = Word2vec::Xmltow2v->new(); $xmlconv->WriteLog( "Hello World" ); undef( $xmlconv );
Clint Cuffy, Virginia Commonwealth University
Copyright (c) 2016
Bridget T McInnes, Virginia Commonwealth University btmcinnes at vcu dot edu Clint Cuffy, Virginia Commonwealth University cuffyca at vcu dot edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
To install Word2vec::Interface, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Word2vec::Interface
CPAN shell
perl -MCPAN -e shell install Word2vec::Interface
For more information on module installation, please visit the detailed CPAN module installation guide.