CUICollector.pl - Scrapes MetaMap Machine Output (MMO) files to build a database of CUI bigram scores.
$ perl CUICollector.pl --directory metamapped-baseline/2014/ CUICollector 0.04 - (C) 2015 Keith Herbert and Bridget McInnes Released under the GNU GPL. Connecting to database CUI_Bigrams on localhost Parsing file: /home/share/data/metamapped-baseline/2014/text.out_01.gz Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz Entering scores into CUI_Bigrams ... Finished
Usage: CUICollector.pl [DATABASE OPTIONS] [OTHER OPTIONS] [FILES | DIRECTORIES]
Specify a directory containing *ONLY* compressed MetaMapped Medical Baseline files: --directory /path/to/files/
Multiple directories may also be supplied: --directory /path/to/first/folder/ /path/to/second/folder/
Likewise, specify a list of individual files --files text.out_01.txt.gz text_mm_out_42.txt.gz text_mm_out_314.txt.gz
a glob of files: --files /path/to/dir/*.gz
Or just one: --files text.out_01.txt.gz
Database to contain the CUI bigram scores. DEFAULT: CUI_Bigrams
If the database is not found in MySQL, CUICollector will create it for you.
Username is required to access the CUI bigram database on MySql. You will be prompted for it if it is not supplied as an argument.
Password is required to access the CUI bigram database on MySql. You will be prompted for it if it is not supplied as an argument.
Hostname where mysql is located. DEFAULT: localhost
Socket where the mysql.sock or mysqld.sock is located. DEFAULT: mysql.sock
The port your mysql is using. DEFAULT: 3306
How many MetaMap files to read between writes to the database. DEFAULT: 5
MMO files can be rather large so setting a low file_step reduces the memory footprint of the script. However, setting a higher file_step reduces the number of write operations to the database.
Sets the debug flag for testing. NOTE: extremely verbose.
Print the current status of the program to STDOUT. This indicates the files being processed and when the program is writing to the database. This is the default output setting.
Don't print anything to STDOUT.
Displays the quick summary of program options.
By default, CUICollector prints he current status of the program as it works through the Metamapped Medline Output files (disable with `--quiet`). It creates a database (or connects to an existing one) and adds bigram scores of the CUIs it encounters in the MMO files.
The resulting database will have four tables:
cui_1 cui_2 n_11
This shows the count (n_11) for every time a particular CUI (cui_1) is immediately followed by another particular CUI (cui_2) in an utterance.
Keith Herbert, Virginia Commonwealth University Amy Olex, Virginia Commonwealth University Bridget McInnes, Virginia Commonwealth University
Copyright (c) 2015-2017 Keith Herbert, Virginia Commonwealth University herbertkb at vcu edu
Amy Olex, Virginia Commonwealth University alolex at vcu dot edu
Bridget McInnes, Virginia Commonwealth University btmcinnes at vcu dot edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.