The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CUICollector.pl - Scrapes MetaMap Machine Output (MMO) files to build a database of CUI bigram scores.

SYNOPSIS

    $ perl CUICollector.pl --directory metamapped-baseline/2014/ 
    CUICollector 0.04 - (C) 2015 Keith Herbert and Bridget McInnes
    Released under the GNU GPL.
    Connecting to database CUI_Bigrams on localhost
    Parsing file: /home/share/data/metamapped-baseline/2014/text.out_01.gz
    Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz
    Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz
    Parsing file: /home/share/data/metamapped-baseline/2014/text.out_02.gz
    Parsing file: /home/share/data/metamapped-baseline/2014/text.out_03.gz
    Entering scores into CUI_Bigrams
    ...
    Finished

USAGE

Usage: CUICollector.pl [DATABASE OPTIONS] [OTHER OPTIONS] [FILES | DIRECTORIES]

INPUT

Required Arguments:

[FILES | DIRECTORIES]

Specify a directory containing *ONLY* compressed MetaMapped Medical Baseline files: --directory /path/to/files/

Multiple directories may also be supplied: --directory /path/to/first/folder/ /path/to/second/folder/

Likewise, specify a list of individual files --files text.out_01.txt.gz text_mm_out_42.txt.gz text_mm_out_314.txt.gz

a glob of files: --files /path/to/dir/*.gz

Or just one: --files text.out_01.txt.gz

Optional Arguments:

--database STRING

Database to contain the CUI bigram scores. DEFAULT: CUI_Bigrams

If the database is not found in MySQL, CUICollector will create it for you.

--username STRING

Username is required to access the CUI bigram database on MySql. You will be prompted for it if it is not supplied as an argument.

--password STRING

Password is required to access the CUI bigram database on MySql. You will be prompted for it if it is not supplied as an argument.

--hostname STRING

Hostname where mysql is located. DEFAULT: localhost

--socket STRING

Socket where the mysql.sock or mysqld.sock is located. DEFAULT: mysql.sock

--port STRING

The port your mysql is using. DEFAULT: 3306

--file_step INTEGER

How many MetaMap files to read between writes to the database. DEFAULT: 5

MMO files can be rather large so setting a low file_step reduces the memory footprint of the script. However, setting a higher file_step reduces the number of write operations to the database.

--debug

Sets the debug flag for testing. NOTE: extremely verbose.

--verbose

Print the current status of the program to STDOUT. This indicates the files being processed and when the program is writing to the database. This is the default output setting.

--quiet

Don't print anything to STDOUT.

--help

Displays the quick summary of program options.

OUTPUT

By default, CUICollector prints he current status of the program as it works through the Metamapped Medline Output files (disable with `--quiet`). It creates a database (or connects to an existing one) and adds bigram scores of the CUIs it encounters in the MMO files.

The resulting database will have four tables:

N_11
    cui_1   cui_2   n_11
    

This shows the count (n_11) for every time a particular CUI (cui_1) is immediately followed by another particular CUI (cui_2) in an utterance.

AUTHOR

 Keith Herbert, Virginia Commonwealth University
 Amy Olex, Virginia Commonwealth University
 Bridget McInnes, Virginia Commonwealth University

COPYRIGHT

Copyright (c) 2015-2017 Keith Herbert, Virginia Commonwealth University herbertkb at vcu edu

Amy Olex, Virginia Commonwealth University alolex at vcu dot edu

Bridget McInnes, Virginia Commonwealth University btmcinnes at vcu dot edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.