The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

bp_genbank_ref_extractor - retrieves all related sequences for a list of searches on Entrez gene

SYNOPSIS

bp_genbank_ref_extractor [options] [Entrez Gene Queries]

DESCRIPTION

This script searches on Entrez Gene database and retrieves not only the gene sequence but also the related transcript and protein sequences.

The gene UIDs of multiple searches are collected before attempting to retrieve them so each gene will only be analyzed once even if appearing as result on more than one search.

Note that by default no sequences are saved (see options and examples).

OPTIONS

Several options can be used to fine tune the script behaviour. It is possible to obtain extra base pairs upstream and downstream of the gene, control the naming of files and genome assembly to use.

See the section bugs for problems when using default values of options.

--assembly

When retrieving the sequence, a specific assemly can be defined. The value expected is a regex that will be case-insensitive. If it matches more than one assembly, it will use the first match. It defauls to (primary|reference) assembly.

--debug

If set, even more output will be printed that may help on debugging. Unlike the messages from --verbose and --very-verbose, these will not appear on the log file unless this option is selected. This option also sets --very-verbose.

--downstream, --down

Specifies the number of extra base pairs to be retrieved downstream of the gene. This extra base pairs will only affect the gene sequence, not the transcript or proteins.

--format

Specifies the format that the sequences will be saved. Defaults to genbank format. Valid formats are 'genbank' or 'fasta'.

--genes

Specifies the name for gene file. By default, they are not saved. If no value is given defaults to its UID. Possible values are 'uid', 'name', 'symbol' (the official symbol or nomenclature).

--help

Display the documentation (this text).

--limit

When making a query, limit the result to these first specific results. This is to prevent the use of specially unspecific queries and a warning will be given if a query returns more results than the limit. The default value is 200. Note that this limit is for each search.

--non-coding, --nonon-coding

Some protein coding genes have transcripts that are non-coding. By default, these sequences are saved as well. --nonon-coding can be used to ignore those transcripts.

--proteins

Specifies the name for proteins file. By default, they are not saved. If no value is given defaults to its accession. Possible values are 'accession', 'description', 'gene' (the corresponding gene ID) and 'transcript' (the corresponding transcript accesion).

Note that if not using 'accession' is possible for files to be overwritten. It is possible for the same gene to encode more than one protein or different proteins to have the same description.

--pseudo, --nopseudo

By default, sequences of pseudo genes will be saved. --nopseudo can be used to ignore those genes.

--save

Specifies the path for the directory where the sequence and log files will be saved. If the directory does not exist it will be created altough the path to it must exist. Files on the directory may be rewritten if necessary. If unspecified, a directory named extracted sequences on the current directory will be used.

--save-data

This options saves the data (gene UIDs, description, product accessions, etc) to a file. As an optional value, the file format can be specified. Defaults to CSV.

Currently only CSV is supported.

Saving the data structure as a CSV file, requires the installation of the Text::CSV module.

--transcripts, --mrna

Specifies the name for transcripts file. By default, they are not saved. If no value is given defaults to its accession. Possible values are 'accession', 'description', 'gene' (the corresponding gene ID) and 'protein' (the protein the transcript encodes).

Note that if not using 'accession' is possible for files to be overwritten. It is possible for the same gene to have more than one transcript or different transcripts to have the same description. Also, non-coding transcripts will create problems if using 'protein'.

--upstream, --up

Specifies the number of extra base pairs to be extracted upstream of the gene. This extra base pairs will only affect the gene sequence, not the transcript or proteins.

--verbose, --v

If set, program becomes verbose. For an extremely verbose program, use --very-verbose instead.

--very-verbose, --vv

If set, program becomes extremely verbose. Setting this option, automatically sets --verbose as well. For help in debugging, consider using --debug

EXAMPLES

bp_genbank_ref_extractor --transcripts=accession '"homo sapiens"[organism] AND H2B'

Search Entrez gene with the query '"homo sapiens"[organism] AND H2B', and save their transcripts sequences. Note that default value of --limit may only extract some of the hits.

bp_genbank_ref_extractor --transcripts=accession --proteins=accession --format=fasta '"homo sapiens"[organism] AND H2B' '"homo sapiens"[organism] AND MCPH1'

Same as first example but also searches for '"homo sapiens"[organism] AND MCPH1', proteins sequences, and saves them in the fasta format.

bp_genbank_ref_extractor --genes --up=100 --down=500 '"homo sapiens"[organism] AND H2B'

Same search as first example but saves the genomic sequences instead including 100 and 500 bp upstream and downstream.

bp_genbank_ref_extractor --genes --asembly='Alternate HuRef' '"homo sapiens"[organism] AND H2B'

Same search as first example but saves genomic sequences and from the Alternate HuRef genome assembly instead.

bp_genbank_ref_extractor --save-data=CSV '"homo sapiens"[organism] AND H2B'

Same search as first example but does not save any sequence but saves all the results in a CSV file.

bp_genbank_ref_extractor --save='search results' --genes=name --upstream=200 downstream=500 --nopseudo --nonnon-coding --transcripts --proteins --format=fasta --save-data=CSV '"homo sapiens"[organism] AND H2B' '"homo sapiens"[organism] AND MCPH1'

Searches on Entrez gene for both '"homo sapiens"[organism] AND H2B' and '"homo sapiens"[organism] AND MCPH1' and saves the gene sequences of all hits (not passing the default limit and ignoring pseudogenes) plus 200 and 500bp upstream and downstream of them. It will also save the sequences of all transcripts and proteins of each gene (but ignoring non-coding transcripts). It will save the sequences in the fasta format, inside a directory search results, and save the results in a CSV file

BUGS

If you find any bug, or have a feature request, please report these at https://redmine.open-bio.org/projects/bioperl or e-mail mailto:bioperl-l@lists.open-bio.org

  • When supplying options, it's possible to not supply a value and use their default. However, when the expected value is a string, the next argument may be confused as value for the option. For example, when using the following command:

    bp_genbank_ref_extractor --transcripts 'H2A AND homo sapiens'

    we mean to search for 'H2A AND homo sapiens' saving only the transcripts and using the default as base for the filename. However, the search terms will be interpreted as the base for the filenames (but since it's not a valid identifier, it will return an error). To prevent this, you can either specify the values:

    bp_genbank_ref_extractor --transcripts 'accession' 'H2A AND homo sapiens'

    bp_genbank_ref_extractor --transcripts='accession' 'H2A AND homo sapiens'

    or you can use the double hash to stop processing options. Note that this should only be used after the last option. All arguments supplied after the double dash will be interpreted as search terms

    bp_genbank_ref_extractor --transcripts -- 'H2A AND homo sapiens'

NOTES ON USAGE

  • Genes that are marked as 'live' and 'protein-coding' should have at least one transcript. However, This is not always true due to mistakes on annotation. Such cases will throw a warning. When faced with this, be nice and write to the entrez RefSeq maintainers http://www.ncbi.nlm.nih.gov/RefSeq/update.cgi.

  • When creating the directories to save the files, if the directory already exists it will be used and no error or warning will be issued unless --debug as been set. If a non-directory file already exists with that name bp_genbank_ref_extractor exits with an error.

  • On the subject of verbosity, all messages are saved on the log file. The options --verbose and --very-verbose only affect their printing to standard output. Debug messages are different as they will only show up (and be logged) if requested with --debug.

  • When saving a file, to avoid problems with limited filesystems such as NTFS or FAT, only some characters are allowed. All other characters will be replaced by an underscore. Allowed characters are:

    a-z 0-9 - + . , () {} []'

  • bp_genbank_ref_extractor tries to use the same file extensions that bioperl would expect when saving the file. If unable it will use the '.seq' extension.