custom-refseq

Set of scripts to download the most recent version of NCBI's RefSeq database, parse redundancy, and reformat with custom headings (see Details below).

Designed to be run on Purdue's Research Computing Community Clusters

To Run

Create a working copy of the custom-refseq directory in your scratch directory.

cp -r custom-refseq $SCRATCH
cd $SCRATCH/custom-refseq

Edit the User Specified Variables section of run_get_refseqdb.sh as needed. If you are a member of the jwisecav-apps group on Purdue clusters, you do not need to changes these variables because they are are specified automatically in our shared .bashrc file.

#### User specified variables
SCRIPTSDIR=$CUSTOM_REFSEQ_SCRIPTS_DIR
MMETSPDIR=$MMETSP_PATH
CDHIT='/group/bioinfo/apps/apps/cd-hit-v4.6.5-2016-0304/cd-hit -M 0'
SFETCH='esl-sfetch'
DIAMOND='diamond'
CDHIT_THRESHOLD=0.95

Submit the run_get_refseqdb.sh script to the PBS queue. Specify your email and queue using -M and -q.

qsub -M your_email@email.com -q your_queue run_get_refseqdb.sh

Output

refseq.fa: Fasta formatted refseq database

refseq.fa.log: Log file with information on which accessions were not included in the final database. The most common reason a sequence is not included is due to redundancy at the species level.

refseq.fa.ssi: SSI index for quickly extracting sequences from the database.

refseq.dmnd: Pre-formatted database for use with Diamond Blast.

refseq.stats.txt: Stats file with information on the number of unique sequences present in each species in the database.

Details

Dealing with redundancy

The way this custom database handles redudancy is different from the standard RefSeq. Redundacy is parsed based on species taxonomy ID (taxid). NOTE: In cases where a sequence is not associated with a species taxid, the genus taxid is used. If a sequence is not associated with either a genus or species taxid, the sequence is skipped.

Therefore, here:

MULTISPECIES proteins (i.e. identical sequences assigned a shared accession number and present in multiple species) will be represented multiple times, once per unique taxid.
Identical sequences with different accession numbers but associated with a single taxid will be collapsed and represented only once.

Custom header

Sequence headers are formatted to allow easy parsing based on taxid as well as provide informative species and lineage information to the user. Each header contains the NCBI protein accession, taxid, genus/species name, and lineage information ('-' deliminted). An example is provided below.

>YP_009335880.1-334597-Yucca_schidigera-monocots

The list of possible lineages is provided in lineages.dmp and can be customized to fit the individual needs of a project. Keep in mind that the lineages are parsed hierarchically from most to least specific. So for the example above, if 'Viridiplantae' and 'monocots' are both present in lineages.dmp then sequences from Yucca schidigera will be labeled 'monocots'