BLAST notes

ApTranscriptome

JV at VGN suggested:

If you already have the NR database locally you can save a lot of time by creating database aliases for the various organisms you listed, instead of extracting sequences and reformatting.

The blastd_aliastool lets you create an alias for a subset of any database (like NR) by giving a list of GI numbers. You can easily obtain the GI numbers in list format from NCBI ENntrez.

After some searching, found clear example of how to do this on biostars (of course).

Advantages of yesterday’s method:

validated approach (my python script may have unknown bugs…)
can use Insecta taxon id (taxid) so I get all insect hits instead of simply the ones I provide

Approach:

Download pre-formatted nr databases using update_blastdb.pl, unzip. [1]

perl update_blastdb.pl nr # unzip each volume gunzip nr.01.tar.gz # and so forth
Find NCBI taxid for Insecta: 50557
Search the Entrez Protein database with query: “txid50557[ORGN]”
Select “Send to File” and choose format “GI list”
- generates list of 1,749,040 unique gi ids
Use the list of Insecta GIs from the previous step with the blastdb_aliastool to build an aliased blastdb of just insects:

blastdb_aliastool -gilist insecta.gi_list.txt -db nr -out nr_insecta -title “Insecta nr database”

which gives output:

Converted 1749040 GIs from insecta_gi.txt to binary format in nr_insecta.p.gil Created protein BLAST (alias) database nr_insecta with 838606 sequences

Compare this to 35,099,569 sequences in the full database!!!

Test search against new (aliased) database:

blastx -query query.fa -db nr_insecta

[1] First, I tried using my own formatted database generated by makeblastdb -in nr -dbtype prot -out nrdb but this failed with the error message: BLAST Database error: BLASTDB alias file creation failed. Some referenced files may be missing. After too much googling, discovered this was because I needed to specify -parse_seqids when making the database so that the ‘gi’ references would be available. So, the alternative at this point would be too (1) download the [nr fasta] sequences, make blast database:

makeblastdb -in nr -dbtype prot -out insectanrdb -parse_seqids

and then proceed as above.

To run in parallel, brute forced 18 PBS shell scripts to run

blastx -query /N/u/tg-johnsg/Mason/ApTranscriptome/results/AphVT_sub18.fasta -db nr_insecta -out AphVTsub18_blastx2nrinsecta -outfmt 5 -evalue 0.0001 -gapopen 11 -gapextend 1 -word_size 3 -matrix BLOSUM62 -max_target_seqs 20 -num_threads 30

As mason has 32 cores per node, I set -num_threads to 30.

for each subset of assembly.

Will merge and proceed to GO next.

Computing

Interesting collection of bioinformatics tools - biopieces