JV at VGN suggested:
If you already have the NR database locally you can save a lot of time by creating database aliases for the various organisms you listed, instead of extracting sequences and reformatting.
The blastd_aliastool lets you create an alias for a subset of any database (like NR) by giving a list of GI numbers. You can easily obtain the GI numbers in list format from NCBI ENntrez.
After some searching, found clear example of how to do this on biostars (of course).
Advantages of yesterday’s method:
Approach:
Download pre-formatted nr databases using update_blastdb.pl, unzip. [1]
perl update_blastdb.pl nr # unzip each volume gunzip nr.01.tar.gz # and so forth
Use the list of Insecta GIs from the previous step with the blastdb_aliastool
to build an aliased blastdb of just insects:
blastdb_aliastool -gilist insecta.gi_list.txt -db nr -out nr_insecta -title “Insecta nr database”
which gives output:
Converted 1749040 GIs from insecta_gi.txt to binary format in nr_insecta.p.gil Created protein BLAST (alias) database nr_insecta with 838606 sequences
Compare this to 35,099,569 sequences in the full database!!!
Test search against new (aliased) database:
blastx -query query.fa -db nr_insecta
[1] First, I tried using my own formatted database generated by makeblastdb -in nr -dbtype prot -out nrdb
but this failed with the error message: BLAST Database error: BLASTDB alias file creation failed. Some referenced files may be missing
. After too much googling, discovered this was because I needed to specify -parse_seqids
when making the database so that the ‘gi’ references would be available. So, the alternative at this point would be too (1) download the [nr fasta] sequences, make blast database:
makeblastdb -in nr -dbtype prot -out insectanrdb -parse_seqids
and then proceed as above.
To run in parallel, brute forced 18 PBS shell scripts to run
blastx -query /N/u/tg-johnsg/Mason/ApTranscriptome/results/AphVT_sub18.fasta -db nr_insecta -out AphVTsub18_blastx2nrinsecta -outfmt 5 -evalue 0.0001 -gapopen 11 -gapextend 1 -word_size 3 -matrix BLOSUM62 -max_target_seqs 20 -num_threads 30
As mason has 32 cores per node, I set -num_threads
to 30.
for each subset of assembly.
Will merge and proceed to GO next.
Interesting collection of bioinformatics tools - biopieces
This work is licensed under a Creative Commons Attribution 4.0 International License.