26 July 2013
Aphaenogaster, RNAseq, transcriptome, FLASH, digital normalization, and diginorm
Next post Previous postConfirmed yesterday that running velvet (or Trinity) for transcriptome assembly will more RAM than available on antlab. Initiated process to get server time on mason (512GB nodes!).
While waiting, started looking into alternate methods and assemblers. Two promising approaches to reduce data:
I thought that FLASH would not work to merge reads because insert size is 200 bp and reads are only 100 bp, so after filtering there may not be enough overlap. Example uses 175 bp inserts. However, as read length is normally-distributed (see FLASH paper) many inserts will be less than 200 bp and may merge so worth a try to reduce the number of reads in data set. Unmerged reads will simply be returned as is.
Awesome! For first sample, 67% of reads were merged after running FLASH:
Sample | Total Reads | Merged | Unmerged |
---|---|---|---|
A22-00-R1 | 16,351,272 | 11,060,076 | 5,291,196 |
And a histogram of the distribution of length of extended fragments.
Note that though the library was size selected to 200bp, the extended fragments have a mode length around 160, probably due to some trimming, but also must be enough overlap to allow combining.
Ran for all samples in ~ 1.5 hrs. 64-70% of reads merged - huge reduction in total file size for assembly!
After merging reads, perform digital normalization to discard redundant data (lower memory for assembly) and error removal.
Following the diginorm tutoral guidelines. Had to make changes as I am not doing this on EC2.
export PYTHONPATH=/opt/software/khmer/python
-n 4 -x 4e9
==> 1.6e10 ==> 16GB. ‘Rules of thumb’ suggested this adequate for up to a billion mRNAseq reads-n 4 -x 2e9
as each of my samples has far fewer than a billion reads-p
option keeps or rejects both reads based on whether either one is novel; though nobody has really explored the effects of this on assembly.Set up script to do this and started running…
This work is licensed under a Creative Commons Attribution 4.0 International License.