Back to real data from simulation. Include 100 Arabidopsis mRNA transcripts in transcriptome assembly pipeline and use these at end to evaluate quality of assembly. Simulate expression levels using rlsim
. Preliminary assembly of transcriptome resulted in 90,612 contigs with total length of 86,485,229 bp. Started with ~140e6 2x100 reads.
So the approximate coverage of real transcripts is (140e6 reads * 200 bp) / (8e7 total length) = r (140e6 * 200) / (8e7)
.
Thus, for 100 known transcripts of about 1000bp length, generate 100 mRNA * 1000bp * 350 coverage / 200 bp per fragment = r (100 * 1000 * 350)/200
Maybe an alternative to CD-HIT is to build transcript families.
Sailfish working on simulated data.
Correlation between TPM from sailfish and known expression levels = 0.976.
So, as promised, equivalent to cufflinks and much faster and easier to use!
Set up github repository in GotelliLab organization
PeerJ csl style for zotero
Brent Sitterly @ Dealer demoed shiny for R web apps. Looks pretty awesome. Already imagining an online supplement to paper that allows users to select pathways of interest to highlight.
High performance computing in R to get around limits of having to load all data in memory.
Re-ran data pre-processing for velvet-oases assembly. Workstation shut-down (WTF??) so had to restart diginorm which is the slowest part of pipeline (~6 hrs).
Started velvet-oases assembly and wrote script for Trinity in comparison.
Course idea: Reproducible research!
Completed velvet-oases assembly. In log file
Extracting loci from connection graph...
[282.496250] Counted 55229 mRNA loci
Transitive reduction of graph
and assemstats:
Assembly stat | Result |
---|---|
Total Trimmed Contigs | 104,783 |
Total Length | 99,167,303 |
Min contig size | 100 |
Median contig size | 442 |
Mean contig size | 946 |
Max contig size | 18,563 |
N50 Contig | 14,654 |
N50 Length | 1,986 |
N90 Contig | 56,313 |
N90 Length | 386 |
After clustering contigs using CAP3 (see below)
Assembly stat | Result |
---|---|
Total Contigs | 11,926 |
Total Trimmed Contigs | 11,926 |
Total Length | 17,556,252 |
Min contig size | 101 |
Median contig size | 979 |
Mean contig size | 1,472 |
Max contig size | 18,839 |
N50 Contig | 2,212 |
N50 Length | 2,471 |
N90 Contig | 7,147 |
N90 Length | 709 |
Wow. A ten-fold reduction down to 11,926 transcripts from 104,782 after oases assembly. Given my expectations for the genome of 17,000 genes with a median length of 1,000 bp = 170,000 the total length is awesomely close so I’m thinking this is the final assembly.
Of course, this assembly is probably too simplistic in that I’ve removed all isoforms. Probably should examine gene expression for both.
Turns out this is wrong - I was only looking at the ‘contigs’ output. There are about 62,018 singlets that were not merged to contigs, for a total of 73,944 transcripts.
Started script for mapping reads using sailfish..
So 55,229 loci with about 2 isoforms each (104,783 contigs)
BLAST against known simulated transcripts to evaluate assembly
Cluster loci using CAP3 or CD-HIT.
Similar question on seqanswers about how to deal with redundant contigs and using CAP3 or CD-HIT is also recommended.
Some nice explanations of how CD-HIT works and comparison of results using CAP3 and CD-HIT
Basically, CD-HIT
Problems are assumption that longest sequence is best representative and short word (kmer) filter.
CAP3 - looks for both overlapping sequences as well as shorter sequences that are completely within the longer sequences
TGICL is a wrapper to megablast and CAP3 which may be another option, but is poorly documented and doesn’t install correctly so I will skip.
CONSED for viewing assemblies
Hornett, E.A. & Wheat, C.W. (2012). Quantitative RNA-Seq analysis in non-model species: assessing transcriptome assemblies as a scaffold and the utility of evolutionary divergent genomic reference species. BMC Genomics, 13, 361.
This work is licensed under a Creative Commons Attribution 4.0 International License.