Transcriptome simulation notes

Back to transcriptome assembly - resolve issue of gene expression counts with simulated data before proceeding to full Aphaenogaster transcriptome assembly

Documentation for using BWA and samtools to get gene expression counts to confusing, so switched to the TopHat-Cufflinks protocol. This also seems to be more widely used program.

Note about the use of FPKM from cufflinks to identify differentially-expressed genes:

I want to find differentially expressed genes. Can I use Cufflinks in conjunction with count-based differential expression packages?

It’s possible, but we strongly advise against this. Current count-based differential expression tools are poorly suited to differential expression analysis in genomes with alternatively spliced genes. The main reason for this is that when a gene has multiple isoforms, a change in the total number of reads or fragments from that gene doesn’t always correspond to a change in expression for that gene. Conversely, a gene’s expression may change, but the total number of fragments generated by its isoforms may be very similar. In order to detect changes accurately, it’s necessary to estimate how many fragments came from each individual splice variant in each sample. Current count-based tools don’t do this (to our knowledge - please send us email if you know of one!). Even if they did, fragments that come from parts of genes that are shared by more than one splice variant can’t generally assigned to a single isoform, so the fragment counts for each isoform are only estimates, and there is some uncertainty in the counts. Isoforms that are very similar will have a great deal of uncertainty surrounding their fragment counts. This uncertainty needs to be accounted for when testing for differential expression. So while you could use Cufflinks to estimate isoform-level counts, you’d be throwing away Cufflinks’ uncertainty, and thus have more confidence in the differences you see than you really should. This will probably lead to many false positives in your analysis. Furthermore, we do not normalize simply by the length to calculate FPKM but an effective length, as explained in our publications. Calculting counts from FPKM by multiplying by the length will give incorrect results. We strongly encourage you to consider using Cuffdiff to find differentially expressed genes and transcripts.

Some important issues for planned regression-style analysis (as opposed to standard cuffdiff comparison among samples)

Multiple isoforms
Uncertainty in fragment counts

For (1) see Anders et al. (2013) -

ApTranscriptome

For fitting function to multiple samples, follow section 8.6.2 of limma guide.

emacs

Set up Emacs Speaks Shell for ESS like functionality with shell commands in terminal.

R

Hadley Wickham’s R curriculum

And a guide to software development for non-programmers

Website

Update to website

Reading

Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W. & Robinson, M.D. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols, 8, 1765–1786.