09 October 2013

transcriptome, cufflinks, TPM, BWA, and expression

Next post Previous post

More transcriptome simulation notes

Awesome - found function read.fasta in RFLPtools package that reads BLAST results! Just had to change the blast command to include -outfmt 6 which specifies results in tabular format, -outfmt 10 for comma-separated format.

blastn -query testquery -df testdatabase -outfmt 6 -out out.txt

Then read into R using

test.res <- read.blast(file="out.txt")

Wrote scripts sim-assemly-eval.R that takes original fasta sequence and blast results, reporting the number of genes captured in assembly and compares average length of captured versus missing genes. Result:

Genes Count Average length (base pairs)
Starting 100 1094.11
Captured 86 1195.5
Missing 14 471.2857

For my simulation, 86 of the known genes were captured in assembly, with the missing genes being those that were less than 500 bp in length.

Oases actually didn’t do quite this good, reporting 76 Loci and a total of 203 transcripts ranging from 1 to 10 per locus, so it associated some transcipts that should have been seperate genes. For gene expression analyses, this shouldn’t be a problem as expression will be done on a per-transcript basis.

Should also compare length of assembled versus known gene.

Here’s where things get confusing. When I use the TopHat-Cufflinks pipeline for gene discovery and gene expression analysis, I only get 29 genes and/or transcripts!

Re-reading the Cufflinks paper

Users performing expression analysis with a de novo transcriptome assembly may wish to consider RSEM or IoSEM.

So, cufflinks may not be the right method here!

Reverting to BWA approach - this time calcuate Transcript per million (Wagner et al. 2012) from raw count data.

Next, try Sailfish for alignment-free expression counts…

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 4.0 International License.