08 April 2013

Next post Previous post

# Bioinformatics Monday

Ran seqtk for quality filtering over the weekend. While the program appeared to run fine, the output files were always 0 lines. Checked that grooming worked correctly:

In Galaxy, the files go from the original Illumina 1.3 format

@HWI-ST261:8:1:1222:2141#CGATGT/2
ATGCCGTGCTCGCGAACGAGCCGACTGGCGCGATTCAGCCACGTGCCAGTCAGGACACGCGTATGTGGACGGTCTGCGCAGAGCGTGTGTGTACGCGTAT
+
f_ffUdcdBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST261:8:1:1418:2082#CGATGT/2
ATCGCCGCATGCGAGGCGGAGAGGCTGGATGTGCAGGGGCAGACGGCGGCGGGTATACAGTTGCGCAACGAGCTCGTCTCGAGTGTCGCACGCGACACGA
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBB

and after grooming to fastq-sanger

@HWI-ST261:8:1:1222:2141#CGATGT/2
ATGCCGTGCTCGCGAACGAGCCGACTGGCGCGATTCAGCCACGTGCCAGTCAGGACACGCGTATGTGGACGGTCTGCGCAGAGCGTGTGTGTACGCGTAT
+
G@GG6EDE############################################################################################
@HWI-ST261:8:1:1418:2082#CGATGT/2
ATCGCCGCATGCGAGGCGGAGAGGCTGGATGTGCAGGGGCAGACGGCGGCGGGTATACAGTTGCGCAACGAGCTCGTCTCGAGTGTCGCACGCGACACGA
+
##########################

and confirmed that the BioPython write function did the same.

Spot checked that this is the correct fastq conversion. Illumina 1.3 encoding: 38=f, 31=_, 2=B, Sanger/Illumina 1.8: 38=G, 31=@, 3=#

NOTE - while checking this, found that I used the files in the directory: /home/tiffinp/stanton1/bmgc_incoming/110527_SN261_0347_B81JM9ABXX_L8/fastq_flt However, there were 3 directories provided by BMGC - fastq - fastq_flt - fastq_flt_syn

and they differ in file size. Based on file size, it appears that the files imported into Galaxy are the ‘fastq_flt’ files, so I should probably restart my workflow from these when all is running. However, the ‘fastq_flt_syn’ files are only marginally smaller and look the same in spot-checking, so probably another filter was applied so I will use these files.

Used seqtk to trim files. NOTE that no reads are removed by seqtk. I found the answer to this buried in a biostars post:

I have just added quality based trimming in seqtk. It uses the Phred algorithm to trim both ends. If after trimming, the read length is below 30 (by default), it tries to find a 30bp window with the highest sum of quality.

So, a second-step should be removing any files with length < 30 bp. Was also able to do this using seqtk

seqtk seq -L 31 in.fastq > out.fastq

Worked perfectly. As I had previously found from Galaxy, about 10 million more R2 reads are removed as they had high quality for only ~60 bp.

IC1 R1 55,277,504 54,330,756
IC1 R2 55,277,504 41,378,320
ENC4 R1 53,690,116 52,845,976
ENC4 R2 53,690,116 41,618,276
EWC3 R1 50,995,944 50,161,016
EWC3 R2 50,995,944 39,365,136

NOTE that this results in unpaired reads. Script available as part of velvet that removes singleton reads: https://github.com/dzerbino/velvet/blob/master/contrib/select_paired/select_paired.pl

## Lab Notebook

Checked off two items on website TODO list!

• set background image
• how to properly display tags

For tags, my question on stackoverflow was answered!

Useful reference for jekyll: https://github.com/mojombo/jekyll/wiki/Liquid-Extensions

For background image, was almost there before…just had to change css to

body {
background-image: url('/assets/img/natural_paper/natural_paper.png');
background-repeat: repeat;
}