Ran seqtk for quality filtering over the weekend. While the program appeared to run fine, the output files were always 0 lines. Checked that grooming worked correctly:
In Galaxy, the files go from the original Illumina 1.3 format
@HWI-ST261:8:1:1222:2141#CGATGT/2
ATGCCGTGCTCGCGAACGAGCCGACTGGCGCGATTCAGCCACGTGCCAGTCAGGACACGCGTATGTGGACGGTCTGCGCAGAGCGTGTGTGTACGCGTAT
+
f_ffUdcdBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST261:8:1:1418:2082#CGATGT/2
ATCGCCGCATGCGAGGCGGAGAGGCTGGATGTGCAGGGGCAGACGGCGGCGGGTATACAGTTGCGCAACGAGCTCGTCTCGAGTGTCGCACGCGACACGA
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBB
and after grooming to fastq-sanger
@HWI-ST261:8:1:1222:2141#CGATGT/2
ATGCCGTGCTCGCGAACGAGCCGACTGGCGCGATTCAGCCACGTGCCAGTCAGGACACGCGTATGTGGACGGTCTGCGCAGAGCGTGTGTGTACGCGTAT
+
G@GG6EDE############################################################################################
@HWI-ST261:8:1:1418:2082#CGATGT/2
ATCGCCGCATGCGAGGCGGAGAGGCTGGATGTGCAGGGGCAGACGGCGGCGGGTATACAGTTGCGCAACGAGCTCGTCTCGAGTGTCGCACGCGACACGA
+
##########################
and confirmed that the BioPython write function did the same.
Spot checked that this is the correct fastq conversion. Illumina 1.3 encoding: 38=f, 31=_, 2=B, Sanger/Illumina 1.8: 38=G, 31=@, 3=#
NOTE - while checking this, found that I used the files in the directory: /home/tiffinp/stanton1/bmgc_incoming/110527_SN261_0347_B81JM9ABXX_L8/fastq_flt However, there were 3 directories provided by BMGC - fastq - fastq_flt - fastq_flt_syn
and they differ in file size. Based on file size, it appears that the files imported into Galaxy are the ‘fastq_flt’ files, so I should probably restart my workflow from these when all is running. However, the ‘fastq_flt_syn’ files are only marginally smaller and look the same in spot-checking, so probably another filter was applied so I will use these files.
Used seqtk to trim files. NOTE that no reads are removed by seqtk. I found the answer to this buried in a biostars post:
I have just added quality based trimming in seqtk. It uses the Phred algorithm to trim both ends. If after trimming, the read length is below 30 (by default), it tries to find a 30bp window with the highest sum of quality.
So, a second-step should be removing any files with length < 30 bp. Was also able to do this using seqtk
seqtk seq -L 31 in.fastq > out.fastq
Worked perfectly. As I had previously found from Galaxy, about 10 million more R2 reads are removed as they had high quality for only ~60 bp.
Sample | Direction | Num reads | Reads after trimming |
---|---|---|---|
IC1 | R1 | 55,277,504 | 54,330,756 |
IC1 | R2 | 55,277,504 | 41,378,320 |
ENC4 | R1 | 53,690,116 | 52,845,976 |
ENC4 | R2 | 53,690,116 | 41,618,276 |
EWC3 | R1 | 50,995,944 | 50,161,016 |
EWC3 | R2 | 50,995,944 | 39,365,136 |
NOTE that this results in unpaired reads. Script available as part of velvet that removes singleton reads: https://github.com/dzerbino/velvet/blob/master/contrib/select_paired/select_paired.pl
Checked off two items on website TODO list!
For tags, my question on stackoverflow was answered!
Useful reference for jekyll: https://github.com/mojombo/jekyll/wiki/Liquid-Extensions
For background image, was almost there before…just had to change css to
body {
background-image: url('/assets/img/natural_paper/natural_paper.png');
background-repeat: repeat;
}
This work is licensed under a Creative Commons Attribution 4.0 International License.