To select best enzyme combination for double digest RADseq (ddRAD), we need to estimate the number of fragments generated by digestion using combination of restriction enzymes. As an Aphaenogaster genome isn’t available, we are using the Pogonomyrmex barbatus as the closest reference. Note that Table 1 of the ddRADseq paper also reports simulated fragment recovery for Solenopsis invicta, Apis mellifera and Drosophila melanogaster.
Method: in silico digest using code suggested on seqanswers:
cat pbar_scaffolds_v03.fasta | tr -d "\n" | grep -o -E "CCGG.{264,336}CCGG" | wc -l
where pbar_scaffolds_v03.fasta
is the target genome fasta file, and CCGG
is replaced the forward and reverse restriction enzyme recognition sequence, respectively. The fragment size selected {264,336}
is based on the ‘wide’ size selection simulation of Peterson et al. (2012).
Restriction Enzyme | Recognition sequence | Adapter |
---|---|---|
NlaIII | 5’CATG | P1-flex |
Sphl | 5’GCATGC | P1-flex |
MluCI | 5’AATT | P2-flex |
EcoRI | 5’GAATTC | P2-flex |
MspI | 5’CCGG | specific |
SbfI | 5’CCTGCAGG | specific |
Number of fragments in P. barbatus and S. invicta for different double digest combinations with a window size of 300+-36bp. Combinations compatible with flex adapters are italicized.
Forward enzyme | Reverse enzyme | P. barbatus | S. invicta |
---|---|---|---|
SbfI | EcoRI | 23 | 20 |
Sphl | EcoRI | 784 | 913 |
EcoRI | MspI | 7,866 | 9,114 |
NlaIII | EcoRI | 12,032 | 14,354 |
Sphl | MluCI | 18,193 | 23,738 |
NlaIII | MluCI | 210,506 | 285,540 |
Odd note - compared to fragment estimates from in Table 1 of Peterson el al. (2012), my estimate for Sphl - MluCI is 10 times greater, and my estimate for NlaIII - MluCI is 20 times greater. Unclear where the difference arises it is comparable for the other combinations, and I can’t find Methods to explain their numbers.
EDIT: SCH pointed out that AATT is often repeated. Checked methods of paper and they specify that repeats are masked for simulations with the mouse genome. Likely did the same with the other genomes. Thus, the discrepancy in number is likely due to repeats.
EDIT: See … post for results of empirical digestion.
This work is licensed under a Creative Commons Attribution 4.0 International License.