05 September 2014
VACC, Docker, STRUCTURE, and haplotype clusteringNext post Previous post
Got data in STRUCTURE format!
To get STRUCTURE to run, had to set POPFLAGF, USEPOPINFO and LOCISPOP == 0
Trained Laurel on ant room.
Workstation still in shop. Major bottleneck in proceeding with analysis is the (re-)installation of software. Motivation to look into using Docker to build reproducible computing image so not only the analysis, but also the computational set-up is reproducible.
Not so easy - Docker only supported for 64-bit architecture. Steps for getting docker to work on 32-bit architecture.
Random awesomeness: iPipet - benchtop tool to track the transfer of samples and reagents using a tablet
Beginning to move analyses to VACC. Cluster runs RedHat Enterprise Linux 5 and all software is completely out of date. Had to install new versions of emacs, java, R.
As test case, doing MtGIRAFFE bioinformatics. Installed bcftools, beagle, etc.
It is 64-bit architecture so hope to get Docker running, but only supported for RHEL 6> so may not work.
Also trying to work on Amazon Cloud.
Considering using BioBrew for computational setup.
Going on VACC - got necessary programs installed.
After haplotype inference with Beagle, found 104 unique haplotypes out of the 524 haplotypes. Given the nature of the inbred accessions, I was surprised that there were so many accessions with two different haplotypes - maybe imputation isn’t appropriate? I also check the number of unique haplotypes in only the first or second chromosome from each accession, with results of 94 and 96 unique haplotypes. So, either way there are nearly 100 haplotypes in the 262 accessions. Best method to further reduce/cluster these?
As a first attempt, using Instruct.
Trying to install pandoc for rmarkdown on VACC. Have to install locally as I don’t have root access so can’t use
Followed these steps to convert rpm to cpio archive
rpm2cpio pandoc-184.108.40.206-1.x86_64.rpm | cpio -idv
and this seemed to work as a new directory tree was created
/usr/bin/pandoc, but trying to run
pandoc gave this error.
./pandoc: error while loading shared libraries: libffi.so.5: cannot open shared object file: No such file or directory
At this point…I quit.
InStruct ran successfully! In test of 1 - 10 clusters, 10 was the optimal number. That said…I’m starting to think this is the wrong approach. Structure-like programs are for clustering of individuals using unlinked molecular markers. These SNPs are all from the same gene so are highly linked.
While BEAGLE can be used for clustering, the clusters are an intermediate step for association mapping and not easily viewed. Google-fu found the Haplosuite R code that (maybe) gets around these (Teo & Small 2010).
This work is licensed under a Creative Commons Attribution 4.0 International License.