You are a new student at Xavier’s School for Gifted Youngsters, beginning a new research project using genetic data from the 1000 Genomes Project. The new hot question at Xavier’s School is how to uncover the genetic basis of superpowers. You’ve been told that if we can collect enough genetic data about people with superpower and people without, we might be able to find the parts of the genome that are responsible for those superpowers. The statistics of how this is done can be fuzzy for now–we’ll explain it later.
Genetic data for a population is often stored in specific kind of file you’ve never seen before. Visit wikipedia’s entry for “Variant Call Format” (VCF) file specifications. Read through the description of the format, focusing on the column name and descriptions. Ignore the Common INFO fields section.
Your professor, Xavier, tells you that the data needs to be cleaned up to make analysis easier. He says that that the data should have indels removed so that only SNPs remain. Further, multiallelic SNPs can be problematic, so the data should be pruned to only contain biallelic SNPs. You recognize these words and nod, even though you may not be 100% sure what these words mean together. You open your trusty web browser to quickly look up the differences.
Unfortunately there’s a backlog at the DNA sequencing facility, so you can’t do your full data cleanup yet. To make good use of your time, you decide to begin testing the cleanup steps on a set of practice data, compressed into the file tinyMT.vcf.gz
. After decompressing the file, you set off to remove any multiallelic sites using regular expressions.
tiny.biallelic.vcf
tinyMT.biallelic.indel.vcf
tinyMT.biallelic.vcf
and save the output to tinyMT.biallelic.SNP.vcf
It’s a few days later, and your larger data set has (finally) arrived, called fullMT.vcf.gz
. You decompress the file and apply the same workflow to this new, larger data set, but are unsure where to go from here. In the lounge that week, you happen to overhear a fellow student mention that Dr. Otto Octavius (Doc OcK for short) believes that A-to-G transitions might have something to do with mutant superpowers. You set off to subset your data to only include these kind of sites.
See the at_home_practice
folder for instructions.