Last week, you used gatk
and samtools
to map sequencing data to a reference genome, and generate a .vcf
file. While your work involved a single chromosome for a single individual, GWAS often use thousands of times more data! In this week’s set of activities, you will perform a subsequent step of GWAS: the actual association part. You will be using plink
, a commonly-used tool that measures statistic signals of association. Using it will require simulated data for 500 individuals, including two inputs:
.vcf
file)At the most basic case, associations are measured by a Chi-square test for differences in allele counts in two groups:
Imagine 100 case individuals and 100 control individuals are genotyped at a candidate SNP implicated in a disease. Because humans are diploid, that means there will be a total of 200 alleles (two per individual) per group. Say the data falls out like so, with individuals having either a G or a C allele:
case | control | total | |
---|---|---|---|
G | 125 | 97 | 222 |
C | 75 | 103 | 178 |
total | 200 | 200 | 400 |
These values are then used in a Chi-square test. You can run it here.
When running a GWAS, you are essentially running thousands and thousands of Chi-square tests–one for every locus (row) in a VCF file. In the end, you have a collection of genome positions (plotted on the X-axis) and significance scores (on the Y-axis). This form of displaying results is called a Manhattan plot. The name comes from the idea that when there are lots of SNPs, it looks like a city skyline, like that of Manhattan.
I’ve provided a VCF file MT.500.vcf
that contains simulated genotype data for 500 individuals. You need to incorporate their phenotypic data together with genotype data to determine if any SNPs are associated with a modeled disease.
less -S
to disable word-wrap, making the file simpler to examine.For this exercise, you have access to phenotype data in two separate files: a list of control individuals is within healthy.txt
and a list of case individuals is within diseased.txt
.
.pheno
file by processing healthy.txt
and diseased.txt
using bash commands. The first two columns will both be the same value, followed by the phenotype in the third column. In this file format, a phenotype value of 1 means healthy, and a phenotype of 2 means diseased. You can use loops, awk
, or sed
plus the tools you already know. Save your commands to a text file for safekeeping. You will want it for later. See instructions here on formatting files using bash loops, awk
, and sed
.The result will ultimately be a file organized like so:
FID | SID | PHENO |
---|---|---|
sample1 | sample1 | 1 |
sample3 | sample3 | 1 |
sample4 | sample4 | 1 |
sample6 | sample6 | 1 |
sample2 | sample2 | 2 |
sample5 | sample5 | 2 |
sample7 | sample7 | 2 |
sample8 | sample8 | 2 |
…etc.
Note: The above headers can be included in the .pheno
file, but are not required. The sample and phenotype values above are just examples–your file will likely look different
Once you believe your .pheno
file is properly formatted, execute a case-control analysis using plink
.
Get Plink. Download it from its hosting website and extract the .zip
archive. The file is hosted at https://www.cog-genomics.org/plink2/ and you want the stable build for Linux (64-bit).
Run a case-control association test. You will need to run the --assoc
model, the --vcf
option followed by an input .vcf
filename, the --pheno
option followed by your input .pheno
filename. You can specify the output with --out prefix
which will give you output files prefix.assoc
, prefix.log
etc. Note: you may need to specify an additional option. plink
will complain to you if this is the case.
prefix.assoc
is the output of the association test. Look through its contents, which includes the fields
column | meaning |
---|---|
CHR | Chromosome |
SNP | SNP ID |
BP | Physical position (base-pair) |
A1 | Minor allele name (based on whole sample) |
F_A | Frequency of this allele in cases |
F_U | Frequency of this allele in controls |
A2 | Major allele name |
CHISQ | Basic allelic test chi-square (1df) |
P | Asymptotic p-value for this test |
OR | Estimated odds ratio (for A1, i.e. A2 is reference) |
association_testing
directory contains a script, PlotGWAS.sh
that will generate a plot for you. Execute it as PlotGWAS.sh filename.assoc
where filename matches your output from plink
. The output file will be GWAS.png
and you can view it using the command display GWAS.png
.Note: If you can’t display the image due to errors, see the instructor.
.pheno
file which includes each individual’s phenotypes.sample01
has its phenotype in the row where Sample ID is also sample01
.Address the following as a group. Submit answers (individually) to Collab during class before you head out.
Is a high or low chi-square value more significant? Is a high or a low p-value more significant? What about -log10(p-value)?
Did you have to include any additional options to get plink to run? If so, which?
The plink output refers to alleles as “major” and “minor.” What do these names mean?
How many degrees of freedom are used in a Chi-square test using a 2x2 contingency table?
How many rows are in the input vcf file? How does this compare to the output assoc file? Why does the answer make sense?
On your own time, complete the homework listed below.
As the saying goes, there’s more than one way to skin a cat! For your homework, you will need to submit examples of commands to generate a properly-formatted .pheno
file three different ways: one for each of the above methods (loops with variables, awk
, and sed
). You will submit your answers via the Tests & Quizzes tab of Collab.
Complete the remaining questions related to your work today on the Tests & Quizzes tab of Collab.
The early submission deadline is Friday Night at 11:59 PM. Final submission deadline is the start of class next week.
Feedback will be available Saturday morning at 12:00 AM. Any questions that were wrong can be submitted to me via email for up to half credit back on incorrect answers, if submitted prior to the start of next class.