Before, you downloaded a pre-generated VCF file containing genetic information of multiple individuals’ mitochondrial genome. Now, you will be generating your own VCF file starting from raw DNA sequencing data. We will pretend that we have already conducted the necessary DNA extraction, purification, and preparation for sequencing. The sample was sent to the sequencing facility, a few weeks have passed, and now the data is ready for us.
The general workflow will consist of:
Thus far, we’ve been working in your home directory for doing computational work. However, the home directory is a lower-performance disk with limited disk space and slower speed. Using the home directory for extensive computing work can result in slowed performance, accidentally filling up your allotted disk space, and failed tasks.
Aside from your home directory, you have a personal high-performance directory located at
/scratch/${USER} where the variable ${USER} corresponds to your computing ID. Simply change
to your /scratch/ directory using cd to access it. For example, if your computing ID is
mst3k, then you would use cd /scratch/mst3k.
While you have substantially more disk space and faster read/write speeds on /scratch/,
the storage there is technically not guaranteed to be backed up (hence the name, it’s like
a scratch pad where you do work and test things out, but it isn’t intended for permenant
long-term storage. For our class’s purposes, that’s fine because we aren’t worried about
storing out results for a long period.
Once you are in your personal directory on /scratch/ you can re-download the course’s
github repository via git clone https://github.com/cory-weller/BIOL4585 and then cd
into this week’s directory, named 06_mapping_sequencing_data.
DNA sequencing data is typically stored in a standardized format called FASTQ. As such,
files containing sequencing data will typically have the .fastq extension. Basically,
the format contains a string of sequenced nucleotides, and their corresponding quality scores.
You can see https://en.wikipedia.org/wiki/FASTQ_format for information on the format.
We’ll be retrieving data from the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI).
The SRA allows researchers to share their sequencing data with others, allowing for their research to be checked for reproducibility, for collaborative work, for archival purposes, and for allowing data to be repurposed for novel studies. There is a convenient tool for retrieving sequencing data from the SRA, directly from the command-line, called the sra-toolkit.
First, you’ll need to download the sra-toolkit from NCBI. Luckily, it’s hosted on their
FTP server, so you can download it directly using commands like wget so long as you have
the link: https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
/scratch/${USER}/BIOL4585/06_mapping_sequencing_data/Note that this file is a .tar.gz archive. This is different from a simple .gz file.
A .tar archive can contain multiple directories and files, and are not innately compressed.
Because this file is .tar.gz, that means it first bundled multiple files together into
a .tar file, and was subsequently compressed using gzip.
tar by entering man tar on the command line.You can then cd into the fully extracted sra-toolkit directory. List all of the contents of this
directory. You’ll notice a directory named bin which is a common naming convention. The
bin directory indicates executable binary applications. The tool we want to use is named
fastq-dump and is contained within the bin directory.
Execute fastq-dump without any arguments to see directions for using the tool. Note that
you will need to either give an explicit full path to the file (e.g. /scratch/$USER/BIOL4585/06_mapping_sequencing_data/sratoolkit.2.9.4-centos_linux64/bin/fastq-dump)
or a relative path starting with ./
Use fastq-dump using the flag --gzip to download compressed sequence data for accessionID SRR8203771.
This should take a minute or two, and afterwards you can use ls to verify you have a .fastq.gz file.
In order to actually put the sequencing reads into the places we think they ‘belong’ in the genome, we need to actually have a reference for what the genome looks like. This is what the reference genome is: a long string of nucleotide characters. A sequence read mapper then compares individual sequencing reads to the reference, and determines optimum placement based on character matches. The output of this process is a sequence alignment map (SAM) file.
retrieve_SRA.sh script
conveniently located in your alignment directory of the class repository. Read through the
script to get an idea of how you expect it to work. Then, retrieve accession NC_012920.1 using the script.In order for the reference to be used, we need to generate accessory index files using a tool called bwa.
Luckily, this tool is pre-installed on Rivanna, through a module that we can turn on using a set of
commands called module load.
bwa by entering it by itself in the command-line. Then, Activate the bwa module
by entering module load bwa. Then, enter bwa by itself in the command-line again. What happens?Note: Different software often runs in different ways, e.g. one might run such as
`
We want to use bwa index. With most tools, entering it with no arguments or commands will give
an error, and tell us how to use it. Enter bwa index to see how it wants to be ran. Then, use it
to generate the index files for our reference fasta file (MT_reference.fasta).
Next, we want to use bwa mem to align the sequencing reads to the reference. Run bwa mem to be
prompted with how the program wants to be ran. The expected output is our sequence alignment map, or
.sam file. In this case, you can either specify an output file name using -o MT.sam or by directing
the output using ` > MT.sam at the end of your command (but don't do both). Note that options like
-o MT.sam` need to come before the fastq input.
While the MT.sam file you generated contains the right information we need to create a VCF file,
it need some preparation before it’s fully ready. These steps include converting to a binary format,
sorting the file for faster processing, and creating another index file for the reference sequence.
Load the samtools module similar to how you loaded the bwa module
Use the samtools view command to convert it to a binary format with the .bam suffix. Run
samtools view without any other commands to get a list of options, and use the option that
indicates you want a .bam output. For an output file name, use MT.bam.
Use the samtools sort command to output a sorted BAM file. Your input will be MT.bam and the
output file name can be MT.sorted.bam.
Use the samtools faidx command to generate another necessary index file for the reference sequence.
GATK is a commonly-used toolkit with many tools packed into it. We will use it to validate our file (making sure it’s properly formatted) and for generating our final VCF output.
Load the gatk module
Use the gatk CreateSequenceDictionary command to generate a sequence dictionary for the reference sequence.
Successful completion should print Tool returned: 0.
gatk AddOrReplaceReadGroups command for our sorted .bam file. There are many arguments here:
MT.sorted.bamMT.sorted.readgroups.bam--RGLB) will be library1--RGPL) will be illumina--RGPU) will be platform1--RGSM) will be sample1Index our final .bam. file by running samtools index for MT.sorted.readgroups.bam
MT.sorted.readgroups.bam using the command gatk ValidateSamFileIf the validation returns with no errors, you’re ready for the final step of generating the VCF output!
gatk HaplotypeCaller command using our sorted .bam file with added readgroups. You will
need to specify the reference sequence file, and an output file name, such as MT.vcf. This may take a
few minutes to run to completion, but the end result should be a fully-formed VCF file! Look at the file contents
(e.g. using less) and scroll down. You should see a bunch of metadata starting with ##, followed
by a header starting with #CHROM POS ID etc. Below that should be rows of data. If not, something
went wrong. See: Troubleshooting.If you get through the final step and data rows are not present, you can check at which step things
went wrong. Your .sam and each of your .bam files should be relatively large (hundreds of MB),
so if any of them is unusually small, then something went wrong building that file.
You can check a file’s size with du -h filename (du stands for for disk usage). The -h option reports file size
in human-readible units of kilobytes, megabytes or gigabytes. For example, if the .sam file you generated is
very small, you’ll need to start over generating that file.
For this week, you have 5 questions related to the pipeline you worked through, as well as the final output that was generated. Once you are confident that you’ve successfully finished the pipeline and generated the VCF file, proceed to the “Tests & Quizzes” tab of the course collab page.
The “early” deadline is this Friday night at 11:59 PM. The “final” deadline is the start of next class (4:00 PM Monday Feb 25). You get one chance to submit online. If you submit prior to the early deadline, you will be able to check if your solutions were correct on 12:00 AM Saturday morning and later. Also, you will have the opportunity to resubmit your answers, along with explanations of why your previous answer was wrong and how you fixed it, via email. If you decide not to submit until after the early deadline, that’s your only submission. The choice is yours!