Many research projects involve using publicly available data. But, that data has to be transferred from its source (hosted on some server somewhere) to your personal directory (on Rivanna). Here, you will be retrieving publicly available data published with the 1000 Genomes Project.
During this activity, you will practice:
To download a file while in bash, you’ll need
bash
shellConnect to Rivanna and then open up a browser window to do some google searches. You’ll come back to Rivanna in a moment, but first you need to learn which command to use to download a file directly to Rivanna.
Search for how to download files using bash
. Your tutorials so far haven’t taught you how to download a file using bash
. Often, you’ll run into the problem of not knowing how to do something. Here, Google is your friend. Many examples will be posted on Stack Overflow and Stack Exchange, websites where people ask for help with coding. Odds are somebody as asked your question before: you just need to search the right terms. Because you’re curious how to download a file using bash
, perform a Google search for bash download file which will return some examples. This is a pretty standard process: if you want to learn how to do task using language, search language task. If you want to only see results posted on stackoverflow, you can search site:stackoverflow.com language thing. Once you know what command(s) you could use, the manual pages can inform you of detailed instructions of how to use the command’s options.
Locate the address of the file to download. Once you have an idea of which command you’ll use, you need the file’s address. The 1000 Genomes data is located on an FTP server, a computer with a File Transfer Protocol address. Basically, this means you can download files from it using a web address. You can view the list of 1000 genomes files in your browser: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Clicking the above link opens up a very plain directory listing of available files. Copy the link to the mitochondrial genome file. We want the file ALL.chrMT.<morestuffhere>.genotypes.vcf.gz
. Note that left clicking will download the file to your own computer, which is not what you want! You only need to copy the link to the file and combine it with the command you found via Google. Run the command in bash
to download the file. Be sure you’re getting the .vcf.gz
file and not the .vcf.gz.tbi
file.
Download the file to your Rivanna directory using the command and file location.
Double-check that it worked. You can see if the file has been downloaded by listing the contents of your folder with ls
. Notice the .vcf.gz
file name is long and complicated. You won’t want to type the whole thing out every time you manipulate the file. There a few options here: First, you could have specified a filename at the same time that you download the file (see examples or manual pages for the command you’re using to download the file). Second, you could download the file with the name it has now, and then rename it to something less complicated using mv
., e.g.
mv oldFileName newFileName
In either case, be sure the new file name communicates that it’s a mitochondrial genome, and that the name ends with the same file extension, .vcf.gz
). Third option, you can keep the full file name and use tab-completion after typing the first character(s).
Decompress the .gz
file containing the mitochondrial genome. The file extension .gz
indicates that the file has been compressed via gzip
to take up less space on the hard drive. You’ll need to decompress the file to read its contents. Perform the necessary Google search and/or use manual pages to learn how to decompress a .gz
file. Side note: because data compression is most effective on highly repetitive information, it can be used to measure how repetitive a pop song is. See https://pudding.cool/2017/05/song-repetition/ for a very well made web page briefly explaining data compression using stunning visual displays, plus evidence that Mariah Carey is more lyrically complicated than Beyonce (if only slightly).
Investigate the contents of .VCF
file. You might first think to look at the contents by printing the file with cat
. Give it a try and see what happens. Remember you can press CTRL+C
to cancel commands that are currently running.
less
command. Is this more effective?less
with the -S
option (note it is a capital S) to turn off word wrap. Is this more effective? The data seems to “line up” nicely when using less -S
because fields are separated by tab characters (that is, it uses a tab delimiter). It is very common for data to be delimited by tab characters because it’s much easier to read than being separated by commas.CHROM
, POS
, ID
, REF
, ALT
etc. The lines above that are metadata that we can ignore for now.wc
command with the -l
optiongrep #CHROM
to pull out all rows that include the “word” #CHROM
grep #CHROM filename.vcf | wc -w
. This first extracts the header row because it matches the #CHROM
pattern, then counts the number of ‘words’ in that row.