If you use R, you’ll almost certainly be using ggplot
for generating figures, or plots. This will be a crash course in the syntax of ggplot
.
First, complete the Data Camp chapter Data Visualization with ggplot2 (Part 1). This should give you an introduction to the basic syntax of ggplot before we work with it using some biological data.
Second, work through the below exercises. The numbered questions are available under the Tests & Quizzes tab of collab.
Log into Rivanna and start R (after loading the gcc and R/3.5.1 modules).
Then, while in R, install the ggplot2
package:
install.packages("ggplot2")
If prompted to select a mirror, choose one relatively close to us in the US for fastest downloads.
It’s often helpful to take screenshots of your plots to save the results to look at later. Also, you’ll need to do so for the homework! The method of saving plots so will differ from one operating system to another.
On Windows, you can search for the Snipping Tool to capture a portion of your screen, and save to a file. Alternatively, copy a region to your clipboard using Windows Key
+ Shift
+ s
, which you can then save after pasting into paint.
On Mac, Command
+ Shift
+ 4
should let you save a portion of your screen to a file.
Linux users are on their own: but depending on your distribution, Shift
+ PrintScreen
may work for you.
If the above do not work (likely due to older operating systems) you can always take a full-window screenshot with the PrintScreen
button and crop it using your default image editing program (e.g. Paint, or Paintbrush).
(Note: You can save your plots within ggplot
using the ggsave()
function, though that’s typically when generating final, high-resolution plots for presentations or publications. Just screenshotting is the simple way to do it for exploratory analysis.)
We’ll be Using the iris
premade dataframe, which describes measurements for different species of flower. You can see the dataframe by executing the command iris
in the R console. We’ll be using iris
as the name of the data for plotting within ggplot()
.
First, let’s look at the relationship between sepal length and sepal width. Sepals are the green leafy parts at the base of the flower, beneath the petals.
Plot sepal width on the x-axis and sepal length on the y. Note you’ll need to match the spelling of the columns in your data.table
to plot it, otherwise you’ll get an error object not found. We’ll want a scatter plot, using geom_point()
Data is plotted in multiple dimensions. The ones you’re most familiar with will be the X and Y dimensions. If all we use are X and Y values, we can compare two variables at the same time.
To add another dimension to the data, we don’t need to make a visually 3D plot. We can add another dimension using things like color, shape, or size. These options all go within the aesthetics, or aes()
portion of a ggplot
command. To visualize how sepal width and sepal length are related for each species, we’ll need to add species as a dimension of the data. Try it first using shape=
, and then using color=
Screenshot the preferred display (with either shape or color as the dimension for species). Upload your plot…
…and paste the code used to generate it.
Does adding species as a dimension to the data provide extra information, compared to when all points looked the same? If so, what kind of patterns could you discern?
Was it easier to discern patterns using color or shape as the dimension for species? Why?
Thus far, we’ve looked at the relationship between length and width for sepals. Now we’re going to look at the relationship between length and width for petals.
Modify your ggplot
command (that you preferred, whether using shape or color for species) to now show the relationship between petal length and petal width. Upload a screenshot of your plot…
…and submit the code used to generate it.
Give your interpretation on petal size as a function of species. Is this relationship more or less clear than sepal size?
Currently, the data is in wide format. More advanced plotting techniques are typically easiest in long format (also called narrow format), where every sample * variable combination is on its own row. You can see a brief example of the difference on the wikipedia page.
melt()
function, and save this long format data to a new variable. Note that you will need to have the data.table
package loaded! When using melt()
, you supply a data.table
object to convert to long format, plus a vector of column names that are being melted into long format.Those column names are listed using the measure.vars=
argument. You will be melting the columns that describe sepal and petal dimensions (but not species). How many rows does this new table have?Now that we have data in long format, we can take advantage of plot faceting. Facets separate data into groups, kind of like when using by=
notation in data.table
. This allows you to look at data for specific groups (or combinations of groups). We’re going to facet data by species, so that each species generates its own sub-plot, separate from the others.
First, plot the long-format data unfaceted using scatter plot, geom_point()
. The X-axis will contain the variable
column, the y-axis will contain the value
column, and data should be colored by species. Upload a screenshot of your plot…
…and the code used to generate it.
Next, facet the data by Species
by adding either facet_wrap()
or facet_grid()
to the code you just used. See the documentation for these in R using ?
, or search google for more help. Upload a screenshot of your plot…
…and add the code used to generate it.
This recent plot is great, but there’s a problem of data overlapping. With scatter plots (geom_point()
), any points that share identical X and Y values will overlap. We can display the data in a couple different ways to get a better idea of the distribution.
Modify your code from 11 from a scatter plot to each of three other methods: a jitter plot, using geom_jitter()
, a violin plot, using geom_violin()
, and a boxplot, using geom_boxplot()
. What are the differences between these three plots? What are similarities in what they tell you?
Currently, your plot’s labels default to column names. Choose your favorite plot from 13 and tidy up the labels to your plot with better descriptors and capitalization, plus a title. For these you can use the ggplot
elements xlab()
, ylab()
and ggtitle()
, or simultaneously using labs()
. Upload a screenshot of the plot…
15 …as well as the code used to generate it.