Overview

In this lesson you will learn how to create basic graphics using the ggplot2 package, and to customize those plots.

Objectives

  1. Use ggplot to create histograms, bar plots, and scatter plots.
  2. Customize those graphs to make them more comprehensible and attractive.

Readings

Lander, Chapter 7.2

1 Ggplot2

Rather than go more deeply into the base graphics of R, it is worth turning to the more robust and beautiful ggplot2 graphics package, which is now fairly standard for R visualization.

The fundamental idea of ggplot is that you build up an image by adding together various functions. The core function is ggplot, which takes your data plus a few settings and (silently) outputs it into a usable structure. The actual visualization is made by adding to ggplot various other functions that take the structured data and output the actual graphics to your operating system.

As always, this is best illustrated with a few examples!

1.1 Histogram

Here is our histogram from before in using ggplot2:

library(ggplot2)
ggplot(data=airquality,aes(x=Temp)) + geom_histogram(bins=30)

The first function ggplot declares that the dataframe we are using is “airquality” and the x variable we want is “Temp” (“aes” stands for aesthetics, and can take many different aesthetic settings besides the variable names). geom_histogram then takes that data and outputs the histogram; it too can take various settings in the “()” which we will explore later. You may notice that when you run it there is a message by default suggesting you choose the numer of bins yourself, which is eliminated by running, say,

ggplot(data=airquality,aes(x=Temp)) + geom_histogram(bins=30)

If we want a smooth density rather than a histogram, we can use:

ggplot(data=airquality,aes(x=Temp)) + geom_density(fill="blue",alpha=0.5)

Note that in addition to filling it with blue, we set the transparency (alpha) to 50%, just to make it look nice.

1.2 Boxplot

To do a boxplot, we follow a similar syntax:

ggplot(data=airquality,aes(x=1,y=Temp)) + geom_boxplot()

The reason we need both an x and a y is that geom_boxplot is by default designed to do boxplots over a number of groups. x=1 is just a way to get around this by making x a constant.

1.3 Bloxplot over x

But if we wanted to boxplot temperature by month, we would write

ggplot(data=airquality,aes(x=as.factor(Month),y=Temp)) + geom_boxplot() + xlab("Month")

Note that we have to change the x variable to a factor so that ggplot knows how to group it. We also added another function to change the xlabel, which otherwise would be the ugly “as.factor(Month)”.

1.4 Scatter plot

Finally, to do a scatter plot, you could probably almost guess the syntax by now, except for one thing. ggplot likes to have its data all in a single dataframe, so first we have to add our “airdate” variable to the original dataframe:

airdate <- as.Date(paste("1972","-",airquality$Month,"-",airquality$Day,sep=""))
airquality2 <- cbind(airquality,airdate)
ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + xlab("Date") + ylab("Temperature") 

If we want to add a best-fit line, we can add geom_smooth()

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + xlab("Date") + ylab("Temperature") + geom_smooth(method=lm, color="red", se=TRUE)
`geom_smooth()` using formula 'y ~ x'

We can also add a non-linear smoothed curve to this to show the overall trend, by specifying a “loess” smoother.

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + xlab("Date") + ylab("Temperature") + geom_smooth(method=loess , color="blue", se=TRUE)
`geom_smooth()` using formula 'y ~ x'

And if we wanted to connect the points (which probably isn’t a good idea in this case) we could use geom_line().

ggplot(data=airquality2,aes(x=airdate,y=Temp)) + geom_point() + geom_line() + xlab("Date") + ylab("Temperature")

2 Saving graphs

Finally, how do we save our plots?

This is easily done with ggplot2 using the ggsave function, which can be executed after you have constructed your image. For instance, our current image is now the plot of temperature vs date with the connecting lines. To save that as a PDF (the best format to use for inclusion in a document), we write:

ggsave("tempvsdate.pdf",width=6,height=4)

To save the file as a png (a good format for the web), we just change the file name to “tempvsdate.png”.

Another alternative is to use RStudio to save graphics. This gives you a little more room to tinker with the size and format, although it is always best practice to include the save within your script for reproducibility.

To save the image in the “Plots” pane using RStudio, click on the “Export” tab right below the “Plots” tab in the “Plots” pane. To save as a PDF, for instance, choose “Save as PDF…”, which gives you the options to set the size and directory, and most importantly, to preview the results so you can get the size right. Different sizes even with the same width-to-height ratio produce different size text and other features, so it’s sometimes worth tinkering to get the most aethetically pleasing output.

3 Customizing ggplot

One thing to bear in mind with ggplot is that it is almost impossible to learn the syntax by heart, at least for all the little settings you will invariably want. So it is essential to have a good reference guide. The best guides are essentially little cookbooks, where you can either look up in the index by the ingredient you want, or look through the pictures for something that looks good.

In addition to our textbook, I especially recommend Cookbook for R, based on the excellent R Graphics Cookbook by Winston Chang. The online version makes it especially easy to find what you need with the minimum of hair-pulling. The utility of crib sheets and online references for R is always important, but it is especially essential for a visual medium like graphics, where often you may not know exactly what you need until you see it.

Another excellent resource is the R Graph Gallery which provides a visual encyclopedia of possible graphing needs and solutions.

We will explore here a few of the basic customizations that are possible with ggplot, but this is really only the tip of the iceberg. Visualization, with or without R, is its own entire discipline, which we can only touch on here.

3.1 Themes

One of the easiest and most powerful ways to improve plots is adding themes. The gray background, for instance, is a little odd and not entirely suited to many professional expectations, but the “classic” or “bw” (black and white) themes can make a plot look more suitable.

library(ggplot2)
ggplot(data=airquality,aes(x=as.factor(Month),y=Temp)) + geom_boxplot() + xlab("Month") + theme_bw()

Note that these themes change many more things than is visible in this example, including color palettes and legend styles.

3.2 Colors

R has a rich set of color palettes, though the default may not be to your liking. For instance, here is a variant on the previous plot where each month gets its own color, by altering the “fill” aesthetic. R chooses the colors automatically based on its default palette. R also creates a legend with a somewhat ugly title.

ggplot(data=airquality,aes(x=as.factor(Month),y=Temp)) + geom_boxplot(aes(fill=as.factor(Month))) + xlab("Month") + theme_bw()

To change the color palette for another preset one, one can use the RColorBrewer package, and add a color palette using scale_fill_brewer() for fills, and scale_color_brewer() if you want to recolor lines or points. To view all the pallettes in that package, just run display.brewer.all().

library(RColorBrewer)
ggplot(data=airquality,aes(x=as.factor(Month),y=Temp)) + geom_boxplot(aes(fill=as.factor(Month))) + xlab("Month")  + scale_fill_brewer(palette="Spectral") + theme_bw()

3.3 Legends

Here we simply retitle the legend, not that any legend is actually necessary. See here for more on legend formatting, which is voluminous.

library(RColorBrewer)
ggplot(data=airquality,aes(x=as.factor(Month),y=Temp)) + geom_boxplot(aes(fill=as.factor(Month))) + xlab("Month")  + scale_fill_brewer(palette="Spectral") + theme_bw()  + guides(fill=guide_legend(title="Months"))