Overview
In this lesson you will learn how to process data, including aggregation and merging.
Objectives
Readings
Lander, Chapters 11, 12, 14, 15
We often wish to describe data aggregated over some variables: eg, mean income by political party affiliation, or mean height by gender. Aggregate
does this using a simple syntax.
R comes with many sample datasets built in, and many packages come with their own sample datasets to illustrate their functions. One such dataset is the airquality data, which measures various weather qualities in New York in over a series of days in 1973. To view the first 6 rows of any dataset, one can use the head function:
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Tail
gives the last 6 rows. Obviously airquality[1:6,]
would also do the trick.
To view, for instance, the mean temperature aggregated by month in this data, we would use aggregate:
aggregate(Temp ~ Month, data=airquality, mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
The first argument consists of the variables to be aggregated on the left of the ~
, and the variables by which to aggregate on the right. The second argument specifies the dataset name, and the third is the function to apply (which can be anything).
If we want to summarize mean temperature and ozone levels by month, we would write:
aggregate(cbind(Temp,Ozone) ~ Month, data=airquality,mean)
Month Temp Ozone
1 5 66.73077 23.61538
2 6 78.22222 29.44444
3 7 83.88462 59.11538
4 8 83.96154 59.96154
5 9 76.89655 31.44828
And if we want to aggregate over two variables, eg month and day, we would write:
aggmat <- aggregate(cbind(Temp,Ozone) ~ Month + Day, data=airquality,mean)
head(aggmat)
The notation here might seem a bit inconsistent, but the ~
is part of the “formula” notation in R that is used often in statistical models, as we will soon see (basically, like “<-
”, “~
” is yet another way to deal with the ambiguous “=
” sign).
The “Tidyverse” universe of R packages, which we have already mentioned, contains many useful functions, a number of which offer somewhat more intuitive ways for doing basic data analysis methods that have long been available in base R. For instance, summarize
provides another way to aggregate data using a somewhat different and, for some, more intuitive synatax.
For instance, to summarize (aggregate) our airquality data, we can write
library(tidyverse)
## ── Attaching packages ────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
summarize(group_by(airquality,Month),Temp_mean=mean(Temp))
## # A tibble: 5 x 2
## Month Temp_mean
## <int> <dbl>
## 1 5 65.5
## 2 6 79.1
## 3 7 83.9
## 4 8 84.0
## 5 9 76.9
Note that the output is a “Tibble,” which is like a data frame, and again can be converted to a data frame if you like with as.data.frame()
. To summarize by two variables we would write:
library(tidyverse)
summarize(group_by(airquality,Month,Day),Temp_mean=mean(Temp),Ozone_mean=mean(Ozone))
## # A tibble: 153 x 4
## # Groups: Month [5]
## Month Day Temp_mean Ozone_mean
## <int> <int> <dbl> <dbl>
## 1 5 1 67 41
## 2 5 2 72 36
## 3 5 3 74 12
## 4 5 4 62 18
## 5 5 5 56 NA
## 6 5 6 66 28
## 7 5 7 65 23
## 8 5 8 59 19
## 9 5 9 61 8
## 10 5 10 69 NA
## # … with 143 more rows
Note that this version of “mean” is a bit more of a stickler with NA values. But this approach allows us to easily summarize different variables by different functions. Eg, teh following summarizes the second variable by the count of observations instead of the mean of the values:
library(tidyverse)
summarize(group_by(airquality,Month),Temp_mean=mean(Temp),Ozone_count=n())
## # A tibble: 5 x 3
## Month Temp_mean Ozone_count
## <int> <dbl> <int>
## 1 5 65.5 31
## 2 6 79.1 30
## 3 7 83.9 31
## 4 8 84.0 31
## 5 9 76.9 30
One of the final basic tasks needed for data manipulation prior to serious analysis is merging different datasets. Unsurprisingly, the function for this included in R is merge
(although there are many other tools for doing this with larger or more complex datasets, using for instance the plyr
or data.table
packages).
Here is a somewhat extended example, adapted from R’s help documentation. First we create two datasets, “authors” and “books”:
authors <- data.frame(
surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
nationality = c("US", "Australia", "US", "UK", "Australia"),
stringsAsFactors=FALSE)
books <- data.frame(
name = c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core"),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
stringsAsFactors=FALSE)
authors
surname nationality
1 Tukey US
2 Venables Australia
3 Tierney US
4 Ripley UK
5 McNeil Australia
books
name title
1 Tukey Exploratory Data Analysis
2 Venables Modern Applied Statistics ...
3 Tierney LISP-STAT
4 Ripley Spatial Statistics
5 Ripley Stochastic Simulation
6 McNeil Interactive Data Analysis
7 R Core An Introduction to R
Now we wish to merge them by author name. We do this with the merge
function:
bookmerge <- merge(authors, books, by.x="surname",by.y="name")
bookmerge
surname nationality title
1 McNeil Australia Interactive Data Analysis
2 Ripley UK Spatial Statistics
3 Ripley UK Stochastic Simulation
4 Tierney US LISP-STAT
5 Tukey US Exploratory Data Analysis
6 Venables Australia Modern Applied Statistics ...
The first argument is the first data frame, the second argument is the second dataframe. by.x
is the column name(s) to match on from the first dataset, by.y
is the column name(s) from the second. If no by.x
or by.y
arguments are supplied, merge will look for variables with identical names to match on, but it is generally better to include the names explicitly even if they are the same. If you wish to merge on multiple columns, by.x
and by.y
can of course be vectors, such as by.x=c("name","height")
.
Finally, by default merge omits rows that aren’t matched in both datasets; to include those that are in the first data frame but not the second, include all.x=TRUE
; all.y=TRUE
does the same for the second data frame.
One of the other convenient functions from Tidyverse is join
, which merges data much as merge
does, but with a syntax that’s clearer to some users. It is particularly convenient of you have two tables where the columns you wish to match on are identically named, in which case you can run:
bookmerge <- inner_join(authors, books) # note this won't work since the columns have different names
inner_join
is equivalent to the default for merge
, which only retains rows that are in both data frames. If you want to keep all rows from both data frames, use full_join
; if you want just the unmatched rows from the first data frame (equivalent to all.x=TRUE
), use left_join
, and if you want all from the second, use right_join
. If you want to specify which columns to match, just use by=columnname
, and if they don’t have the same name in the two tables, you can use:
bookmerge <- inner_join(authors, books, by=c("surname"="name"))
One of the most innovative tools in the Tidyverse is the %>%
or “pipe” operator. Basically, this turns a sequence of steps where you save the intermediate result of each step into a pipeline which takes one input, pipes it through a series of steps, and produces one final output. For instance, consider the following steps
x <- 1:5
x1 <- x^2
x2 <- diff(x1)
x3 <- abs(x2)
x4 <- round(x3,2)
x4
## [1] 3 5 7 9
In the above sequence, we did a series of operations on x
in order to get our final output, x4
; we saved a bunch of values along the way, but we didn’t actually care about those intermediate values, and to be honest don’t particularly want them cluttering up our environment after we are done. We could of course always write it in nested form:
round(abs(diff((1:5)^2)),2)
## [1] 3 5 7 9
But that’s both hard to read and hard to debug. Here is how we woud write it in the Tidyverse lingo, using the pipe operator:
set.seed(1)
x4 <- (1:5)^2 %>%
diff() %>%
abs() %>%
round(2)
x4
## [1] 3 5 7 9
What’s going on here? Bascially, the pipe operator %>%
sends the output from top to bottom, implicitly putting the output from the previous function as the first argument in the next function. So (1:5)^2 %>% diff()
generates the sequence from 1 to 5, squares those numbers, and then pipes the output from that (a vector of length 5) into the diff function, which assumes (thanks the the pipe) that its first (and only) argument is the previous output. So instead of writing diff(x)
, we omit the x
and just write diff()
. So when the chain gets to round(2)
, once again the first argument (what to round) is omitted, and we just include the second argument, 2, ie how many digits to round to.
Somewhat confusingly, instead of saving the result of that long chain that begins with 1:5
at the end, we save it at the beginning, with x4 <- ...
there are other ways to do this, but this is the most clear about the fact that the final result is being saving as x4
.
This approach is especially useful not for algebra, but for data manipulation, where you wish to perform a sequence of transformations on a dataset and don’t want to keep intermediate steps. For instance, consider our summarize
approach above. We can rewrite this using the pipe operator as:
summaryair <-
airquality %>%
group_by(Month) %>%
summarize(Temp_mean=mean(Temp),Ozone_count=n())
summaryair
## # A tibble: 5 x 3
## Month Temp_mean Ozone_count
## <int> <dbl> <int>
## 1 5 65.5 31
## 2 6 79.1 30
## 3 7 83.9 31
## 4 8 84.0 31
## 5 9 76.9 30
This is considerably easier to read than the nested approach above, even had we broken it up into multiple intermediate stages, and it doesn’t waste memory saving things along the way, which can make a big difference with bigger datasets.
The pipe operator, along with much of the Tidyverse collection of functions, has become immensely popular, and it is worth knowing if only to be able to read a large portion of modern R code. There are some drawbacks, however. The pipe can sometimes make debugging harder if you have piped a long sequence of operations; there are times when you find you intermediate datasets and have to chop it up again later; it requires a series of new functions to replace standard R methods like subsetting (eg, mydata[,c("columna","columnb")]
needs a new approach); and the fact that first argument of each function is implicitly the output of the previous operation can be confusing for some readers, especially for functions that take two symmetrical arguments. But overall, it can be a useful tool, and has become quite popular.