Overview
In this lesson we learn to aggreate, merge, and manipulate strings in datasets.
Objectives
After completing this lesson, students should be able to:
Readings
Lander, Chapters 11.2, 12.2, 13
We often wish to describe data aggregated over some variables: eg, mean income by political party affiliation, or mean height by gender. Aggregate
does this using a simple syntax.
R comes with many sample datasets built in, and many packages come with their own sample datasets to illustrate their functions. One such dataset is the airquality data, which measures various weather qualities in New York in over a series of days in 1973. To view the first 6 rows of any dataset, one can use the head function:
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Tail
gives the last 6 rows. Obviously airquality[1:6,]
would also do the trick.
To view, for instance, the mean temperature aggregated by month in this data, we would use aggregate:
aggregate(Temp ~ Month, data=airquality, mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
The first argument consists of the variables to be aggregated on the left of the ~
, and the variables by which to aggregate on the right. The second argument specifies the dataset name, and the third is the function to apply (which can be anything).
If we want to summarize mean temperature and ozone levels by month, we would write:
aggregate(cbind(Temp,Ozone) ~ Month, data=airquality,mean)
Month Temp Ozone
1 5 66.73077 23.61538
2 6 78.22222 29.44444
3 7 83.88462 59.11538
4 8 83.96154 59.96154
5 9 76.89655 31.44828
And if we want to aggregate over two variables, eg month and day, we would write:
aggregate(cbind(Temp,Ozone) ~ Month + Day, data=airquality,mean)
The notation here might seem a bit inconsistent, but the ~
is part of the “formula” notation in R that is used often in statistical models, as we will soon see (basically, like “<-
”, “~
” is yet another way to deal with the ambiguous “=
” sign).
One of the final basic tasks needed for data manipulation prior to serious analysis is merging different datasets. Unsurprisingly, the function for this included in R is merge
(although there are many other tools for doing this with larger or more complex datasets, using for instance the plyr
or data.table
packages).
Here is a somewhat extended example, adapted from R’s help documentation. First we create two datasets, “authors” and “books”:
authors <- data.frame(
surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
nationality = c("US", "Australia", "US", "UK", "Australia"),
stringsAsFactors=FALSE)
books <- data.frame(
name = c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core"),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...",
"LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
stringsAsFactors=FALSE)
authors
surname nationality
1 Tukey US
2 Venables Australia
3 Tierney US
4 Ripley UK
5 McNeil Australia
books
name title
1 Tukey Exploratory Data Analysis
2 Venables Modern Applied Statistics ...
3 Tierney LISP-STAT
4 Ripley Spatial Statistics
5 Ripley Stochastic Simulation
6 McNeil Interactive Data Analysis
7 R Core An Introduction to R
Now we wish to merge them by author name. We do this with the merge
function:
bookmerge <- merge(authors, books, by.x="surname",by.y="name")
bookmerge
surname nationality title
1 McNeil Australia Interactive Data Analysis
2 Ripley UK Spatial Statistics
3 Ripley UK Stochastic Simulation
4 Tierney US LISP-STAT
5 Tukey US Exploratory Data Analysis
6 Venables Australia Modern Applied Statistics ...
The first argument is the first data frame, the second argument is the second dataframe. by.x
is the column name(s) to match on from the first dataset, by.y
is the column name(s) from the second. If no by.x
or by.y
arguments are supplied, merge will look for variables with identical names to match on, but it is generally better to include the names explicitly even if they are the same. If you wish to merge on multiple columns, by.x
and by.y
can of course be vectors, such as by.x=c("name","height")
.
Finally, by default merge omits rows that aren’t matched in both datasets; to include those that are in the first data frame but not the second, include all.x=TRUE
; all.y=TRUE
does the same for the second data frame.
One last tidbit. As we have seen, print("Hello")
will output the string “Hello”; we could include print(i)
in a for
loop, for instance, to keep track of progress.
To concatenate strings (eg, if you want to save a bunch of datasets, each with a different number), one uses the paste
function:
for(i in 1:3){
filename = paste("datafile",i,".txt",sep="")
print(filename)
}
[1] "datafile1.txt"
[1] "datafile2.txt"
[1] "datafile3.txt"
All the arguments in paste
are pasted together except for the last one, which specifies what to put in between each argument – in this case it was nothing "", but it could be a comma if we wanted to create a comma-separated data file (the hard way!).
The inverse of paste
is splitting a string with strsplit
:
sout <- strsplit("good,bad,happy,sad",",")
sout
[[1]]
[1] "good" "bad" "happy" "sad"
This splits the string(s) in the first argument at the string in the second argument, and saves the output as a list, which allows for easier subsequent processing.
R of course has many other functions for strings that could cover an entire course. The last ones we will cover here are finding and replacing. To test whether a string or value is in another set (including a vector or table), we can use %in%
:
c("fish","dog",2) %in% c("happy","fish","pie",2)
[1] TRUE FALSE TRUE
Note that in this example we do three searches (one for each element of the vector on the left of the %in%
), and it returns TRUE if a search element matches any of the elments of the vector being searched.
Partial matches don’t work with %in%
:
"fish" %in% "I would like to go fishing"
[1] FALSE
For partial matches, you can use grep
:
grep("fish", c("I would like to go fishing","dog my cats","fishsticks"))
[1] 1 3
Note that grep
outputs the element numbers for the elements in the searched vector where it finds matches. Grep
uses the powerful regex
search language, which we won’t cover here, but which allows very complex (though often slow) searches for string patterns.
The last essential string tool is replacement, using gsub
:
gout <- gsub("Sad","Happy",c("Sad Birthday","Sad dog"))
gout
[1] "Happy Birthday" "Happy dog"
The first argument is what to look for, the second is what to replace it with, and the third is what to search, which can of course be a vector of strings, not just a single string.