Overview

In this lesson we learn to aggreate, merge, and manipulate strings in datasets.

Objectives

After completing this lesson, students should be able to:

Create summary datasets and statistics using “aggregate.”
Merge datasets.
Manipulate strings.

Readings

Lander, Chapters 11.2, 12.2, 13

1 Aggregate

We often wish to describe data aggregated over some variables: eg, mean income by political party affiliation, or mean height by gender. Aggregate does this using a simple syntax.

R comes with many sample datasets built in, and many packages come with their own sample datasets to illustrate their functions. One such dataset is the airquality data, which measures various weather qualities in New York in over a series of days in 1973. To view the first 6 rows of any dataset, one can use the head function:

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Tail gives the last 6 rows. Obviously airquality[1:6,] would also do the trick.

1.1 Aggregate over a variable

To view, for instance, the mean temperature aggregated by month in this data, we would use aggregate:

aggregate(Temp ~ Month, data=airquality, mean)

  Month     Temp
1     5 65.54839
2     6 79.10000
3     7 83.90323
4     8 83.96774
5     9 76.90000

The first argument consists of the variables to be aggregated on the left of the ~, and the variables by which to aggregate on the right. The second argument specifies the dataset name, and the third is the function to apply (which can be anything).

1.2 Aggregate over two variables

If we want to summarize mean temperature and ozone levels by month, we would write:

aggregate(cbind(Temp,Ozone) ~ Month, data=airquality,mean)

  Month     Temp    Ozone
1     5 66.73077 23.61538
2     6 78.22222 29.44444
3     7 83.88462 59.11538
4     8 83.96154 59.96154
5     9 76.89655 31.44828

And if we want to aggregate over two variables, eg month and day, we would write:

aggregate(cbind(Temp,Ozone) ~ Month + Day, data=airquality,mean)

The notation here might seem a bit inconsistent, but the ~ is part of the “formula” notation in R that is used often in statistical models, as we will soon see (basically, like “<-”, “~” is yet another way to deal with the ambiguous “=” sign).

2 Merging data

One of the final basic tasks needed for data manipulation prior to serious analysis is merging different datasets. Unsurprisingly, the function for this included in R is merge (although there are many other tools for doing this with larger or more complex datasets, using for instance the plyr or data.table packages).

Here is a somewhat extended example, adapted from R’s help documentation. First we create two datasets, “authors” and “books”:

authors <- data.frame(
    surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"), 
    nationality = c("US", "Australia", "US", "UK", "Australia"),
    stringsAsFactors=FALSE)
books <- data.frame(
    name = c("Tukey", "Venables", "Tierney",
               "Ripley", "Ripley", "McNeil", "R Core"),
    title = c("Exploratory Data Analysis",
              "Modern Applied Statistics ...",
              "LISP-STAT",
              "Spatial Statistics", "Stochastic Simulation",
              "Interactive Data Analysis",
              "An Introduction to R"),
    stringsAsFactors=FALSE)

2.1 View data

authors

   surname nationality
1    Tukey          US
2 Venables   Australia
3  Tierney          US
4   Ripley          UK
5   McNeil   Australia

books

      name                         title
1    Tukey     Exploratory Data Analysis
2 Venables Modern Applied Statistics ...
3  Tierney                     LISP-STAT
4   Ripley            Spatial Statistics
5   Ripley         Stochastic Simulation
6   McNeil     Interactive Data Analysis
7   R Core          An Introduction to R

2.2 Merge by

Now we wish to merge them by author name. We do this with the merge function:

bookmerge <- merge(authors, books, by.x="surname",by.y="name")
bookmerge

   surname nationality                         title
1   McNeil   Australia     Interactive Data Analysis
2   Ripley          UK            Spatial Statistics
3   Ripley          UK         Stochastic Simulation
4  Tierney          US                     LISP-STAT
5    Tukey          US     Exploratory Data Analysis
6 Venables   Australia Modern Applied Statistics ...

The first argument is the first data frame, the second argument is the second dataframe. by.x is the column name(s) to match on from the first dataset, by.y is the column name(s) from the second. If no by.x or by.y arguments are supplied, merge will look for variables with identical names to match on, but it is generally better to include the names explicitly even if they are the same. If you wish to merge on multiple columns, by.x and by.y can of course be vectors, such as by.x=c("name","height").

Finally, by default merge omits rows that aren’t matched in both datasets; to include those that are in the first data frame but not the second, include all.x=TRUE; all.y=TRUE does the same for the second data frame.

3 Strings

One last tidbit. As we have seen, print("Hello") will output the string “Hello”; we could include print(i) in a for loop, for instance, to keep track of progress.

To concatenate strings (eg, if you want to save a bunch of datasets, each with a different number), one uses the paste function:

for(i in 1:3){
  filename = paste("datafile",i,".txt",sep="")
  print(filename)
}

[1] "datafile1.txt"
[1] "datafile2.txt"
[1] "datafile3.txt"

All the arguments in paste are pasted together except for the last one, which specifies what to put in between each argument – in this case it was nothing "", but it could be a comma if we wanted to create a comma-separated data file (the hard way!).

3.1 Split

The inverse of paste is splitting a string with strsplit:

sout <- strsplit("good,bad,happy,sad",",")
sout

[[1]]
[1] "good"  "bad"   "happy" "sad"

This splits the string(s) in the first argument at the string in the second argument, and saves the output as a list, which allows for easier subsequent processing.

3.2 Find

R of course has many other functions for strings that could cover an entire course. The last ones we will cover here are finding and replacing. To test whether a string or value is in another set (including a vector or table), we can use %in%:

c("fish","dog",2) %in% c("happy","fish","pie",2)

[1]  TRUE FALSE  TRUE

Note that in this example we do three searches (one for each element of the vector on the left of the %in%), and it returns TRUE if a search element matches any of the elments of the vector being searched.

3.3 Partial finding

Partial matches don’t work with %in% :

"fish" %in% "I would like to go fishing"

[1] FALSE

For partial matches, you can use grep:

grep("fish", c("I would like to go fishing","dog my cats","fishsticks"))

[1] 1 3

Note that grep outputs the element numbers for the elements in the searched vector where it finds matches. Grep uses the powerful regex search language, which we won’t cover here, but which allows very complex (though often slow) searches for string patterns.

3.4 Replacement

The last essential string tool is replacement, using gsub:

gout <- gsub("Sad","Happy",c("Sad Birthday","Sad dog"))
gout

[1] "Happy Birthday" "Happy dog"

The first argument is what to look for, the second is what to replace it with, and the third is what to search, which can of course be a vector of strings, not just a single string.

Computational Statistics 2.3: Data Manipulation