Overview

In this lesson you will learn some core functions and tools for dealing with textual data.

Objectives

  1. Manipulate strings, such as finding and replacing text.

Readings

Lander, Ch. 16

1 Strings

One last minor but important topic in data manipulation. As we have seen, print("Hello") will output the string “Hello”; we could include print(i) in a for loop, for instance, to keep track of progress.

To concatenate strings (eg, if you want to save a bunch of datasets, each with a different number), one uses the paste function:

for(i in 1:3){
  filename = paste("datafile",i,".txt",sep="")
  print(filename)
}
[1] "datafile1.txt"
[1] "datafile2.txt"
[1] "datafile3.txt"

All the arguments in paste are pasted together except for the last one, which specifies what to put in between each argument – in this case it was nothing "", but it could be a comma if we wanted to create a comma-separated data file (the hard way!).

2 Split

The inverse of paste is splitting a string with strsplit:

sout <- strsplit("good,bad,happy,sad",",")
sout
[[1]]
[1] "good"  "bad"   "happy" "sad"  

This splits the string(s) in the first argument at the string in the second argument, and saves the output as a list, which allows for easier subsequent processing.

3 Find

R of course has many other functions for strings that could cover an entire course. The last ones we will cover here are finding and replacing. To test whether a string or value is in another set (including a vector or table), we can use %in%:

c("fish","dog",2) %in% c("happy","fish","pie",2)
[1]  TRUE FALSE  TRUE

Note that in this example we do three searches (one for each element of the vector on the left of the %in%), and it returns TRUE if a search element matches any of the elments of the vector being searched.

4 Partial finding

Partial matches don’t work with %in% :

"fish" %in% "I would like to go fishing"
[1] FALSE

For partial matches, you can use grep:

grep("fish", c("I would like to go fishing","dog my cats","fishsticks"))
[1] 1 3

Note that grep outputs the element numbers for the elements in the searched vector where it finds matches. Grep uses the powerful regex search language, which we won’t cover here, but which allows very complex (though often slow) searches for string patterns.

5 Replacement

The converse essential string tool is replacement, using gsub:

gout <- gsub("Sad","Happy",c("Sad Birthday","Sad dog"))
gout
[1] "Happy Birthday" "Happy dog"     

The first argument is what to look for, the second is what to replace it with, and the third is what to search, which can of course be a vector of strings, not just a single string.

6 Regular expressions

Grep, gsub, strsplit, and many other text search and replace functions can use a special notation system called “regular expressions” to find complex text patterns. Describing this syntax goes beyond what we can cover here; see here for a quick intro and here for a cheatsheet. The basic idea is that you use various bits of punctuation and syntax to find various abstract patterns, such as “[A-Z]” to match any capitalized letter, “[^A-Z]” to match anything other than a capitalized letter, “.” to match any character at all, and so on. When properly wielded, regular expressions can capture almost any textual pattern you can describe, such as proper names where you want to copy from a text only pairs of words that both start with capital letters.

7 Read text and counting words

Putting a few of the tools above to work, we can read in a raw text document using readLines, use strsplit and regex to chop it up at all non-letters, convert everything to lowercase, remove empty strings, and then count up all the strings using table. This gives us the word counts for the document, of which we show the top 10 words.

singleString <- paste(readLines("i_have_a_dream.txt"), collapse=" ")
splitstring <- strsplit(singleString,"[^a-zA-Z]")[[1]]
lowerstring <- tolower(splitstring)
lowerstring_noblanks <- lowerstring[lowerstring != ""]
wordcounts <- as.data.frame(table(lowerstring_noblanks))
wordcounts[order(-wordcounts$Freq)[1:10],]
##     lowerstring_noblanks Freq
## 459                  the  103
## 319                   of   99
## 474                   to   59
## 18                   and   54
## 1                      a   37
## 33                    be   33
## 512                   we   30
## 525                 will   27
## 458                 that   24
## 229                   is   23