Overview
In this lesson you will learn some core functions and tools for dealing with textual data.
Objectives
Readings
Lander, Ch. 16
One last minor but important topic in data manipulation. As we have seen, print("Hello")
will output the string “Hello”; we could include print(i)
in a for
loop, for instance, to keep track of progress.
To concatenate strings (eg, if you want to save a bunch of datasets, each with a different number), one uses the paste
function:
for(i in 1:3){
filename = paste("datafile",i,".txt",sep="")
print(filename)
}
[1] "datafile1.txt"
[1] "datafile2.txt"
[1] "datafile3.txt"
All the arguments in paste
are pasted together except for the last one, which specifies what to put in between each argument – in this case it was nothing "", but it could be a comma if we wanted to create a comma-separated data file (the hard way!).
The inverse of paste
is splitting a string with strsplit
:
sout <- strsplit("good,bad,happy,sad",",")
sout
[[1]]
[1] "good" "bad" "happy" "sad"
This splits the string(s) in the first argument at the string in the second argument, and saves the output as a list, which allows for easier subsequent processing.
R of course has many other functions for strings that could cover an entire course. The last ones we will cover here are finding and replacing. To test whether a string or value is in another set (including a vector or table), we can use %in%
:
c("fish","dog",2) %in% c("happy","fish","pie",2)
[1] TRUE FALSE TRUE
Note that in this example we do three searches (one for each element of the vector on the left of the %in%
), and it returns TRUE if a search element matches any of the elments of the vector being searched.
Partial matches don’t work with %in%
:
"fish" %in% "I would like to go fishing"
[1] FALSE
For partial matches, you can use grep
:
grep("fish", c("I would like to go fishing","dog my cats","fishsticks"))
[1] 1 3
Note that grep
outputs the element numbers for the elements in the searched vector where it finds matches. Grep
uses the powerful regex
search language, which we won’t cover here, but which allows very complex (though often slow) searches for string patterns.
The converse essential string tool is replacement, using gsub
:
gout <- gsub("Sad","Happy",c("Sad Birthday","Sad dog"))
gout
[1] "Happy Birthday" "Happy dog"
The first argument is what to look for, the second is what to replace it with, and the third is what to search, which can of course be a vector of strings, not just a single string.
Grep, gsub, strsplit, and many other text search and replace functions can use a special notation system called “regular expressions” to find complex text patterns. Describing this syntax goes beyond what we can cover here; see here for a quick intro and here for a cheatsheet. The basic idea is that you use various bits of punctuation and syntax to find various abstract patterns, such as “[A-Z]” to match any capitalized letter, “[^A-Z]” to match anything other than a capitalized letter, “.” to match any character at all, and so on. When properly wielded, regular expressions can capture almost any textual pattern you can describe, such as proper names where you want to copy from a text only pairs of words that both start with capital letters.
Putting a few of the tools above to work, we can read in a raw text document using readLines
, use strsplit
and regex to chop it up at all non-letters, convert everything to lowercase, remove empty strings, and then count up all the strings using table
. This gives us the word counts for the document, of which we show the top 10 words.
singleString <- paste(readLines("i_have_a_dream.txt"), collapse=" ")
splitstring <- strsplit(singleString,"[^a-zA-Z]")[[1]]
lowerstring <- tolower(splitstring)
lowerstring_noblanks <- lowerstring[lowerstring != ""]
wordcounts <- as.data.frame(table(lowerstring_noblanks))
wordcounts[order(-wordcounts$Freq)[1:10],]
## lowerstring_noblanks Freq
## 459 the 103
## 319 of 99
## 474 to 59
## 18 and 54
## 1 a 37
## 33 be 33
## 512 we 30
## 525 will 27
## 458 that 24
## 229 is 23