In this lesson we examine how data is imported, stored, and exported with R.
After completing this lesson, students should be able to:
Lander, Chapters 5-6.
Data frames are like matrices, in that they store data in a rectangular format. Unlike matrices though, each column can be a different type of data (numbers, characters, dates, etc.). So data frames are the default way to store simple datasets in R.
Data frames can be created like matrices: the data.frame()
function automatically cbinds vectors (of the same length) together into a data frame:
df1 <- data.frame(1:4,rep(3,4),c(2,4,1,5))
df1
X1.4 rep.3..4. c.2..4..1..5.
1 1 3 2
2 2 3 4
3 3 3 1
4 4 3 5
As you can see, it also gives column and row names. We can get the column names with:
colnames(df1)
[1] "X1.4" "rep.3..4." "c.2..4..1..5."
To change the column names, we just assign a vector of strings to the colnames function:
colnames(df1) <- c("c1","c2","c3")
df1
c1 c2 c3
1 1 3 2
2 2 3 4
3 3 3 1
4 4 3 5
To work with one of the columns, we use the $
to pick it out of the data frame:
ourcol <- df1$c1
ourcol
[1] 1 2 3 4
ourcol
is now a new vector with values taken from the first column of the data frame.
We can also pull out subsets of the dataframe just as we did with matrices:
df1[2:3,2:3]
c2 c3
2 3 4
3 3 1
or we can pull out the same using column names:
df1[2:3,c("c2","c3")]
c2 c3
2 3 4
3 3 1
df1[c(1,3),"c3"]
?
2 1
.
df1$c3[c(1,3)]
?
2 1
.
Most importantly, data frames allow us to combine vector types:
df1 <- cbind(df1,c("truck","car","lettuce","porkchop"))
colnames(df1)[4] <- "things"
df1
c1 c2 c3 things
1 1 3 2 truck
2 2 3 4 car
3 3 3 1 lettuce
4 4 3 5 porkchop
To see the internal structure of our data frame, including what types the variables in it are, we can use the str
function:
str(df1)
'data.frame': 4 obs. of 4 variables:
$ c1 : int 1 2 3 4
$ c2 : num 3 3 3 3
$ c3 : num 2 4 1 5
$ things: Factor w/ 4 levels "car","lettuce",..: 4 1 2 3
Using str
we saw that our final column has been stored as a factor rather than as a character. To change it, we can use our earlier technique:
df1$things <- as.character(df1$things)
str(df1)
'data.frame': 4 obs. of 4 variables:
$ c1 : int 1 2 3 4
$ c2 : num 3 3 3 3
$ c3 : num 2 4 1 5
$ things: chr "truck" "car" "lettuce" "porkchop"
RStudio also allows you to view certain rectangular datasets in a graphical manner that might be more familiar from other statistics programs:
View(df1)
You can even directly edit some data objects using
edit(df1)
But this is strongly not recommended, since it leaves no record of what you have done. Much better to use textual commands to assign new values to your data.
If we want to save our data frame to a text file, we can export it using the write.table
function:
write.table(df1,file="testdata.csv",row.names=FALSE,sep=",")
Here the first entry is the data frame name and the second entry is the name of the text file, which can include directory information, such as “/Users/nick/Desktop/testdata.csv”; a file without directory information is by default saved to your working directory.
The third option specifies that we don’t want an separate column for row names (which by default is just a column from 1 to N), and the fourth option sets the delimiter to put between each entry in the text file. A comma is standard (hence the name “comma-separated values” or the “.csv” suffix), but you may want something else if, for instance, you have a text column with commas in it.
To read a text file with data in rectangular format into R, we do something similar:
newdf <- read.table(file="testdata.csv",header=TRUE,sep=",",stringsAsFactors=FALSE)
newdf
c1 c2 c3 things
1 1 3 2 truck
2 2 3 4 car
3 3 3 1 lettuce
4 4 3 5 porkchop
The header option tells R that the first row is the column names, and the stringsAsFactors
option tells R to treat columns with text as text and not as factors (the default).
To read in a file from the internet, the file name can simply be a URL.
landerfile <- read.table(file="http://www.jaredlander.com/data/Tomato%20First.csv",
header=TRUE,sep=",",stringsAsFactors=FALSE)
To quickly check it, you can examine the first five lines using head
:
head(landerfile)
Round Tomato Price Source Sweet Acid Color Texture
1 1 Simpson SM 3.99 Whole Foods 2.8 2.8 3.7 3.4
2 1 Tuttorosso (blue) 2.99 Pioneer 3.3 2.8 3.4 3.0
3 1 Tuttorosso (green) 0.99 Pioneer 2.8 2.6 3.3 2.8
4 1 La Fede SM DOP 3.99 Shop Rite 2.6 2.8 3.0 2.3
5 2 Cento SM DOP 5.49 D Agostino 3.3 3.1 2.9 2.8
6 2 Cento Organic 4.99 D Agostino 3.2 2.9 2.9 3.1
Overall Avg.of.Totals Total.of.Avg
1 3.4 16.1 16.1
2 2.9 15.3 15.3
3 2.9 14.3 14.3
4 2.8 13.4 13.4
5 3.1 14.4 15.2
6 2.9 15.5 15.1
Google around and find a csv file somewhere online. Read it into R, and check it out with the head()
and str()
functions. Which variables look like they might have the wrong data type and would require recoding?
Some datasets also have a unwiedly number of columns to display.
landerfile[1:5,1:5]
does the job.
R can also read many non-text data formats, including those from SPSS, STATA, SAS, and others. To do so you need to have installed and load the foreign
package, and then use (for instance) read.dta()
rather than read.table()
.
Also of great use is saving to R’s native .RData
format, which preserves variable types and is much more efficient than text files for reading and saving large datasets.
save(newdf,file="df1.RData")
rm(newdf)
load(file="df1.RData")
newdf
c1 c2 c3 things
1 1 3 2 truck
2 2 3 4 car
3 3 3 1 lettuce
4 4 3 5 porkchop
Note that the .RData
file is a full R entity, including the file name.
Lists are the most general-purpose data containers in R that we will be using. Lists can hold collections of almost any type of R object: variables, vectors, data frames, other lists, etc.
list1 <- list(1:5,newdf,3)
list1
[[1]]
[1] 1 2 3 4 5
[[2]]
c1 c2 c3 things
1 1 3 2 truck
2 2 3 4 car
3 3 3 1 lettuce
4 4 3 5 porkchop
[[3]]
[1] 3
Note that the objects in a list are indexed with the double bracket [[ ]]
and can be picked out that way:
list1[[1]][3]
[1] 3
In the above example, the [[1]]
picks out the first object in the list (the vector 1:5
) and the [3]
picks out the third element in that vector (3
).
list1[[2]][2:4,3]
[1] 4 1 5
list1[[2]][2:4,3]
?
4 1 5
. That is, the second through fourth items in the third column of the second element in list1.
We haven’t given names to the three objects in list1
, so currently they can only be picked out with numbers in [[ ]]
, just as a data frame without column names can only be identified with column numbers. The way to name objects in a list is much like that for a data frame:
names(list1) <- c("onetofive","df1","three")
list1$onetofive[3]
[1] 3
Note that str()
shows you the internal structures of lists just as it does for data frames or any other object.
str(list1)
List of 3
$ onetofive: int [1:5] 1 2 3 4 5
$ df1 :'data.frame': 4 obs. of 4 variables:
..$ c1 : int [1:4] 1 2 3 4
..$ c2 : int [1:4] 3 3 3 3
..$ c3 : int [1:4] 2 4 1 5
..$ things: chr [1:4] "truck" "car" "lettuce" "porkchop"
$ three : num 3
Many R objects, including the outputs of many functions, are lists. All the objects in the workspace in R are stored as a list, and can be seen with
ls()
[1] "df1" "landerfile" "list1" "newdf" "ourcol"
This is also what the “Environment” pane in RStudio shows.
Since .Rdata
essentially saves list objects, it is easy to save the entirety of your workspace as a (nameless) .Rdata
file, which applications like RStudio do automatically:
save.image()
To reload this file (which is just an invisible file named .Rdata
) simply write:
load(".RData")
Finally, to erase everything in your workspace, you can write:
rm(list=ls())