Computational Statistics 1.3: Data Sets

Working with Data and Data Frames
Data Frames
- Columns
- Subsets
- Mixed types
- Editing
Importing and Exporting Data
Lists

Working with Data and Data Frames

Overview

In this lesson we examine how data is imported, stored, and exported with R.

Objectives

After completing this lesson, students should be able to:

Create and manipulate data frames.
Import basic datasets such as csv files.
Export datasets to csv format.

Readings

Lander, Chapters 5-6.

Data Frames

Data frames are like matrices, in that they store data in a rectangular format. Unlike matrices though, each column can be a different type of data (numbers, characters, dates, etc.). So data frames are the default way to store simple datasets in R.

Data frames can be created like matrices: the data.frame() function automatically cbinds vectors (of the same length) together into a data frame:

df1 <- data.frame(1:4,rep(3,4),c(2,4,1,5))
df1

  X1.4 rep.3..4. c.2..4..1..5.
1    1         3             2
2    2         3             4
3    3         3             1
4    4         3             5

As you can see, it also gives column and row names. We can get the column names with:

colnames(df1)

[1] "X1.4"          "rep.3..4."     "c.2..4..1..5."

Columns

To change the column names, we just assign a vector of strings to the colnames function:

colnames(df1) <- c("c1","c2","c3")
df1

To work with one of the columns, we use the $ to pick it out of the data frame:

ourcol <- df1$c1
ourcol

[1] 1 2 3 4

ourcol is now a new vector with values taken from the first column of the data frame.

Subsets

We can also pull out subsets of the dataframe just as we did with matrices:

df1[2:3,2:3]

  c2 c3
2  3  4
3  3  1

or we can pull out the same using column names:

df1[2:3,c("c2","c3")]

  c2 c3
2  3  4
3  3  1

What is df1[c(1,3),"c3"]?: 2 1.
What is df1$c3[c(1,3)]?: Yup: 2 1.

Mixed types

Most importantly, data frames allow us to combine vector types:

df1 <- cbind(df1,c("truck","car","lettuce","porkchop"))
colnames(df1)[4] <- "things"
df1

  c1 c2 c3   things
1  1  3  2    truck
2  2  3  4      car
3  3  3  1  lettuce
4  4  3  5 porkchop

To see the internal structure of our data frame, including what types the variables in it are, we can use the str function:

str(df1)

'data.frame':   4 obs. of  4 variables:
 $ c1    : int  1 2 3 4
 $ c2    : num  3 3 3 3
 $ c3    : num  2 4 1 5
 $ things: Factor w/ 4 levels "car","lettuce",..: 4 1 2 3

Editing

Using str we saw that our final column has been stored as a factor rather than as a character. To change it, we can use our earlier technique:

df1$things <- as.character(df1$things)
str(df1)

'data.frame':   4 obs. of  4 variables:
 $ c1    : int  1 2 3 4
 $ c2    : num  3 3 3 3
 $ c3    : num  2 4 1 5
 $ things: chr  "truck" "car" "lettuce" "porkchop"

RStudio also allows you to view certain rectangular datasets in a graphical manner that might be more familiar from other statistics programs:

View(df1)

You can even directly edit some data objects using

edit(df1)

But this is strongly not recommended, since it leaves no record of what you have done. Much better to use textual commands to assign new values to your data.

Importing and Exporting Data

If we want to save our data frame to a text file, we can export it using the write.table function:

write.table(df1,file="testdata.csv",row.names=FALSE,sep=",")

Here the first entry is the data frame name and the second entry is the name of the text file, which can include directory information, such as “/Users/nick/Desktop/testdata.csv”; a file without directory information is by default saved to your working directory.

The third option specifies that we don’t want an separate column for row names (which by default is just a column from 1 to N), and the fourth option sets the delimiter to put between each entry in the text file. A comma is standard (hence the name “comma-separated values” or the “.csv” suffix), but you may want something else if, for instance, you have a text column with commas in it.

Reading in a csv file

To read a text file with data in rectangular format into R, we do something similar:

newdf <- read.table(file="testdata.csv",header=TRUE,sep=",",stringsAsFactors=FALSE)
newdf

  c1 c2 c3   things
1  1  3  2    truck
2  2  3  4      car
3  3  3  1  lettuce
4  4  3  5 porkchop

The header option tells R that the first row is the column names, and the stringsAsFactors option tells R to treat columns with text as text and not as factors (the default).

Reading data from the internet

To read in a file from the internet, the file name can simply be a URL.

landerfile <- read.table(file="http://www.jaredlander.com/data/Tomato%20First.csv",
                         header=TRUE,sep=",",stringsAsFactors=FALSE)

To quickly check it, you can examine the first five lines using head:

head(landerfile)

  Round             Tomato Price      Source Sweet Acid Color Texture
1     1         Simpson SM  3.99 Whole Foods   2.8  2.8   3.7     3.4
2     1  Tuttorosso (blue)  2.99     Pioneer   3.3  2.8   3.4     3.0
3     1 Tuttorosso (green)  0.99     Pioneer   2.8  2.6   3.3     2.8
4     1     La Fede SM DOP  3.99   Shop Rite   2.6  2.8   3.0     2.3
5     2       Cento SM DOP  5.49  D Agostino   3.3  3.1   2.9     2.8
6     2      Cento Organic  4.99  D Agostino   3.2  2.9   2.9     3.1
  Overall Avg.of.Totals Total.of.Avg
1     3.4          16.1         16.1
2     2.9          15.3         15.3
3     2.9          14.3         14.3
4     2.8          13.4         13.4
5     3.1          14.4         15.2
6     2.9          15.5         15.1

Exercise

Google around and find a csv file somewhere online. Read it into R, and check it out with the head() and str() functions. Which variables look like they might have the wrong data type and would require recoding?

Some datasets also have a unwiedly number of columns to display.

How would you examine just the first 5 rows and first 5 columns?: landerfile[1:5,1:5] does the job.

Other data structures

R can also read many non-text data formats, including those from SPSS, STATA, SAS, and others. To do so you need to have installed and load the foreign package, and then use (for instance) read.dta() rather than read.table().

Also of great use is saving to R’s native .RData format, which preserves variable types and is much more efficient than text files for reading and saving large datasets.

save(newdf,file="df1.RData")
rm(newdf)
load(file="df1.RData")
newdf

  c1 c2 c3   things
1  1  3  2    truck
2  2  3  4      car
3  3  3  1  lettuce
4  4  3  5 porkchop

Note that the .RData file is a full R entity, including the file name.

Lists

Lists are the most general-purpose data containers in R that we will be using. Lists can hold collections of almost any type of R object: variables, vectors, data frames, other lists, etc.

list1 <- list(1:5,newdf,3)
list1

[[1]]
[1] 1 2 3 4 5

[[2]]
  c1 c2 c3   things
1  1  3  2    truck
2  2  3  4      car
3  3  3  1  lettuce
4  4  3  5 porkchop

[[3]]
[1] 3

List structure

Note that the objects in a list are indexed with the double bracket [[ ]] and can be picked out that way:

list1[[1]][3]

[1] 3

In the above example, the [[1]] picks out the first object in the list (the vector 1:5) and the [3] picks out the third element in that vector (3).

list1[[2]][2:4,3]

[1] 4 1 5

What is list1[[2]][2:4,3] ?: 4 1 5. That is, the second through fourth items in the third column of the second element in list1.

List naming

We haven’t given names to the three objects in list1, so currently they can only be picked out with numbers in [[ ]], just as a data frame without column names can only be identified with column numbers. The way to name objects in a list is much like that for a data frame:

names(list1) <- c("onetofive","df1","three")
list1$onetofive[3]

[1] 3

Note that str() shows you the internal structures of lists just as it does for data frames or any other object.

str(list1)

List of 3
 $ onetofive: int [1:5] 1 2 3 4 5
 $ df1      :'data.frame':  4 obs. of  4 variables:
  ..$ c1    : int [1:4] 1 2 3 4
  ..$ c2    : int [1:4] 3 3 3 3
  ..$ c3    : int [1:4] 2 4 1 5
  ..$ things: chr [1:4] "truck" "car" "lettuce" "porkchop"
 $ three    : num 3

R’s internal data

Many R objects, including the outputs of many functions, are lists. All the objects in the workspace in R are stored as a list, and can be seen with

ls()

[1] "df1"        "landerfile" "list1"      "newdf"      "ourcol"

This is also what the “Environment” pane in RStudio shows.

Since .Rdata essentially saves list objects, it is easy to save the entirety of your workspace as a (nameless) .Rdata file, which applications like RStudio do automatically:

save.image()

To reload this file (which is just an invisible file named .Rdata) simply write:

load(".RData")

Finally, to erase everything in your workspace, you can write:

rm(list=ls())