Overview

In this lesson, you will learn how to read in various types of data and clean up data you have found.

Objectives

  1. Read and write CSV data types, as well as data types from other software packages.
  2. Clean and recode data.

Readings

Lander, Ch. 6

1 Importing and Exporting Data

Recall our simple data frame from earlier:

load(file="df1.RData",verbose=TRUE)
## Loading objects:
##   df1
df1
##   c1 c2 c3   things
## 1  1  3  2    truck
## 2  2  3  4      car
## 3  3  3  1  lettuce
## 4  4  3  5 porkchop

1.1 Writing to a csv file

If we want to save our data frame as a plain text file, we can export it using the write.table function:

write.table(df1,file="testdata.csv",row.names=FALSE,sep=",")

Here the first entry is the data frame name and the second entry is the name of the text file, which can include directory information, such as “/Users/nick/Desktop/testdata.csv”; a file without directory information is by default saved to your working directory.

The third option specifies that we don’t want an separate column for row names (which by default is just a column from 1 to N), and the fourth option sets the delimiter to put between each entry in the text file. A comma is standard (hence the name “comma-separated values” or the “.csv” suffix), but you may want something else if, for instance, you have a text column with commas in it.

1.2 Reading in a csv file

To read a text file with data in rectangular format into R, we do something similar:

newdf <- read.table(file="testdata.csv",header=TRUE,sep=",",stringsAsFactors=FALSE)
newdf
  c1 c2 c3   things
1  1  3  2    truck
2  2  3  4      car
3  3  3  1  lettuce
4  4  3  5 porkchop

The header option tells R that the first row is the column names, and the stringsAsFactors option tells R to treat columns with text as text and not as factors (the default).

1.3 Reading data from the internet

To read in a file from the internet, the file name can simply be a URL.

landerfile <- read.table(file="http://www.jaredlander.com/data/Tomato%20First.csv",
                         header=TRUE,sep=",",stringsAsFactors=FALSE)

To quickly check it, you can examine the first five lines using head:

head(landerfile)
  Round             Tomato Price      Source Sweet Acid Color Texture Overall
1     1         Simpson SM  3.99 Whole Foods   2.8  2.8   3.7     3.4     3.4
2     1  Tuttorosso (blue)  2.99     Pioneer   3.3  2.8   3.4     3.0     2.9
3     1 Tuttorosso (green)  0.99     Pioneer   2.8  2.6   3.3     2.8     2.9
4     1     La Fede SM DOP  3.99   Shop Rite   2.6  2.8   3.0     2.3     2.8
5     2       Cento SM DOP  5.49  D Agostino   3.3  3.1   2.9     2.8     3.1
6     2      Cento Organic  4.99  D Agostino   3.2  2.9   2.9     3.1     2.9
  Avg.of.Totals Total.of.Avg
1          16.1         16.1
2          15.3         15.3
3          14.3         14.3
4          13.4         13.4
5          14.4         15.2
6          15.5         15.1

Some datasets also have a unwiedly number of columns to display.

How would you examine just the first 5 rows and first 5 columns?
landerfile[1:5,1:5] does the job.

1.4 Reading other data formats

Sometimes you encounter data in other formats. Some of the most popular statistical analysis platforms are SPSS (.spss), Stata (.dta), and SAS (.sas7bdat), which each have their own proprietary file formats just as R has .RData. There are a number of packages designed to read these file types, including the foreign package, but perhaps the most currently up-to-date package is haven, which part of a entire family of packages called “Tidyverse” originally created by Hadley Wickham. If you install the tidyverse package you get a vast array of packages and newer, more efficient methods for achieving many of the goals of data analysis, including haven.

So for instance, if you want to read a Stata .dta file you found, you would run something lke:

library(tidyverse)
mystatafile <- read_dta("mtcars.dta")

One thing to bear in mind is that the Tidyverse has its own preferred data format, known as a tibble instead of Data Frames. You can convert a tibble into a data frame using as.data.frame(), but tibbles are useful because they are more efficient with large datasets and often provide better ways for inspecting and manipulating data. For the most part, they work the same with all the functions we have already explored, and we will spend a bit more time discussing the Tidyverse world in subsequent lessons.

One other common data format one encounters is Excel. These can often be opened in Excel and saved as a csv, but it is often easier (and preserves data types better) to just open an .xls or .xlsx file directly in R. Once again, the best package for doing this, readxl, is part of Tidyverse, and once Tidyverse is loaded, you can real Excel files with

myexcelfile <- read_excel("something.xlsx")

You can learn more about these functions and many others at www.tidyverse.org.

2 Cleaning data

Most often, when using someone else’s data, the dataset you obtain is complex, ill-described, and full of missing and possibly even mis-coded data. If you are lucky, there is a decent “codebook” which describes the variables in the dataset and how they are coded, but even then, many details about how exactly the data was constructed are omitted. This is to some degree inevitable, but the best you can do is be sure to inspect your dataset carefully and carefully work through each of the variables you wish to examine, making sure each is as clean as you can make it.

There are many packages for cleaning data, including a number in the Tidyverse universe, and the number of issues one may encounter is almost infinite, and always idiosyncratic to each dataset. So let’s just take a close look at a single example that illustrates a number of common issues one might encounter, sticking with just the base R functions for now.

The American National Election Studies survey is a large, biennial political survey that asks a representative sample of Americans a large set of questions about the background and political views. It can be downloaded here or here once you register, and there are a number of formats it can be downloaded in, along with extensive codebooks describing the data collection and coding procedures. Even though a huge amount of work has gone into the dataset to prepare it, it nevertheless requires a large amount of work by the researcher to prepare it for their own analysis.

2.1 Read dataset

In this case, let’s begin with perhaps the crudest version of the survey dataset, in raw text form as downloaded here. To even know how each cell in the data is separated requires reading the codebook or website, glancing at the raw text using a text editor, or looking at the suffix to the file, which is names ANES-Data.tsv, or “tab-separated values.” Once we know that, we can read the data in. It’s not guaranteed that there is a header, so again, looking at the codebook or glancing at the data first using a text editor or the like can be useful. It is essential to included “header=TRUE” if there is a header, though, since otherwise every column will be encoded as texts to match the text of the first row, the column names. And as before, we use “stringsAsFactors=FALSE” just to make it easier to re-encode data as needed later.

anes_all <-  read.table("ANES-1948-2008/ANES-Data.tsv",sep="\t",header=TRUE,stringsAsFactors=FALSE)

2.2 Subset data

Now, in this example, I am interested in examining 2008 data only, and I know from the website that this dataset encompasses all surveys from 1948 to 2008. But sadly, the variable names in this dataset are extremely non-self-explanatory. So we have to look into the codebook to even know what variable encodes the year of survey, which is for some reason called VCF0004. So we subset that year (and again, there are faster ways to do this using Tidyverse, but let’s stick with the basics here):

anes_2008 <- anes_all[anes_all$VCF0004==2008,]

Now, as it happens, I’m interested in just some very basic analysis: how various demographics features relate to voting. For the basic demographics, I know what I’m looking for: age, race, gender, education, income; and I also want party ID, whether they voted in 2008, and who they voted for. So I laboriously look through the codebook, which in this case is 388 pages, noting that some of these demographic features are represented in multiple variables, eg listing party ID by four groups (Democrat, Republican, Independent, Other) or via a spectrum (eg, Strong Republican, etc). So I pick out the ones I want, and create a new dataset with just those. Note how we create variable names that include not just the general name (eg “gender”), but also give a hint of which way the variable will be encoded when that is ambiguous (eg, “gender_male” will indicate that 0 is non-male and 1 is male).

anes_2008tr <- data.frame(age=anes_2008$VCF0101,
                          race_white=anes_2008$VCF0106,
                          gender_male=anes_2008$VCF0104,
                          education=anes_2008$VCF0140a,
                          income=anes_2008$VCF0114,
                          ideology_con=anes_2008$VCF0803,
                          partyid_rep=anes_2008$VCF0301,
                          vote_rep=anes_2008$VCF0704a,
                          voted=anes_2008$VCF0702)

2.3 Recode data

Finally comes the hard part. For each of these variables, this particular codebook does a good job of describing what each number means, but for each one, a different number might mean NA (“not available”) or DK (“don’t know”), so we have to go through them one at a time. In most cases, one wants to preserve NA values for subsequent analysis, because missing data is meaningful and should never just be ignored or papered over. But in this case, I’d like to recode (“impute”) missing data elements to the median, mean, or mode of each variable. If it’s a binary or categorical variable, I want to use the mode; if it’s a continuous variable, I’ll use the median, which is often more robust than the mean, eg if there are a few outliers it won’t be biased by that. It’s also safer because sometimes you have a variable that is just integer values that you mistakening believe to be continuous, but if you use the median, your imputed value for the missing element will also be an integer (assuming your “median” function flips a coin if there are an even number of elements and the median is between two different values; this can be set within R, of course).

There are nice functions for imputing, such as imputeMissings, whose impute function does a good job of automatically using the median or mode as needed. But for this small dataset, we do it ourselves to make sure it’s done right. For each variable, we have to think how it should be be encoded for subsequent analysis. For instance, race is a categorical variable with many different possibilities. Do we want a separate, “dummy” (0/1) variable for each possibility? In this case, we will just create a single variable for whether or not the respondant is the modal category, White. Education and Income are both coded as a set of integer levels which require the Codebook to interpret. How should we encode the DK or NA values here? We choose the median, which is probably wrong (non-response is more likely to come from respondents with less than average education, for instance), but is both rare, and unlikely to bias our data too much. And so forth. Each line below recodes a new variable, each via debatable rules. But the advantage, at least, of doing it this way is that one gets to know the dataset very well before moving on to analysis.

anes_2008tr$race_white[anes_2008tr$race_white != 1] <- 0 # non-race_white to 0 (race_white=1)
anes_2008tr$gender_male[anes_2008tr$gender_male == 2] <- 0 # fegender_male to 0 (gender_male=1)
anes_2008tr$education[anes_2008tr$education > 7] <- 3 # DK/NA to middle
anes_2008tr$income[anes_2008tr$income == 0] <- 3 # DK/NA to middle
anes_2008tr$ideology_con[anes_2008tr$ideology_con == 9] <- 4 # DK to middle
anes_2008tr$ideology_con[anes_2008tr$ideology_con == 0] <- 4 # NA to middle
anes_2008tr$partyid_rep[anes_2008tr$partyid_rep == 0] <- 4 # DK/NA to middle
anes_2008tr$vote_rep <- anes_2008tr$vote_rep - 1 # make binary
anes_2008tr$vote_rep[anes_2008tr$vote_rep == -1] <- NA # DK/NA to no
anes_2008tr$voted <- anes_2008tr$voted - 1 # make binary
anes_2008tr$voted[anes_2008tr$voted == -1] <- 0 # DK/NA to no

In this particular dataset, I don’t preserve any categorical variables with more than 2 categories, but one often does have a variable (such as religion) that should not be collapsed down to two categories. In that case, one would use as.factor() to recode the variable from a string back into a factor. One might then want to recode or rename the factors in various ways (eg, changing a “1” to “Catholic”, and so on), which can be complex. A good quick introduction to how to do this is at the Cookbook for R here and here.

2.4 Examine for outliers

Now that our data is partially recoded, it is always worth examining it again with summary(). It is also worth examining the data visually, using histograms of each data, to see whether there are any “outliers”, or data elements that seem much larger or smaller than the rest of the elements in that variable. If so, it is always a judgement call about what do. If it looks like something miscoded, or due to an entirely different mechanism (eg, a dataset of dogs with one element in the “height” variable being 5 feet, which is either a miscode, or a Great Dane that might be very unusual for a dog), one might want to drop it (set it to NA or drop that row) or recode it in some other way. That is a substantive question though, and depends entirely on the meaning of the data.

2.5 Save

Finally, we want to save our data before using it in later analysis. In this case I’ll save it in a general text format instead of RData, so it can be used by others more easily if I want to share it. We will in fact see this dataset recur later.

write.table(anes_2008tr,"anes_2008tr.csv",sep=",",row.names=FALSE)