In this lesson we learn how functions and packages power R.
After completing this lesson, students should be able to:
Lander, Chapters 3, 8
We’ve now covered how to get data into and out of R – but what do we do with it once it’s in R? Having done the Data part of Data Analysis (or at least, some of it), it’s now time for some Analysis.
Before we get to the more complex analysis of statistics and machine learning, we need to understand the basics of how to manipulate, analyze, and visualize data using R. We will return to all of these topics repeatedly over the rest of the course, but this module gives a quick overview of how we do data analysis in R.
The most fundamental tool for data analysis in R is the function:
v <- c(2,4,1,5)
mv <- mean(v)
mv
[1] 3
mean()
is a function that takes as its input a vector, and outputs the mean (a scalar).
R comes with a large set of functions pre-installed, although the naming of them is sufficiently idiosyncratic that you are unlikely to guess the name of the function (except for something as simple as mean
) on your own; much better to look it up online.
To get a list of the function in R’s base
install, you can write
library(help = "base")
To get help on a given function (eg, mean
), write
help(mean)
# or
?mean
In RStudio that should automatically open the Help pane with the definitions for the given function. In the help file, the Arguments are the function’s inputs, and the Values are its outputs. In addition to the data taken as input, there are usually options for how the function operates. The Usage portion of the Help shows both how the function is used, and what the default inputs are for the various options.
For instance, we can calculate the mean of a vector that has missing data (NA
), but what do we do about those NA
values? Does the mean exist or not? It depends on what you want:
v2 <- c(v,NA)
v2
[1] 2 4 1 5 NA
mean(v2)
[1] NA
mean(v2,na.rm=TRUE)
[1] 3
In this case, the default option is na.rm=FALSE
– that is, do not remove the NA
values before calculating the mean, in which case the mean is NA
if there are any NA
values. We can instead set na.rm=TRUE
to remove those values first, giving the previous answer (3). Each function has many of these options; it’s always good to consult the help first. na.rm
in particular is a common one for many functions.
It should be said that R is not especially helpful with its help files or especially clear with its naming conventions. It can often be an exercise in frustration to try to figure out what the function name for something should be (eg, calculating the standard deviation), how its options work, and why it’s not working for you – despite the fact that you are surely doing everything correctly!
So in addition to the materials here and in our textbooks, don’t hesistate to Google. Google results, R Tutorials online and stackoverflow.com (among others) can be invaluable to quickly answering an annoying question. Sometimes you will encounter an unhelpful answer from a testy statistician; the best strategy is to not waste too much time but move quickly on to a better answer. The internet is filled with many of them.
One additional wrinkle is that searching for help for “R standard deviation” (for instance) can run into trouble because “R” is a very unhelpful name for a piece of software. Google has gotten better at knowing you mean the statistics software, but sometimes adding “statistics” to your search can help.
At the end of this module are a few links to helpful resources, but again, Google is often a good bet. But don’t underestimate the value of a well-written textbook! This is true especially if you have a basic problem or don’t know how to do something.
The reason R has become one of the dominant statistical software tools, though, is not due to the built-in functions – which are mainly similar to those in many other statistics programs – but because R is open-source and benefits from thousands of user-contributed functions across every domain of statistics and, increasingly, machine learning and computer science more generally.
These user contributions, called “packages,” are not stored in the base installation of R, but have to be installed from internet repositories. Luckily this process has largely been standardized and centralized, so although finding which package does what you want (and there are often multiple ones that overlap in the functions they perform, often with differing function names!), installing it once you’ve found it is relatively easy.
There are two important steps to keep in mind. First, installed packages are put in your Library. To see all your installed packages (including many that come with the basic R installation), in RStudio go to the Packages pane. Most packages, though, upon being installed onto your computer, are not automatically loaded into R. You only have to install a package once, but for every session using R, you need to load that package so it can be used. (This is to prevent R from being bogged down for those of us with hundreds of packages and functions installed.)
To load an already installed package in RStudio, just check the box next to the package name in the Packages pane. To install a new package, click on “Install” in the upper-left corner of the Packages pane; this of course requires knowing the package name.
One additional wrinkle, as you can see in the installation window, is that many packages come with “dependencies” – other packages or functions that they use to get their job done. Luckily, most packages specify their dependencies, and by default these are also installed.
It is usually better to load packages using commands in your R script rather than the GUI, although it matters less for installing packages, which hopefully only happens once. The command for installing a package (here, the “ggplot2” package) is:
install.packages("ggplot2")
Again, you have to know the name of it to install it; this once again often involves Googling or other sources of external information.
To load a package in order to be able to use the functions in it, do:
library(ggplot2)
require()
also works, and you will see that in many scripts.
R is a very flexible and open programming language, so you can build your own functions and packages and share them with others easily. We won’t go into building full-scale packages here, bu t one of the most important skills in R is being able to create your own functions.
If you have done any programming elsewhere, creating a function in R will look familiar. Here is an example of a simple function:
# define the function named "doubleit"
doubleit <- function(x){
doubled <- x*2
return(doubled)
}
# now use the function with an input of 7 and save it as sevendoubled
sevendoubled <- doubleit(7)
sevendoubled
[1] 14
The first line specifies the name of the function (“doubleit”) and the arguments of the function (its inputs). “x” here is just an internal name used in the function; we could call it “a” or “thefirstargument” or whatever. The input (in this case 7) gets assigned temporarily to “x”, and then multiplied by 2 with the output stored temporarily in “doubled”. The final line tells the function that the value of “doubled” is the output from the function.
A function can have many inputs (a vector, or a set of scalars and vectors and lists, or anything at all) and can return a more complex output such as a list of R objects – just as we saw when examining the Help files for pre-existing functions.
For instance, here is a function that takes two numbers and calculates both their sum and difference:
sumdiff <- function(a,b){
nsum <- a + b
ndif <- a - b
return(c(nsum,ndif))
}
sdoutput <- sumdiff(5,3)
sdoutput
[1] 8 2