Overview

In this lesson we learn R programming tools such as “if”, “for” and “apply”.

Objectives

Create and use “if” conditionals.
Create and use “for” loops.
Apply “apply” for faster vector operations.

Readings

Lander, Chapters 9, 10, 11.1

1 Programming

The other important tools for writing functions and manipulating data are control statements and loops. This too will be familiar to those with some programming background.

For those with no programming experience, the key idea is that a program or script is executed by R one line at a time, often with generous use of brackets { } and parentheses to let the computer know where chunks start and end.

1.1 If

We have seen conditional statements implicitly before, but if allows us more control:

if(3 < 4){
  print("yup")
}

[1] "yup"

Again, like all scripts, this is executed by R like we read it – from the top down, one line at a time. If takes a truth condition as its input, and executes the stuff in the brackets {} if the input truth condition is TRUE.

Here’s another example:

if(3 == 4){
  print("yup")
}

Note the lack of output now.

1.2 Logical operators

Inportant logical operators include ==, <, >, <=, >=, as well as “and” &, “or” |, “not” ! and not equal to !=. For instance:

if((3 < 4) & (3 != 2)){
  print("yup")
}

[1] "yup"

Note the use of parentheses to make sure the logic is clear; you can sometimes get away with less parentheses, but it’s good form to be explicit.

1.3 Else

If can also do alternative actions when the input is false, via else. Here is an illustration in a function using if and else:

isitthree <- function(x){
  if(x == 3){
    return(TRUE)
  }else{
    return(FALSE)
  } 
}
isitthree(3)

[1] TRUE

You can also user a shorter function that combines them into ifelse:

isitless <- ifelse(3<4,1,0)
isitless

[1] 1

where the first argument is the test, the second is the output for if it is true, and the third is the output for if the test is false.

2 Loops: For

R is notoriously slow for scripts that repeated loop through data, but often speed is not an issue or you just need to write a loop to generate or manipulate your dataset.

The most common loop function is for:

for(i in 1:3){
  print(2*i)
}

[1] 2
[1] 4
[1] 6

Here i, like in a function, is a local variable, used within the loop only. After the “in” comes a vector (eg, 1:3); it can be any vector, including a column of data.

for(j in c("frog","duck")){
  print(j)
}

[1] "frog"
[1] "duck"

2.1 Loops: While

While is another loop function, one that’s pretty self-explanatory:

i <- 1
while(i <= 2){
  print(3*i)
  i <- i + 1
}

[1] 3
[1] 6

Be careful with while loops though – if you set it up wrong it can possibly run forever if the truth condition is never reached. If possible, better to use a for loop, and be clever with your “in” vector if need be. One option is to use a break which interrupts the loop if some condition is met partway:

for(i in 1:3){
  if(i == 2){
    break
  }
  print(i)
}

[1] 1

3 Apply

A faster way to step through and manipulate a vector or list of data is using R’s various apply functions, which like loops will apply any function iteratively to a set of data. Apply is in general much faster than a for loop, but it can be trickier to conceptualize how best to use it, and sometimes for loops, though slower, are more easy and flexible.

Apply is best suited to matrices or data frames, and is generally used with functions that take as their input rows or columns. So instead of writing a for loop to step through each row, for instance, you can use apply to apply a function to all of those rows at once.

Thus the key option for apply is whether you want your function to operate on all the rows or all the columns of your data.

3.1 Apply to take the mean

Consider the following matrix:

m <- matrix(1:6,nrow=2,ncol=3)
m

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Say we want the mean of each column (eg, we might want the mean of each variable in a dataset). We could do this via:

apply(m,2,mean)

[1] 1.5 3.5 5.5

where m is the input data, 2 means we apply the function over columns (1 is rows), and mean is the function we are applying. We can use any function we want here, including those of our own creation.

Of course, R also has shortcuts for such common things as calculating row or column means:

colMeans(m)

[1] 1.5 3.5 5.5

3.2 Other ways to Apply

And R also has base functions for getting more detailed summaries of variables:

summary(m)

       V1             V2             V3      
 Min.   :1.00   Min.   :3.00   Min.   :5.00  
 1st Qu.:1.25   1st Qu.:3.25   1st Qu.:5.25  
 Median :1.50   Median :3.50   Median :5.50  
 Mean   :1.50   Mean   :3.50   Mean   :5.50  
 3rd Qu.:1.75   3rd Qu.:3.75   3rd Qu.:5.75  
 Max.   :2.00   Max.   :4.00   Max.   :6.00

Apply can also be applied to each element of the data individually, ie over both rows and columns with the setting 1:2 – though there are often better ways to do this. (Eg, for the following example, m==3 does the same thing.)

apply(m,1:2,isitthree)

      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,] FALSE FALSE FALSE

There are many other apply functions designed for different input data.

4 Tips for writing longer scripts and programs

Often in data analysis you need to write a script or program (which we use interchangeably here), eg to go through a dataset you have acquired or built and fix variables, rows, or individual elements in the dataset. Generally there are faster ways to do that particular problem using apply and other functions, but sometimes it’s easier to just write a script that goes through each row, each column, or each cell one at a time. Other times, you might want to write a script that simulates data in order to test or validate some theory or method, and that script needs to run a bunch of times in order to do a bunch of random simulations.

There are lots of different reasons why you might want to write longer scripts or programs, some of which we will touch on in later lessons here and in related courses. But rather than diving into an example with data, let’s just explore how to write a somewhat longer program and what strategies and practices work best for writing robust, readable code.

Note that everyone has different styles and feelings about how best to do these things, and here you’re just getting one perspective. There are many other good ways to write code, and if you go on to serious programming, many of the practices that are fine for writing quick, casual scripts may not be best practices for writing more complex code. But a few tips can at least be helpful for accomplishing the sort of quick-and-dirty jobs one often encounters in data analysis.

Introduction to R 2.2: Programming and Scripts