Overview

In this lesson we learn R programming tools such as “if”, “for” and “apply”.

Objectives

After completing this lesson, students should be able to:

Create and use “if” conditionals.
Create and use “for” loops.
Apply “apply” for faster vector operations.

Readings

Lander, Chapters 9, 10, 11.1

1 Programming

The other important tools for writing functions and manipulating data are control statements and loops. This too will be familiar to those with some programming background.

For those with no programming experience, the key idea is that a program or script is executed by R one line at a time, often with generous use of brackets { } and parentheses to let the computer know where chunks start and end.

1.1 If

We have seen conditional statements implicitly before, but if allows us more control:

if(3 < 4){
  print("yup")
}

[1] "yup"

Again, like all scripts, this is executed by R like we read it – from the top down, one line at a time. If takes a truth condition as its input, and executes the stuff in the brackets {} if the input truth condition is TRUE.

Here’s another example:

if(3 == 4){
  print("yup")
}

Note the lack of output now.

1.2 Logical operators

Inportant logical operators include ==, <, >, <=, >=, as well as “and” &, “or” |, “not” ! and not equal to !=. For instance:

if((3 < 4) & (3 != 2)){
  print("yup")
}

[1] "yup"

Note the use of parentheses to make sure the logic is clear; you can sometimes get away with less parentheses, but it’s good form to be explicit.

1.3 Else

If can also do alternative actions when the input is false, via else. Here is an illustration in a function using if and else:

isitthree <- function(x){
  if(x == 3){
    return(TRUE)
  }else{
    return(FALSE)
  } 
}
isitthree(3)

[1] TRUE

You can also user a shorter function that combines them into ifelse:

isitless <- ifelse(3<4,1,0)
isitless

[1] 1

where the first argument is the test, the second is the output for if it is true, and the third is the output for if the test is false.

2 Loops: For

R is notoriously slow for scripts that repeated loop through data, but often speed is not an issue or you just need to write a loop to generate or manipulate your dataset.

The most common loop function is for:

for(i in 1:3){
  print(2*i)
}

[1] 2
[1] 4
[1] 6

Here i, like in a function, is a local variable, used within the loop only. After the “in” comes a vector (eg, 1:3); it can be any vector, including a column of data.

for(j in c("frog","duck")){
  print(j)
}

[1] "frog"
[1] "duck"

2.1 Loops for updating variables

One common technique you see in programming using loops is to create an empty scalar or vector variable, and then repeatedly updating it inside a “for” loop. Say I wanted to multiply a number (say, 5) by 3 without actually using a multiplication function. Since multiplying 5 by 3 is the same as counting by 3s five times, I could write:

myprod <- 0
for(i in 1:5){
  myprod <- myprod + 3
}
myprod

## [1] 15

When that is done, myprod equals five three’s, or 15. Note how we use the same variable, myprod, over and over again, updating it each time.

If instead of outputting 3*5, you wanted to output a vector containing all the steps from 3 to 15, you could do something similar:

myvect <- c(3)
for(i in 2:5){
  myvect <- c(myvect, myvect[i-1] + 3)
}
myvect

## [1]  3  6  9 12 15

Note how we again recyle a variable, this time the vector myvect, growing it at each step by “gluing” another number on the end, in this case the previous last number in myvect + 3. The output will be whole sequence from 3 to 15. (And note also how we start with i=2, because if we started with i=1, things would break when it tried to figure out what myvect[i-1] was when i is 1.)

2.2 Loops: While

While is another loop function, one that’s pretty self-explanatory:

i <- 1
while(i <= 2){
  print(3*i)
  i <- i + 1
}

[1] 3
[1] 6

Be careful with while loops though – if you set it up wrong it can possibly run forever if the truth condition is never reached. If possible, better to use a for loop, and be clever with your “in” vector if need be. One option is to use a break which interrupts the loop if some condition is met partway:

for(i in 1:3){
  if(i == 2){
    break
  }
  print(i)
}

[1] 1

3 Apply

A faster way to step through and manipulate a vector or list of data is using R’s various apply functions, which like loops will apply any function iteratively to a set of data. Apply is in general much faster than a for loop, but it can be trickier to conceptualize how best to use it, and sometimes for loops, though slower, are more easy and flexible.

Apply is best suited to matrices or data frames, and is generally used with functions that take as their input rows or columns. So instead of writing a for loop to step through each row, for instance, you can use apply to apply a function to all of those rows at once.

Thus the key option for apply is whether you want your function to operate on all the rows or all the columns of your data.

3.1 Apply to take the mean

Consider the following matrix:

m <- matrix(1:6,nrow=2,ncol=3)
m

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Say we want the mean of each column (eg, we might want the mean of each variable in a dataset). We could do this via:

apply(m,2,mean)

[1] 1.5 3.5 5.5

where m is the input data, 2 means we apply the function over columns (1 is rows), and mean is the function we are applying. We can use any function we want here, including those of our own creation.

Of course, R also has shortcuts for such common things as calculating row or column means:

colMeans(m)

[1] 1.5 3.5 5.5

3.2 Other ways to Apply

And R also has base functions for getting more detailed summaries of variables:

summary(m)

       V1             V2             V3      
 Min.   :1.00   Min.   :3.00   Min.   :5.00  
 1st Qu.:1.25   1st Qu.:3.25   1st Qu.:5.25  
 Median :1.50   Median :3.50   Median :5.50  
 Mean   :1.50   Mean   :3.50   Mean   :5.50  
 3rd Qu.:1.75   3rd Qu.:3.75   3rd Qu.:5.75  
 Max.   :2.00   Max.   :4.00   Max.   :6.00

Apply can also be applied to each element of the data individually, ie over both rows and columns with the setting 1:2 – though there are often better ways to do this. (Eg, for the following example, m==3 does the same thing.)

apply(m,1:2,isitthree)

      [,1]  [,2]  [,3]
[1,] FALSE  TRUE FALSE
[2,] FALSE FALSE FALSE

There are many other apply functions designed for different input data.

4 Example: writing a program

Let’s say we want to write a program (or “script”) to check a famously simple mathematical conjecture. The Collatz Conjecture is very easy to state, but surprisingly has resisted nearly a century of efforts to prove that it is true. Consider the following algorithm: start with any number \(n\); if it’s even, halve it, but if it’s odd, triple it and add 1; and whatever number results, keep repeating that process over and over. The conjecture is that, whatever number you start with, you will eventually end up at 1.

Now, it may have turned out that this is true for some trivial reason, in the same way that the “divisibility by 9” rule turns out to be trivial (where a number is divisible by 9 if the sum of its digits is divisible by 9). But in this case, it turns out there no known reason why you end up at 1, and in fact mathematicians have been puzzled over how to prove it for decades.

But even though we can’t prove it, we can easily test computationally that you end up at 1 for any given number. So let’s write a program that tests it for some number \(n\), saving the somewhat random-seeming sequence of numbers you get starting from \(n\), and stops when it gets to 1.

4.1 Tips for coding

Here are a few tips for coding:

Write it out in English first as best you can.
Use lots of comments so it will be readable by you and others.
Indent contents within curly brackets; RStudio does this automatically.
Always include the closing parenthesis or bracket before filling in the contents so you don’t forget; RStudio helps with this too.
Check that each step of your program works before going to the next. Sometimes this requires writing temporary code (eg, printing out each step) that you erase later, just to make sure eveything is working.
It’s often easiest to do the easy stuff first and the hard stuff later, even if the hard stuff comes earlier in your program; you can often leave out the hard stuff or use a place-holder while you work on the easier things.
After you are done and it seems to work, try to stress-test it a bit by giving it unexpected inputs just to test it out. But for casual scripts, you don’t have make it robust to other users who might deliberately try to break it (eg, by feeding a numeric function a string or something else that’s problematic).

4.2 Start in English, then translate to R

So how do we do this? The best way to start is in English, with numbered steps. We’ll write out in comments in R what we want our program to do, in as much detail as we can manage, and then later we’ll translate each bit of English into R code, as best we can.

So we want a program that takes some \(n\), does the Collatz algorithm on it repeatedly, saves each number along the way, and stops when it gets to 1, and returns the sequence of numbers.

# 1. Take some n as an input
# 2a. If it is even, halve it
# 2b. If it is odd, triple it and add 1
# 3. Repeat 2 over and over, saving each number along the way
# 4. If the current number is 1, stop
# 5. If we've done this, say, 10,000 times and never got to 1, stop and say we gave up

To do part 1, we clearly want a function that operates on some input \(n\), and so we wrap everything inside a function, putting the close-bracket at the end. We’ll preserve all the comments and keep them above each part of the code as best we can. Note how we indent the comments, as we will for all the content inside a curly bracket.

# 1. Take some n as an input
collatzf <- function(n){

  # 2a. If it is even, halve it
  # 2b. If it is odd, triple it and add 1
  # 3. Repeat 2 over and over, saving each number along the way
  # 4. If the current number is 1, stop

}

For step 2, we clearly need an if statement. We know how to measure even and odd using the %% modulo function, where x %% y gives the remainder from dividing \(x\) by \(y\). So depending on whether it is even or odd, we create a new \(n\) that is the transformation of the original \(n\). Notice how we reuse the same variable \(n\), replacing the original value of \(n\) with the new value; this is a very common approach in programming, to repeated replace the vale of the same variable as the program runs so that we don’t have to create something new for each step.

# 1. Take some n as an input
collatzf <- function(n){
  
  # 2a. If it is even, halve it
  # 2b. If it is odd, triple it and add 1
  if(n %% 2 == 0){
    n <- n/2
  }else{
    n <- 3*n + 1
  }
  
  # 3. Repeat 2 over and over, saving each number along the way
  # 4. If the current number is 1, stop

}

Ok, now we need some loop that keeps repeating the process in 2, and also saves each value of \(n\) along the way. This actually wraps around the step in 2, so we’ll have to shuffle things a bit. First let’s create a vector at the top of the function to store our list of \(n\) values, which we will update for each repetition by gluing onto the end each new number. Note how we create a vector with just the initial \(n\) to begin, and then build it up by using c() to paste a new number onto the end of the vector. This too is a common trick when writing scripts where you don’t necessarily know how long it will run before it ends.

# 1. Take some n as an input
collatzf <- function(n){
  collatzv <- c(n)
  
  # 2a. If it is even, halve it
  # 2b. If it is odd, triple it and add 1
  if(n %% 2 == 0){
    n <- n/2
  }else{
    n <- 3*n + 1
  }
  collatzv <- c(collatzv,n)
  
  # 3. Repeat 2 over and over, saving each number along the way
  # 4. If the current number is 1, stop
  
}

Now we need to add the loop that repeats 2 until the current. We use while and indent again because of the curly brackets.

# 1. Take some n as an input
collatzv <- c()
collatzf <- function(n){
  collatzv <- c(n)
  
  # 3. Repeat 2 over and over, saving each number along the way
  # 4. If the current number is 1, stop
  while(n != 1){
    # 2a. If it is even, halve it
    # 2b. If it is odd, triple it and add 1
    if(n %% 2 == 0){
      n <- n/2
    }else{
      n <- 3*n + 1
    }
    collatzv <- c(collatzv,n)
  }
  
}

And finally, we want to return our vector so we can inspect it or use it later.

# 1. Take some n as an input
collatzv <- c()
collatzf <- function(n){
  collatzv <- c(n)
  
  # 3. Repeat 2 over and over, saving each number along the way
  # 4. If the current number is 1, stop
  while(n != 1){
    # 2a. If it is even, halve it
    # 2b. If it is odd, triple it and add 1
    if(n %% 2 == 0){
      n <- n/2
    }else{
      n <- 3*n + 1
    }
    collatzv <- c(collatzv,n)
  }
  
  return(collatzv)
}

4.3 Test it

So now let’s test it, with a few options, one of which we won’t show here.

collatzf(10)

## [1] 10  5 16  8  4  2  1

collatzf(100)

##  [1] 100  50  25  76  38  19  58  29  88  44  22  11  34  17  52  26  13  40  20
## [20]  10   5  16   8   4   2   1

And here’s a couple we won’t run. Try to guess what would go wrong with each:

# Not a good idea:
# collatzf(9.7)
# Very bad idea:
# collatzf(-1)
# Bad but at least just produces an error:
# collatzf("fish")

If you want to make the program more robust, you can put a step at the beginning that checks to make sure \(n\) is a positive integer and breaks the loop otherwise, before it gets to the while, which can run forever if given a negative input. (Interestingly, the input of 9.7 ought to never finish either, but if you run it, what you will see is that it does actually converge to 1: the number gets bigger and bigger until it is so big it is rounded to an integer in R’s memory, at which point the algorithm works to bring it down to 1 in about 444 total steps.)

4.4 Extension

Say instead we had wanted the Collatz sequence for every number between 1 and 100. Then we would start in the same way, writing our function step by step, and then after that we embed the function in a loop which creates the list of sequences. So again, you would work from the inside out, doing the easier part first (the function), and then putting that inside a loop that repeats it for every \(n\) and saves those sequences in a list.

collatzlist <- list()
for(i in 2:100){
  collatzlist[[i]] <- collatzf(i)
}

If you are curious, different numbers seem to take longer to converge than others. Here is how long they take to converge:

stepsto1 <- sapply(collatzlist,length)

And, getting ahead of ourselves, here is a plot of how long they take:

plot(1:length(stepsto1),stepsto1)

Why do some initial \(n\) vales take a long time to get to 1, and others not very long? I have no idea! Maybe the Wikipedia article has some insights…

Computational Statistics 2.2: Programming and Scripts