Overview
In this lesson we learn R programming tools such as “if”, “for” and “apply”.
Objectives
After completing this lesson, students should be able to:
Readings
Lander, Chapters 9, 10, 11.1
The other important tools for writing functions and manipulating data are control statements and loops. This too will be familiar to those with some programming background.
For those with no programming experience, the key idea is that a program or script is executed by R one line at a time, often with generous use of brackets {
}
and parentheses to let the computer know where chunks start and end.
We have seen conditional statements implicitly before, but if
allows us more control:
if(3 < 4){
print("yup")
}
[1] "yup"
Again, like all scripts, this is executed by R like we read it – from the top down, one line at a time. If
takes a truth condition as its input, and executes the stuff in the brackets {}
if the input truth condition is TRUE.
Here’s another example:
if(3 == 4){
print("yup")
}
Note the lack of output now.
Inportant logical operators include ==
, <
, >
, <=
, >=
, as well as “and” &
, “or” |
, “not” !
and not equal to !=
. For instance:
if((3 < 4) & (3 != 2)){
print("yup")
}
[1] "yup"
Note the use of parentheses to make sure the logic is clear; you can sometimes get away with less parentheses, but it’s good form to be explicit.
If
can also do alternative actions when the input is false, via else
. Here is an illustration in a function using if
and else
:
isitthree <- function(x){
if(x == 3){
return(TRUE)
}else{
return(FALSE)
}
}
isitthree(3)
[1] TRUE
You can also user a shorter function that combines them into ifelse
:
isitless <- ifelse(3<4,1,0)
isitless
[1] 1
where the first argument is the test, the second is the output for if it is true, and the third is the output for if the test is false.
R is notoriously slow for scripts that repeated loop through data, but often speed is not an issue or you just need to write a loop to generate or manipulate your dataset.
The most common loop function is for
:
for(i in 1:3){
print(2*i)
}
[1] 2
[1] 4
[1] 6
Here i
, like in a function, is a local variable, used within the loop only. After the “in” comes a vector (eg, 1:3); it can be any vector, including a column of data.
for(j in c("frog","duck")){
print(j)
}
[1] "frog"
[1] "duck"
One common technique you see in programming using loops is to create an empty scalar or vector variable, and then repeatedly updating it inside a “for” loop. Say I wanted to multiply a number (say, 5) by 3 without actually using a multiplication function. Since multiplying 5 by 3 is the same as counting by 3s five times, I could write:
myprod <- 0
for(i in 1:5){
myprod <- myprod + 3
}
myprod
## [1] 15
When that is done, myprod equals five three’s, or 15. Note how we use the same variable, myprod, over and over again, updating it each time.
If instead of outputting 3*5, you wanted to output a vector containing all the steps from 3 to 15, you could do something similar:
myvect <- c(3)
for(i in 2:5){
myvect <- c(myvect, myvect[i-1] + 3)
}
myvect
## [1] 3 6 9 12 15
Note how we again recyle a variable, this time the vector myvect
, growing it at each step by “gluing” another number on the end, in this case the previous last number in myvect
+ 3. The output will be whole sequence from 3 to 15. (And note also how we start with i=2
, because if we started with i=1
, things would break when it tried to figure out what myvect[i-1]
was when i is 1.)
While
is another loop function, one that’s pretty self-explanatory:
i <- 1
while(i <= 2){
print(3*i)
i <- i + 1
}
[1] 3
[1] 6
Be careful with while
loops though – if you set it up wrong it can possibly run forever if the truth condition is never reached. If possible, better to use a for
loop, and be clever with your “in” vector if need be. One option is to use a break
which interrupts the loop if some condition is met partway:
for(i in 1:3){
if(i == 2){
break
}
print(i)
}
[1] 1
A faster way to step through and manipulate a vector or list of data is using R’s various apply
functions, which like loops will apply any function iteratively to a set of data. Apply
is in general much faster than a for
loop, but it can be trickier to conceptualize how best to use it, and sometimes for
loops, though slower, are more easy and flexible.
Apply
is best suited to matrices or data frames, and is generally used with functions that take as their input rows or columns. So instead of writing a for
loop to step through each row, for instance, you can use apply
to apply a function to all of those rows at once.
Thus the key option for apply
is whether you want your function to operate on all the rows or all the columns of your data.
Consider the following matrix:
m <- matrix(1:6,nrow=2,ncol=3)
m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Say we want the mean of each column (eg, we might want the mean of each variable in a dataset). We could do this via:
apply(m,2,mean)
[1] 1.5 3.5 5.5
where m
is the input data, 2 means we apply the function over columns (1 is rows), and mean
is the function we are applying. We can use any function we want here, including those of our own creation.
Of course, R also has shortcuts for such common things as calculating row or column means:
colMeans(m)
[1] 1.5 3.5 5.5
And R also has base functions for getting more detailed summaries of variables:
summary(m)
V1 V2 V3
Min. :1.00 Min. :3.00 Min. :5.00
1st Qu.:1.25 1st Qu.:3.25 1st Qu.:5.25
Median :1.50 Median :3.50 Median :5.50
Mean :1.50 Mean :3.50 Mean :5.50
3rd Qu.:1.75 3rd Qu.:3.75 3rd Qu.:5.75
Max. :2.00 Max. :4.00 Max. :6.00
Apply
can also be applied to each element of the data individually, ie over both rows and columns with the setting 1:2
– though there are often better ways to do this. (Eg, for the following example, m==3
does the same thing.)
apply(m,1:2,isitthree)
[,1] [,2] [,3]
[1,] FALSE TRUE FALSE
[2,] FALSE FALSE FALSE
There are many other apply
functions designed for different input data.
Let’s say we want to write a program (or “script”) to check a famously simple mathematical conjecture. The Collatz Conjecture is very easy to state, but surprisingly has resisted nearly a century of efforts to prove that it is true. Consider the following algorithm: start with any number \(n\); if it’s even, halve it, but if it’s odd, triple it and add 1; and whatever number results, keep repeating that process over and over. The conjecture is that, whatever number you start with, you will eventually end up at 1.
Now, it may have turned out that this is true for some trivial reason, in the same way that the “divisibility by 9” rule turns out to be trivial (where a number is divisible by 9 if the sum of its digits is divisible by 9). But in this case, it turns out there no known reason why you end up at 1, and in fact mathematicians have been puzzled over how to prove it for decades.
But even though we can’t prove it, we can easily test computationally that you end up at 1 for any given number. So let’s write a program that tests it for some number \(n\), saving the somewhat random-seeming sequence of numbers you get starting from \(n\), and stops when it gets to 1.
Here are a few tips for coding:
So how do we do this? The best way to start is in English, with numbered steps. We’ll write out in comments in R what we want our program to do, in as much detail as we can manage, and then later we’ll translate each bit of English into R code, as best we can.
So we want a program that takes some \(n\), does the Collatz algorithm on it repeatedly, saves each number along the way, and stops when it gets to 1, and returns the sequence of numbers.
# 1. Take some n as an input
# 2a. If it is even, halve it
# 2b. If it is odd, triple it and add 1
# 3. Repeat 2 over and over, saving each number along the way
# 4. If the current number is 1, stop
# 5. If we've done this, say, 10,000 times and never got to 1, stop and say we gave up
# 1. Take some n as an input
collatzf <- function(n){
# 2a. If it is even, halve it
# 2b. If it is odd, triple it and add 1
# 3. Repeat 2 over and over, saving each number along the way
# 4. If the current number is 1, stop
}
if
statement. We know how to measure even and odd using the %%
modulo function, where x %% y
gives the remainder from dividing \(x\) by \(y\). So depending on whether it is even or odd, we create a new \(n\) that is the transformation of the original \(n\). Notice how we reuse the same variable \(n\), replacing the original value of \(n\) with the new value; this is a very common approach in programming, to repeated replace the vale of the same variable as the program runs so that we don’t have to create something new for each step.# 1. Take some n as an input
collatzf <- function(n){
# 2a. If it is even, halve it
# 2b. If it is odd, triple it and add 1
if(n %% 2 == 0){
n <- n/2
}else{
n <- 3*n + 1
}
# 3. Repeat 2 over and over, saving each number along the way
# 4. If the current number is 1, stop
}
c()
to paste a new number onto the end of the vector. This too is a common trick when writing scripts where you don’t necessarily know how long it will run before it ends.# 1. Take some n as an input
collatzf <- function(n){
collatzv <- c(n)
# 2a. If it is even, halve it
# 2b. If it is odd, triple it and add 1
if(n %% 2 == 0){
n <- n/2
}else{
n <- 3*n + 1
}
collatzv <- c(collatzv,n)
# 3. Repeat 2 over and over, saving each number along the way
# 4. If the current number is 1, stop
}
Now we need to add the loop that repeats 2 until the current. We use while
and indent again because of the curly brackets.
# 1. Take some n as an input
collatzv <- c()
collatzf <- function(n){
collatzv <- c(n)
# 3. Repeat 2 over and over, saving each number along the way
# 4. If the current number is 1, stop
while(n != 1){
# 2a. If it is even, halve it
# 2b. If it is odd, triple it and add 1
if(n %% 2 == 0){
n <- n/2
}else{
n <- 3*n + 1
}
collatzv <- c(collatzv,n)
}
}
And finally, we want to return our vector so we can inspect it or use it later.
# 1. Take some n as an input
collatzv <- c()
collatzf <- function(n){
collatzv <- c(n)
# 3. Repeat 2 over and over, saving each number along the way
# 4. If the current number is 1, stop
while(n != 1){
# 2a. If it is even, halve it
# 2b. If it is odd, triple it and add 1
if(n %% 2 == 0){
n <- n/2
}else{
n <- 3*n + 1
}
collatzv <- c(collatzv,n)
}
return(collatzv)
}
So now let’s test it, with a few options, one of which we won’t show here.
collatzf(10)
## [1] 10 5 16 8 4 2 1
collatzf(100)
## [1] 100 50 25 76 38 19 58 29 88 44 22 11 34 17 52 26 13 40 20
## [20] 10 5 16 8 4 2 1
And here’s a couple we won’t run. Try to guess what would go wrong with each:
# Not a good idea:
# collatzf(9.7)
# Very bad idea:
# collatzf(-1)
# Bad but at least just produces an error:
# collatzf("fish")
If you want to make the program more robust, you can put a step at the beginning that checks to make sure \(n\) is a positive integer and breaks the loop otherwise, before it gets to the while
, which can run forever if given a negative input. (Interestingly, the input of 9.7 ought to never finish either, but if you run it, what you will see is that it does actually converge to 1: the number gets bigger and bigger until it is so big it is rounded to an integer in R’s memory, at which point the algorithm works to bring it down to 1 in about 444 total steps.)
Say instead we had wanted the Collatz sequence for every number between 1 and 100. Then we would start in the same way, writing our function step by step, and then after that we embed the function in a loop which creates the list of sequences. So again, you would work from the inside out, doing the easier part first (the function), and then putting that inside a loop that repeats it for every \(n\) and saves those sequences in a list.
collatzlist <- list()
for(i in 2:100){
collatzlist[[i]] <- collatzf(i)
}
If you are curious, different numbers seem to take longer to converge than others. Here is how long they take to converge:
stepsto1 <- sapply(collatzlist,length)
And, getting ahead of ourselves, here is a plot of how long they take:
plot(1:length(stepsto1),stepsto1)
Why do some initial \(n\) vales take a long time to get to 1, and others not very long? I have no idea! Maybe the Wikipedia article has some insights…