5  Programming

After learning the basics, programming in R is the next big step. There are already a vast number of R packages available, surely more than enough to cover everything you could possibly want to do? Why then, would you ever need to create you own R functions? Why not just stick to the functions from a package? Well, in some cases you’ll want to customise those existing functions to suit your specific needs. Or you may want to implement a new approach which means there won’t be any pre-existing packages that work for you. Both of these are not particularly common. Functions are mainly used to do one thing well in a simple manner without having to type of the code necessary to do that function each time. We can see functions as a short-cut to copy-pasting. If you have to do a similar task 4 times or more, build a function for it, and simply call that function 4 times or call it in a loop .

5.1 Looking behind the curtain

A good way to start learning to program in R is to see what others have done. We can start by briefly peeking behind the curtain. With many functions in R, if you want to have a quick glance at the machinery behind the scenes, we can simply write the function name but without the ().

Note that to view the source code of base R packages (those that come with R) requires some additional steps which we won’t cover here (see this link if you’re interested), but for most other packages that you install yourself, generally entering the function name without () will show the source code of the function.

What can have a look at the function to fit a linear model lm()

lm
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
{
    ret.x <- x
    ret.y <- y
    cl <- match.call()
    mf <- match.call(expand.dots = FALSE)
    m <- match(c("formula", "data", "subset", "weights", "na.action", 
        "offset"), names(mf), 0L)
    mf <- mf[c(1L, m)]
    mf$drop.unused.levels <- TRUE
    mf[[1L]] <- quote(stats::model.frame)
    mf <- eval(mf, parent.frame())
    if (method == "model.frame") 
        return(mf)
    else if (method != "qr") 
        warning(gettextf("method = '%s' is not supported. Using 'qr'", 
            method), domain = NA)
    mt <- attr(mf, "terms")
    y <- model.response(mf, "numeric")
    w <- as.vector(model.weights(mf))
    if (!is.null(w) && !is.numeric(w)) 
        stop("'weights' must be a numeric vector")
    offset <- model.offset(mf)
    mlm <- is.matrix(y)
    ny <- if (mlm) 
        nrow(y)
    else length(y)
    if (!is.null(offset)) {
        if (!mlm) 
            offset <- as.vector(offset)
        if (NROW(offset) != ny) 
            stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
                NROW(offset), ny), domain = NA)
    }
    if (is.empty.model(mt)) {
        x <- NULL
        z <- list(coefficients = if (mlm) matrix(NA_real_, 0, 
            ncol(y)) else numeric(), residuals = y, fitted.values = 0 * 
            y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != 
            0) else ny)
        if (!is.null(offset)) {
            z$fitted.values <- offset
            z$residuals <- y - offset
        }
    }
    else {
        x <- model.matrix(mt, mf, contrasts)
        z <- if (is.null(w)) 
            lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
                ...)
        else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
            ...)
    }
    class(z) <- c(if (mlm) "mlm", "lm")
    z$na.action <- attr(mf, "na.action")
    z$offset <- offset
    z$contrasts <- attr(x, "contrasts")
    z$xlevels <- .getXlevels(mt, mf)
    z$call <- cl
    z$terms <- mt
    if (model) 
        z$model <- mf
    if (ret.x) 
        z$x <- x
    if (ret.y) 
        z$y <- y
    if (!qr) 
        z$qr <- NULL
    z
}
<bytecode: 0x6392c6069a40>
<environment: namespace:stats>

What we see above is the underlying code for this particular function. We could copy and paste this into our own script and make any changes we deemed necessary, although tread carefully and test the changes you’ve made.

Don’t worry overly if most of the code contained in functions doesn’t make sense immediately. This will be especially true if you are new to R, in which case it seems incredibly intimidating. Honestly, it can be intimidating even after years of R experience. To help with that, we’ll begin by making our own functions in R in the next section.

5.2 Functions in R

Functions are the bread and butter of R, the essential sustaining elements allowing you to work with R. They’re made (most of the time) with the utmost care and attention but may end up being something of a Frankenstein’s monster - with weirdly attached limbs. But no matter how convoluted they may be they will always faithfully do the same thing.

This means that functions can also be very stupid.

If we asked you to go to the supermarket to get us some ingredients to make Balmoral chicken, even if you don’t know what the heck that is, you’d be able to guess and bring at least something back. Or you could decide to make something else. Or you could ask a chef for help. Or you could pull out your phone and search online for what Balmoral chicken is. The point is, even if we didn’t give you enough information to do the task, you’re intelligent enough to, at the very least, try to find a work around.

If instead, we asked a function to do the same, it would listen intently to our request, and then will simply return an error. It would then repeat this every single time we asked it to do the job when the task is not clear. The point here, is that code and functions cannot find workarounds to poorly provided information, which is great. It’s totally reliant on you, to tell it very explicitly what it needs to do step by step.

Remember two things: the intelligence of code comes from the coder, not the computer and functions need exact instructions to work.

To prevent functions from being too stupid you must provide the information the function needs in order for it to work. As with the Balmoral chicken example, if we’d supplied a recipe list to the function, it would have managed just fine. We call this “fulfilling an argument”. The vast majority of functions require the user to fulfill at least one argument.

This can be illustrated in the pseudocode below. When we make a function we can:

  • specify what arguments the user must fulfill (e.g. arg1 and arg2)
  • provide default values to arguments (e.g. arg2 = TRUE)
  • define what to do with the arguments (expression):
my_function <- function(arg1, arg2, ...) {
  expression
}

The first thing to note is that we’ve used the function function() to create a new function called my_function. To walk through the above code; we’re creating a function called my_function. Within the round brackets we specify what information (i.e. arguments) the function requires to run (as many or as few as needed). These arguments are then passed to the expression part of the function. The expression can be any valid R command or set of R commands and is usually contained between a pair of braces { }. Once you run the above code, you can then use your new function by typing:

my_function(arg1, arg2)

Let’s work through an example to help clear things up.

First we are going to create a data frame called dishes, where columns lasagna, stovies, poutine, and tartiflette are filled with 10 random values drawn from a bag (using the rnorm() function to draw random values from a Normal distribution with mean 0 and standard deviation of 1). We also include a “problem”, for us to solve later, by including 3 NA values within the poutine column (using rep(NA, 3)).

dishes <- data.frame(
  lasagna = rnorm(10),
  stovies = rnorm(10),
  poutine = c(rep(NA, 3), rnorm(7)),
  tartiflette = rnorm(10)
)

Let’s say that you want to multiply the values in the variables stovies and lasagna and create a new object called stovies_lasagna. We can do this “by hand” using:

stovies_lasagna <- dishes$stovies * dishes$lasagna

If this was all we needed to do, we can stop here. R works with vectors, so doing these kinds of operations in R is actually much simpler than other programming languages, where this type of code might require loops (we say that R is a vectorised language). Something to keep in mind for later is that doing these kinds of operations with loops can be much slower compared to vectorisation.

But what if we want to repeat this multiplication many times? Let’s say we wanted to multiply columns lasagna and stovies, stovies and tartiflette, and poutine and tartiflette. In this case we could copy and paste the code, replacing the relevant information.

lasagna_stovies <- dishes$lasagna * dishes$stovies
stovies_tartiflette <- dishes$stovies * dishes$stovies
poutine_tartiflette <- dishes$poutine * dishes$tartiflette

While this approach works, it’s easy to make mistakes. In fact, here we’ve “forgotten” to change stovies to tartiflette in the second line of code when copying and pasting. This is where writing a function comes in handy. If we were to write this as a function, there is only one source of potential error (within the function itself) instead of many copy-pasted lines of code (which we also cut down on by using a function).

Tip

As a rule of thumb if we have to do the same thing (by copy/paste & modify) 3 times or more, we just make a function for it.

In this case, we’re using some fairly trivial code where it’s maybe hard to make a genuine mistake. But what if we increased the complexity?

dishes$lasagna * dishes$stovies / dishes$lasagna + (dishes$lasagna * 10^(dishes$stovies))
-dishes$stovies - (dishes$lasagna * sqrt(dishes$stovies + 10))

Now imagine having to copy and paste this three times, and in each case having to change the lasagna and stovies variables (especially if we had to do it more than three times).

What we could do instead is generalize our code for x and y columns instead of naming specific dishes. If we did this, we could recycle the x * y code. Whenever we wanted to multiple columns together, we assign a dishes to either x or y. We’ll assign the multiplication to the objects lasagna_stovies and stovies_poutine so we can come back to them later.

# Assign x and y values
x <- dishes$lasagna
y <- dishes$stovies

# Use multiplication code
lasagna_stovies <- x * y

# Assign new x and y values
x <- dishes$stovies
y <- dishes$poutine

# Reuse multiplication code
stovies_poutine <- x * y

This is essentially what a function does. Let’s call our new function multiply_cols() and define it with two arguments, x and y. A function in R will simply return its last value. However, it is possible to force the function to return an earlier value if wanted/needed. Using the return() function is not strictly necessary in this example as R will automatically return the value of the last line of code in our function. We include it here to make this explicit.

multiply_cols <- function(x, y) {
  return(x * y)
}

Now that we’ve defined our function we can use it. Let’s use the function to multiple the columns lasagna and stovies and assign the result to a new object called lasagna_stovies_func

lasagna_stovies_func <- multiply_cols(x = dishes$lasagna, y = dishes$stovies)
lasagna_stovies_func
 [1]  0.15888066  0.07950564  0.10198302  0.17930855 -0.89100492 -2.07422678
 [7]  0.18301854  1.36082011 -0.26397192  0.36560595

If we’re only interested in multiplying dishes$lasagna and dishes$stovies, it would be overkill to create a function to do something once. However, the benefit of creating a function is that we now have that function added to our environment which we can use as often as we like. We also have the code to create the function, meaning we can use it in completely new projects, reducing the amount of code that has to be written (and retested) from scratch each time.

To satisfy ourselves that the function has worked properly, we can compare the lasagna_stovies variable with our new variable lasagna_stovies_func using the identical() function. The identical() function tests whether two objects are exactly identical and returns either a TRUE or FALSE value. Use ?identical if you want to know more about this function.

identical(lasagna_stovies, lasagna_stovies_func)
[1] TRUE

And we confirm that the function has produced the same result as when we do the calculation manually. We recommend getting into a habit of checking that the function you’ve created works the way you think it has.

Now let’s use our multiply_cols() function to multiply columns stovies and poutine. Notice now that argument x is given the value dishes$stoviesand y the value dishes$poutine.

stovies_poutine_func <- multiply_cols(x = dishes$stovies, y = dishes$poutine)
stovies_poutine_func
 [1]           NA           NA           NA  0.134984249  0.991411525
 [6] -1.138108599 -1.153075159  1.636003714  0.008278524 -0.076287115

So far so good. All we’ve really done is wrapped the code x * y into a function, where we ask the user to specify what their x and y variables are.

Using the function is a bit long since we have to retype the name of the data frame for each variable. For a bit of fun we can modify the function so that, we can specify the data frame as an argument and the column names without quoting them (as in a tidyverse style).

multiply_cols <- function(data, x, y) {
  temp_var <- data %>%
    select({{ x }}, {{ y }}) %>%
    mutate(xy = prod(.)) %>%
    pull(xy)
}

For this new version of the function, we added we added a data argument on line 1. On lines 3, we select the x and y variables provided as arguments. On line 4., we create the product of the 2 selected columns and on line 5. we extract the column we juste created. We also remove the return() function since it was not needed

Our function is now compatible with the pipe (either native |> or magrittr %>%) function. However, since the function now uses the pipe from magrittr 📦 and dplyr 📦 functions, we need to load the tidyverse 📦 package for it to work.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
lasagna_stovies_func <- multiply_cols(dishes, lasagna, stovies)
lasagna_stovies_func <- dishes |> multiply_cols(lasagna, stovies)

Now let’s add a little bit more complexity. If you look at the output of poutine_tartiflette some of the calculations have produced NA values. This is because of those NA values we included in poutine when we created the dishes data frame. Despite these NA values, the function appeared to have worked but it gave us no indication that there might be a problem. In such cases we may prefer if it had warned us that something was wrong. How can we get the function to let us know when NA values are produced? Here’s one way.

multiply_cols <- function(data, x, y) {
  temp_var <- data %>%
    select({{ x }}, {{ y }}) %>%
    mutate(xy = {
      .[1] * .[2]
    }) %>%
    pull(xy)
  if (any(is.na(temp_var))) {
    warning("The function has produced NAs")
    return(temp_var)
  } else {
    return(temp_var)
  }
}
stovies_poutine_func <- multiply_cols(dishes, stovies, poutine)
Warning in multiply_cols(dishes, stovies, poutine): The function has produced
NAs
lasagna_stovies_func <- multiply_cols(dishes, lasagna, stovies)

The core of our function is still the same, but we’ve now got an extra six lines of code (lines 6-11). We’ve included some conditional statements, if (lines 6-8) and else (lines 9-11), to test whether any NAs have been produced and if they have we display a warning message to the user. The next section of this Chapter will explain how these work and how to use them.

5.3 Conditional statements

x * y does not apply any logic. It merely takes the value of x and multiplies it by the value of y. Conditional statements are how you inject some logic into your code. The most commonly used conditional statement is if. Whenever you see an if statement, read it as ‘If X is TRUE, do a thing’. Including an else statement simply extends the logic to ‘If X is TRUE, do a thing, or else do something different’.

Both the if and else statements allow you to run sections of code, depending on a condition is either TRUE or FALSE. The pseudo-code below shows you the general form.

  if (condition) {
  Code executed when condition is TRUE
  } else {
  Code executed when condition is FALSE
  }

To delve into this a bit more, we can use an old programmer joke to set up a problem.

A programmer’s partner says: ‘Please go to the store and buy a carton of milk and if they have eggs, get six.’

The programmer returned with 6 cartons of milk.

When the partner sees this, and exclaims ‘Why the heck did you buy six cartons of milk?’

The programmer replied ‘They had eggs’

At the risk of explaining a joke, the conditional statement here is whether or not the store had eggs. If coded as per the original request, the programmer should bring 6 cartons of milk if the store had eggs (condition = TRUE), or else bring 1 carton of milk if there weren’t any eggs (condition = FALSE). In R this is coded as:

eggs <- TRUE # Whether there were eggs in the store

if (eggs == TRUE) { # If there are eggs
  n.milk <- 6 # Get 6 cartons of milk
} else { # If there are not eggs
  n.milk <- 1 # Get 1 carton of milk
}

We can then check n.milk to see how many milk cartons they returned with.

n.milk
[1] 6

And just like the joke, our R code has missed that the condition was to determine whether or not to buy eggs, not more milk (this is actually a loose example of the Winograd Scheme, designed to test the intelligence of artificial intelligence by whether it can reason what the intended referent of a sentence is).

We could code the exact same egg-milk joke conditional statement using an ifelse() function.

eggs <- TRUE
n.milk <- ifelse(eggs == TRUE, yes = 6, no = 1)

This ifelse() function is doing exactly the same as the more fleshed out version from earlier, but is now condensed down into a single line of code. It has the added benefit of working on vectors as opposed to single values (more on this later when we introduce loops). The logic is read in the same way; “If there are eggs, assign a value of 6 to n.milk, if there isn’t any eggs, assign the value 1 to n.milk”.

We can check again to make sure the logic is still returning 6 cartons of milk:

n.milk
[1] 6

Currently we’d have to copy and paste code if we wanted to change if eggs were in the store or not. We learned above how to avoid lots of copy and pasting by creating a function. Just as with the simple x * y expression in our previous multiply_cols() function, the logical statements above are straightforward to code and well suited to be turned into a function. How about we do just that and wrap this logical statement up in a function?

milk <- function(eggs) {
  if (eggs == TRUE) {
    6
  } else {
    1
  }
}

We’ve now created a function called milk() where the only argument is eggs. The user of the function specifies if eggs is either TRUE or FALSE, and the function will then use a conditional statement to determine how many cartons of milk are returned.

Let’s quickly try:

milk(eggs = TRUE)
[1] 6

And the joke is maintained. Notice in this case we have actually specified that we are fulfilling the eggs argument (eggs = TRUE). In some functions, as with ours here, when a function only has a single argument we can be lazy and not name which argument we are fulfilling. In reality, it’s generally viewed as better practice to explicitly state which arguments you are fulfilling to avoid potential mistakes.

OK, lets go back to the multiply_cols() function we created above and explain how we’ve used conditional statements to warn the user if NA values are produced when we multiple any two columns together.

multiply_cols <- function(data, x, y) {
  temp_var <- data %>%
    select({{ x }}, {{ y }}) %>%
    mutate(xy = {
      .[1] * .[2]
    }) %>%
    pull(xy)
  if (any(is.na(temp_var))) {
    warning("The function has produced NAs")
    return(temp_var)
  } else {
    return(temp_var)
  }
}

In this new version of the function we still use x * y as before but this time we’ve assigned the values from this calculation to a temporary vector called temp_var so we can use it in our conditional statements. Note, this temp_var variable is local to our function and will not exist outside of the function due something called R’s scoping rules. We then use an if statement to determine whether our temp_var variable contains any NA values. The way this works is that we first use the is.na() function to test whether each value in our temp_var variable is an NA. The is.na() function returns TRUE if the value is an NA and FALSE if the value isn’t an NA. We then nest the is.na(temp_var) function inside the function any() to test whether any of the values returned by is.na(temp_var) are TRUE. If at least one value is TRUE the any() function will return a TRUE. So, if there are any NA values in our temp_var variable the condition for the if() function will be TRUE whereas if there are no NA values present then the condition will be FALSE. If the condition is TRUE the warning() function generates a warning message for the user and then returns the temp_var variable. If the condition is FALSE the code below the else statement is executed which just returns the temp_var variable.

So if we run our modified multiple_columns() function on the columns dishes$stovies and dishes$poutine (which contains NAs) we will receive an warning message.

stovies_poutine_func <- multiply_cols(dishes, stovies, poutine)
Warning in multiply_cols(dishes, stovies, poutine): The function has produced
NAs

Whereas if we multiple two columns that don’t contain NA values we don’t receive a warning message

lasagna_stovies_func <- multiply_cols(dishes, lasagna, stovies)

5.4 Combining logical operators

The functions that we’ve created so far have been perfectly suited for what we need, though they have been fairly simplistic. Let’s try creating a function that has a little more complexity to it. We’ll make a function to determine if today is going to be a good day or not based on two criteria. The first criteria will depend on the day of the week (Friday or not) and the second will be whether or not your code is working (TRUE or FALSE). To accomplish this, we’ll be using if and else statements. The complexity will come from if statements immediately following the relevant else statement. We’ll use such conditional statements four times to achieve all combinations of it being a Friday or not, and if your code is working or not.

We also used the cat() function to output text formatted correctly.

good.day <- function(code.working, day) {
  if (code.working == TRUE && day == "Friday") {
    cat(
  "BEST.
  DAY.
    EVER.
      Stop while you are ahead and go to the pub!"
    )
  } else if (code.working == FALSE && day == "Friday") {
    cat("Oh well, but at least it's Friday! Pub time!")
  } else if (code.working == TRUE && day != "Friday") {
    cat("
  So close to a good day...
  shame it's not a Friday"
    )
  } else if (code.working == FALSE && day != "Friday") {
    cat("Hello darkness.")
  }
}
good.day(code.working = TRUE, day = "Friday")
BEST.
  DAY.
    EVER.
      Stop while you are ahead and go to the pub!
good.day(FALSE, "Tuesday")
Hello darkness.

Notice that we never specified what to do if the day was not a Friday? That’s because, for this function, the only thing that matters is whether or not it’s Friday.

We’ve also been using logical operators whenever we’ve used if statements. Logical operators are the final piece of the logical conditions jigsaw. Below is a table which summarises operators. The first two are logical operators and the final six are relational operators. You can use any of these when you make your own functions (or loops).

Operator Technical Description What it means Example
&& Logical AND Both conditions must be met if(cond1 == test && cond2 == test)
|| Logical OR Either condition must be met if(cond1 == test || cond2 == test)
< Less than X is less than Y if(X < Y)
> Greater than X is greater than Y if(X > Y)
<= Less than or equal to X is less/equal to Y if(X <= Y)
>= Greater than or equal to X is greater/equal to Y if(X >= Y)
== Equal to X is equal to Y if(X == Y)
!= Not equal to X is not equal to Y if(X != Y)

5.5 Loops

R is very good at performing repetitive tasks. If we want a set of operations to be repeated several times we use what’s known as a loop. When you create a loop, R will execute the instructions in the loop a specified number of times or until a specified condition is met. There are three main types of loop in R: the for loop, the while loop and the repeat loop.

Loops are one of the staples of all programming languages, not just R, and can be a powerful tool (although in our opinion, used far too frequently when writing R code).

5.5.1 For loop

The most commonly used loop structure when you want to repeat a task a defined number of times is the for loop. The most basic example of a for loop is:

for (i in 1:5) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

But what’s the code actually doing? This is a dynamic bit of code were an index i is iteratively replaced by each value in the vector 1:5. Let’s break it down. Because the first value in our sequence (1:5) is 1, the loop starts by replacing i with 1 and runs everything between the { }. Loops conventionally use i as the counter, short for iteration, but you are free to use whatever you like, even your pet’s name, it really does not matter (except when using nested loops, in which case the counters must be called different things, like SenorWhiskers and HerrFlufferkins).

So, if we were to do the first iteration of the loop manually

i <- 1
print(i)
[1] 1

Once this first iteration is complete, the for loop loops back to the beginning and replaces i with the next value in our 1:5 sequence (2 in this case):

i <- 2
print(i)
[1] 2

This process is then repeated until the loop reaches the final value in the sequence (5 in this example) after which point it stops.

To reinforce how for loops work and introduce you to a valuable feature of loops, we’ll alter our counter within the loop. This can be used, for example, if we’re using a loop to iterate through a vector but want to select the next row (or any other value). To show this we’ll simply add 1 to the value of our index every time we iterate our loop.

for (i in 1:5) {
  print(i + 1)
}
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6

As in the previous loop, the first value in our sequence is 1. The loop begins by replacing i with 1, but this time we’ve specified that a value of 1 must be added to i in the expression resulting in a value of 1 + 1.

i <- 1
i + 1
[1] 2

As before, once the iteration is complete, the loop moves onto the next value in the sequence and replaces i with the next value (2 in this case) so that i + 1 becomes 2 + 1.

i <- 2
i + 1
[1] 3

And so on. We think you get the idea! In essence this is all a for loop is doing and nothing more.

Whilst above we have been using simple addition in the body of the loop, you can also combine loops with functions.

Let’s go back to our data frame dishes. Previously in the Chapter we created a function to multiply two columns and used it to create our lasagna_stovies, stovies_poutine, and poutine_tartiflette objects. We could have used a loop for this. Let’s remind ourselves what our data look like and the code for the multiple_columns() function.

dishes <- data.frame(
  lasagna = rnorm(10),
  stovies = rnorm(10),
  poutine = c(rep(NA, 3), rnorm(7)),
  tartiflette = rnorm(10)
)
multiply_cols <- function(data, x, y) {
  temp_var <- data %>%
    select({{ x }}, {{ y }}) %>%
    mutate(xy = {
      .[1] * .[2]
    }) %>%
    pull(xy)
  if (any(is.na(temp_var))) {
    warning("The function has produced NAs")
    return(temp_var)
  } else {
    return(temp_var)
  }
}

To use a list to iterate over these columns we need to first create an empty list (remember Section 3.2.3?) which we call temp (short for temporary) which will be used to store the output of the for loop.

temp <- list()
for (i in 1:(ncol(dishes) - 1)) {
  temp[[i]] <- multiply_cols(dishes, x = colnames(dishes)[i], y = colnames(dishes)[i + 1])
}
Warning in multiply_cols(dishes, x = colnames(dishes)[i], y =
colnames(dishes)[i + : The function has produced NAs
Warning in multiply_cols(dishes, x = colnames(dishes)[i], y =
colnames(dishes)[i + : The function has produced NAs

When we specify our for loop notice how we subtracted 1 from ncol(dishes). The ncol() function returns the number of columns in our dishes data frame which is 4 and so our loop runs from i = 1 to i = 4 - 1 which is i = 3.

So in the first iteration of the loop i takes on the value 1. The multiply_cols() function multiplies the dishes[, 1] (lasagna) and dishes[, 1 + 1] (stovies) columns and stores it in the temp[[1]] which is the first element of the temp list.

The second iteration of the loop i takes on the value 2. The multiply_cols() function multiplies the dishes[, 2] (stovies) and dishes[, 2 + 1] (poutine) columns and stores it in the temp[[2]] which is the second element of the temp list.

The third and final iteration of the loop i takes on the value 3. The multiply_cols() function multiplies the dishes[, 3] (poutine) and dishes[, 3 + 1] (tartiflette) columns and stores it in the temp[[3]] which is the third element of the temp list.

Again, it’s a good idea to test that we are getting something sensible from our loop (remember, check, check and check again!). To do this we can use the identical() function to compare the variables we created by hand with each iteration of the loop manually.

lasagna_stovies_func <- multiply_cols(dishes, lasagna, stovies)
i <- 1
identical(
  multiply_cols(dishes, colnames(dishes)[i], colnames(dishes)[i + 1]),
  lasagna_stovies_func
)
[1] TRUE
stovies_poutine_func <- multiply_cols(dishes, stovies, poutine)
Warning in multiply_cols(dishes, stovies, poutine): The function has produced
NAs
i <- 2
identical(
  multiply_cols(dishes, colnames(dishes)[i], colnames(dishes)[i + 1]),
  stovies_poutine_func
)
Warning in multiply_cols(dishes, colnames(dishes)[i], colnames(dishes)[i + :
The function has produced NAs
[1] TRUE

If you can follow the examples above, you’ll be in a good spot to begin writing some of your own for loops. That said there are other types of loops available to you.

5.5.2 While loop

Another type of loop that you may use (albeit less frequently) is the while loop. The while loop is used when you want to keep looping until a specific logical condition is satisfied (contrast this with the for loop which will always iterate through an entire sequence).

The basic structure of the while loop is:

while (logical_condition) {
  expression
}

A simple example of a while loop is:

i <- 0
while (i <= 4) {
  i <- i + 1
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Here the loop will only continue to pass values to the main body of the loop (the expression body) when i is less than or equal to 4 (specified using the <= operator in this example). Once i is greater than 4 the loop will stop.

There is another, very rarely used type of loop; the repeat loop. The repeat loop has no conditional check so can keep iterating indefinitely (meaning a break, or “stop here”, has to be coded into it). It’s worthwhile being aware of it’s existence, but for now we don’t think you need to worry about it; the for and while loops will see you through the vast majority of your looping needs.

5.5.3 When to use a loop?

Loops are fairly commonly used, though sometimes a little overused in our opinion. Equivalent tasks can be performed with functions, which are often more efficient than loops. Though this raises the question when should you use a loop?

In general loops are implemented inefficiently in R and should be avoided when better alternatives exist, especially when you’re working with large datasets. However, loop are sometimes the only way to achieve the result we want.

Some examples of when using loops can be appropriate:

  • Some simulations (e.g. the Ricker model can, in part, be built using loops)

  • Recursive relationships (a relationship which depends on the value of the previous relationship [“to understand recursion, you must understand recursion”])

  • More complex problems (e.g., how long since the last badger was seen at site \(j\), given a pine marten was seen at time \(t\), at the same location \(j\) as the badger, where the pine marten was detected in a specific 6 hour period, but exclude badgers seen 30 minutes before the pine marten arrival, repeated for all pine marten detections)

  • While loops (keep jumping until you’ve reached the moon)

5.5.4 If not loops, then what?

In short, use the apply family of functions; apply(), lapply(), tapply(), sapply(), vapply(), and mapply(). The apply functions can often do the tasks of most “home-brewed” loops, sometimes faster (though that won’t really be an issue for most people) but more importantly with a much lower risk of error. A strategy to have in the back of your mind which may be useful is; for every loop you make, try to remake it using an apply function (often lapply or sapply will work). If you can, use the apply version. There’s nothing worse than realizing there was a small, tiny, seemingly meaningless mistake in a loop which weeks, months or years down the line has propagated into a huge mess. We strongly recommend trying to use the apply functions whenever possible.

lapply

Your go to apply function will often be lapply() at least in the beginning. The way that lapply() works, and the reason it is often a good alternative to for loops, is that it will go through each element in a list and perform a task (i.e. run a function). It has the added benefit that it will output the results as a list - something you’d have to otherwise code yourself into a loop.

An lapply() has the following structure:

lapply(X, FUN)

Here X is the vector which we want to do something to. FUN stands for how much fun this is (just kidding!). It’s also short for “function”.

Let’s start with a simple demonstration first. Let’s use the lapply() function create a sequence from 1 to 5 and add 1 to each observation (just like we did when we used a for loop):

lapply(0:4, function(a) {
  a + 1
})
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 5

Notice that we need to specify our sequence as 0:4 to get the output 1 ,2 ,3 ,4 , 5 as we are adding 1 to each element of the sequence. See what happens if you use 1:5 instead.

Equivalently, we could have defined the function first and then used the function in lapply()

add_fun <- function(a) {
  a + 1
}
lapply(0:4, add_fun)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 5

The sapply() function does the same thing as lapply() but instead of storing the results as a list, it stores them as a vector.

sapply(0:4, function(a) {
  a + 1
})
[1] 1 2 3 4 5

As you can see, in both cases, we get exactly the same results as when we used the for loop.