Faster for loops in R

The more I use R for data analysis at work, the more I think of how I can use it for new battery testing experiments I simply wouldn’t have been able to do before, and the more I think of how I can push it further to get more and more information. One such experiment is what I’ve been calling “resistance mapping”, which I published a couple of months ago and have written about with the help of an interactive data set here.

While working on that initially, I had to spend a bit of time streamlining both the amount of data I was taking and the code I was writing so that the analysis took a reasonable amount of time. R, as I discovered, can be quite slow when picking values out of even moderately large data sets.

For example, one of the first steps in the analysis of a dataset in any of my “resistance mapping” experiments is to make a new column and assign a letter in a variable called state depending on the sign of the current – R for resting, or D or C for discharging or charging, respectively. When I first wrote this, I used the sapply function to go through the raw dataframe (called raw here) row by row, check the sign of the current I and return the letter. Later, assuming it was faster, I changed it along with several other functions using sapply to unlist(mclapply()) to make use of all four of my Macbook’s cores (note: ncores = detectCores() is used earlier in the program to make sure all cores are used). So initially, it looked like this:

raw$state <- unlist(mclapply(c(1:nrow(raw)), function(i) {
  if(raw$I[i] == 0) {
    return("R")
  } else {
    if(raw$I[i] < 0) return("D")
    if(raw$I[i] > 0) return("C")
  }
}, mc.cores = ncores))

On a raw dataset with 112,863 rows and 9 columns, this took 2.06 seconds, which looks like it could be considerably quicker. Since I don’t need any of the other data in the dataframe I can just look at the vector containing all the currents instead, rather than looking at every row in the dataframe.

raw$state <- unlist(mclapply(raw$I, function(x) {
  if(x == 0) {
    return("R")
  } else {
    if(x < 0) return("D")
    if(x > 0) return("C")
  }
}, mc.cores = ncores))

This took 0.293 seconds, and gives the same result! Just for comparison, the same function but using sapply instead of unlist(mclapply()):

raw$state <- sapply(raw$I, function(x) {
  if(x == 0) {
    return("R")
  } else {
    if(x < 0) return("D")
    if(x > 0) return("C")
  }
})

This took 0.251 seconds, despite using only one of the four processor cores. I suppose the process of converting a list of almost 10,000 elements back into a vector eliminates the speed advantage.

Just for fun, I compared these with a couple of for loops which do the same thing. The first one adds an empty vector to the data frame and then adds the letter in as it goes:

raw$state <- vector(length = length(raw$I), mode = "numeric")

for (i in c(1:length(raw$I))) {
  if(raw$I[i] == 0) {
    raw$state[i] <- "R"
  } else {
    if(raw$I[i] < 0) raw$state[i] <- "D"
    if(raw$I[i] > 0) raw$state[i] <- "C"
  }
}

This took 85 seconds(!!). The other way to do it would be to initialise the state variable separately, fill in all the blanks, and then cbind it to the data frame at the end.

state <- vector(length = length(raw$I), mode = "numeric")

for (i in c(1:length(raw$I))) {
  if(raw$I[i] == 0) {
    state[i] <- "R"
  } else {
    if(raw$I[i] < 0) state[i] <- "D"
    if(raw$I[i] > 0) state[i] <- "C"
  }
}

raw <- cbind(raw, state)

This took 3.51 seconds, way faster than the first for loop but still slower than any of the \*apply versions. This goes to show just how slow it is to modify values in a data frame in place.

What actually made the biggest relative difference, however, was to write the function in C++ instead, using the Rcpp library.

library(Rcpp)

cppFunction('CharacterVector signCurrent(NumericVector x) {

            int n = x.size();
            CharacterVector out(n);

            for (int i = 0; i < n; ++i) {
              if (x[i] == 0) {
                out[i] = "R";
              } else if (x[i] < 0) {
                out[i] = "D";
              } else if (x[i] > 0) {
                out[i] = "C";
              }
            }
            return out;
            }')

And then calling the function like this:

raw$state <- signCurrent(raw$I)

5 milliseconds! That’s fifty times faster than the sapply method and a ridiculous four orders of magnitude faster than the slower for loop. I really don’t know anything about C++ at the moment, other than it is much better than R at doing this kind of thing – and that it is thankfully very easy to contract out such tasks to C++ within an R program.

Some of these time savings don’t see like much but the relative differences are quite huge. Just to finish off this post, I’ll give an example of another much more significant time-saving trick. The main part of my “resistance mapping” analysis involves making an output data frame where certain bits of information are collected and all of the dE/dI calculations used to estimate cell resistance are calculated. It looks a bit like this:

proc <- data.frame(
  rest = unique(raw$rests),
  state = sapply(unique(raw$rests), function(i) {
    tail(raw$state[raw$rests == i & raw$state != "R"], 1)
  }),

...

)

(rest or rests is a unique identifier for a period of current interruption used to determine the cell resistance, and proc is the name of the data frame containing the output processed data). There are six such columns in the output data frame, and all but the rest variable is extracted using the same sort of sapply function. This code as it is, giving a dataframe with close to 10,000 rows output, takes 6 minutes, 13 seconds to run. So clearly, this is some significant waiting, especially when I want to do this with multiple data sets. Substituting sapply() for unlist(mclapply()) as before offers some boost, to 2 minutes, 58 seconds. Better, but not great.

But again, this sluggishness is down to R having to keep looking up values in the original big dataset. One strategy I’ve found effective is to break up the raw data frame into a list of much smaller data frames. Because it’s battery testing data, the input and output data will contain cycle number information. So, I can split the raw data up by cycle number (cycn), make an output table for each cycle number and then stitch everything back together at the end.

proc <- mclapply(split(raw, raw$cycn), function(cycle) {

  out <- data.frame(
    rest = unique(cycle$rests),
    state = sapply(unique(cycle$rests), function(i) {
      tail(cycle$state[cycle$rests == i & cycle$state != "R"], 1)
    }),

...

)
  return(out)
}, mc.cores = ncores)

proc <- rbind.fill(proc)

The data here is for 76 cycles of a battery, containing 9,975 individual current interruptions to be analysed. What I’m doing now therefore is to break the data into a list of 76 much smaller tables with an average of ~131 current interruptions each. Doing it this way takes about 4.1 seconds to give the same results – about 40 times faster, and a considerable time saving whichever way you look at it.