Ridiculously Easy Code Optimizations in R: Part 4

A list of alternate function calls for better performance :)

Rahul Saxena
7 min readMay 8, 2020

My last post about R code optimizations was about a set of strategical changes and mindset shifts required to write code better suited for the R programming language. Here, in this post I’ll go over alternate function calls that you could use to bolster your code performance without any restructuring of code per se.

Compiling Functions:

As we know R code is interpreted when it is run, this makes the functions written in R a little slower than their counterparts in C/C++, where the functions (the entire code, actually) gets compiled first and then executed. R however, does have compiling ability that can make functions upto 4 times faster. This requires calling the cmpfun() function from the base compiler package.

Benchmarking:

f <- function(n) {
to_cubes <- 0;
for (i in seq_len(n)) {
to_cubes <- to_cubes + (i * i * i) # function inlined
}
}
f_opt <- cmpfun(f)n <- 10000microbenchmark(f(1000), f_opt(1000))

Just-in-time compiling can also be enabled in R, that automatically compilers every function for the first time, it is run. Just add the following code snippet to your script at the very beginning.

library(compiler)
enableJIT(1)

The argument to the enableJIT() function represents the “level” of compilation. With higher “levels” implying more intensive compilation.

Using [a-z]*pply functions instead of for loops:

Using apply() family of function instead of vectorized for loops yields much higher speed gains. However, you have to get a bit creative when you intend to use complex logic in a loop.

Benchmarking:

m1 <- matrix(C<-(1:100),nrow=25, ncol=4)apply_fun <- function() {
apply(m1, 1, mean)
}
for_loop <- function() {
for(i in 1:25) {
inter <- 0
for (j in 1:4) {
inter <- inter + m1[i, j]
}
print(inter/4)
}
}

autoplot(microbenchmark(for_loop(), apply_fun(), times = 1000))

Reversing Elements:

Here, the idea is to use a single specialized function instead of two functions used in a nested form. The rev() function provides a reversed version of it’s argument. But, if you wish to sort your vector in decreasing order, sort(x, decreasing = TRUE) is around 10% faster than rev(sort(x)).

Benchmarking:

reverse_fun <- function(n) {
vec <- sample(seq_len(100000), n)
rev(sort(vec))
}
reverse_fun_opt <- function(n) {
vec <- sample(seq_len(100000), n)
sort(vec, decreasing = TRUE)
}
n <- 1000
autoplot(microbenchmark(reverse_fun(n), reverse_fun_opt(n), times = 100))

Efficient Sorting:

There are currently three sorting algorithms that can be specified in the sort() function, namely c(“shell”, “quick”, “radix”). The radix algorithm is a relatively new addition to R. Typically the radix(the non-default option) is the most computationally efficient option for most situations (it is around 20% faster when sorting large vectors of double).

Benchmarking:

sort_fun <- function(n) {
vec <- sample(seq_len(10000000), n)
sort(vec)
}
sort_fun_opt <- function(n) {
vec <- sample(seq_len(10000000), n)
sort(vec, method = "radix")
}
n <- 1000000
autoplot(microbenchmark(sort_fun(n), sort_fun_opt(n)))

Determining the minimum and the maximum element position:

Here, we follow the same guideline that instead of using two functions in a nested manner, we should go for a single, specialized function for our tasks. Therefore, in a vector named “vec”, which(vec == min(vec)) and which(vec == max(vec)) is slower than which.min(vec) and which.max(vec) respectively.

Benchmarking:

which_fun <- function(n) {
vec <- sample(seq_len(100000), n)
which(vec == min(vec))
which(vec == max(vec))
}
which_fun_opt <- function (n) {
vec <- sample(seq_len(100000), n)
which.min(vec)
which.max(vec)
}
n <- 1000
autoplot(microbenchmark(which_fun(n), which_fun_opt(n)))

Row and Column Operations:

Instead of using for loops or even the apply family of functions on rows and columns of a matrix/data.frame for simple/trivial operations, it is often more efficient to use a single specialized function for the same. For example, in a matrix called “mat” rowMeans(mat) and colSums(mat) are way more efficient than the custom apply(mat, 1, mean) and the apply(mat, 2, sum) functions.

For more row and column operations look up the matrixStats package.

Benchmarking:

rowCol_fun <- function () {
mat <- matrix(c(1:81), nrow = 9, ncol = 9)
apply(mat, 1, mean)
apply(mat, 2, sum)
}
rowCol_fun_opt <- function (n) {
mat <- matrix(c(1:81), nrow = 9, ncol = 9)
rowMeans(mat)
colSums(mat)
}
autoplot(microbenchmark(rowCol_fun(), rowCol_fun_opt()))

Detecting NA values:

Using our above stated reasoning of using a specialized function we would be using the anyNA(val) function instead of any(is.na(val)) to detect any NA values in a vector named “val”.

Benchmarking:

detectNA <- function(n) {
val <- sample(seq_len(100000), n)
val <- append(val, NA, as.integer(n/2))
any(is.na(val))
}
detectNA_opt <- function(n) {
val <- sample(seq_len(100000), n)
val <- append(val, NA, as.integer(n/2))
anyNA(val)
}
n <- 10000
autoplot(microbenchmark(detectNA(n), detectNA_opt(n)))

Efficient Column Extraction:

The idea here is to use a system call, ie, .subset2()instead of the different function calls used for column extraction from a data set.

Benchmarking:

autoplot(microbenchmark(
mtcars[, 11],
mtcars$carb,
mtcars[[c(11)]],
mtcars[[11]],
.subset2(mtcars, 11)
))

Efficient Value Extraction:

The idea here is to use a system call, ie, .subset2() instead of the different function calls used for value extraction from a data set.

Benchmarking:

autoplot(microbenchmark(
mtcars[32, 11],
mtcars$carb[32],
mtcars[[c(11, 32)]],
mtcars[[11]][32],
.subset2(mtcars, 11)[32],
times = 1000L
))

Using Simpler Data Structures:

When you use objects, lists and dataframes, you are pretty much free to include any type of object in it. This flexibility however, comes at the cost of efficiency. Therefore, it is recommended to use data structures such as matrix for homogeneous data manipulation.

Benchmarking:

df_fun <- function(){
mtcars[4, ]
}
mat_fun <- function(){
mat <- matrix(1:352, nrow = 32, ncol = 11)
mat[4, ]
}
autoplot(microbenchmark(df_fun(), mat_fun()))

Removing the return keyword:

Although not much recommended, but in simple functions where only one return statement is being used at the end of the function, the return keyword can be removed to shave off few extra milliseconds. But be careful with this optimization in case of complex functions with multiple return statements.

Benchmarking:

with_return <- function() {
res <- sum(sample(seq_len(1000), 10))
return(res)
}
without_return <- function() {
res <- sum(sample(seq_len(1000), 10))
res
}
autoplot(microbenchmark(without_return(), with_return()))

Few more optimization tips:

Using ifelse() in place of if and else:

You can increase your code readability and make it much faster by adopting the ifelse() statement in stead of the traditional if statements and else statement at appropriate places. However, this statement is not absolute and for further reading you can refer Benjamin’s answer here.

Removing variables and flushing memory:

Removing objects using rm() as soon as their utility ends, is a good programming practice. Also, flushing using gc() at the end of each iteration within loops can lead to potential speed-ups.

Evaluate with & instead of && whenever possible:

The non-vectorized version of R logical vectors, && and ||, only executes the second component if needed. This is efficient and leads to neater code. Care must be taken not to use && or || on vectors since it only evaluates the first element of the vector, giving the incorrect answer.

Inline Expansion:

At the cost of code readability and making your binary larger, you can gain some speed-up by expanding functions at the places where they are called. This approach works for simpler functions but care must be taken as the function logic gets more complex.

Conclusion:

In my next post regarding R code optimization we will primarily focus on code profiling, identifying the bottlenecks in your code and the optimal usage of benchmarking for coding strategy.

Also, I have been selected as a GSoC 2020 scholar under the R Project for Statistical Computing. I would be working on improving and adding functionalities to a package called rco which stands for the R Code Optimizer. So stay tuned to my blog if you want to witness the awesomeness of this rco package.

I wish you all the best for your next R script/project. If this post made sense to you then, be sure to go through this, this and this too. If you liked this post, considering clapping here and following me.

You could also buy me a coffee to support my work.

Thank You and Godspeed.

--

--