Ridiculously Easy Code Optimizations in R: Part 3

Photo by Sarah Bedu on Unsplash

If you have had a background in coding in “traditional” programming languages like C++ or Java and then switched to R, a lot of thing might have baffled you. Among the plethora of new concepts, you might have come across the phrase “write vectorized R code”. If you were like me, you would have simply ignored this and went on ahead with just adjusting to the new syntax of R.

A lot of beginner R users are not comfortable with the term “vectorize”, and not really familiar with the method. So, here, I’ll present to you why your code should always be vectorized and a few tips to optimize that, because if you understand the ins and outs of vectorization in R, it may help you write shorter, simpler, safer, and faster code.

Vectorization in R:

Vectorization in R basically is a way of referring to how R function under the hood.

R naturally stores data in columns (column major order), so if you are not coding to that pattern you are fighting the language.

So, let’s benchmark a few vectorized and non-vectorized code snippets and test our claim. We’ll be adding two arrays/vectors element wise.

Benchmarking Results:

Benchmarking results

Plots of the Benchmark:

Benchmark Plots

I guess, this 100x speed-up should be enough to convince you to at least think of a vectorized approach while writing your R code before you start writing your C++ styled code in R syntax.

Please note that it isn’t that the for loop is slow, rather it is because we have many more function calls (+is being called in each iteration) compared to the single + call in the vectorized code. Each individual function call is quick, but the total combination is slow.

This was the most basic example, I could think of. Just imagine the fruits this kind of optimization will bear in complex scripts. Now, let’s move onto the second most important thing related to vectors that you must take care of.

Memory Pre-Allocation:

All R Gurus unanimously agree that growing an R vector is a cardinal sin.

By growing a vector I mean, growing/expanding the size of the vector. Now. let’s straight away jump into the code to gain a better understanding of what I mean to highlight here.

So, here in the first unoptimized case we are working with a vector that was not initialized at all.

In the second case, we are working with a vector that is initialized with both the size and the type of entries that will be put in it.

In the third case, the vector is only initialized with the size and not the type.

Benchmarking Results:

Benchmarking Results

Plots of the Benchmark:

In the benchmark we observed a speed up of 4x. This is a pretty good speed up for a single line change, isn’t it? Needless to say, this speed up becomes more pronounced as the size of the vector becomes larger.

To completely understand why this speed up happens, it would require diving into dynamic memory allocation, which is not our focus here, but the basic idea is that when the object grows inside a loop, the program must repeatedly ask for more space/memory from the processor, which is a costly process that can lead to memory fragmentation.

Conclusion:

In this article you realized the huge speed gains that is up for grabs with two simple practices:

  1. Writing vectorized code whenever possible
  2. Pre allocating memory whenever possible

Moving ahead whenever you write your next R script keep the above two points in mind and also this and this.

I wish you all the best for your future projects, dear reader :)

You could also buy me a coffee to support my work.

Thank You and Godspeed.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store