Ridiculously Easy Code Optimizations in R: Part 3
The love affair of R with everything vector ^-^
If you have had a background in coding in “traditional” programming languages like C++ or Java and then switched to R, a lot of thing might have baffled you. Among the plethora of new concepts, you might have come across the phrase “write vectorized R code”. If you were like me, you would have simply ignored this and went on ahead with just adjusting to the new syntax of R.
A lot of beginner R users are not comfortable with the term “vectorize”, and not really familiar with the method. So, here, I’ll present to you why your code should always be vectorized and a few tips to optimize that, because if you understand the ins and outs of vectorization in R, it may help you write shorter, simpler, safer, and faster code.
Vectorization in R:
Vectorization in R basically is a way of referring to how R function under the hood.
R naturally stores data in columns (column major order), so if you are not coding to that pattern you are fighting the language.
So, let’s benchmark a few vectorized and non-vectorized code snippets and test our claim. We’ll be adding two arrays/vectors element wise.
Plots of the Benchmark:
I guess, this 100x speed-up should be enough to convince you to at least think of a vectorized approach while writing your R code before you start writing your C++ styled code in R syntax.
Please note that it isn’t that the
for loop is slow, rather it is because we have many more function calls (
+is being called in each iteration) compared to the single
+ call in the vectorized code. Each individual function call is quick, but the total combination is slow.
This was the most basic example, I could think of. Just imagine the fruits this kind of optimization will bear in complex scripts. Now, let’s move onto the second most important thing related to vectors that you must take care of.
All R Gurus unanimously agree that growing an R vector is a cardinal sin.
By growing a vector I mean, growing/expanding the size of the vector. Now. let’s straight away jump into the code to gain a better understanding of what I mean to highlight here.
So, here in the first unoptimized case we are working with a vector that was not initialized at all.
In the second case, we are working with a vector that is initialized with both the size and the type of entries that will be put in it.
In the third case, the vector is only initialized with the size and not the type.
Plots of the Benchmark:
In the benchmark we observed a speed up of 4x. This is a pretty good speed up for a single line change, isn’t it? Needless to say, this speed up becomes more pronounced as the size of the vector becomes larger.
To completely understand why this speed up happens, it would require diving into dynamic memory allocation, which is not our focus here, but the basic idea is that when the object grows inside a loop, the program must repeatedly ask for more space/memory from the processor, which is a costly process that can lead to memory fragmentation.
In this article you realized the huge speed gains that is up for grabs with two simple practices:
- Writing vectorized code whenever possible
- Pre allocating memory whenever possible
I wish you all the best for your future projects, dear reader :)
You could also buy me a coffee to support my work.
Thank You and Godspeed.