Ridiculously Easy Code Optimizations in R: Part 1

Looking beyond read.csv( )

Rahul Saxena
The Startup

--

My last story generated quite a buzz in my university, as it was in regards to how I did my part to fight against a global pandemic as a university student. Many of my friends were intrigued by a specific line of code in that story. That line was:

test_df <- readRDS("input_data/cleaned_alumni_2.rds")

Even with zero idea of what readRDS is,I think it would be amply clear to you that here, we are reading a data file named “cleaned_alumni_2” with a strange “.rds” extension from a folder named “input_data”.

If you arrived at the above inference yourself, congratulations, it is absolutely correct. Okay, so, “rds” much like “csv” is a file format. What makes is better though, is that it is a native R data type(more on that later). Being a native data type, doesn’t it make intuitive sense, that loading up the “rds” data format into your R script/model/dashboard would be faster than loading a generic “csv” data format file!!!

Isn’t this getting exciting? A different file format, native to R, faster loading times, isn’t this so cool?

Now, your next logical question would most probably be, what should I do if I only have my data set as a “csv” file. Don’t worry, I’ll show you how to convert that to the “rds” format in just two lines of code.

original_dataset <- read.csv("filename.csv")
saveRDS(original_dataset, "new_filename.rds")

Well, that’s it, the entirety of it. The first parameter of the “saveRDS()” function was the data set that you want to save as a “rds” file and the second parameter was the new filename. Smooth. Now how do we read that data?

converted_dataset <- readRDS("new_filename.rds")

So, that’s the end of it. That is how you replace your “read.csv()” function with the faster “readRDS()” function.

Benchmarking :

Well it wouldn’t hurt to test our claims of speed-up, right? So, here we go!!

Benchmarking readRDS and read.csv functions

Well, the results obtained below are just amazing. It’s a speed-up by a factor of almost 7.5x. That’s some huge benefit for changing a single line of code, isn’t it 😉.

You could try to run the above test on your systems too, using the “csv” files available with you. If no “csv” files are available, then simply create a sample data set and use the write.csv() function to create one. As a bonus point, just check out the file size of your “rds file” and compare it with the “csv file” (Hint: rds file are compressed too… :D)

Now let us visually inspect the difference in the two functions:

Load the “ggplot2” and “microbenchmark” packages.

The below plot visually represents what we observed earlier in our benchmarks.

Visual plot of benchmarks of read.csv() and readRDS()

Conclusion:

The compressed file size and faster loading times of “rds”, provide humongous benefits when your script has to load a data set again and again(for example say, a deployed ML model or a R Shiny app). Reduced data size implies that you use up less resources on your hosting platform, which again is a good thing to have.

Being a computer science undergrad, I have a heavy interest in optimizations and we’ll keep on exploring this domain together 😃. Here I’ve not touched upon the various other reading/loading option provided by some external libraries, because I’d promised optimization in a single line. In future posts, we’d also explore those packages too.

You could also buy me a coffee to support my work.

Thank You and Godspeed.

--

--