Ridiculously Easy Code Optimizations in R: Part 2

Exploring all option of fast and efficient I/O

Rahul Saxena
The Startup

--

In my last post, you got to know about how to speed up your I/O speed without loading any external libraries. Today, we would explore a few more options of efficient I/O that can be leveraged with external libraries. So let’s jump right in.

Different Approaches for I/O:

Probably the most versatile package for data reading/writing in R is the rio package. The rio package can be used to read/access data of almost all formats using a single “import” function. You are not even required to specify the file format as a parameter in this function, because the extension in the filename takes care of that. Just how slick is that!!!!! +_+

The second package that we’ll talk about would be the readr package. The readr package was recently developed by Hadley Wickham to deal with reading in large flat files quickly. The package provides replacements for functions like read.table() and read.csv(). The analogous functions in readr are read_table() and read_csv(). These functions are often much faster than their base R analogues.

The third package that we’ll talk about is the data.table package. Apart from the faster fread() option, it provides a plethora of really awesome techniques to deal with tabular data, something that is not of interest to us here and now.

The fourth package is feather. The relatively new package feather is introduced as a binary file format, that has cross-language support. Feather was developed as a collaboration between R and Python developers to create a fast, light and language agnostic format for storing data frames.

Benchmarking

All right, so below is the code that we’ll be using to run over benchmarking all the above approaches. Bear in mind that the results obtained would be representational and the results will vary with the size of the file being read. For example, a small speed up in a method in a file of a small size may snowball into massive speed ups for files of larger size.

So, below is the screenshot of the benchmark performed on a file from my laptop hard drive along with the code used.

The results:

Benchmark for different I/O methods

Here, we have a plot for the above benchmark.

Visual representation of the above benchmark.

Conclusion

For my data set, it turned out that the fastest I/O method was fread. It may vary for the data set that you choose to work on. My data set was text-heavy therefore, even reading from the native rds format turned out to be slow. For a more detailed explanation, head over here.

My intention with this post was to expose you to different methods of reading data. Generally speaking, feather was the fastest method for I/O, but as you might have deciphered by now, the overall efficiency depends on several factors such as file size, contents of the file, what you intend to do with the file, the overhead of converting a file to a specific format for its I/O among others.

Therefore, for your future projects, explore all options with benchmarking your datasets and then decide on a method of I/O.

A noticeable point is that both readr and base methods can be made significantly faster by specifying the column types at the outset of the read function call. For example:

read.csv(file_name, colClasses = c("numeric", "factor"))

One thing that should be kept in mind is that, although this section is focused on reading files, it demonstrates the wider principle that the speed and flexibility advantages of additional read functions can be offset by the disadvantage of additional package dependencies (in terms of complexity and maintaining the code) for small datasets. The real benefits kick in on large datasets.

It all might seem confusing and maybe it is all too difficult to wrap your head around now, but be patient, explore the internet, explore these packages in your RStudio till you are comfortable with trying all these methods and then fixating on one.

Note: The more experienced R users would have noticed that I did not mention the save and load duo. This is because I do not want you to use those at all. The reason, you ask?? Hear from Yihui himself!!

All the best for your projects, dear reader.

You could also buy me a coffee to support my work.

Thank You and Godspeed.

--

--