The pros and cons of the most popular approaches to R programming
Programmers are passionate people. They’ll enter enthusiastic debates (read, heated arguments) about their favourite languages and frameworks, defending their preferred approaches from critics. Among R programmers, one of the biggest sources of debate is the choice between two frameworks; Base-R, and tidyverse.
Base-R refers to all the functionality that comes built into the R programming language. The tidyverse is a collection of packages that add onto R, with its own ethos and stance on data analysis. Both are very popular, and people can’t stop debating which one is better.
Tweets from Base-R fans calling out tidyverse users for not being “real programmers” seem like an annual occurrence. It gets a little heated.
From my point of view, this rivalry is overblown. I think both approaches are simply different toolsets that you should use depending on your needs.
In this article, I’ll consider five questions that will help you choose between tidyverse or Base-R. Based on your situation, I’ll also give my verdict on which one you should choose.
Just as a carpenter wouldn’t trim floorboards with a butter knife, you should choose the right tools for the job when using R. Although Base-R and tidyverse offer much the same functionality, it is much easier to do certain things in one approach.
For instance, tidyverse is often your best bet for quick and easy data manipulation. Grouping datasets by many variables to create summary statistics is much easier with packages like dplyr than with Base-R functions.
Yet, Base-R is better suited to other applications like running quick simulations. Depending on what your day-to-day work in R involves, your preferred framework might change.
It’s also worth considering your skill level and programming background when thinking about usability.
Beginners tend to favour tidyverse because it’s easier to read than Base-R. The syntax is consistent across functions, making it easier to learn, and the key functions have descriptive names, which enables reading code like a straightforward set of instructions.
That said, some seasoned programmers are thrown off by this and prefer the feel of Base-R. Unlike tidyverse, Base-R puts more focus on programmatical features that feel familiar to those coming from other languages.
When doing computationally expensive operations, execution time matters. In many situations, there’s a big difference in speed between Base-R and tidyverse.
To give an example of when Base-R is much faster, we can work with the mtcars dataset that’s built into R. Performing a basic operation like filtering the dataset to show only cars with six cylinders is over 40 times faster in Base-R than tidyverse!
results <- microbenchmark(mtcars %>% filter(cyl == 6),
mtcars[mtcars$cyl == 6,])
select(expression = expr, mean_execution_time = mean)
Sure, the tidyverse version is more readable for beginners and has other perks. But, if you’re running a script where you have to repeat that filter operation hundreds of times, a 40x performance boost is very handy.
Although there are many times when Base-R is faster than tidyverse, the opposite is sometimes true too. Even though Base-R usually wins out on speed for me, it’s worth checking based on a case-by-case basis.
Although being able to write great code on your own is important, there comes a time in every R user’s life when they must share it. Whether you’re a scientist, developer, or data analyst, having others be able to understand and work with your code is vital.
This is where you should heed your colleagues’ taste in R packages. If everyone you work with uses tidyverse, then consider defaulting to that at least some of the time to make collaboration easier. Likewise, if they all use Base-R.
Having an approach in common with your colleagues can also help when you encounter problems or stubborn bugs. Speaking from personal experience, I had a much easier time collaborating with my tidyverse-focused colleagues after I learned it myself, two years into my R journey.
That’s not to say you must limit yourself to tidyverse or Base-R based on the whims of your collaborators. Even though I and most people I work with default to using tidyverse, I write Base-R code for them every so often. But, it’s helpful to use their favoured approach as a foundation.
Following collaborating, one of the best things about learning R is the online community that comes with it. There are lots of people and organisations that share R tips and updates that can help you improve your code.
For both tidyverse and Base-R enthusiasts, there’s no shortage of community spirit. #RStats is a good place to pick up tips on social media. There are also plenty of blogs, on Medium and otherwise, that give Base-R and tidyverse tips.
For tidyverse fans, the weekly Tidy Tuesday initiative puts emphasis on creating stunning visualizations using tidyverse packages. The R for Data Science community has also spun out of the seminal book of the same name, authored by Hadley Wickham, co-creator of the tidyverse.
Many committed fans of Base-R have historically gathered in forums. Although many are also on social media, it seems to me that the tidyverse has more of a community presence on platforms like Twitter and Mastodon. Depending on where you spend your time online, you could learn a lot about either approach.
While the tidyverse is great, one area where it can falter is in software development. There are currently over 25 packages in the tidyverse, each requiring its own updates to stay current.
If you’re relying on lots of them for writing your own R package or other software, you can introduce lots of extra dependencies into your code. While depending on additional packages isn’t necessarily bad, it’s not ideal.
Your code’s functionality is now affected by updates to the packages it depends on; updates that you don’t control. The more dependencies you have, the harder it gets to reproduce your environment so others can run your code.
If you get serious about development with R and want to submit a package to CRAN, you’ll face strict limitations on dependencies for these (and other) reasons. Tidyverse packages can often be a no-go in this situation.
By contrast, Base-R introduces no extra dependencies. Problem solved.
So with all these things in mind, which should you choose — Base-R, or tidyverse?
Yes, it’s a cop-out. But seriously. Knowing about both approaches is the best way to expand your toolset and make sure you can tackle all kinds of tasks in R.
That said, many programmers still focus on one approach in their day-to-day work, adding parts from the other when needed. Here are a few reasons to choose each approach as your default.
Make tidyverse your default approach if:
- Most of your work involves data cleaning, visualization, and common statistics
- You’re newer to R and find it easier to read and understand than base-R
- Most of your collaborators and online network use it too
Make base-R your default approach if:
- Most of your work involves software or package development, advanced statistical procedures, or computationally expensive operations
- You’re used to other languages that have more in common with Base-R
- Most of your collaborators and online network use it too
This isn’t an exhaustive list of reasons why you should use each package, but they can help you to make the right choice for your circumstances.
As a researcher in psychology, I default to tidyverse for most of my data cleaning and simple analysis. However, I use Base-R when doing more complex statistical modelling and simulation, or when dependencies are an issue.
Most importantly, I don’t think there’s one correct approach. Using tidyverse doesn’t stop you from being a “real R programmer”, and using Base-R doesn’t stop you from writing neat code. They’re both just toolsets that you can use to make cool stuff with R.
Learn both, mix and match them, and use whatever is right for the job.