Top 15 R Libraries for Data Science in 2022

Read it in 13 Mins

Last updated on
23rd Feb, 2022
Published
23rd Feb, 2022
Views
5,000
Top 15 R Libraries for Data Science in 2022

While many people opt for Python for data science tasks today, R remains a staple in the data scientist's toolkit. With its clean code, ability to chain functions and the pipe operator, R can often make simple tasks like exploratory analysis or visualization super easy to do. It also stands its ground well when it comes to complex tasks like forecasting or modelling. All in all, R today is stronger than ever with an ever-expanding list of supported libraries on the CRAN repository. In this article, we'll walk through some old staples and some newer R libraries for data science. You can learn more about data science using this data science course.  

Top 15 R Libraries for Data Science in 2022

1. dplyr

dplyr (dataframe plier) is perhaps the most used library in the tidyverse set of libraries. Tidyverse is a collection of data manipulation and cleansing libraries that work well together, can be chained together, and are maintained by the same organization. 

With dplyr, you can easily perform data manipulation tasks. Each function is a verb that does exactly what it says it does. Some of the most commonly used functions in dplyr are select(), mutate(), filter(), summarise() and arrange(). 

A common paradigm in all tidyverse R libraries for data science is to use the pipe operator, %>%, which allows us to chain or pipe functions together. For example, you can use the syntax of, 

dataframe %>% select(col1, col2) %>% summarise(average=sum(col1)) 

The pipe operator lets us take the results of one function and pass it quickly to the next function with the processing happening between them. This makes for clean, readable code that shows exactly what is happening.

Top 15 R Libraries for Data Science in 2022

2. tidyr

tidyr is the cousin of dplyr. While dplyr focuses on data wrangling and manipulation, tidyr's only priority is tidying or cleaning the data from a format perspective. tidyr defines tidy data with the following tenets, 

  • Every column is variable. 
  • Every row is an observation. 
  • Every cell is a single value. 

Data is often available in unconventional formats such as JSON, which make sense from a programmer's perspective but not much from the data scientist's perspective. These can be easily handled with tidyr's unnest_longer() function. The process is called Rectanguling. In other words, taking nested data and converting it into, you guessed it, rectangular data. 

Another super important task is Pivoting. If you're familiar with Excel, you'd know Pivoting the data is a crucial step in any data analyst's playbook. To do this, the new pivot_longer() and pivot_wider() functions will help you out. These are new functions in tidyr 1.0.0 and these replace old approaches of spread() and gather(). 

The last noteworthy task is Completion which is handled by the complete(), drop_na(), fill() and replace_na() functions. These make your data frame more "complete" and handle missing values by removal, inference, or imputation. 

If you notice, the tidyverse set of R libraries for data science focus on readability which makes each iteration an improvement over the older ones. Each function is a clear verb which barely needs a definition. 

3. readr

You may be thinking why you'd need a separate library to read data when base R handles everything just fine. Well, that's because readr offers some nifty improvements over the reading functions offered by base R. Of course, these aren't life-changing, but they are good to have. Here are some improvements these functions make over the base R functions. 

  • They provide a progress bar if the dataset is too large and takes time to load. So, you don't sit there thinking your R session has crashed. 
  • They are faster than the base R functions. The improvements vary on the size of the dataset but the factor of improvement goes from 10x to 100x. 
  • Handle strings as strings and parse most date/time formats unlike base R.

4. stringr 

Since we mentioned strings in the last library, let's talk about stringr. R doesn't do strings well natively. It seems to be a bit clunky to handle them as vectors especially when Python has a plethora of inbuilt string functions. stringr brings these functions (or their equivalent ones) to R.

The library caters to some classic use-cases such as str_length(), str_c() (concatenate). There's seven different pattern matching functions available in stringr as well which makes string search and count tasks much easier. Patterns can simply be strings or regular expression as well. 

5. ggplot2

If you know anything about R, you've probably heard of ggplot2. ggplot2 is the most popular way to visualize data in R. It's also part of the tidyverse stack which means it integrates seamlessly with the other tidyverse libraries.  

The idea behind ggplot2 is the Grammar of Graphics. You have data, variables and aesthetics (color, axes etc.). The idea is to provide data, map variables to aesthetics, and the library handles the rest. The ggplot2 syntax relies of geometries or geoms. There are different geoms which create different charts. geom_point(), geom_histogram() to name a couple of them. 

ggplot2 also offers some additional customisations like legends, themes, labels etc. which make it the most comprehensive plotting library available for R. 

6. lubridate 

Dates are probably the usual suspects of when some analysis goes wrong or when the data makes little sense. That's because dates are rarely parsed correctly and reliably out of the box. Often, we have to manually select the locale, understand the format, parse it and so on. 

lubridate makes it much easier to handle dates with simple functions to automatically parse datetime values. It also has unique formatters such as ymd(), dmy(), mdy() et al which convert date formats from one to another. Of course, similar formatters are available for both time and datetime values as well. 

Another core feature here is value extraction. Once a datetime value is parsed, functions like year(), month(), wday(), mday(), hour(), minute(), second() extract the relevant values for you to quickly use them without some clunky formatter or string subsetting. This makes your code more reliable as well. 

7. jsonlite 

If you've worked with data, you know how common the JSON format is not only when you receive it but often as a required deliverable. JSON is a huge hassle when it comes to being parsed. There are format issues, other stuff that goes wonky now and then. Enter jsonlite. jsonlite has functions for parsing, generating and prettifying json. It's easy to get started with and works out of the box. The toJSON() and fromJSON() functions are the core of it. It also supports streams both as input or output.  

8. Shiny

Shiny is an interesting R Library because it does more than what you'd expect from R. Managed and developed by RStudio itself, Shiny lets you create and publish interactive dashboards and applications with your R code. 

The core philosophy behind Shiny is reactive() components. Reactivity means that any change in the data or original component is reflected in the subsequent components. In other words, if the data changes, so do the visualisations, functions, tables et al.  

Shiny lets you use almost all HTML and CSS tags to style your apps and dashboard as required. It has a learning curve of its own but at the heart of it, it's still your analysis running. Shiny expertise is a much sought-after skill today as the landscape moves to quick analysis, interactivity and real-life dashboarding. 

9. tseries 

Time Series analysis is a popular use-case. The tseries library facilitates exactly that with functions for reading timeseries, conducting tests, plottingOHLC and so on. The tseries set of functions work more towards financial timeseries analysis but are general purpose enough to be used with other cases as well. 

For example, the tseries library can help us plot the OHLC data which is the Open, High, Low and Close for stocks using the plottingOHLC function. This is a stock market analysis process that helps us compare stock trends. On the other hand, we can also use the tseries library to chart any timeseries such as weather or rainfall data. 

It's a nifty library with some really simple functions to make time series analysis tasks easier.

10. Prophet

Prophet by Facebook is the most popular forecasting library in 2021. The ease of setup and use make it the go-to library for anyone trying to forecast anything today. The library uses the standard R API of model fitting and returns a model object that you can plot() or predict() from. The library shines with its add_regressor() function which basically lets you add as many additional regressors as possible. A regressor is any variable that is used to predict the response variable. In forecasting with Prophet, the ability to add additional regressors makes it easier to predict time series with better accuracy since multiple inputs may affect the trends. 

For example, if you're predicting crop yield on a timeseries, you can add the rainfall measures as an additional regressor. You can also add other regressors to improve upon your forecasts. The catch is that the data for the regressors should be available for the period you're forecasting for; if not, you can always use Prophet to forecast the regressors as well. This increases the error margin but makes it easier to forecast on regressors that don't have data for the forecast intervals. 

Interestingly, prophet_plot_components() is a function that also gives you a component plot which shows the trend as well as the other timeseries components such as yearly, monthly or weekly plots. 

11. RColorBrewer

While we've talked a lot about libraries that make life easier, RColorBrewer is a library that makes life fun! With this simple library, you can create palettes of colours that you can then call into your ggplot2 plots. This can be especially useful if you're creating plots for a company or organization that’s too serious about their brand. If nothing else, it makes for plots that look slightly better than the standard colours that ggplot2 ships with.  

12. githubinstall

As you may know, R libraries for data science come from the CRAN repository with mirrors like MRAN and others. However, often, the CRAN approval takes time to get public and some urgent quick fixes are already shipped on the Github page for the library in question. Alternatively, some libraries are not available on CRAN at all but still have fully maintained Github repositories. Be watchful if you install any libraries that are not vetted by CRAN though. 

In cases like these, you can install your packages from Github. githubinstall makes doing that as easy as one line of code. You can also choose which branch to install a library from among other parameters that make life much simpler if you're a power user in R who likes to stay up to date with new, cool libraries. 

13. ggmap

Where would data visualization be if not for maps? The average person is never interested in graphs or charts but showing them their own state or country makes them go "Aha!" in a split-second. ggmap does exactly that. With a plethora of functions that let you select a map, choose a center, and add any ggplot visualization, it makes plotting on maps much easier. 

You can also select map types with the appropriate parameters. People won't know your visualisations were created in R. It gets even better with integrations such as the Google Geocoding API that work out of the box. Of course, you need an API key and a one-time configuration but functions like geocode() make leveraging the APIs much simpler and easier. 

Similar integrations are also available for OpenStreetMap and so on. 

14. sqldf

If you've worked in data analysis before, you may be experienced in SQL. To be honest, we all know regardless of what technology is used or what language is preferred, SQL never leaves the room. sqldf takes it a step further. With sqldf, you can use your R dataframes as if they were SQL tables. That is, once loaded, you can use the sqldf() function itself to use SQL statements with your dataframe variables. It's as simple as sqldf(SELECT * from df).  

15. caret

If you're doing modelling, it helps to have caret in your toolkit. Caret stands for Classification and Regression Training and is one of the most popular R libraries for data science. The sole purpose of caret is to make model building and training easier in R. You could call it an equivalent of the scikit-learn set of libraries in Python. However, in my experience, both have their own advantages and uniqueness. 

Caret has functions to split the data, to train the data using different classifiers (specified via parameters), and even has a GridSearchCV equivalent to do hyperparameter tuning in the form of a parameter called tuneGrid in the train() function. GridSearch and hyperparameter tuning in general makes caret a fairly advanced library. 

Overall, Caret supports all standard classifiers and regressors. It also creates plots for your training process as well as the tuneGrid comparisons. The parameters to train() are powerful enough to let you control different resampling methods, performance metrics and so on. 

Caret may be one of the most powerful R libraries to ever exist. 

A Final Note

While these libraries are used for different purposes, there is no one size fits all when it comes to using R. That's what makes it so versatile. There are countless R libraries for data science. You can use data. Table instead of the tidyverse set of functions and still get the same jobs done. glm works for modelling as well as some use-cases of caret. Plotting can be done by Plotly as well, if not better than ggplot2. Instead of taking this list as a single source of truth, we urge you to explore and find libraries that work the best for you use-cases, programming styles and the paradigms of your organization. 

Frequently Asked Questions 

1. Is R Good for Data Science?

Yes, R is still fantastic for data science. While adoption for Python has increased over the years, R has always caught up and stayed in the competition. In fact, Knowledgehut provides a data science course covering R which might benefit you.  

2. What Does Library() Do in R?

library() is the command to import a library into your R script. The parantheses contain the name of the library. If the library is not installed, the library() function throws an error. 

3. How Do Libraries Work in R? 

When we import a library with the library() function in R, all the functions in that library are instantly available for use. Alternatively, we can also use the syntax library_name::function_name() to use functions without importing libraries. The syntax only works if the library is installed. To explore more useful tips, try the Knowledgehut data science course.  

Profile

Deepansh Khurana

Author

Deepansh Khurana is a data enthusiast with a penchant for exploratory analysis. He enjoys the open-ended nature of data analysis in general, and lives for coincidental insights and "aha!" moments. When not coding or playing with numbers, he writes prose and embraces art. He also writes a self-help newsletter to nudge people into better balance.