Top 15 R Programming Interview Questions and Answers for 2024

Q: When to use the following functions: apply(), lapply(), sapply(), tapply() in R? Explain.

The differences are the following: apply(): Use as an alternative to for() loop lapply(): Applies function to every item and returns the result as a list sapply(): function will be executed column wise tapply(): Similar to aggregate() function

Q: What is the difference between "%>%" and "%%"?

The differences are the following: %>% indicates – left hand side (LHS) to the right hand side (RHS) call %<>% indicates – left hand side (LHS) to the right hand side (RHS) call. However, at the end update the LHS object with the resulting value.

Beginner
Advanced

Beginner

1.
Given data related to specific parameter for a country for a particular year is mentioned in Table 1. Provide an approach or write a program function using R to reshape the data in the way it is expected in Table 2 (which is the desired layout). Explain briefly with your response.
Table 1 (Given input data)
Country 2011 2012 2013
Japan 2300 3100 6800
China 2700 3300 5400
India 4800 6200 9500
Assume this data exists in your data frame in R as “my_df”
Table 2 (Expected desired layout as output)
Country Year n
Japan 2011 2300
China 2011 2700
India 2011 4800
Japan 2012 3100
China 2012 3300
India 2012 6200
Japan 2013 6800
China 2013 5400
India 2013 9500
Here objective is to get the count (n) captured in a separate row for every year for every country.

Country	2011	2012	2013
Japan	2300	3100	6800
China	2700	3300	5400
India	4800	6200	9500

Country	Year	n
Japan	2011	2300
China	2011	2700
India	2011	4800
Japan	2012	3100
China	2012	3300
India	2012	6200
Japan	2013	6800
China	2013	5400
India	2013	9500

We can use gather() function in tidyr package to accomplish this.

Below is the desired line of code.

# This will load the “tidyr” package
library(tidyr)
# This will reshape the data in desired format
gather(my_df,"Year","n",2:4,convert = TRUE)

gather() function parameters –

my_df is the first parameter to reshape the data.
“Year” is the second parameter which is name of the new key column, typically this is a character string.
“n” is the third parameter which is the name of the new value column.
2:4 is the fourth parameter which shows names or numeric indexes of columns to collapse from your input dataset .
“Convert=TRUE” is the last parameter mentioned here which converts number in the keys column from factors to numeric.

This is one of the most frequently asked R programming interview questions for freshers and experienced professionals in recent times.

2.
Given sample data below in table 1 has 4 columns including a date column in “Col4”. Provide an approach using separate() function in R to convert the data to be reflected in desired layout as provided in Table 2. Explain briefly.
Table1 (Input data layout)
Col1 Col2 Col3 Col4
AA 110 1007 2002-08-11
BB 45 1009 1999-08-12
CC 65 1005 2002-04-13
DD 40 1013 2001-08-14
EE 50 1010 2002-01-15
FF 45 1010 2002-07-16
Assume this data exists in your data frame in R as “my_df”
Table 2 (Expected desired layout as output)
Col1 Col2 Col3 year month day
AA 110 1007 2002 08 11
BB 45 1009 1999 08 12
CC 65 1005 2002 04 13
DD 40 1013 2001 08 14
EE 50 1010 2002 01 15
FF 45 1010 2002 07 16

Col1	Col2	Col3	year	month	day
AA	110	1007	2002	08	11
BB	45	1009	1999	08	12
CC	65	1005	2002	04	13
DD	40	1013	2001	08	14
EE	50	1010	2002	01	15
FF	45	1010	2002	07	16

We can use the following approach using separate to distribute date field into three separate columns for year, month and day values.

# This will load the tidyr package
library(tidyr)
# This will reshape the data in desired format
separate(my_df, Col4, c("year","month","day"),sep = "-")

separate() function will use the parameters appropriately to display data in desired format.

First parameter used here is the data frame which is my_df.
Second parameter used here is the date column. We can use any column to split up as per need.
Third parameter used here is the names of new columns to make.
Fourth parameter is the string to split on. Basically this is the separation criteria. By default, separate() will split on any non-alphanumeric characters.

Given below is sample input dataset and the code snippet. When we execute the code mentioned in Figure2 using the dataset in Figure1, is the desired output same as input data? Explain with your response.

Figure1 (input dataset)

Col1	Col2	Col3	Col4
AA	110	1007	2002-08-11
BB	45	1009	1999-08-12
CC	65	1005	2002-04-13
DD	40	1013	2001-08-14
EE	50	1010	2002-01-15
FF	45	1010	2002-07-16

Assume this data exists in your data frame in R as “my_df”

Figure2 (code snippet)

my_df %>%
  separate(Col4,c("year","month","day")) %>%
  unite("Col4",month,day,year,sep = "/")

The output data will not be same as that of input.

Output will look like below.

Col1	Col2	Col3	Col4
AA	110	1007	08/11/2002
BB	45	1009	08/12/1999
CC	65	1005	04/13/2002
DD	40	1013	08/14/2002
EE	50	1010	01/15/2002
FF	45	1010	07/16/2002

The difference is in the format of Col4 which is the date value.
Separate() function splits into 3 different parts of this date column.
Unite() function unites these 3 different parts into one column which is Col4.
However the format is slightly different as mentioned in the code.

Here we are converting non-tidy format to tidy format and again back to non-tidy format.

This is one of the most frequently asked R programming interview questions and answers for freshers in recent times.

4.
When to use the following functions: apply(), lapply(), sapply(), tapply() in R? Explain.

The differences are the following:

apply(): Use as an alternative to for() loop
lapply(): Applies function to every item and returns the result as a list
sapply(): function will be executed column wise
tapply(): Similar to aggregate() function

Are the following code snippets same or different? Explain why to support your response.

flights_mutate1 <- flights %>% mutate(speed=distance/air_time*60)
%>% select(carrier,arr_delay,speed)
flights_mutate2 <- flights %>% select(carrier,arr_delay,speed)
%>% mutate(speed=distance/air_time*60)

These are NOT same. Flights_mutate1 will perform appropriately. Where as

flights_mutate2 will throw an error. We can not use select because the derived variables “speed” does not exist. It has to be created first using mutate() function and then select() function can be used to extract specific variables from the data frame.

6.
We have a sample dataset related to “pollution” which can be described as below. How do we use R functions to come up with median and variance of that dataset. Secondly, what changes to the code snippet that you need to perform to add another information which will display the number of observations in the dataset.

We can use the summarise() function from R in the dplyr package which will provide the mean and variance values as per below.

If can include the below parameter to get the number of observations information as well.

Expect to come across this, one of the most important R programming interview questions for experienced professionals in programming, in your next R interviews.

7. What is the difference between "n()" and "n_distinct()" functions in R? Explain with an example.

The n() provides the number of values in a vector, where as n_distinct() provides number of distinct values in a vector. For example, if we take the sample “flights” dataset in R, then we see the following characteristic:

We first remove the NA values from air_time and distance before using the summarise function.

The n() function performs a count of total number of flights or rows in the dataset. The

n_distinct() function captures the number of distinct carriers / airlines in the dataset which is 16.

8. What is tidy data in R? Explain with an example.

Data set comes in many formats but R prefers just one format and that is tidy data. Tidyr package in R does this. For example if you look at below dataset of pollution:

Each variable is saved in its own column, each observation is saved in its own row and each “type” observation stored in a single table (here it is in “pollution” shown above). It automatically preserve observations.

Library(tidyr) can be used to load the required package in R if not installed already.

Advanced

1.
What is the difference between "%>%" and "%%"?

The differences are the following:

%>% indicates – left hand side (LHS) to the right hand side (RHS) call
%<>% indicates – left hand side (LHS) to the right hand side (RHS) call. However, at the end update the LHS object with the resulting value.

2. Which function is used to derive new variables using the dplyr package in R - from existing variables? Explain with an example.

The mutate() function in dplyr package in R is used to derive new variables from existing variables (not from existing observations). For existing observations, one needs to use summarise() function instead. Below is an example:

If we take a sample data from “nycflights13” dataset, and try to view top few records, it looks like as below.

Now, if we use the mutate() function to derive a new variable and use select() function to fetch selected columns from above data frame.

flights <- as.data.frame(flights)
flights_mutate <- flights %>% mutate(speed=distance/air_time*60) %>% select(carrier,arr_delay,speed)

This will give below desired result. (again, few records shown from the data frame). Here the new derived variable is “speed” which is computed and derived based on the formula [distance / air_time*60]

3.
What does kable() perform in R? Let’s say if we use the “airlines” sample dataset and perform below operations in R console, do we expect any differences or are they same?

kable() function is used to explore entirety of a data frame. This is from the knitr() package in R. When we execute above two statements from R console, the kable() statement produces output which is much more legible. It is used in the R markdown where documentation can be clearer.

Below are snapshot of differences while executing from R console.

4.
Consider “flights” dataset in R. How can you find out how many flights go from a particular source to a particular destination? The “flights” dataset looks like below. You can use in R console as it is a sample dataset available from any R console (provided the packages are installed appropriately).

We need to groupby data from source to destination using a group_by() function and then summarize it find number of records in each grouped by set. That will provide us the desired result. Please refer below.

5.
We have an untidy dataset as shown below. Provide your approach to make it tidy and a format that you would like to analyze using R?
Country
2011 2012 2013
FR 7000 6900 7000
DE 5800 6000 6200
US 15000 14000 13000
P.S: This above data has case count year wise for every country but represented in above untidy format.

Country	2011	2012	2013
FR	7000	6900	7000
DE	5800	6000	6200
US	15000	14000	13000

We would like to convert it into below format which can be tidy format and will be easily analyzed in R.

Country	Year	n
FR	2011	7000
DE	2011	5800
US	2011	15000
FR	2012	6900
DE	2012	6000
US	2012	14000
FR	2013	7000
DE	2013	6200
US	2013	13000

A staple in senior R language interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the top interview questions to ask an R programmer.

6.
The sample dataset “cases” in R has the following data. Write a code or approach and explain to tidy the data in the desired format / expected format shown below?
Dataset sample:
Expected format of dataset:

We need to use gather() function to reshape the dataset into tidy format in R so that desired / expected output can be achieved. Please see below.

The first parameter in gather()function takes the data frame name that needs to be reshaped, second parameter is the name of the new key column which is “year” here since we want to show number of cases by year, by country, third parameter is the name of new value column which is count here, fourth parameter is the names or numeric indexes of columns to collapse. There could be different ways to achieve, but important aspect to think about the approach and see how we can leverage powerful packages such as “tidyr” package in R to accomplish this.

We have a sample dataset on tuberculosis (tb) from EDAWR package. What do you think about below two approaches written in code – will they provide same or different results? Explain why you feel – either case.

Code snippet 1:

EDAWR::tb %>% gather("age","cases",4:6) %>% arrange(country, year, sex, age)

Code snippet 2:

EDAWR::tb %>% gather("age","cases",child:elderly) %>% arrange(country, year, sex, age)

Information on dataset:

Both code snippet will yield the same result output.

This is because we are arranging by country, year, sex and age in both cases.

The 4:6 and child:elderly portion will pick based on column indexes or column names. Post that reshaping by arrange() will provide in desired / expected organized fashion.

This, along with other basic R interview questions for freshers, is a regular feature in R programming interviews, be ready to tackle it with the approach mentioned.

1.
Given data related to specific parameter for a country for a particular year is mentioned in Table 1. Provide an approach or write a program function using R to reshape the data in the way it is expected in Table 2 (which is the desired layout). Explain briefly with your response.
Table 1 (Given input data)
Country 2011 2012 2013
Japan 2300 3100 6800
China 2700 3300 5400
India 4800 6200 9500
Assume this data exists in your data frame in R as “my_df”
Table 2 (Expected desired layout as output)
Country Year n
Japan 2011 2300
China 2011 2700
India 2011 4800
Japan 2012 3100
China 2012 3300
India 2012 6200
Japan 2013 6800
China 2013 5400
India 2013 9500
Here objective is to get the count (n) captured in a separate row for every year for every country.

Country	2011	2012	2013
Japan	2300	3100	6800
China	2700	3300	5400
India	4800	6200	9500

Country	Year	n
Japan	2011	2300
China	2011	2700
India	2011	4800
Japan	2012	3100
China	2012	3300
India	2012	6200
Japan	2013	6800
China	2013	5400
India	2013	9500

We can use gather() function in tidyr package to accomplish this.

Below is the desired line of code.

# This will load the “tidyr” package
library(tidyr)
# This will reshape the data in desired format
gather(my_df,"Year","n",2:4,convert = TRUE)

gather() function parameters –

my_df is the first parameter to reshape the data.
“Year” is the second parameter which is name of the new key column, typically this is a character string.
“n” is the third parameter which is the name of the new value column.
2:4 is the fourth parameter which shows names or numeric indexes of columns to collapse from your input dataset .
“Convert=TRUE” is the last parameter mentioned here which converts number in the keys column from factors to numeric.

This is one of the most frequently asked R programming interview questions for freshers and experienced professionals in recent times.

2.
Given sample data below in table 1 has 4 columns including a date column in “Col4”. Provide an approach using separate() function in R to convert the data to be reflected in desired layout as provided in Table 2. Explain briefly.
Table1 (Input data layout)
Col1 Col2 Col3 Col4
AA 110 1007 2002-08-11
BB 45 1009 1999-08-12
CC 65 1005 2002-04-13
DD 40 1013 2001-08-14
EE 50 1010 2002-01-15
FF 45 1010 2002-07-16
Assume this data exists in your data frame in R as “my_df”
Table 2 (Expected desired layout as output)
Col1 Col2 Col3 year month day
AA 110 1007 2002 08 11
BB 45 1009 1999 08 12
CC 65 1005 2002 04 13
DD 40 1013 2001 08 14
EE 50 1010 2002 01 15
FF 45 1010 2002 07 16

Col1	Col2	Col3	year	month	day
AA	110	1007	2002	08	11
BB	45	1009	1999	08	12
CC	65	1005	2002	04	13
DD	40	1013	2001	08	14
EE	50	1010	2002	01	15
FF	45	1010	2002	07	16

We can use the following approach using separate to distribute date field into three separate columns for year, month and day values.

# This will load the tidyr package
library(tidyr)
# This will reshape the data in desired format
separate(my_df, Col4, c("year","month","day"),sep = "-")

separate() function will use the parameters appropriately to display data in desired format.

First parameter used here is the data frame which is my_df.
Second parameter used here is the date column. We can use any column to split up as per need.
Third parameter used here is the names of new columns to make.
Fourth parameter is the string to split on. Basically this is the separation criteria. By default, separate() will split on any non-alphanumeric characters.

Figure1 (input dataset)

Col1	Col2	Col3	Col4
AA	110	1007	2002-08-11
BB	45	1009	1999-08-12
CC	65	1005	2002-04-13
DD	40	1013	2001-08-14
EE	50	1010	2002-01-15
FF	45	1010	2002-07-16

Assume this data exists in your data frame in R as “my_df”

Figure2 (code snippet)

my_df %>%
  separate(Col4,c("year","month","day")) %>%
  unite("Col4",month,day,year,sep = "/")

The output data will not be same as that of input.

Output will look like below.

Col1	Col2	Col3	Col4
AA	110	1007	08/11/2002
BB	45	1009	08/12/1999
CC	65	1005	04/13/2002
DD	40	1013	08/14/2002
EE	50	1010	01/15/2002
FF	45	1010	07/16/2002

Here we are converting non-tidy format to tidy format and again back to non-tidy format.

This is one of the most frequently asked R programming interview questions and answers for freshers in recent times.

4.
When to use the following functions: apply(), lapply(), sapply(), tapply() in R? Explain.

The differences are the following:

apply(): Use as an alternative to for() loop
lapply(): Applies function to every item and returns the result as a list
sapply(): function will be executed column wise
tapply(): Similar to aggregate() function

Are the following code snippets same or different? Explain why to support your response.

flights_mutate1 <- flights %>% mutate(speed=distance/air_time*60)
%>% select(carrier,arr_delay,speed)
flights_mutate2 <- flights %>% select(carrier,arr_delay,speed)
%>% mutate(speed=distance/air_time*60)

These are NOT same. Flights_mutate1 will perform appropriately. Where as

6.
We have a sample dataset related to “pollution” which can be described as below. How do we use R functions to come up with median and variance of that dataset. Secondly, what changes to the code snippet that you need to perform to add another information which will display the number of observations in the dataset.

We can use the summarise() function from R in the dplyr package which will provide the mean and variance values as per below.

If can include the below parameter to get the number of observations information as well.

Expect to come across this, one of the most important R programming interview questions for experienced professionals in programming, in your next R interviews.

7. What is the difference between "n()" and "n_distinct()" functions in R? Explain with an example.

We first remove the NA values from air_time and distance before using the summarise function.

The n() function performs a count of total number of flights or rows in the dataset. The

n_distinct() function captures the number of distinct carriers / airlines in the dataset which is 16.

8. What is tidy data in R? Explain with an example.

Data set comes in many formats but R prefers just one format and that is tidy data. Tidyr package in R does this. For example if you look at below dataset of pollution:

Library(tidyr) can be used to load the required package in R if not installed already.

1.
What is the difference between "%>%" and "%%"?

The differences are the following:

%>% indicates – left hand side (LHS) to the right hand side (RHS) call
%<>% indicates – left hand side (LHS) to the right hand side (RHS) call. However, at the end update the LHS object with the resulting value.

2. Which function is used to derive new variables using the dplyr package in R - from existing variables? Explain with an example.

If we take a sample data from “nycflights13” dataset, and try to view top few records, it looks like as below.

Now, if we use the mutate() function to derive a new variable and use select() function to fetch selected columns from above data frame.

flights <- as.data.frame(flights)
flights_mutate <- flights %>% mutate(speed=distance/air_time*60) %>% select(carrier,arr_delay,speed)

3.
What does kable() perform in R? Let’s say if we use the “airlines” sample dataset and perform below operations in R console, do we expect any differences or are they same?

Below are snapshot of differences while executing from R console.

4.
Consider “flights” dataset in R. How can you find out how many flights go from a particular source to a particular destination? The “flights” dataset looks like below. You can use in R console as it is a sample dataset available from any R console (provided the packages are installed appropriately).

5.
We have an untidy dataset as shown below. Provide your approach to make it tidy and a format that you would like to analyze using R?
Country
2011 2012 2013
FR 7000 6900 7000
DE 5800 6000 6200
US 15000 14000 13000
P.S: This above data has case count year wise for every country but represented in above untidy format.

Country	2011	2012	2013
FR	7000	6900	7000
DE	5800	6000	6200
US	15000	14000	13000

We would like to convert it into below format which can be tidy format and will be easily analyzed in R.

Country	Year	n
FR	2011	7000
DE	2011	5800
US	2011	15000
FR	2012	6900
DE	2012	6000
US	2012	14000
FR	2013	7000
DE	2013	6200
US	2013	13000

A staple in senior R language interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the top interview questions to ask an R programmer.

6.
The sample dataset “cases” in R has the following data. Write a code or approach and explain to tidy the data in the desired format / expected format shown below?
Dataset sample:
Expected format of dataset:

We need to use gather() function to reshape the dataset into tidy format in R so that desired / expected output can be achieved. Please see below.

Code snippet 1:

EDAWR::tb %>% gather("age","cases",4:6) %>% arrange(country, year, sex, age)

Code snippet 2:

EDAWR::tb %>% gather("age","cases",child:elderly) %>% arrange(country, year, sex, age)

Information on dataset:

Both code snippet will yield the same result output.

This is because we are arranging by country, year, sex and age in both cases.

The 4:6 and child:elderly portion will pick based on column indexes or column names. Post that reshaping by arrange() will provide in desired / expected organized fashion.

This, along with other basic R interview questions for freshers, is a regular feature in R programming interviews, be ready to tackle it with the approach mentioned.

R Programming Interview Questions and Answers Programming

Beginner

4.
When to use the following functions: apply(), lapply(), sapply(), tapply() in R? Explain.

5.
Are the following code snippets same or different? Explain why to support your response.
flights_mutate1 <- flights %>% mutate(speed=distance/air_time60) %>% select(carrier,arr_delay,speed) flights_mutate2 <- flights %>% select(carrier,arr_delay,speed) %>% mutate(speed=distance/air_time60)

7. What is the difference between "n()" and "n_distinct()" functions in R? Explain with an example.

8. What is tidy data in R? Explain with an example.

Advanced

1.
What is the difference between "%>%" and "%%"?

2. Which function is used to derive new variables using the dplyr package in R - from existing variables? Explain with an example.

3.
What does kable() perform in R? Let’s say if we use the “airlines” sample dataset and perform below operations in R console, do we expect any differences or are they same?

6.
The sample dataset “cases” in R has the following data. Write a code or approach and explain to tidy the data in the desired format / expected format shown below?
Dataset sample:
Expected format of dataset:

4.
When to use the following functions: apply(), lapply(), sapply(), tapply() in R? Explain.

5.
Are the following code snippets same or different? Explain why to support your response.
flights_mutate1 <- flights %>% mutate(speed=distance/air_time60) %>% select(carrier,arr_delay,speed) flights_mutate2 <- flights %>% select(carrier,arr_delay,speed) %>% mutate(speed=distance/air_time60)

7. What is the difference between "n()" and "n_distinct()" functions in R? Explain with an example.

8. What is tidy data in R? Explain with an example.

1.
What is the difference between "%>%" and "%%"?

2. Which function is used to derive new variables using the dplyr package in R - from existing variables? Explain with an example.

3.
What does kable() perform in R? Let’s say if we use the “airlines” sample dataset and perform below operations in R console, do we expect any differences or are they same?

6.
The sample dataset “cases” in R has the following data. Write a code or approach and explain to tidy the data in the desired format / expected format shown below?
Dataset sample:
Expected format of dataset:

Description

Useful links

R Programming Interview Questions and Answers Programming

Beginner

Advanced

Description

Related Interview Questions

Useful links