# R Programming Interview Questions Data Science

Barely 2 hours of your daily study time dedicated to these R programming interview questions can actually help convert your next R interview into a top job offer.

• 4.8 Rating
• 21 Question(s)

The differences are the following:

%>% indicates – left hand side (LHS) to the right hand side (RHS) call

%<>% indicates – left hand side (LHS) to the right hand side (RHS) call. However, at the end update the LHS object with the resulting value.

The mutate() function in dplyr package in R is used to derive new variables from existing variables (not from existing observations). For existing observations, one needs to use summarise() function instead. Below is an example:

If we take a sample data from “nycflights13” dataset, and try to view top few records, it looks like as below.

Now, if we use the mutate() function to derive a new variable and use select() function to fetch selected columns from above data frame.

flights <- as.data.frame(flights)
flights_mutate <- flights %>% mutate(speed=distance/air_time*60) %>% select(carrier,arr_delay,speed)

This will give below desired result. (again, few records shown from the data frame). Here the new derived variable is “speed” which is computed and derived based on the formula [distance / air_time*60]

Question Continued:

kable() function is used to explore entirety of a data frame. This is from the knitr() package in R. When we execute above two statements from R console, the kable() statement produces output which is much more legible. It is used in the R markdown where documentation can be clearer.

Below are snapshot of differences while executing from R console.

Question Continued:

We need to groupby data from source to destination using a group_by() function and then summarize it find number of records in each grouped by set. That will provide us the desired result. Please refer below.

Question Continued:

Country
201120122013
FR700069007000
DE580060006200
US150001400013000

P.S: This above data has case count year wise for every country but represented in above untidy format.

We would like to convert it into below format which can be tidy format and will be easily analyzed in R.

Country
Yearn
FR20117000
DE20115800
US201115000
FR20126900
DE20126000
US201214000
FR20137000
DE20136200
US201313000

Question Continued:

Dataset sample:

Expected format of dataset:

We need to use gather() function to reshape the dataset into tidy format in R so that desired / expected output can be achieved. Please see below.

The first parameter in gather()function takes the data frame name that needs to be reshaped, second parameter is the name of the new key column which is “year” here since we want to show number of cases by year, by country, third parameter is the name of new value column which is count here, fourth parameter is the names or numeric indexes of columns to collapse. There could be different ways to achieve, but important aspect to think about the approach and see how we can leverage powerful packages such as “tidyr” package in R to accomplish this.

Question Continued:

Code snippet 1:

EDAWR::tb %>% gather("age","cases",4:6) %>% arrange(country, year, sex, age)

Code snippet 2:

EDAWR::tb %>% gather("age","cases",child:elderly) %>% arrange(country, year, sex, age)

Information on dataset:

Both code snippet will yield the same result output.

This is because we are arranging by country, year, sex and age in both cases.

The 4:6 and child:elderly portion will pick based on column indexes or column names. Post that reshaping by arrange() will provide in desired / expected organized fashion.

## Beginner

Question Continued:

Table 1 (Given input data)

Country201120122013
Japan230031006800
China270033005400
India480062009500

Assume this data exists in your data frame in R as “my_df”

Table 2 (Expected desired layout as output)

CountryYearn
Japan20112300
China20112700
India20114800
Japan20123100
China20123300
India20126200
Japan20136800
China20135400
India20139500

Here objective is to get the count (n) captured in a separate row for every year for every country.

We can use gather() function in tidyr package to accomplish this.

Below is the desired line of code.

# This will load the “tidyr” package
library(tidyr)
# This will reshape the data in desired format
gather(my_df,"Year","n",2:4,convert = TRUE)

gather() function parameters –

• my_df is the first parameter to reshape the data.
• “Year” is the second parameter which is name of the new key column, typically this is a character string.
• “n” is the third parameter which is the name of the new value column.
• 2:4 is the fourth parameter which shows names or numeric indexes of columns to collapse from your input dataset .
• “Convert=TRUE” is the last parameter mentioned here which converts number in the keys column from factors to numeric.

Question continued:

Table1 (Input data layout)

Col1Col2Col3Col4
AA11010072002-08-11
BB4510091999-08-12
CC6510052002-04-13
DD4010132001-08-14
EE5010102002-01-15
FF4510102002-07-16

Assume this data exists in your data frame in R as “my_df”

Table 2 (Expected desired layout as output)

Col1Col2Col3yearmonthday
AA110100720020811
BB45100919990812
CC65100520020413
DD40101320010814
EE50101020020115
FF45101020020716

We can use the following approach using separate to distribute date field into three separate columns for year, month and day values.

# This will load the tidyr package
library(tidyr)
# This will reshape the data in desired format
separate(my_df, Col4, c("year","month","day"),sep = "-")

separate() function will use the parameters appropriately to display data in desired format.

• First parameter used here is the data frame which is my_df.
• Second parameter used here is the date column. We can use any column to split up as per need.
• Third parameter used here is the names of new columns to make.
• Fourth parameter is the string to split on. Basically this is the separation criteria. By default, separate() will split on any non-alphanumeric characters.

Question Continued:

Figure1 (input dataset)

Col1Col2Col3Col4
AA11010072002-08-11
BB4510091999-08-12
CC6510052002-04-13
DD4010132001-08-14
EE5010102002-01-15
FF4510102002-07-16

Assume this data exists in your data frame in R as “my_df”

Figure2 (code snippet)

my_df %>%
separate(Col4,c("year","month","day")) %>%
unite("Col4",month,day,year,sep = "/")

The output data will not be same as that of input.

Output will look like below.

Col1
Col2
Col3
Col4
AA110100708/11/2002
BB45100908/12/1999
CC65100504/13/2002
DD40101308/14/2002
EE50101001/15/2002
FF45101007/16/2002

The difference is in the format of Col4 which is the date value.
Separate() function splits into 3 different parts of this date column.
Unite() function unites these 3 different parts into one column which is Col4.
However the format is slightly different as mentioned in the code.

Here we are converting non-tidy format to tidy format and again back to non-tidy format.

save() function – saves variables, data frames (multiple objects) etc. back as .Rdata file in R.

saveRDS() function – saves one object at a time.

FALSE. We have to use stringsAsFactors = FALSE. R's default behaviour when creating data frames is to convert all characters into factors.

We can use below functions such as – distinct(), pull(), select(), filter() as part of dplyr package. Filter and distinct are used as “manipulate cases”. Pull and select are used as “manipulate variables”.

table()” gives frequency of one variable (in row) against second variable (in column), whereas “dcast()” gives aggregate of third variable against two variables (mentioned in rows and columns)

The differences are the following:

apply(): Use as an alternative to for() loop

lapply(): Applies function to every item and returns the result as a list

sapply(): function will be executed column wise

tapply(): Similar to aggregate() function

quantile(df\$V1, c(0.1, 0.9)) will provide the desired value. This will provide 10% and 90% quantile range in the dataset for the variable V1.

Question Continued:

flights_mutate1 <- flights %>% mutate(speed=distance/air_time*60)
%>% select(carrier,arr_delay,speed)
flights_mutate2 <- flights %>% select(carrier,arr_delay,speed)
%>% mutate(speed=distance/air_time*60)

These are NOT same. Flights_mutate1 will perform appropriately. Where as

flights_mutate2 will throw an error. We can not use select because the derived variables “speed” does not exist. It has to be created first using mutate() function and then select() function can be used to extract specific variables from the data frame.

Question Continued:

We can use the summarise() function from R in the dplyr package which will provide the mean and variance values as per below.

If can include the below parameter to get the number of observations information as well.

The n() provides the number of values in a vector, where as n_distinct() provides number of distinct values in a vector. For example, if we take the sample “flights” dataset in R, then we see the following characteristic:

We first remove the NA values from air_time and distance before using the summarise function.

The n() function performs a count of total number of flights or rows in the dataset. The

n_distinct() function captures the number of distinct carriers / airlines in the dataset which is 16.

Data set comes in many formats but R prefers just one format and that is tidy data. Tidyr package in R does this. For example if you look at below dataset of pollution:

Each variable is saved in its own column, each observation is saved in its own row and each “type” observation stored in a single table (here it is in “pollution” shown above). It automatically preserve observations.

Library(tidyr) can be used to load the required package in R if not installed already.

## Description

Barely 2 hours of your daily study time dedicated to these R programming interview questions can actually help convert your next R interview into a top job offer.
Levels