Data structure can be technically defined as the specific form of organizing and storing the data. R programming supports five basic types of data structures namely vector, matrix, list, data frame, and factor. In this tutorial, we will talk about each of these components to understand the data structures better in R.
In reality, R’s base data structure can be organized based on their dimensionality (1d, 2d, 3d, Nd) and if they are homogenous or not.
Given an object, the best way to understand what data structures it’s composed of is to use str(). str() is short for structure and it gives a compact, human-readable description of any R data structure.
One of the basic data structures in R is the vector. Vectors have two different flavors: atomic vectors and lists. They have three common properties:
They differ in the types of their elements: all elements of an atomic vector must be the same type, whereas the elements of a list can have different types.
NB: is.vector() does not test if an object is a vector. Instead, it returns TRUE only if the object is a vector with no attributes apart from names. One can use is.atomic(x) or is.list(x) to test if an object is actually a vector or not.
There are four basic types of atomic vectors that we will talk about in detail: logical, integer, double (often called numeric), and character. There are two rare types which we will skip for now: complex and raw.
Atomic vectors are usually created with c(), short for combine:
var <- c(1.9, 2.0, 7.5) var #Result  1.9 2.0 7.5 # With the L suffix, you get an integer rather than a double int_var <- c(2L, 8L, 100L) int_var #Result  2 8 100 # Use TRUE and FALSE (or T and F) to create logical vectors logical_var <- c(TRUE, FALSE, T, F) logical_var #Result  TRUE FALSE TRUE FALSE chr_var <- c("example of","some strings") chr_var #Result "example of" "some strings"
Atomic vectors are always flat, even if you nest c()’s:
c(1, c(2.96, c(3.75, 9))) #Result  1.00 2.96 3.75 9.00
Missing values are specified with NA, which is a logical vector of length 1. NA will always be coerced to the correct type if used inside c(), or you can create NAs of a specific type with NA_real_ (a double vector), NA_integer_ and NA_character_.
Given a vector, you can determine its type with typeof(), or check if it’s a specific type with an “is” function: is.character(), is.double(), is.integer(), is.logical(), or, more generally, is.atomic().
int_var <- c(1.05L, 8L, 10L) typeof(int_var) #Result  "double" is.integer(int_var) #Result  FALSE is.atomic(int_var) #Result  TRUE is.double(int_var) #Result  TRUE is.numeric(int_var) #Result  TRUE
All elements of an atomic vector must be of the same type, so when you attempt to combine different types they will be coerced to the most flexible type. Types from least to most flexible are: logical, integer, double, and character.
For example, combining a character and an integer yields a character:
str(c("a", 1L, 0.95)) #Result chr [1:3] "a" "1" "0.95" #When a logical vector is coerced to an integer or double, #TRUE becomes 1 and FALSE becomes 0. This is very useful in conjunction #with sum() and mean() x <- c(FALSE, FALSE, TRUE) as.numeric(x) #Result  0 0 1 # Total number of TRUEs sum(x) #Result  1 mean(x) #Result  0.3333333
Coercion can often happen automatically. Most mathematical functions (+, log, abs, etc.) will coerce to a double or integer, and most logical operations (&, |, any, etc) will coerce to a logical. One will usually get a warning message if the coercion might lose information. If confusion is likely, explicitly coerce with as.character(), as.double(), as.integer(), or as.logical().
Some key properties of Vectors:
> v <- c(10, 20, 30) > names(v) <- c("John", "Tracey", "Harry") > print(v) ##John Tracey Harry
10 20 30
>v[“Tracey”] ## Tracey 20
Lists are quite different from atomic vectors as their elements can be of any type, including lists. One can construct lists by using list() instead of c():
------Lists x <- list(1:5, "a", c(TRUE, FALSE, T, F), c(2.9, 5.3)) str(x)
#Result List of 4 $ : int [1:5] 1 2 3 4 5 $ : chr "a" $ : logi [1:4] TRUE FALSE TRUE FALSE $ : num [1:2] 2.9 5.3 x <- list(list(list(list()))) str(x) #Result List of 1 $ :List of 1 ..$ :List of 1 .. ..$ : list() is.recursive(x) #Result  True
Lists are sometimes expressed as recursive vectors, because a list may contain other lists as well. This is what makes them fundamentally different from atomic vectors.
c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to lists before combining them. Compare the results of a list() and c():
x <- list(list(1:9), c(3, 4)) y <- c(list(1, 2), c(3, 4)) str(x) #Result List of 2 $ :List of 1 ..$ : int [1:9] 1 2 3 4 5 6 7 8 9 $ : num [1:2] 3 4 str(y) #Result List of 4 $ : num 1 $ : num 2 $ : num 3 $ : num 4
The typeof() a list is a list. You can test for a list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with unlist(). If the elements of a list have different types, unlist() uses the same coercion rules as c().
Lists are basically used to create many of the more complicated data structures in R. For example, both data frames and linear models objects (as produced by lm()) are lists:
Some key properties of Lists:
In R, every object has a mode, which indicates how it is stored in memory: as a number, as a character string, as a list of pointers to other objects, as a function, and so forth:
|Vectors of Numbers||c(2.7.182, 3.1415)||Numeric|
|Vectors of Character Strings||c("John", "Tracey", "Harry")||Char|
|Factor||factor(c("NY", "CA", "IL"))||Numeric|
|List||list("John", "Tracey", "Harry")||list|
|Data Frame||data.frame(x=1:3, y=c("NY", "CA", "IL"))||List|
The mode() functions give us this information
(Please refer to the write up attached on Array and Matrices)
A factor looks like a vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor. R uses a compact representation for factors, which makes them efficient for storage in data frames. In other programming languages, a factor would be represented by a vector of enumerated values. In simple terms: “A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class, “factor”, which makes them behave differently from regular integer vectors, and the levels, which defines the set of allowed values.”
There are two key uses for factors:
> x <- factor(c("a", "b", "c", "d")) >x ##Result ##  a b c d ## Levels: a b c d >class(x) #Result # “factor” >levels(x) #Result ## "a" "b" “c” “d” # You can't use values that are not in the levels x <- "e" #Result
## Warning in `[<-.factor`(`*tmp*`, 2, value = "e"): invalid factor level, NA ## generated # NB: you can't combine factors >c(factor("a"), factor("b")) ##Result ##  1 1
Factors are quite useful when you know the possible values a variable may take, even if you don’t see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:
gen_char <- c("m", "m", "f") gen_factor <- factor(gen_char, levels = c("m", "f")) table(gen_char) #Result ## gen_char ## f m ## 1 2 table(gen_factor) ##Result #gen_factor # m f # 2 1
Sometimes when a data frame is read directly from a file, a column you’d thought would produce a numeric vector instead produces a factor. This is caused by a non-numeric value in the column, often a missing value encoded in a special way . or -. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.) Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the na.strings argument to read.csv() is often a good place to start.
A data frame is a very powerful and flexible data structure. Most serious R applications involve data frames. A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing. The length () of a data frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows.
A data frame is a tabular (rectangular) data structure, which means that it has rows and columns. It is not implemented by a matrix, however. Rather, a data frame is a list:
Few important points to remember when you are dealing with a data frame:
Because a data frame is both a list and a rectangular structure, R provides two different paradigms for accessing its contents:
#Create a data frame
df <- data.frame(x = 1:5, y = c("a", "b", "c", “d”, ”e”)) str(df) #Result 'data.frame': 5 obs. of 2 variables: $ x: int 1 2 3 4 5 $ y: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
One key point to remember while working with data frame is that data.frame() by default turns strings into factors. In that case , use stringsAsFactors = FALSE to suppress this behaviour:
df <- data.frame( x = 1:5, y = c("a", "b", "c" ,”d” , “e”), stringsAsFactors = FALSE) str(df) #Result ##'data.frame': 5 obs. of 2 variables: #$ x: int 1 2 3 4 5 #$ y: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 >typeof(df) #  “list”
>cbind(df, data.frame( z = 5:1)) #Result x y z 1 1 a 5 2 2 b 4 3 3 c 3 4 4 d 2 5 5 e 1 > rbind(df, data.frame(x = 10, y = "z")) #Result x y 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 10 z
When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. Use plyr::rbind.fill() to combine data frames that don’t have the same columns.
It’s a common mistake to try and create a data frame by cbind() - ing vectors together. This doesn’t work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly:
>correct_arg <- data.frame(a = 1:2, b = c("a", "b"), stringsAsFactors = FALSE) str(correct_arg) #Result 'data.frame': 2 obs. of 2 variables: $ a: int 1 2 $ b: chr "a" "b"
It’s also quite possible to have a column of a data frame that’s a matrix or array, as long as the number of rows matches the data frame:
dfm <- data.frame(x = 1:5, y = I(matrix(1:25, nrow = 5))) str(dfm) #Result 'data.frame': 5 obs. of 2 variables: $ x: int 1 2 3 4 5 $ y: 'AsIs' int [1:5, 1:5] 1 2 3 4 5 6 7 8 9 10 ... > dfm[5, "y"] #Result [,1] [,2] [,3] [,4] [,5] [1,] 5 10 15 20 25
We need to take extra care with the list and array columns: many functions that work with data frames assume that all columns are atomic vectors.
Hope you enjoyed this tutorial which discusses in detail about various data structures in R and now the next step would be to play around various aspects of each of these.