How R stores data, and how that affects your workflows

Let’s eat our vegetables

Data types

Each piece of information is assigned of one class

  • Numeric
  • Integer
  • Logical
  • Character
  • Complex
  • Raw
  • Applications in ecology and evolution will rarely require use of complex or raw variables.

# Define a numeric variable and check its class
myvar <- 123
class(myvar)
## [1] "numeric"

# Define a string and check its class
myvar2 <- 'hello'
class(myvar2)
## [1] "character"

# Define a Logical
is.tuesday <- TRUE
class(is.tuesday)
## [1] "logical"

Data structures

Types of data structures

  • Scalars and vectors
  • Matrices and arrays
  • Data frames
  • Lists
  • Tibbles

Moving from highly-structured to less-structured

Scalars and Vectors

  • Scalars are variables that have length 1
  • Multiple scalars of the same class can be organized into a vector (== “atomic” vector)

All elements in a vector are of an identical class

# Define two scalars - both numeric
s1 <- 1
s2 <- 2

class(s1); class(s2)
## [1] "numeric"
## [1] "numeric"

# Arrange scalars into a vector
v <- c(s1, s2)
class(v); v
## [1] "numeric"
## [1] 1 2

All elements in a vector are of an identical class

# Define two scalars -- one numeric, one character
s1 <- 1
s2 <- "b"

class(s1); class(s2)
## [1] "numeric"
## [1] "character"

# Arrange scalars into a vector
v <- c(s1, s2)
class(v); v
## [1] "character"
## [1] "1" "b"

Vectors can be of any class introduced above

numeric_vector <- c(1,2,3)
integer_vector <- c(1L, 2L, 3L)
character_vector <- c("a", "b", "c")
logical_vector <- c(TRUE, FALSE, FALSE)

class(numeric_vector)
## [1] "numeric"
class(integer_vector)
## [1] "integer"
class(character_vector)
## [1] "character"
class(logical_vector)
## [1] "logical"

Matrices

  • Matrices comprise of vectors that are of the same class (and of the same length)
    • e.g. a set of numeric vectors; a set of logical vectors, etc.
v1 <- c(1,2,3)
v2 <- c(4,5,6)
v3 <- c(7,8,9)

m1 <- matrix(c(v1,v2,v3), nrow = 3)

m1
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Matrices

  • Matrices comprise of vectors that are of the same class (and of the same length)
    • e.g. a set of numeric vectors; a set of logical vectors, etc.
  • Matrices cannot comprise vectors of different classes
v1 <- c(1,2,3)
v2 <- c(4,5,6)
v3 <- c(7,8,9)
v4 <- c('a','b','c')
m1 <- matrix(c(v1,v2,v3), nrow = 3)
m2 <- matrix(c(v1,v2,v3,v4), nrow = 3)

m1
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
m2
##      [,1] [,2] [,3] [,4]
## [1,] "1"  "4"  "7"  "a" 
## [2,] "2"  "5"  "8"  "b" 
## [3,] "3"  "6"  "9"  "c"

  • We can extract individual vectors (columns or rows) by indexing the matrix
  • matrixName[rowNumber,columnNumber]
v1 <- c(1,2,3)
v2 <- c(4,5,6)
v3 <- c(7,8,9)
v4 <- c('a','b','c')
m1 <- matrix(c(v1,v2,v3), nrow = 3)
m2 <- matrix(c(v1,v2,v3,v4), nrow = 3)


class(m1[,1])
## [1] "numeric"
class(m2[,1])
## [1] "character"

Data frames

  • Data frames comprise of vectors that are of different classes (and of the same length)
v1 <- c(1,2,3)
v2 <- c(4,5,6)
v3 <- c(7,8,9)
v4 <- c('a','b','c')

df1 <- data.frame(v1,v2,v3,v4)

df1
##   v1 v2 v3 v4
## 1  1  4  7  a
## 2  2  5  8  b
## 3  3  6  9  c

  • As with matrices, can extract individual vectors (columns or rows) by indexing the data frame
  • dataframe[rowNumber,columnNumber]

But we can also use the syntax dataframe$columnName

v1 <- c(1,2,3)
v2 <- c(4,5,6)
v3 <- c(7,8,9)
v4 <- c('a','b','c')

df1 <- data.frame(v1,v2,v3,v4)

df1
##   v1 v2 v3 v4
## 1  1  4  7  a
## 2  2  5  8  b
## 3  3  6  9  c

class(df1[,1])
## [1] "numeric"
class(df1[,4])
## [1] "character"

class(df1$v1)
## [1] "numeric"
class(df1$v4)
## [1] "character"

Lists

  • Lists can comprise of vectors that are of different classes and/or of different lengths)
v1 <- c(1,2)
v2 <- c(4,5,6)
v3 <- c(7,8,9,10)
v4 <- c('a','b','c')

l1 <- list(v1 = v1, v2 = v2, v3 = v3, v4 = v4)

l1
## $v1
## [1] 1 2
## 
## $v2
## [1] 4 5 6
## 
## $v3
## [1]  7  8  9 10
## 
## $v4
## [1] "a" "b" "c"

  • As with matrices and data frames, can extract individual vectors (columns or rows) by indexing the list

  • listName[[itemnumber]] or listName$itemName

v1 <- c(1,2)
v2 <- c(4,5,6)
v3 <- c(7,8,9,10)
v4 <- c('a','b','c', 'd', 'e')

l1 <- list(v1 = v1, v2 = v2, v3 = v3, v4 = v4)

l1
## $v1
## [1] 1 2
## 
## $v2
## [1] 4 5 6
## 
## $v3
## [1]  7  8  9 10
## 
## $v4
## [1] "a" "b" "c" "d" "e"

class(l1$v1); length(l1$v1)
## [1] "numeric"
## [1] 2
class(l1[[4]]); length(l1[[4]])
## [1] "character"
## [1] 5

Exercises

Tuesday exercises

  • For those new to programming in R:

https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures.html

  • For those with previous experience in R:

https://adv-r.hadley.nz/base-types.html

Data structures in R, pt. 2

Review

Each piece of information is assigned of one class

  • Numeric
  • Integer
  • Logical
  • Character
  • Complex
  • Raw

This is a simplification; see here for advanced topics on data classes in R

Review

Information organized into data structures

  • Scalars and vectors
  • Matrices and arrays
  • Data frames
  • Lists

New today: Tibbles

Review

  • Vectors contain only one data type
# Numeric vector
numbers <- c(1,1.24,3.123,5.0)
numbers; class(numbers)
## [1] 1.000 1.240 3.123 5.000
## [1] "numeric"


# Adding a single character element turns this into a character vector
numbers <- c(1,1.24,3.123,5.0, "6 ")
numbers; class(numbers)
## [1] "1"     "1.24"  "3.123" "5"     "6 "
## [1] "character"


# R can coerce data into different types - sometimes convenient, sometimes dangerous!
numbers <- c(TRUE, TRUE, TRUE, TRUE, FALSE, 2, FALSE)
numbers; class(numbers)
## [1] 1 1 1 1 0 2 0
## [1] "numeric"

Review

  • Matrices are sets of vectors of the same length, organized into rows and columns
v1 <- c(1,2,3)
v2 <- c(4,5,6)
v3 <- c(7,8,9)

m <- matrix(c(v1,v2,v3), nrow = 3)
m
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
  • Matrices can be organized by rows or by columns
m2 <- matrix(c(v1,v2,v3), nrow = 3, byrow = TRUE)
m2
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
  • Data in matrices is all of the same class
v1 <- c(1,2,3)
v2 <- c(4,5,6)
v3 <- c(7,8,'9 ')

m <- matrix(c(v1,v2,v3), nrow = 3)
m
##      [,1] [,2] [,3]
## [1,] "1"  "4"  "7" 
## [2,] "2"  "5"  "8" 
## [3,] "3"  "6"  "9 "

Review

  • Data frames comprise vectors of the same length but potentially different classes
v1 <- c(1,2,3)
v2 <- c("a","b","c")
v3 <- c(TRUE, TRUE, FALSE)


d <- data.frame(v1,v2,v3)
d 
##   v1 v2    v3
## 1  1  a  TRUE
## 2  2  b  TRUE
## 3  3  c FALSE
  • As above, the default behaviors can be convenient sometimes, but dangerous sometimes

R can “recycle” vectors to create data frames

v1 <- c(1,2,3)
v2 <- c("a", "b", "c", "d", "e", "f")
# V1 is "recycled" because it can be easily multiplied to get 6 elements

data.frame(v1, v2)
##   v1 v2
## 1  1  a
## 2  2  b
## 3  3  c
## 4  1  d
## 5  2  e
## 6  3  f

Review

  • Lists comprise vectors that can be of different classes and/or of different lengths.
    • Lists can also include data frames within them!
v1 <- c(1,2,3)
v2 <- c("a","b","c")
v3 <- c(TRUE, TRUE, FALSE)

m <- matrix(c(v1, v1+3, v1+6), nrow = 3)

d <- data.frame(v1,v2,v3)

l <- list(v1, v2, v3, m, d)
l
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
## [1]  TRUE  TRUE FALSE
## 
## [[4]]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## [[5]]
##   v1 v2    v3
## 1  1  a  TRUE
## 2  2  b  TRUE
## 3  3  c FALSE

Tibbles

Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating.

  • Similarities to data frames:
    • Can include columns of different classes
    • All columns need to be of the same length (“rectangular” data set)
  • Differences
    • We will explore these as we go

Creating tibbles

  • Can be very similar to creating data frames
library(dplyr)
v1 <- c(1,2,3)
v2 <- c(4,5,6)

d <- data.frame(v1,v2)
t <- tibble(v1,v2)

class(d)
## [1] "data.frame"
class(t)
## [1] "tbl_df"     "tbl"        "data.frame"
  • Can convert existing data.frames into tibbles using as_tibble():
class(d)
## [1] "data.frame"

dt <- as_tibble(d)

class(dt)
## [1] "tbl_df"     "tbl"        "data.frame"

Important properties of tibbles

Tibbles reject row names

  • Data frames in R can have row names, but tibbles can not.
  • Some examples with the mtcars dataset (inbuilt in R)
  • You might wonder: but the car names were important!
  • In tibble’s opinion: if it’s important, keep it as a column in your dataset.

Viewing tibbles

  • Tibbles print more “cleanly” than do data frames

Example: print the mtcars dataframe (in-built in R)

Example: print mtcars as a tibble

Tibbles reject recycled values

  • Recall that if you tried to make a dataframe with vectors of different lengths, it would work as long as one length was a multiple of the other
v1 <- c(1,2,3)
v2 <- c(4,5,6,7,8,9)

data.frame(v1,v2)
##   v1 v2
## 1  1  4
## 2  2  5
## 3  3  6
## 4  1  7
## 5  2  8
## 6  3  9

The exception to the rule: values of size one are recycled

Tibbles can have non-vector columns

  • Recall that when we made data frame, each column was a vector of the same length
v1 <- c(1,2,3)
v2 <- c(4,5,6)

data.frame(v1,v2)
##   v1 v2
## 1  1  4
## 2  2  5
## 3  3  6

What if we wanted one of our columns to have vectors in it?

  • E.g. Column 1 is site ID, and Column 2 is a vector of the species recorded there

Tibbles can have non-vector columns

  • Tibbles make it easier to have “list-columns”

What can be done with tibbles or data frames?

Lots!

Let’s do some exercises