Understanding How R Sees Data

RAdelaide 2024

Dr Stevie Pederson

Black Ochre Data Labs
Telethon Kids Institute

July 10, 2024

Vectors

Vectors

The key building blocks for R objects: Vectors

  • There is no such thing as a scalar in R
  • Everything is based around the concept of a vector

What is a vector?

Definition

A vector is zero or more values of the same type

Vectors

A simple vector would be

 [1]  1  2  3  4  5  6  7  8  9 10

What type of values are in this vector?

Vectors

Another vector might be

[1] "a"     "cat"   "video"

What type of values are in this vector?

Vectors

[1] "742"       "Evergreen" "Tce"      

What type of values are in this vector?

The 4 Atomic Vector Types

  • Atomic Vectors are the building blocks for everything in R
  • There are four main types
  • Plus two we can ignore

Logical Vectors

  1. logical: Can only hold the values TRUE or FALSE
logi_vec <- c(TRUE, TRUE, FALSE)
print(logi_vec)
[1]  TRUE  TRUE FALSE

Integer Vectors

  1. logical
  2. integer: Counts, ranks or indexing positions
int_vec <- 1:5
print(int_vec)
[1] 1 2 3 4 5

Double Precision Vectors

  1. logical
  2. integer
  3. double: Often (& lazily) referred to as numeric
dbl_vec <- c(0.618, 1.414, 2)
print(dbl_vec)

Why are these called doubles?

Character Vectors

  1. logical
  2. integer
  3. double
  4. character
char_vec <- c("blue", "red", "green")
print(char_vec)

The 4 Atomic Vector Types

These are the basic building blocks for all R objects

  1. logical
  2. integer
  3. double
  4. character
  • There are two more rare types we’ll ignore: complex & raw
  • All R data structures are built on these 6 vector types

Properties of a vector

What four defining properties might a vector have?

  1. The actual values
  2. Length, accessed by the function length()
  3. The type, accessed by the function typeof()
    • Similar but preferable to class()
  4. Any optional & additional attributes \(\implies\) attributes()
    • Holds data such as names etc.

Properties of a vector

Let’s try them on our vectors

typeof(char_vec)
length(int_vec)
attributes(logi_vec)
class(dbl_vec)
typeof(dbl_vec)

Were you surprised by any of the results?

Working with Vectors

We can combine two vectors in R, using the function c()

c(1, 2)
[1] 1 2
  • The numbers 1 & 2 were both vectors with length 1
  • We have combined two vectors of length 1, to make a vector of length 2

Working with Vectors

What would happen if we combined two vectors of different types?

new_vec <- c(logi_vec, int_vec)
print(new_vec)
typeof(new_vec)

Working with Vectors

Q: What happened to the logical values?

Answer: R will coerce them into a common type (i.e. integers).

Coercion

Coercion

What other types could logical vectors be coerced into?

Try using the functions: as.integer(), as.double() & as.character() on logi_vec

Coercion

  1. Can numeric vectors be coerced into logical vectors?
  2. Can character vectors be coerced into numeric vectors?
simp_vec <- c(742, "Evergreen", "Terrace")
simp_vec
[1] "742"       "Evergreen" "Terrace"  
as.numeric(simp_vec)
[1] 742  NA  NA

Subsetting Vectors

Subsetting Vectors

One or more elements of a vector can be called using []

y <- c("A", "B", "C", "D", "E")
y[2]
[1] "B"
y[1:3]
[1] "A" "B" "C"

Subsetting Vectors

Double brackets ([[]]) can be used to return single elements only

y[[2]]
[1] "B"

If you tried y[[1:3]] you would receive an error message

Subsetting Vectors

If a vector has name attributes, we can call values by name

head(euro)
      ATS       BEF       DEM       ESP       FIM       FRF 
 13.76030  40.33990   1.95583 166.38600   5.94573   6.55957 
euro["ESP"]
    ESP 
166.386 

Subsetting Vectors

Try repeating the call-by-name approach using double brackets

euro["ESP"]
euro[["ESP"]]

What was the difference in the output?

  1. Using [] returned the vector with the identical structure
  2. Using [[]] removed the attributes & just gave the value

Subsetting Vectors

Is it better to call by position, or by name?

Things to consider:

  • Which is easier to type on the fly?
  • Which is easier to read?
  • Which is more robust to undocumented changes in an object?

Extracting Multiple Values

What is really happening in this line?

euro[1:5]
      ATS       BEF       DEM       ESP       FIM 
 13.76030  40.33990   1.95583 166.38600   5.94573 

We are using the integer vector 1:5 to extract values from euro

int_vec
[1] 1 2 3 4 5
euro[int_vec]
      ATS       BEF       DEM       ESP       FIM 
 13.76030  40.33990   1.95583 166.38600   5.94573 

Vector Operations

R Functions are designed to work on vectors

dbl_vec - 1
dbl_vec > 1
dbl_vec^2
mean(dbl_vec)
sd(dbl_vec)
sqrt(int_vec)

This is one of the real strengths of R

Vector Operations

We can also combine the above logical test and subsetting

dbl_vec
[1] 0.618 1.414 2.000
dbl_vec > 1
[1] FALSE  TRUE  TRUE
dbl_vec[dbl_vec > 1]

Vector Operations

An additional logical test: %in% (read as: “is in”)

dbl_vec
[1] 0.618 1.414 2.000
int_vec
[1] 1 2 3 4 5
dbl_vec %in% int_vec

Returns TRUE/FALSE for each value in dbl_vec if it is in int_vec

NB: int_vec was coerced silently to a double vector

Matrices

Matrices

  • Vectors are strictly one dimensional and have a length attribute.
  • A matrix is the two dimensional equivalent
int_mat <- matrix(1:6, ncol = 2)
print(int_mat)

Matrices

  • Matrices can only hold one type of value
    • i.e. logical, integer, double, character
  • Have additional attributes such as dim(), nrow() ncol()
  • Can have optional rownames() & colnames()

Matrices

Some commands to try:

dim(int_mat)
nrow(int_mat)
typeof(int_mat)
class(int_mat)
attributes(int_mat)
colnames(int_mat)
length(int_mat)

Ask questions if anything is confusing

Matrices

  • Use square brackets to extract values by row & column
  • The form is x[row, col]
  • Leaving either row or col blank selects the entire row/column
int_mat[2, 2]
int_mat[1,]

How would we just get the first column?

Matrices

NB: Forgetting the comma when subsetting will treat the matrix as a single vector running down the columns

int_mat
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
int_mat[5]
[1] 5

Matrices

Requesting a row or column that doesn’t exist is the source of a very common error message

dim(int_mat)
[1] 3 2
int_mat[5,]
Error in int_mat[5, ] : subscript out of bounds

Arrays

Arrays extend matrices to 3 or more dimensions

Beyond the scope of today, but we just have more commas in the square brackets, e.g.

dim(iris3)
[1] 50  4  3
dimnames(iris3)
iris3[1,,]
         Setosa Versicolor Virginica
Sepal L.    5.1        7.0       6.3
Sepal W.    3.5        3.2       3.3
Petal L.    1.4        4.7       6.0
Petal W.    0.2        1.4       2.5
iris3[1:2,,]

Homogeneous Data Types

  • Vectors, Matrices & Arrays are the basic homogeneous data types
  • All are essentially just vectors

Heterogeneous Data Types

Heterogeneous Data Types

Summary of main data types in R

Dimension Homogeneous Heterogeneous
1d vector list
2d matrix data.frame
3d+ array

Lists

A list is a heterogeneous vector.

  • Each component is an R object
  • Can be a vector, or matrix
  • Could be another list
  • Any other R object type we haven’t seen yet

These are incredibly common in R

Lists

Many R functions provide output as a list

testResults <- t.test(dbl_vec)
class(testResults)
typeof(testResults)
testResults

NB: There is a function (print.htest()) that tells R how to print the results to the Console

Lists

We can call the individual components of a list using the $ symbol followed by the name

testResults$statistic
testResults$conf.int
testResults$method

Note that each component is quite different to the others.

Subsetting Lists

A list is a vector so we can also subset using the [] method

testResults[1]
typeof(testResults[1])
  • Using single square brackets returns a list
    • i.e. is a subset of the larger object and of the same type

Subsetting Lists

Double brackets again retrieve a single element of the vector

  • Returns the actual component as the underlying R object
testResults[[1]]
typeof(testResults[[1]])

When would we use either method?


We can also use names instead of positions

testResults[c("statistic", "p.value")]
testResults[["statistic"]]

Data Frames

Data Frames

Finally!

  • These are the most common type of data you will work with
  • Each column is a vector
  • Columns can be different types of vectors
  • Column vectors MUST be the same length

Data Frames

  • Analogous to matrices, but are specifically for heterogeneous data
  • Have many of the same attributes as matrices
    • dim(), nrow(), ncol(), rownames(), colnames()
  • colnames() & rownames() are NOT optional \(\implies\) assigned by default
    • tibble variants have simple row numbers as rownames

Data Frames

Let’s load pigs again

library(tidyverse)
pigs <- read_csv("data/pigs.csv")
head(pigs)


Try these commands

colnames(pigs)
dim(pigs)
nrow(pigs)

Data Frames

Individual entries can also be extracted using the square brackets

pigs[1:2, 1]

We can also refer to columns by name (same as matrices)

pigs[1:2, "len"]

Data Frames

Thinking of columns being vectors is quite useful

  • We can call each column vector of a data.frame using the $ operator
pigs$len[1:2]

This does NOT work for rows!!!

Data Frames

  • R is column major by default (as is FORTRAN & Matlab)
    • Very common in the 1970s
  • Many other languages are row major, e.g. C/C++, Python
  • R was designed for statistical analysis, but has developed capabilities far beyond this

We will see this advantage this afternoon

Data Frames & Lists

Data Frames & Lists

Data frames are actually special cases of lists

  • Each column of a data.frame is a component of a list
  • The components must all be vectors of the same length
  • Data Frames can be treated identically to a list
  • Have additional subsetting operations and attributes

Data Frames & Lists

Forgetting the comma, now gives a completely different result to a matrix!

pigs[1]

Was that what you expected?

Try using the double bracket method

Common Data Frame Errors

What do you think will happen if we type:

pigs[5]

Error: Column index must be at most 3 if positive, not 5

Working With R Objects

Name Attributes

How do we assign names?

named_vec <- c(a = 1, b = 2, c = 3)

OR we can name an existing vector

names(int_vec) <- c("a", "b", "c", "d", "e")

Name Attributes

Can we remove names?

The NULL, or empty, vector in R is created using c()

null_vec <- c()
length(null_vec)

Name Attributes

We can also use this to remove names

names(named_vec) <- c()

Don’t forget to put the names back…

Matrices

We can convert vectors to matrices, as earlier

int_mat <- matrix(1:6, ncol = 2)

R is column major so fills columns by default

row_mat <- matrix(1:6, ncol = 2, byrow = TRUE)

Matrices

We can assign row names & column names after creation

colnames(row_mat) <- c("odds", "evens")

Or using dimnames()

dimnames(row_mat)

This a list of length 2 with rownames then colnames as the components.

Lists

my_list <- list(int_vec, dbl_vec)
names(my_list) <- c("integers", "doubles")

OR

my_list <- list(integers = int_vec, doubles = dbl_vec)

Lists

What happens if we try this?

my_list$logical <- logi_vec

Data Frames

This is exactly the same as creating lists, but

The names attribute will also be the colnames()

my_df <- data.frame(doubles = dbl_vec, logical = logi_vec)
names(my_df) == colnames(my_df)
[1] TRUE TRUE

Data Frames

What happens if we try to combine components that aren’t the same length?

my_df <- data.frame(
  integers = int_vec, 
  doubles = dbl_vec, logical = logi_vec
)