[1] 1 2 3 4 5 6 7 8 9 10
Understanding How R Sees Data
RAdelaide 2024
Vectors
Vectors
The key building blocks for R objects: Vectors
- There is no such thing as a scalar in
R - Everything is based around the concept of a vector
What is a vector?
Definition
A vector is zero or more values of the same type
Vectors
A simple vector would be
What type of values are in this vector?
Vectors
Another vector might be
[1] "a" "cat" "video"
What type of values are in this vector?
Vectors
[1] "742" "Evergreen" "Tce"
What type of values are in this vector?
The 4 Atomic Vector Types
- Atomic Vectors are the building blocks for everything in
R - There are four main types
- Plus two we can ignore
Logical Vectors
- logical: Can only hold the values
TRUEorFALSE
logi_vec <- c(TRUE, TRUE, FALSE)
print(logi_vec)[1] TRUE TRUE FALSE
- Spell out that when you type an object’s name, you’re calling
print() - Also mention that in the 70’s we didn’t have printers so it means print the object to screen
Integer Vectors
- logical
- integer: Counts, ranks or indexing positions
int_vec <- 1:5
print(int_vec)[1] 1 2 3 4 5
Double Precision Vectors
- logical
- integer
- double: Often (& lazily) referred to as
numeric
dbl_vec <- c(0.618, 1.414, 2)
print(dbl_vec)Why are these called doubles?
Character Vectors
- logical
- integer
- double
- character
char_vec <- c("blue", "red", "green")
print(char_vec)The 4 Atomic Vector Types
These are the basic building blocks for all R objects
- logical
- integer
- double
- character
. . .
- There are two more rare types we’ll ignore:
complex&raw - All
Rdata structures are built on these 6 vector types
Properties of a vector
What four defining properties might a vector have?
- The actual values
- Length, accessed by the function
length() - The type, accessed by the function
typeof()- Similar but preferable to
class()
- Similar but preferable to
- Any optional & additional attributes \(\implies\)
attributes()- Holds data such as
namesetc.
- Holds data such as
Properties of a vector
Let’s try them on our vectors
typeof(char_vec)
length(int_vec)
attributes(logi_vec)
class(dbl_vec)
typeof(dbl_vec)Were you surprised by any of the results?
Working with Vectors
We can combine two vectors in R, using the function c()
c(1, 2)[1] 1 2
- The numbers
1&2were both vectors withlength1 - We have combined two vectors of length 1, to make a vector of length 2
Working with Vectors
What would happen if we combined two vectors of different types?
. . .
new_vec <- c(logi_vec, int_vec)
print(new_vec)
typeof(new_vec)Working with Vectors
Q: What happened to the logical values?
. . .
Answer: R will coerce them into a common type (i.e. integers).
Coercion
Coercion
What other types could logical vectors be coerced into?
. . .
Try using the functions: as.integer(), as.double() & as.character() on logi_vec
Coercion
- Can
numericvectors be coerced intologicalvectors? - Can
charactervectors be coerced intonumericvectors?
simp_vec <- c(742, "Evergreen", "Terrace")
simp_vec[1] "742" "Evergreen" "Terrace"
as.numeric(simp_vec)[1] 742 NA NA
Subsetting Vectors
Subsetting Vectors
One or more elements of a vector can be called using []
y <- c("A", "B", "C", "D", "E")
y[2][1] "B"
y[1:3][1] "A" "B" "C"
Subsetting Vectors
Double brackets ([[]]) can be used to return single elements only
y[[2]][1] "B"
If you tried y[[1:3]] you would receive an error message
Unless you’re the Seurat package developers…
Subsetting Vectors
If a vector has name attributes, we can call values by name
head(euro) ATS BEF DEM ESP FIM FRF
13.76030 40.33990 1.95583 166.38600 5.94573 6.55957
euro["ESP"] ESP
166.386
Subsetting Vectors
Try repeating the call-by-name approach using double brackets
euro["ESP"]
euro[["ESP"]]. . .
What was the difference in the output?
. . .
- Using
[]returned the vector with the identical structure - Using
[[]]removed theattributes& just gave the value
Ask about Seurat & mention it’s unconventional behaviour
Subsetting Vectors
Is it better to call by position, or by name?
Things to consider:
- Which is easier to type on the fly?
- Which is easier to read?
- Which is more robust to undocumented changes in an object?
Extracting Multiple Values
What is really happening in this line?
euro[1:5] ATS BEF DEM ESP FIM
13.76030 40.33990 1.95583 166.38600 5.94573
. . .
We are using the integer vector 1:5 to extract values from euro
. . .
int_vec[1] 1 2 3 4 5
euro[int_vec] ATS BEF DEM ESP FIM
13.76030 40.33990 1.95583 166.38600 5.94573
Vector Operations
R Functions are designed to work on vectors
dbl_vec - 1
dbl_vec > 1
dbl_vec^2
mean(dbl_vec)
sd(dbl_vec)
sqrt(int_vec)This is one of the real strengths of R
Vector Operations
We can also combine the above logical test and subsetting
dbl_vec[1] 0.618 1.414 2.000
dbl_vec > 1[1] FALSE TRUE TRUE
. . .
dbl_vec[dbl_vec > 1]Vector Operations
An additional logical test: %in% (read as: “is in”)
dbl_vec[1] 0.618 1.414 2.000
int_vec[1] 1 2 3 4 5
dbl_vec %in% int_vec. . .
Returns TRUE/FALSE for each value in dbl_vec if it is in int_vec
NB: int_vec was coerced silently to a double vector
Matrices
Matrices
- Vectors are strictly one dimensional and have a
lengthattribute. - A
matrixis the two dimensional equivalent
int_mat <- matrix(1:6, ncol = 2)
print(int_mat)Matrices
- Matrices can only hold one type of value
- i.e. logical, integer, double, character
- Have additional attributes such as
dim(),nrow()ncol() - Can have optional
rownames()&colnames()
Matrices
Some commands to try:
dim(int_mat)
nrow(int_mat)
typeof(int_mat)
class(int_mat)
attributes(int_mat)
colnames(int_mat)
length(int_mat)Ask questions if anything is confusing
Matrices
- Use square brackets to extract values by row & column
- The form is
x[row, col] - Leaving either
roworcolblank selects the entire row/column
int_mat[2, 2]
int_mat[1,]How would we just get the first column?
Matrices
NB: Forgetting the comma when subsetting will treat the matrix as a single vector running down the columns
int_mat [,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
int_mat[5][1] 5
Matrices
Requesting a row or column that doesn’t exist is the source of a very common error message
dim(int_mat)[1] 3 2
int_mat[5,]Error in int_mat[5, ] : subscript out of bounds
Arrays
Arrays extend matrices to 3 or more dimensions
Beyond the scope of today, but we just have more commas in the square brackets, e.g.
dim(iris3)[1] 50 4 3
dimnames(iris3). . .
iris3[1,,] Setosa Versicolor Virginica
Sepal L. 5.1 7.0 6.3
Sepal W. 3.5 3.2 3.3
Petal L. 1.4 4.7 6.0
Petal W. 0.2 1.4 2.5
iris3[1:2,,]Homogeneous Data Types
- Vectors, Matrices & Arrays are the basic homogeneous data types
- All are essentially just vectors
Heterogeneous Data Types
Heterogeneous Data Types
Summary of main data types in R
| Dimension | Homogeneous | Heterogeneous |
|---|---|---|
| 1d | vector |
list |
| 2d | matrix |
data.frame |
| 3d+ | array |
Lists
A list is a heterogeneous vector.
- Each component is an
Robject - Can be a
vector, ormatrix - Could be another
list - Any other
Robject type we haven’t seen yet
These are incredibly common in R
Lists
Many R functions provide output as a list
testResults <- t.test(dbl_vec)
class(testResults)
typeof(testResults)
testResultsNB: There is a function (print.htest()) that tells R how to print the results to the Console
Lists
Explore the various attributes of the object testResults
attributes(testResults)
length(testResults)
names(testResults)
typeof(testResults)Lists
We can call the individual components of a list using the $ symbol followed by the name
testResults$statistic
testResults$conf.int
testResults$methodNote that each component is quite different to the others.
Subsetting Lists
A list is a vector so we can also subset using the [] method
testResults[1]
typeof(testResults[1])- Using single square brackets returns a
list- i.e. is a subset of the larger object and of the same type
Subsetting Lists
Double brackets again retrieve a single element of the vector
- Returns the actual component as the underlying
Robject
testResults[[1]]
typeof(testResults[[1]])When would we use either method?
. . .
We can also use names instead of positions
testResults[c("statistic", "p.value")]
testResults[["statistic"]]Lists
- Note also the Environment Tab in the top right of RStudio
- Click the arrow next to
testResultsto expand the entry - This is the output of
str(testResults)
Data Frames
Data Frames
Finally!
- These are the most common type of data you will work with
- Each column is a
vector - Columns can be different types of vectors
- Column vectors MUST be the same length
Data Frames
- Analogous to matrices, but are specifically for heterogeneous data
- Have many of the same attributes as matrices
dim(),nrow(),ncol(),rownames(),colnames()
colnames()&rownames()are NOT optional \(\implies\) assigned by defaulttibblevariants have simple row numbers as rownames
Data Frames
Let’s load pigs again
library(tidyverse)
pigs <- read_csv("data/pigs.csv")
head(pigs). . .
Try these commands
colnames(pigs)
dim(pigs)
nrow(pigs)Data Frames
Individual entries can also be extracted using the square brackets
pigs[1:2, 1]. . .
We can also refer to columns by name (same as matrices)
pigs[1:2, "len"]Data Frames
Thinking of columns being vectors is quite useful
- We can call each column vector of a
data.frameusing the$operator
pigs$len[1:2]This does NOT work for rows!!!
Data Frames
Ris column major by default (as isFORTRAN& Matlab)- Very common in the 1970s
- Many other languages are row major, e.g. C/C++, Python
Rwas designed for statistical analysis, but has developed capabilities far beyond this
We will see this advantage this afternoon
Data Frames & Lists
Data Frames & Lists
Data frames are actually special cases of lists
- Each column of a
data.frameis a component of alist - The components must all be vectors of the same length
- Data Frames can be treated identically to a
list - Have additional subsetting operations and attributes
Data Frames & Lists
Forgetting the comma, now gives a completely different result to a matrix!
pigs[1]Was that what you expected?
Try using the double bracket method
Common Data Frame Errors
What do you think will happen if we type:
pigs[5]Error: Column index must be at most 3 if positive, not 5
Working With R Objects
Name Attributes
How do we assign names?
named_vec <- c(a = 1, b = 2, c = 3). . .
OR we can name an existing vector
names(int_vec) <- c("a", "b", "c", "d", "e")Name Attributes
Can we remove names?
The NULL, or empty, vector in R is created using c()
null_vec <- c()
length(null_vec)Name Attributes
We can also use this to remove names
names(named_vec) <- c()Don’t forget to put the names back…
Matrices
We can convert vectors to matrices, as earlier
int_mat <- matrix(1:6, ncol = 2)R is column major so fills columns by default
row_mat <- matrix(1:6, ncol = 2, byrow = TRUE)Matrices
We can assign row names & column names after creation
colnames(row_mat) <- c("odds", "evens")Or using dimnames()
dimnames(row_mat)This a list of length 2 with rownames then colnames as the components.
Lists
my_list <- list(int_vec, dbl_vec)
names(my_list) <- c("integers", "doubles")OR
my_list <- list(integers = int_vec, doubles = dbl_vec)Lists
What happens if we try this?
my_list$logical <- logi_vecData Frames
This is exactly the same as creating lists, but
The names attribute will also be the colnames()
my_df <- data.frame(doubles = dbl_vec, logical = logi_vec)
names(my_df) == colnames(my_df)[1] TRUE TRUE
Data Frames
What happens if we try to combine components that aren’t the same length?
my_df <- data.frame(
integers = int_vec,
doubles = dbl_vec, logical = logi_vec
)