[1] 1 2 3 4 5 6 7 8 9 10
Understanding How R Sees Data
RAdelaide 2024
Vectors
Vectors
The key building blocks for R
objects: Vectors
- There is no such thing as a scalar in
R
- Everything is based around the concept of a vector
What is a vector?
Definition
A vector is zero or more values of the same type
Vectors
A simple vector would be
What type of values are in this vector?
Vectors
Another vector might be
[1] "a" "cat" "video"
What type of values are in this vector?
Vectors
[1] "742" "Evergreen" "Tce"
What type of values are in this vector?
The 4 Atomic Vector Types
- Atomic Vectors are the building blocks for everything in
R
- There are four main types
- Plus two we can ignore
Logical Vectors
- logical: Can only hold the values
TRUE
orFALSE
<- c(TRUE, TRUE, FALSE)
logi_vec print(logi_vec)
[1] TRUE TRUE FALSE
- Spell out that when you type an object’s name, you’re calling
print()
- Also mention that in the 70’s we didn’t have printers so it means print the object to screen
Integer Vectors
- logical
- integer: Counts, ranks or indexing positions
<- 1:5
int_vec print(int_vec)
[1] 1 2 3 4 5
Double Precision Vectors
- logical
- integer
- double: Often (& lazily) referred to as
numeric
<- c(0.618, 1.414, 2)
dbl_vec print(dbl_vec)
Why are these called doubles?
Character Vectors
- logical
- integer
- double
- character
<- c("blue", "red", "green")
char_vec print(char_vec)
The 4 Atomic Vector Types
These are the basic building blocks for all R
objects
- logical
- integer
- double
- character
. . .
- There are two more rare types we’ll ignore:
complex
&raw
- All
R
data structures are built on these 6 vector types
Properties of a vector
What four defining properties might a vector have?
- The actual values
- Length, accessed by the function
length()
- The type, accessed by the function
typeof()
- Similar but preferable to
class()
- Similar but preferable to
- Any optional & additional attributes \(\implies\)
attributes()
- Holds data such as
names
etc.
- Holds data such as
Properties of a vector
Let’s try them on our vectors
typeof(char_vec)
length(int_vec)
attributes(logi_vec)
class(dbl_vec)
typeof(dbl_vec)
Were you surprised by any of the results?
Working with Vectors
We can combine two vectors in R
, using the function c()
c(1, 2)
[1] 1 2
- The numbers
1
&2
were both vectors withlength
1 - We have combined two vectors of length 1, to make a vector of length 2
Working with Vectors
What would happen if we combined two vectors of different types?
. . .
<- c(logi_vec, int_vec)
new_vec print(new_vec)
typeof(new_vec)
Working with Vectors
Q: What happened to the logical
values?
. . .
Answer: R
will coerce them into a common type (i.e. integers).
Coercion
Coercion
What other types could logical
vectors be coerced into?
. . .
Try using the functions: as.integer()
, as.double()
& as.character()
on logi_vec
Coercion
- Can
numeric
vectors be coerced intological
vectors? - Can
character
vectors be coerced intonumeric
vectors?
<- c(742, "Evergreen", "Terrace")
simp_vec simp_vec
[1] "742" "Evergreen" "Terrace"
as.numeric(simp_vec)
[1] 742 NA NA
Subsetting Vectors
Subsetting Vectors
One or more elements of a vector can be called using []
<- c("A", "B", "C", "D", "E")
y 2] y[
[1] "B"
1:3] y[
[1] "A" "B" "C"
Subsetting Vectors
Double brackets ([[]]
) can be used to return single elements only
2]] y[[
[1] "B"
If you tried y[[1:3]]
you would receive an error message
Unless you’re the Seurat package developers…
Subsetting Vectors
If a vector has name attributes, we can call values by name
head(euro)
ATS BEF DEM ESP FIM FRF
13.76030 40.33990 1.95583 166.38600 5.94573 6.55957
"ESP"] euro[
ESP
166.386
Subsetting Vectors
Try repeating the call-by-name approach using double brackets
"ESP"]
euro["ESP"]] euro[[
. . .
What was the difference in the output?
. . .
- Using
[]
returned the vector with the identical structure - Using
[[]]
removed theattributes
& just gave the value
Ask about Seurat
& mention it’s unconventional behaviour
Subsetting Vectors
Is it better to call by position, or by name?
Things to consider:
- Which is easier to type on the fly?
- Which is easier to read?
- Which is more robust to undocumented changes in an object?
Extracting Multiple Values
What is really happening in this line?
1:5] euro[
ATS BEF DEM ESP FIM
13.76030 40.33990 1.95583 166.38600 5.94573
. . .
We are using the integer vector 1:5
to extract values from euro
. . .
int_vec
[1] 1 2 3 4 5
euro[int_vec]
ATS BEF DEM ESP FIM
13.76030 40.33990 1.95583 166.38600 5.94573
Vector Operations
R
Functions are designed to work on vectors
- 1
dbl_vec > 1
dbl_vec ^2
dbl_vecmean(dbl_vec)
sd(dbl_vec)
sqrt(int_vec)
This is one of the real strengths of R
Vector Operations
We can also combine the above logical test and subsetting
dbl_vec
[1] 0.618 1.414 2.000
> 1 dbl_vec
[1] FALSE TRUE TRUE
. . .
> 1] dbl_vec[dbl_vec
Vector Operations
An additional logical test: %in%
(read as: “is in”)
dbl_vec
[1] 0.618 1.414 2.000
int_vec
[1] 1 2 3 4 5
%in% int_vec dbl_vec
. . .
Returns TRUE/FALSE
for each value in dbl_vec
if it is in int_vec
NB: int_vec
was coerced silently to a double
vector
Matrices
Matrices
- Vectors are strictly one dimensional and have a
length
attribute. - A
matrix
is the two dimensional equivalent
<- matrix(1:6, ncol = 2)
int_mat print(int_mat)
Matrices
- Matrices can only hold one type of value
- i.e. logical, integer, double, character
- Have additional attributes such as
dim()
,nrow()
ncol()
- Can have optional
rownames()
&colnames()
Matrices
Some commands to try:
dim(int_mat)
nrow(int_mat)
typeof(int_mat)
class(int_mat)
attributes(int_mat)
colnames(int_mat)
length(int_mat)
Ask questions if anything is confusing
Matrices
- Use square brackets to extract values by row & column
- The form is
x[row, col]
- Leaving either
row
orcol
blank selects the entire row/column
2, 2]
int_mat[1,] int_mat[
How would we just get the first column?
Matrices
NB: Forgetting the comma when subsetting will treat the matrix as a single vector running down the columns
int_mat
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
5] int_mat[
[1] 5
Matrices
Requesting a row or column that doesn’t exist is the source of a very common error message
dim(int_mat)
[1] 3 2
5,] int_mat[
Error in int_mat[5, ] : subscript out of bounds
Arrays
Arrays extend matrices to 3 or more dimensions
Beyond the scope of today, but we just have more commas in the square brackets, e.g.
dim(iris3)
[1] 50 4 3
dimnames(iris3)
. . .
1,,] iris3[
Setosa Versicolor Virginica
Sepal L. 5.1 7.0 6.3
Sepal W. 3.5 3.2 3.3
Petal L. 1.4 4.7 6.0
Petal W. 0.2 1.4 2.5
1:2,,] iris3[
Homogeneous Data Types
- Vectors, Matrices & Arrays are the basic homogeneous data types
- All are essentially just vectors
Heterogeneous Data Types
Heterogeneous Data Types
Summary of main data types in R
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1d | vector |
list |
2d | matrix |
data.frame |
3d+ | array |
Lists
A list
is a heterogeneous vector.
- Each component is an
R
object - Can be a
vector
, ormatrix
- Could be another
list
- Any other
R
object type we haven’t seen yet
These are incredibly common in R
Lists
Many R
functions provide output as a list
<- t.test(dbl_vec)
testResults class(testResults)
typeof(testResults)
testResults
NB: There is a function (print.htest()
) that tells R
how to print the results to the Console
Lists
Explore the various attributes of the object testResults
attributes(testResults)
length(testResults)
names(testResults)
typeof(testResults)
Lists
We can call the individual components of a list using the $
symbol followed by the name
$statistic
testResults$conf.int
testResults$method testResults
Note that each component is quite different to the others.
Subsetting Lists
A list
is a vector
so we can also subset using the []
method
1]
testResults[typeof(testResults[1])
- Using single square brackets returns a
list
- i.e. is a subset of the larger object and of the same type
Subsetting Lists
Double brackets again retrieve a single element of the vector
- Returns the actual component as the underlying
R
object
1]]
testResults[[typeof(testResults[[1]])
When would we use either method?
. . .
We can also use names instead of positions
c("statistic", "p.value")]
testResults["statistic"]] testResults[[
Lists
- Note also the Environment Tab in the top right of RStudio
- Click the arrow next to
testResults
to expand the entry - This is the output of
str(testResults)
Data Frames
Data Frames
Finally!
- These are the most common type of data you will work with
- Each column is a
vector
- Columns can be different types of vectors
- Column vectors MUST be the same length
Data Frames
- Analogous to matrices, but are specifically for heterogeneous data
- Have many of the same attributes as matrices
dim()
,nrow()
,ncol()
,rownames()
,colnames()
colnames()
&rownames()
are NOT optional \(\implies\) assigned by defaulttibble
variants have simple row numbers as rownames
Data Frames
Let’s load pigs
again
library(tidyverse)
<- read_csv("data/pigs.csv")
pigs head(pigs)
. . .
Try these commands
colnames(pigs)
dim(pigs)
nrow(pigs)
Data Frames
Individual entries can also be extracted using the square brackets
1:2, 1] pigs[
. . .
We can also refer to columns by name (same as matrices)
1:2, "len"] pigs[
Data Frames
Thinking of columns being vectors is quite useful
- We can call each column vector of a
data.frame
using the$
operator
$len[1:2] pigs
This does NOT work for rows!!!
Data Frames
R
is column major by default (as isFORTRAN
& Matlab)- Very common in the 1970s
- Many other languages are row major, e.g. C/C++, Python
R
was designed for statistical analysis, but has developed capabilities far beyond this
We will see this advantage this afternoon
Data Frames & Lists
Data Frames & Lists
Data frames are actually special cases of lists
- Each column of a
data.frame
is a component of alist
- The components must all be vectors of the same length
- Data Frames can be treated identically to a
list
- Have additional subsetting operations and attributes
Data Frames & Lists
Forgetting the comma, now gives a completely different result to a matrix!
1] pigs[
Was that what you expected?
Try using the double bracket method
Common Data Frame Errors
What do you think will happen if we type:
5] pigs[
Error: Column index must be at most 3 if positive, not 5
Working With R
Objects
Name Attributes
How do we assign names?
<- c(a = 1, b = 2, c = 3) named_vec
. . .
OR we can name an existing vector
names(int_vec) <- c("a", "b", "c", "d", "e")
Name Attributes
Can we remove names?
The NULL
, or empty, vector in R
is created using c()
<- c()
null_vec length(null_vec)
Name Attributes
We can also use this to remove names
names(named_vec) <- c()
Don’t forget to put the names back…
Matrices
We can convert vectors to matrices, as earlier
<- matrix(1:6, ncol = 2) int_mat
R
is column major so fills columns by default
<- matrix(1:6, ncol = 2, byrow = TRUE) row_mat
Matrices
We can assign row names & column names after creation
colnames(row_mat) <- c("odds", "evens")
Or using dimnames()
dimnames(row_mat)
This a list of length
2 with rownames
then colnames
as the components.
Lists
<- list(int_vec, dbl_vec)
my_list names(my_list) <- c("integers", "doubles")
OR
<- list(integers = int_vec, doubles = dbl_vec) my_list
Lists
What happens if we try this?
$logical <- logi_vec my_list
Data Frames
This is exactly the same as creating lists, but
The names
attribute will also be the colnames()
<- data.frame(doubles = dbl_vec, logical = logi_vec)
my_df names(my_df) == colnames(my_df)
[1] TRUE TRUE
Data Frames
What happens if we try to combine components that aren’t the same length?
<- data.frame(
my_df integers = int_vec,
doubles = dbl_vec, logical = logi_vec
)