Loading Data Into R
ASI: Introduction to R
Loading Data Into R
Data In R
- Working with data in R is very different to Excel
- Can have complicated structures or be very simple (e.g.
x <- 1:5
) - Spreadsheet-like data is very common
- The
R
equivalent is known as adata.frame
- Has many variants, e.g.
tbl_df
ortibble
(SQL-inspired) - We’ll mainly use the
tibble
variant today
- The
- We import the data as an
R
object- All analysis is performed on the
R
object - Almost never modify the source file
- All analysis is performed on the
Importing Data
- Cell formatting will be ignored by R
- Plots will also be ignored
- Blank rows are not fatal, just annoying
- Mixtures of numbers and text in a column
data.frame
s are structured with vectors as columns
- Deleted cells are sometimes imported as blank rows/columns
- Comma-separated or tab-separated files are favoured for
R
- i.e. plain text, or just the data
Other Common Excel Issues
- Excel thinks everything is a date:
- Septin genes are now officially named SEPTIN1 not SEPT1 1 etc.
Preparation
File
>New File
>R Script
(OrCtrl+Shift+N
)- Save as
DataImport.R
Import Using the GUI
Importing Data
- Preview the file
pigs.csv
by clicking on it (View File
)- Try in Excel if you prefer, but DO NOT save anything from Excel
- The data measures tooth (i.e. odontoblast) length in guinea pigs
- Using 3 dose levels of Vitamin C (“Low”, “Med”, “High”)
- Vitamin C was given in drinking water or using orange juice
- “OJ” or “VC”
Using the GUI To Load Data
Click on the pigs.csv
and choose Import Dataset
then stop!
(Click Update
if you don’t see this)
The Preview Window
- This is another preview of the data before we import it
- There are 3 columns:
len
,supp
anddose
len
is a double (numeric)- The other two are character columns
What just happened?
The code we copied has 3 lines:
- Loads the package
readr
usinglibrary(readr)
- Packages are collections (i.e. libraries) of related functions
- All
readr
functions are about importing data
readr
contains the functionread_csv()
read_csv()
tells R what to do with a csv file
Let’s Demonstrate
- In the
Environment Tab
click the broom icon ()
- This will delete everything from your
R Environment
- It won’t unload the packages
- This will delete everything from your
Select the code we’ve just pasted and send it to the console
Reloading the packages won’t hurtCheck the
Environment Tab
again andpigs
is back
- You can delete the line
View(pigs)
Realistically we only need to preview it the first time. Having that preview open every time actually ends up being really annoying
Data Frame Objects
Data Frame Objects
- The object
pigs
is known as adata.frame
- Very similar to an SQL table
R
equivalent to a spreadsheet- Missing values (blank cells) are usually filled with
NA
- Must have column names \(\implies\) row names becoming less common
- Missing values (blank cells) are usually filled with
- A
tibble
is adata.frame
which prints nicely to your screen- Cannot have rownames though
Tibble Objects
readr
uses a variant called atbl_df
ortbl
(pronounced tibble)- A
data.frame
with nice bonus features (e.g. prints a summary only) - Similar to a SQL table
- Can only have row numbers for row names
- Is a foundational structure in the
tidyverse
- A
The Tidyverse
- The
tidyverse
is a collection of thematically-linked packages- Produced by developers from RStudio/Posit
- Often referred to as tidy-programming or similar
- Calling
library(tidyverse)
loads all of these packages- \(>\) 10 convenient packages in one line
readr
is one of these \(\implies\) usually just load the tidyverse
library(tidyverse)
Functions
Functions in R
head(pigs)
glimpse(pigs)
- Here we have called the functions 1)
head()
and 2)glimpse()
- They were both executed on the object
pigs
- They were both executed on the object
- Call the help page for
head()
?head
(if you get multiple options, choose the one from utils
)
Function Arguments
head()
prints the first part of an object- Useful for very large objects (e.g. if we had 1000 pigs)
- We can change the number of rows shown to us
head(pigs, 4)
# A tibble: 4 × 3
len supp dose
<dbl> <chr> <chr>
1 4.2 VC Low
2 11.5 VC Low
3 7.3 VC Low
4 5.8 VC Low
Understanding read_csv()
- Earlier we called the
R
functionread_csv()
- Check the help page
?read_csv
- We have four functions shown but stick to
read_csv()
Closing Comments
read_csv()
Vs read.csv()
RStudio
now usesread_csv()
fromreadr
by default- You will often see
read.csv()
in older scripts (fromutils
) - The newer (
readr
) version is:- slightly faster
- more user-friendly
- gives informative messages
- always returns a
tibble
- Earlier functions in
utils
areread.*()
(csv, delim etc.) readr
has the functionsread_*()
(csv, tsv, delim etc.)- I always use the newer ones
Reading Help Pages: Bonus Slide
- The bottom three functions are simplified wrappers to
read_delim()
read_csv()
callsread_delim()
usingdelim = ","
read_csv2()
callsread_delim()
usingdelim = ";"
read_tsv()
callsread_delim()
usingdelim = "\t"
What function would we call for space-delimited files?
Loading Excel Files
- The package
readxl
is for loading.xls
andxlsx
files. - Not part of the core tidyverse but very compatible
library(readxl)
- The main function is
read_excel()
?read_excel
References
Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biol. 17 (1): 177.