Importing Spreadsheet-Like Data
RAdelaide 2024
Importing Data
Data In R
- Working with data in R is very different to Excel
- Can have complicated structures or be simple (e.g.
x <- 1:5
) - Spreadsheet-like data is very common
- The
R
equivalent to a spreadsheet is known as adata.frame
- A common subtype is known as a
tibble
- These were originally called
tbl_df
objects referring to SQL tables - Is just a
data.frame
with pretty bows & ribbons
- A common subtype is known as a
Data In R
- We import the data as an
R
object- All analysis is performed on the
R
object - Almost never modify the source file
- All analysis is performed on the
Importing Data
- Cell formatting will be ignored by R
- Plots will also be ignored
- Blank rows/columns are not fatal, just annoying
- Mixtures of numbers and text in a column
data.frame
s are structured with vectors as columns
- Deleted cells are sometimes imported as blank rows/columns
- Comma-separated or tab-separated files are favoured for
R
- i.e. plain text, or just the data
- Can also import data from Stata, SPSS or other statistical languages
Other Common Excel Issues
If we’re not careful:
- Excel thinks everything is a date:
- Septin genes are now officially named SEPTIN2 etc (not SEPT2)
- Fractions are also not dates…
- Excel will remove leading zeroes (e.g. phone numbers, catalog ids)
- No record of any steps we’ve performed by clicking on something
- Very common sources of broken data \(\implies\) may need fixing
- ’000s of publications with dates for gene names or results
Mention my former collaborator who would often have completely different results in the ones I’d send him - To sort by p-value, he’d select the p-value column & sort (just that column) - There was no record of this. Only discovered by sitting down with him
Preparation
File
>New File
>R Script
(OrCtrl+Shift+N
)- Save as
GuineaPigs.R
Preparation
Then get the data for this exercise.
- Download the file
data.zip
from the workshop homepage - Place in your directory
RAdelaide24
- Extract which should create a folder named
data
- Make sure your files are in
data
not indata/data
- Make sure your files are in
- This should contain all files for the workshop
- Navigate to the
data
directory using theFiles
pane
(You should see pigs.csv
in data
)
Import Using the GUI
Importing Data
- Preview the file
pigs.csv
by clicking on it (View File
)- Try in Excel if you prefer, but DO NOT save anything from Excel
. . .
- The data measures tooth (i.e. odontoblast) length in guinea pigs
- Using 3 dose levels of Vitamin C (“Low”, “Med”, “High”)
- Vitamin C was given in drinking water or using orange juice
- “OJ” or “VC”
Importing Data
- This type of data is very easy to manage in
R
- Plain text with comma delimiters
- Simple column structure with column names
- No blank rows at the top or separating sub-tables
- No blank columns
- No rownames
Using the GUI To Load Data
Click on pigs.csv
, choose Import Dataset
then stop! 🛑
(Click Update
if you don’t see this)
The Preview Window
- This is another preview of the data before we import it
- There are 3 columns:
len
,supp
anddose
len
is a double (numeric)- The other two are character columns
The Preview Window
The Preview Window
- Select and copy all the code in the
Code Preview
Box- We’ll paste this somewhere in a minute…
. . .
- Click
Import
. . .
- Magic happens!!!
- Ignore the coloured text. This is just ‘helpful information’
. . .
Now paste the copied code at the top of your script
What just happened?
The code we copied has 3 lines:
library(readr)
loads the packagereadr
- Packages are collections (i.e. libraries) of related functions
- All
readr
functions are about importing data
readr
contains the functionread_csv()
read_csv()
tells R what to do with a csv file
What just happened?
The code we copied has 3 lines:
- This line actually loads the data into your
R Environment
- It creates an object named
pigs
by using the file name (pigs.csv) - Can change this name if we wish
What just happened?
The code we copied has 3 lines:
- Opens a preview in a familiar
Excel-like
format- I personally never use this
Close the preview by clicking the cross
What just happened?
- We have just loaded data using the default settings of
read_csv()
- The object
pigs
is now in ourR Environment
- The original file remains on our HDD without modification!!!
- The code is saved in our script
\(\implies\) we don’t need the GUI for this operation again!
Let’s Demonstrate
- In the
Environment Tab
click the broom icon 🧹- This will delete everything from your
R Environment
- It won’t unload the packages
- This will delete everything from your
. . .
- Highlight the code we’ve just pasted and click
Run
- Reloading the packages won’t hurt
. . .
- Check the
Environment Tab
again andpigs
is back
. . .
You can delete the line View(pigs)
Realistically we only need to preview it the first time. Having that preview open every time actually ends up being really annoying
Our First R Script
- This is now our first R script
- An R script is a plain text file
- No objects are stored
. . .
- Individual lines can be run using
Ctrl/Cmd + Enter
- Can select multiple lines & run
Our First R Script
- Insert a new line before
library(readr)
then type
## Load packages first
. . .
- This is a comment \(\implies\) no executable code
- Communicates to “future” you what you’re doing
- Only needs a single
#
to be a comment
Our First R Script
- Insert a line after
library(readr)
then enter
## Now load the file "pigs.csv"
## This contains odontoblast lengths from guinea pigs
Data Frame Objects
Data Frame Objects
- The object
pigs
is known as adata.frame
- Very similar to an SQL table
R
equivalent to a spreadsheet- Must have column names
- row names becoming less common (sometimes just the row numbers)
- Missing values (blank cells) are usually filled with
NA
Data Frame Objects
Instead of View()
\(\implies\) preview by typing the object name
pigs
# A tibble: 60 × 3
len supp dose
<dbl> <chr> <chr>
1 4.2 VC Low
2 11.5 VC Low
3 7.3 VC Low
4 5.8 VC Low
5 6.4 VC Low
6 10 VC Low
7 11.2 VC Low
8 11.2 VC Low
9 5.2 VC Low
10 7 VC Low
# ℹ 50 more rows
Gives a preview up to 10 lines with:
- The object type:
A tibble
- The full dimensions:
60 X 3
- Column names:
len
,supp
,dose
- Data types:
<dbl>
,<chr>
,<chr>
I personally find this more informative than View()
Tibble Objects
readr
uses a variant called atbl_df
ortbl
(pronounced tibble)- A
data.frame
with convenient features - Similar to a SQL table
- Can only have row numbers for row names
- Is a foundational structure in the
tidyverse
- A
The Tidyverse
- The
tidyverse
is a collection of thematically-linked packages- Produced by developers from RStudio/Posit
- Often referred to as tidy-programming or similar
- Calling
library(tidyverse)
loads all of these packages- \(>\) 30 convenient packages in one line
readr
is one of these \(\implies\) usually just load the tidyverse
library(tidyverse)
tidyverse_packages()
[1] "broom" "conflicted" "cli" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "ggplot2"
[9] "googledrive" "googlesheets4" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "ragg"
[21] "readr" "readxl" "reprex" "rlang"
[25] "rstudioapi" "rvest" "stringr" "tibble"
[29] "tidyr" "xml2" "tidyverse"
The Tidyverse
- Replace
library(readr)
withlibrary(tidyverse)
- Execute this line
. . .
- This will load all 31 packages which form the tidyverse
- Reloading a package has no effect (is ignored)
Tibble Objects
- Some additional ways to inspect tibbles are:
head(pigs)
glimpse(pigs)
. . .
glimpse
is from the package (pillar
)- Loaded with
library(tidyverse)
- Loaded with
What were the differences between each method?
Functions
Functions in R
head(pigs)
glimpse(pigs)
- Here we have called the functions
head()
andglimpse()
- They were both executed on the object
pigs
- They were both executed on the object
. . .
- Call the help page for
head()
?head
(if you get multiple options, choose the one from utils
)
Functions in R
- The key place to look at is
head(x, ...)
## Default S3 method:
head(x, n = 6L, ...)
- there are two arguments to
head()
\(\implies\)x
andn
x
has no default value \(\implies\) we need to provide somethingn = 6L
meansn
has a default value of 6 (L \(\implies\)integer
)
Execute head()
to show the error!!!
Functions in R
Lower down the page you’ll see
Arguments
x
an object
n
an integer vector of length up to dim(x) (or 1, for non-dimensioned objects). Blah, blah, blah…
. . .
- Some of the rest is technical detail (can sometimes be very helpful)
Function Arguments
head()
prints the first part of an object- Useful for very large objects (e.g. if we had 1000 guinea pigs)
. . .
- We can change the number of rows shown to us
head(pigs, 4)
# A tibble: 4 × 3
len supp dose
<dbl> <chr> <chr>
1 4.2 VC Low
2 11.5 VC Low
3 7.3 VC Low
4 5.8 VC Low
Function Arguments
- Notice we didn’t provide these as named arguments
- If passing values in order \(\implies\) no need for names
. . .
head(pigs, 4)
# A tibble: 4 × 3
len supp dose
<dbl> <chr> <chr>
1 4.2 VC Low
2 11.5 VC Low
3 7.3 VC Low
4 5.8 VC Low
head(x = pigs, n = 4)
# A tibble: 4 × 3
len supp dose
<dbl> <chr> <chr>
1 4.2 VC Low
2 11.5 VC Low
3 7.3 VC Low
4 5.8 VC Low
Function Arguments
- If we explicitly name arguments \(\implies\) can pass in any order
head(n = 4, x = pigs)
# A tibble: 4 × 3
len supp dose
<dbl> <chr> <chr>
1 4.2 VC Low
2 11.5 VC Low
3 7.3 VC Low
4 5.8 VC Low
Function Arguments
- Check the help page for
glimpse()
- This is from the package
pillar
- This is from the package
- Try changing the
width
argument to see what happens
Understanding read_csv()
- Earlier we called the
R
functionread_csv()
- Check the help page
?read_csv
- We have four functions shown but stick to
read_csv()
Understanding read_csv()
read_csv(
file, col_names = TRUE, col_types = NULL, col_select = NULL,
id = NULL, locale = default_locale(),
na = c("", "NA"), quoted_na = TRUE,
quote = "\"", comment = "",
trim_ws = TRUE,
skip = 0, n_max = Inf,
guess_max = min(1000, n_max),
name_repair = "unique",
num_threads = readr_threads(),
progress = show_progress(),
show_col_types = should_show_types(),
skip_empty_rows = TRUE,
lazy = should_read_lazy()
)
- This function has way too many arguments (
file
,col_names
etc.)
- Most have default values given
- All were defined somewhere in the GUI
- Default assumes column names are the first row (
col_names = TRUE
)
- Ask which arguments have default before moving to the fragment…
Understanding read_csv()
All arguments
were defined somewhere in the GUI.
- Open the GUI Preview by clicking on the file again
- Uncheck the
First Row as Names
checkbox
. . .
- What happened to the code?
- How did the columns change?
. . .
Try clicking/unclicking a few more & try understand the consequences
Understanding read_csv()
- Column types can also be set using shorthand
- “ncc” is number, character, character
<- read_csv("data/pigs.csv", col_types = "ncc") pigs
- Changing one of the character columns to “n” will lead to
NA
s - Changing the numeric column to character would mean … ?
Closing Comments
read_csv()
Vs read.csv()
RStudio
now usesread_csv()
fromreadr
by default- You will often see
read.csv()
(fromutils
) in older scripts - The newer (
readr
) version is:- slightly faster
- more user-friendly
- handles large files more efficiently via indexing
- gives informative messages
- Earlier functions in
utils
areread.*()
(csv, delim etc.) readr
has the functionsread_*()
(csv, tsv, delim etc.)- I always use the newer ones
Reading Help Pages: Bonus Slide
- The bottom three functions are simplified wrappers to
read_delim()
read_csv()
callsread_delim()
usingdelim = ","
read_csv2()
callsread_delim()
usingdelim = ";"
read_tsv()
callsread_delim()
usingdelim = "\t"
What function would we call for space-delimited files?
Loading Excel Files
- The package
readxl
is for loading.xls
andxlsx
files. - Not part of the core tidyverse but very compatible
library(readxl)
. . .
- The main function is
read_excel()
?read_excel
Loading Excel Files
- This file contains multiple sheets
excel_sheets("data/RealTimeData.xlsx")
[1] "Sheet1" "Sheet2" "Sheet3"
I found this file after a random Google search for RT-PCR and Excel about 10 years ago. I didn’t keep track of who created it…
. . .
- Once again we can click on the file \(\implies\)
Import Dataset
Sheet1
looks pretty simple- First column has no name
. . .
- Use the drop-down menu to look at the
Sheet2
&Sheet3
- We’ll learn how to manage this in the next session