Loading Data Into R

ASI: Introduction to R

Author

Affiliation

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

Published

September 2, 2025

Loading Data Into R

Data In R

Working with data in R is very different to Excel
Can have complicated structures or be very simple (e.g. x <- 1:5)
Spreadsheet-like data is very common
- The R equivalent is known as a data.frame
- Has many variants, e.g. tbl_df or tibble (SQL-inspired)
- We’ll mainly use the tibble variant today

We import the data as an R object
- All analysis is performed on the R object
- Almost never modify the source file

Importing Data

Cell formatting will be ignored by R
Plots will also be ignored
Blank rows are not fatal, just annoying
Mixtures of numbers and text in a column
- data.frames are structured with vectors as columns
Deleted cells are sometimes imported as blank rows/columns
Comma-separated or tab-separated files are favoured for R
- i.e. plain text, or just the data

Other Common Excel Issues

Excel thinks everything is a date:
- Septin genes are now officially named SEPTIN1 not SEPT1 ¹ etc.

Papers with genes as dates in gene lists (Ziemann, Eren, and El-Osta 2016)

Other Common Excel Issues

Excel thinks everything is a date:
- Septin genes are now officially named SEPTIN1 not SEPT1 ² etc.
- Fractions are also not dates…
Excel will remove leading zeroes (e.g. phone numbers, catalog ids)
No record of any steps we’ve performed by clicking on something

These are very common sources of broken data $\implies$ may need fixing

Mention my former collaborator who would often have completely different results in the ones I’d send him - To sort by p-value, he’d select the p-value column & sort (just that column) - There was no record of this. Only discovered by sitting down with him

Preparation

File > New File > R Script (Or Ctrl+Shift+N)
Save as DataImport.R

Preparation

Download the file data.zip from the workshop homepage
Place in your directory R_Training
Extract to here which should create a folder named data

Make sure your files are in data not in data/data

This should contain all of today’s files
Navigate to the data directory using the Files pane

(You should see pigs.csv in there)

(Make sure people haven’t accidentally created data/data)

Import Using the GUI

Importing Data

Preview the file pigs.csv by clicking on it (View File)
- Try in Excel if you prefer, but DO NOT save anything from Excel

The data measures tooth (i.e. odontoblast) length in guinea pigs
- Using 3 dose levels of Vitamin C (“Low”, “Med”, “High”)
Vitamin C was given in drinking water or using orange juice
- “OJ” or “VC”

Importing Data

This type of data is very easy to manage in R
- Plain text with comma delimiters
- Simple column structure with column names
- No blank rows at the top or separating sub-tables
- No blank columns
- No rownames

Using the GUI To Load Data

Click on the pigs.csv and choose Import Dataset then stop!

(Click Update if you don’t see this)

The Preview Window

This is another preview of the data before we import it
There are 3 columns: len, supp and dose
- len is a double (numeric)
- The other two are character columns

The Preview Window

We also have a preview of the code we’re about to execute

The Preview Window

Select and copy all the code in the Code Preview Box
- We’ll paste this somewhere in a minute…

Click Import
Magic happens!!!
Ignore the red/blue text. This is just ‘helpful’ information

Now paste the copied code at the top of your script

What just happened?

The code we copied has 3 lines:

library(readr)
pigs <- read_csv("data/pigs.csv")
View(pigs)

Loads the package readr using library(readr)
- Packages are collections (i.e. libraries) of related functions
- All readr functions are about importing data
readr contains the function read_csv()
read_csv() tells R what to do with a csv file

What just happened?

The code we copied has 3 lines:

library(readr)
pigs <- read_csv("data/pigs.csv")
View(pigs)

The 2^nd line actually loads the data into your R Environment
It created an object named pigs by using the file name (pigs.csv)
Can change this name if we wish

What just happened?

The code we copied has 3 lines:

library(readr)
pigs <- read_csv("data/pigs.csv")
View(pigs)

Opens a preview in a familiar Excel-like format
- I personally don’t use this

Close the preview by clicking the cross

What just happened?

We have just loaded data using the default settings of read_csv()
The object pigs is now in our R Environment
- The original file remains on our HDD without modification!!!
The code is saved in our script
$\implies$ we don’t need the GUI for this operation again!

Let’s Demonstrate

In the Environment Tab click the broom icon ()
- This will delete everything from your R Environment
- It won’t unload the packages

Select the code we’ve just pasted and send it to the console
Reloading the packages won’t hurt
Check the Environment Tab again and pigs is back

You can delete the line View(pigs)

Realistically we only need to preview it the first time. Having that preview open every time actually ends up being really annoying

Data Frame Objects

The object pigs is known as a data.frame
- Very similar to an SQL table
R equivalent to a spreadsheet
- Missing values (blank cells) are usually filled with NA
- Must have column names $\implies$ row names becoming less common

A tibble is a data.frame which prints nicely to your screen
- Cannot have rownames though

Data Frame Objects

Instead of View() $\implies$ preview by typing the object name

pigs

# A tibble: 60 × 3
     len supp  dose 
   <dbl> <chr> <chr>
 1   4.2 VC    Low  
 2  11.5 VC    Low  
 3   7.3 VC    Low  
 4   5.8 VC    Low  
 5   6.4 VC    Low  
 6  10   VC    Low  
 7  11.2 VC    Low  
 8  11.2 VC    Low  
 9   5.2 VC    Low  
10   7   VC    Low  
# ℹ 50 more rows

Gives a preview up to 10 lines with:

The object type: A tibble
The full dimensions: 60 X 3
Column names: len, supp, dose
Data types: <dbl>, <chr>, <chr>

I personally find this more informative than View()

Data Frame Objects

data.frame objects can be subset using square brackets [row, col]

pigs[1:3, ]

# A tibble: 3 × 3
    len supp  dose 
  <dbl> <chr> <chr>
1   4.2 VC    Low  
2  11.5 VC    Low  
3   7.3 VC    Low

Data Frame Objects

Columns can be selected by position or name

pigs[, 1]

# A tibble: 60 × 1
     len
   <dbl>
 1   4.2
 2  11.5
 3   7.3
 4   5.8
 5   6.4
 6  10  
 7  11.2
 8  11.2
 9   5.2
10   7  
# ℹ 50 more rows

pigs[, "len"]

# A tibble: 60 × 1
     len
   <dbl>
 1   4.2
 2  11.5
 3   7.3
 4   5.8
 5   6.4
 6  10  
 7  11.2
 8  11.2
 9   5.2
10   7  
# ℹ 50 more rows

Data Frame Objects

Entire columns can also be selected using $
$\implies$ doesn’t return a data.frame

pigs$len

 [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0 16.5 16.5 15.2 17.3 22.5
[16] 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
[31] 15.2 21.5 17.6  9.7 14.5 10.0  8.2  9.4 16.5  9.7 19.7 23.3 23.6 26.4 20.0
[46] 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0

Data Frame Objects

Each column is a vector
- Exactly like a spreadsheet column
Vectors only contain one data type
- logical, integer, numeric (i.e. doubles), character
- len values are all numeric (or dbl)

Tibble Objects

readr uses a variant called a tbl_df or tbl (pronounced tibble)
- A data.frame with nice bonus features (e.g. prints a summary only)
- Similar to a SQL table
- Can only have row numbers for row names
- Is a foundational structure in the tidyverse

The Tidyverse

The tidyverse is a collection of thematically-linked packages
- Produced by developers from RStudio/Posit
- Often referred to as tidy-programming or similar
Calling library(tidyverse) loads all of these packages
- $>$ 10 convenient packages in one line
- readr is one of these $\implies$ usually just load the tidyverse

library(tidyverse)

The Tidyverse

Replace library(readr) with library(tidyverse) and execute

Some additional ways to inspect data frames are:

head(pigs)
glimpse(pigs)

glimpse is loaded with library(tidyverse) (not part of readr)

What were the differences between each method?

Functions

Functions in `R`

head(pigs)
glimpse(pigs)

Here we have called the functions 1) head() and 2) glimpse()
- They were both executed on the object pigs

Call the help page for head()

?head

(if you get multiple options, choose the one from utils)

Functions in `R`

The key place to look at is

head(x, ...)
## Default S3 method:
head(x, n = 6L, ...)

there are two arguments to head() $\implies$ x and n
- x has no default value $\implies$ we need to provide something
- n = 6L means n has a default value of 6 (L $\implies$ integer)

Execute head() to show the error!!!

Functions in `R`

Lower down the page you’ll see

Arguments

x an object
n an integer vector of length up to dim(x) (or 1, for non-dimensioned objects). Blah, blah, blah…

Some of the rest is technical detail (sometimes very helpful)

Function Arguments

head() prints the first part of an object
Useful for very large objects (e.g. if we had 1000 pigs)

We can change the number of rows shown to us

head(pigs, 4)

# A tibble: 4 × 3
    len supp  dose 
  <dbl> <chr> <chr>
1   4.2 VC    Low  
2  11.5 VC    Low  
3   7.3 VC    Low  
4   5.8 VC    Low

Function Arguments

Notice we didn’t provide these as named arguments
If passing values in order $\implies$ no need

head(pigs, 4)

# A tibble: 4 × 3
    len supp  dose 
  <dbl> <chr> <chr>
1   4.2 VC    Low  
2  11.5 VC    Low  
3   7.3 VC    Low  
4   5.8 VC    Low

head(x = pigs, n = 4)

# A tibble: 4 × 3
    len supp  dose 
  <dbl> <chr> <chr>
1   4.2 VC    Low  
2  11.5 VC    Low  
3   7.3 VC    Low  
4   5.8 VC    Low

Function Arguments

If we name the arguments, we can pass in any order we choose

head(x = pigs, n = 4)

# A tibble: 4 × 3
    len supp  dose 
  <dbl> <chr> <chr>
1   4.2 VC    Low  
2  11.5 VC    Low  
3   7.3 VC    Low  
4   5.8 VC    Low

head(n = 4, x = pigs)

# A tibble: 4 × 3
    len supp  dose 
  <dbl> <chr> <chr>
1   4.2 VC    Low  
2  11.5 VC    Low  
3   7.3 VC    Low  
4   5.8 VC    Low

Understanding `read_csv()`

Earlier we called the R function read_csv()
Check the help page

?read_csv

We have four functions shown but stick to read_csv()

Understanding `read_csv()`

read_csv(
  file,
  col_names = TRUE, col_types = NULL, col_select = NULL,
  id = NULL, locale = default_locale(), 
  na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "",
  trim_ws = TRUE,
  skip = 0, n_max = Inf,
  guess_max = min(1000, n_max),
  name_repair = "unique",
  num_threads = readr_threads(),
  progress = show_progress(),
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE,
  lazy = should_read_lazy()
)

This function has numerous arguments (file, col_names etc.)
Most have default values given
- All were defined somewhere in the GUI
- The default assumes there are column names in the first row (col_names = TRUE)

Understanding `read_csv()`

All arguments for the function were defined somewhere in the GUI.

Open the GUI Preview by clicking on the file again
Uncheck the First Row as Names check-box

Understanding `read_csv()`

All arguments for the function were defined somewhere in the GUI.

Open the GUI Preview by clicking on the file again
Uncheck the First Row as Names check-box
- What happened to the code?
- How did the columns change?

Try clicking/unclicking a few more & try understand the consequences

Closing Comments

`read_csv()` Vs `read.csv()`

RStudio now uses read_csv() from readr by default
You will often see read.csv() in older scripts (from utils)
The newer (readr) version is:
- slightly faster
- more user-friendly
- gives informative messages
- always returns a tibble

Earlier functions in utils are read.*() (csv, delim etc.)
readr has the functions read_*() (csv, tsv, delim etc.)
I always use the newer ones

Reading Help Pages: Bonus Slide

The bottom three functions are simplified wrappers to read_delim()
read_csv() calls read_delim() using delim = ","
read_csv2() calls read_delim() using delim = ";"
read_tsv() calls read_delim() using delim = "\t"

What function would we call for space-delimited files?

Loading Excel Files

The package readxl is for loading .xls and xlsx files.
Not part of the core tidyverse but very compatible

library(readxl)

The main function is read_excel()

?read_excel

Loading Excel Files

This file contains multiple sheets

excel_sheets("data/RealTimeData.xlsx")

[1] "Sheet1" "Sheet2" "Sheet3"

I found this file after a random Google search for RT-PCR and Excel about 10 years ago. I didn’t keep track of who created it…

Once again we can click on the file $\implies$ Import Dataset
- Sheet1 looks pretty simple
- First column has no name

Loading Excel Files

pcr <- read_excel("data/RealTimeData.xlsx")
colnames(pcr)[1] <- "Sample"

There are two pieces of data in the 1st column
- We’ll learn how to manage this in the next session

Have a look at the previews of Sheet2 and Sheet3
- Everything here is surprisingly easy to wrangle

References

Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biol. 17 (1): 177.

Footnotes

https://blog.genenames.org/newsletters/2020/08/28/Summer_newsletter/↩︎
https://blog.genenames.org/newsletters/2020/08/28/Summer_newsletter/↩︎

Loading Data Into R

Data In R

Importing Data

Other Common Excel Issues

Other Common Excel Issues

Preparation

Preparation

Import Using the GUI

Importing Data

Importing Data

Using the GUI To Load Data

The Preview Window

The Preview Window

The Preview Window

What just happened?

What just happened?

What just happened?

What just happened?

Let’s Demonstrate

Data Frame Objects

Data Frame Objects

Data Frame Objects

Data Frame Objects

Data Frame Objects

Data Frame Objects

Data Frame Objects

Tibble Objects

The Tidyverse

The Tidyverse

Functions

Functions in R

Functions in R

Functions in R

Function Arguments

Function Arguments

Function Arguments

Understanding read_csv()

Understanding read_csv()

Understanding read_csv()

Understanding read_csv()

Closing Comments

read_csv() Vs read.csv()

Reading Help Pages: Bonus Slide

Loading Excel Files

Loading Excel Files

Loading Excel Files

References

Footnotes

Functions in `R`

Functions in `R`

Functions in `R`

Understanding `read_csv()`

Understanding `read_csv()`

Understanding `read_csv()`

Understanding `read_csv()`

`read_csv()` Vs `read.csv()`