Importing Spreadsheet-Like Data

RAdelaide 2025

Dr Stevie Pederson

Black Ochre Data Labs
The Kids Research Institute Australia

July 8, 2025

Importing Data

Data In R

Working with data in R is very different to Excel
Can have complicated structures or be simple (e.g. x <- 1:5)
Spreadsheet-like data is very common
Is known as a data.frame
- A common variant is known as a tibble
- These were originally called tbl_df objects referring to SQL tables
- Is just a data.frame with some convenience features

Data In R

We import the data as an R object
- All analysis is performed on the R object
- Almost never modify the source file
Once data is imported the source file is no longer involved

Common File Types

Excel files (xls/xlsx) have one or more spreadsheets
- Can import one sheet at a time using library(readxl)
Comma-separated Values (.csv)
- Plain text file
- Column-style data with commas delimiting the columns
Resemble a single spreadsheet but:
- No formatting
- No formulas \(\implies\) only values
- No graphs/figures

Common File Types

Tab-separated Values (.tsv)
- Like a .csv but with tabs separating columns
Sometimes see files saved as .txt
- Can be anything \(\implies\) need to check

Importing Data

When importing data into R:

Cell formatting will be ignored by R
Plots will also be ignored
Blank rows/columns are not fatal, just annoying
Mixtures of numbers and text in a column
- data.frame columns contain all the same type of value
Deleted cells are sometimes imported as blank rows/columns
Comma-separated or tab-separated files are often favoured for R
- i.e. plain text, or just the data

Other Common Excel Issues

If we’re not careful:

Excel thinks everything is a date:
- Septin genes are now officially named SEPTIN2 etc (not SEPT2)
- Fractions are also not dates…
Excel will remove leading zeroes (e.g. phone numbers, catalog ids)
No record of any steps we’ve performed by clicking on something

Very common sources of broken data \(\implies\) may need fixing
’000s of publications with dates for gene names in results

Preparation

File > New File > R Script (Or Ctrl+Shift+N)
Save as ImportPenguins.R

Preparation

Download the file data.zip from the workshop homepage
Place in your directory RAdelaide25
Extract to here which should create a folder named data

Make sure your files are in data not in data/data

This should contain all files for the workshop
Navigate to the data directory using the Files pane

(You should see penguins.csv in data)

Import Using the GUI

Importing Data

Preview the file penguins.csv by clicking on it (View File)
- Try in Excel if you prefer, but DO NOT(!!!) save anything from Excel

This is a common dataset as a simple .csv file
This type of data is very easy to manage in R
- Plain text with comma delimiters
- Simple column structure with column names
- No blank rows at the top or separating sub-tables
- No blank columns
- No rownames

Using the GUI To Load Data

Click on penguins.csv, choose Import Dataset then stop! 🛑

(Click Update if you don’t see this)

The Preview Window

We have a preview of the data

The Preview Window

We also have a preview of the code we’re about to execute

The Preview Window

Select and copy all the code in the Code Preview Box
- We’ll paste this somewhere in a minute…

Click Import

Magic happens!!!
- Ignore the coloured text. This is just ‘helpful information’

Now paste the copied code at the top of your script

What just happened?

The code we copied has 3 lines:

library(readr)
penguins <- read_csv("data/penguins.csv")
View(penguins)

library(readr) loads the package readr
- Packages are collections (i.e. libraries) of related functions
- All readr functions are about importing data
readr contains the function read_csv()
read_csv() tells R what to do with a csv file

What just happened?

The code we copied has 3 lines:

library(readr)
penguins <- read_csv("data/penguins.csv")
View(penguins)

This line actually loads the data into your R Environment
It creates an object named penguins by using the file name (penguins.csv)
Can change this name if we wish

What just happened?

The code we copied has 3 lines:

library(readr)
penguins <- read_csv("data/penguins.csv")
View(penguins)

Opens a preview in a familiar Excel-like format
- I personally never use this

Close the preview by clicking the cross

What just happened?

We have just loaded data using the default settings of read_csv()
The object penguins is now in our R Environment
- The original file remains on our HDD without modification!!!
The code is saved in our script
\(\implies\) we don’t need the GUI for this operation again!

Let’s Demonstrate

In the Environment Tab click the broom icon 🧹
- This will delete everything from your R Environment
- It won’t unload the packages

Highlight the code we’ve just pasted and click Run
- Reloading the packages won’t hurt

Check the Environment Tab again and penguins is back

You can delete the line View(penguins)

Our First R Script

This is now our first R script
- An R script is a plain text file
- No objects are stored

Individual lines can be run using Ctrl/Cmd + Enter
Can select multiple lines & run

Our First R Script

Insert a new line before library(readr) then type

## Load packages first

This is a comment \(\implies\) no executable code
- Communicates to “future” you what you’re doing
- Only needs a single # to be a comment

Our First R Script

Insert a line after library(readr) then enter

## Now load the file "penguins.csv"
## Taken from https://allisonhorst.github.io/palmerpenguins/index.html

Data Frame Objects

The object penguins is known as a data.frame
- Very similar to an SQL table
R equivalent to a spreadsheet
- Must have column names
- row names becoming less common (sometimes just the row numbers)
- Missing values (blank cells) are usually filled with NA

Data Frame Objects

Instead of View() \(\implies\) preview by typing the object name

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Data Frame Objects

Gives a preview up to 10 lines with:

The object type: A tibble
The full dimensions: 60 X 3
Column names: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex and year
Data types: <chr>, <chr>, <dbl> etc

I personally find this more informative than View()

Tibble Objects

readr uses a variant called a tbl_df or tbl (pronounced tibble)
- A data.frame with convenient features
- Similar to a SQL table
- Can only have row numbers for row names
- Is a foundational structure in the tidyverse

The Tidyverse

The tidyverse is a collection of thematically-linked packages
- Produced by developers from RStudio/Posit
- Often referred to as tidy-programming or similar
Calling library(tidyverse) loads multiple key packages
- readr is one of these \(\implies\) usually just load the tidyverse
- Will load dplyr, tidyr, readr, ggplot2 + others

The Tidyverse

Replace library(readr) with library(tidyverse)
Execute this line

This will load all core packages from the tidyverse
Reloading a package has no effect (is ignored)

Previewing Objects

Some additional ways to inspect tibbles are:

head(penguins)
glimpse(penguins)

glimpse is loaded with library(tidyverse)

What were the differences between each method?

Functions

Functions in `R`

head(penguins)
glimpse(penguins)

Here we have called the functions head() and glimpse()
- They were both executed on the object penguins

Call the help page for head()

?head

(if you get multiple options, choose the one from utils)

Functions in `R`

The key place to look at is

head(x, ...)
## Default S3 method:
head(x, n = 6L, ...)

there are two arguments to head() \(\implies\) x and n
- x has no default value \(\implies\) we need to provide something
- n = 6L means n has a default value of 6 (L \(\implies\) integer)

Functions in `R`

Lower down the page you’ll see

Arguments

x an object
n an integer vector of length up to dim(x) (or 1, for non-dimensioned objects). Blah, blah, blah…

Some of the rest is technical detail (can sometimes be very helpful)

Function Arguments

head() prints the first part of an object
Useful for very large objects (e.g. if we had 1000 penguins)

We can change the number of rows shown to us

head(penguins, 4)

# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <chr>, year <dbl>

Function Arguments

If passing values in order \(\implies\) no need for names

head(penguins, 4)

# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <chr>, year <dbl>

head(x = penguins, n = 4)

# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <chr>, year <dbl>

Function Arguments

If we explicitly name arguments \(\implies\) can pass in any order

head(n = 4, x = penguins)

# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
# ℹ 2 more variables: sex <chr>, year <dbl>

Function Arguments

Check the help page for glimpse()
- This is from the package pillar
Try changing the width argument to see what happens
- Default value auto-detects screen width

Understanding `read_csv()`

Earlier we called the R function read_csv()
Check the help page

?read_csv

We have four functions shown but stick to read_csv()

Understanding `read_csv()`

read_csv(
  file, 
  col_names = TRUE, col_types = NULL, col_select = NULL,
  id = NULL, locale = default_locale(), 
  na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "",
  trim_ws = TRUE,
  skip = 0, n_max = Inf,
  guess_max = min(1000, n_max),
  name_repair = "unique",
  num_threads = readr_threads(),
  progress = show_progress(),
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE,
  lazy = should_read_lazy()
)

This function has way too many arguments (file, col_names etc.)
Most have default values given
col_names = TRUE \(\implies\) column names assumed as the first row

Understanding `read_csv()`

All arguments were defined somewhere in the GUI.

Open the GUI Preview by clicking on the file again
Uncheck the First Row as Names check-box

What happened to the code?
How did the columns change?

Try clicking/unclicking a few more & try understand the consequences

Understanding `read_csv()`

Column types can also be set using shorthand
- “ccnnnncn” is a mix of character & number columns

penguins <- read_csv("data/penguins.csv", col_types = "ccnnnncn")

Changing one of the character columns to “n” will lead to NAs
Changing the numeric column to character would mean … ?

Closing Comments

`read_csv()` Vs `read.csv()`

RStudio now uses read_csv() from readr by default
You will often see read.csv() (from utils) in older scripts
The newer (readr) version is:
- slightly faster
- more user-friendly
- handles large files more efficiently via indexing
- gives informative messages

Earlier functions in utils are read.*() (csv, delim etc.)
readr has the functions read_*() (csv, tsv, delim etc.)
I always use the newer ones

Loading Excel Files

The package readxl is for loading .xls and xlsx files.
Not part of the core tidyverse but very compatible

library(readxl)

The main function is read_excel()

?read_excel

Loading Excel Files

This file contains multiple sheets

excel_sheets("data/RealTimeData.xlsx")

[1] "Sheet1" "Sheet2" "Sheet3"

Once again we can click on the file \(\implies\) Import Dataset
- Sheet1 looks pretty simple
- First column has no name

Use the drop-down menu to look at the Sheet2 & Sheet3
We’ll learn how to manage these as challenges later

Importing Spreadsheet-Like Data

Importing Data

Data In R

Data In R

Common File Types

Common File Types

Importing Data

Other Common Excel Issues

Preparation

Preparation

Import Using the GUI

Importing Data

Using the GUI To Load Data

The Preview Window

The Preview Window

The Preview Window

What just happened?

What just happened?

What just happened?

What just happened?

Let’s Demonstrate

Our First R Script

Our First R Script

Our First R Script

Data Frame Objects

Data Frame Objects

Data Frame Objects

Data Frame Objects

Tibble Objects

The Tidyverse

The Tidyverse

Previewing Objects

Functions

Functions in R

Functions in R

Functions in R

Function Arguments

Function Arguments

Function Arguments

Function Arguments

Understanding read_csv()

Understanding read_csv()

Understanding read_csv()

Understanding read_csv()

Closing Comments

read_csv() Vs read.csv()

Loading Excel Files

Loading Excel Files

Functions in `R`

Functions in `R`

Functions in `R`

Understanding `read_csv()`

Understanding `read_csv()`

Understanding `read_csv()`

Understanding `read_csv()`

`read_csv()` Vs `read.csv()`