library(tidyverse)
library(readxl)
library(writexl)
Chapter 21
Spreadsheets
21.2.9 Exercises
The two main packages for reading data from and writing data to excel spreadsheets are readxl
and writexl
. But they are not core-tidyverse, so let us load them first.
The few important functions we will use are:
read_excel()
excel_sheets()
Question 1.
In an Excel file, create the following dataset and save it as survey.xlsx
. Alternatively, you can download it as an Excel file from here. Then, read it into R, with survey_id
as a character variable and n_pets
as a numerical variable.
= tibble(
survey_df survey_id = c(1:6),
n_pets = c(0,1,"N/A", "two", 2, "")
)
|>
survey_df write_xlsx("docs/survey.xlsx")
= read_excel(
df path = "docs/survey.xlsx",
col_names = TRUE,
col_types = c("text", "text"),
na = c("N/A", "")
|>
) mutate(
n_pets = ifelse(n_pets == "two", 2, n_pets),
n_pets = parse_number(n_pets)
)
df
# A tibble: 6 × 2
survey_id n_pets
<chr> <dbl>
1 1 0
2 2 1
3 3 NA
4 4 2
5 5 2
6 6 NA
Question 2.
In another Excel file, create the following data-set and save it as roster.xlsx
. Alternatively, you can download it as an Excel file from here. Then, read it into R. The resulting data frame should be called roster and should look like the following.
There are two ways of doing this:
Using the
read_excel()
function withfill()
function of thetidyr
package.Using the package
openxlsx
and its functionread.xlsx()
which has an argumentfillMergedCells = TRUE
to do the same task in one go. However, the output is a data.frame, which we must then convert to a tibble.
# Use readxl package with fill() from tidyr
read_excel(
path = "docs/roster.xlsx"
|>
) fill(group, subgroup)
# A tibble: 12 × 3
group subgroup id
<dbl> <chr> <dbl>
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 4
5 1 B 5
6 1 B 6
7 1 B 7
8 2 A 8
9 2 A 9
10 2 B 10
11 2 B 11
12 2 B 12
# Option 2: using the openxlsx package
library(openxlsx)
read.xlsx(
xlsxFile = "docs/roster.xlsx",
fillMergedCells = TRUE
|>
) as_tibble()
# A tibble: 12 × 3
group subgroup id
<dbl> <chr> <dbl>
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 4
5 1 B 5
6 1 B 6
7 1 B 7
8 2 A 8
9 2 A 9
10 2 B 10
11 2 B 11
12 2 B 12
Question 3.
In a new Excel file, create the following dataset and save it as sales.xlsx
. Alternatively, you can download it as an Excel file from here.
Read
sales.xlsx
in and save assales
. The data frame should look like the following, withid
andn
as column names and with 9 rows.= read_excel( sales "docs/sales.xlsx", skip = 4, col_names = c("id", "n") ) sales
# A tibble: 9 × 2 id n <chr> <chr> 1 Brand 1 n 2 1234.0 8.0 3 8721.0 2.0 4 1822.0 3.0 5 Brand 2 n 6 3333.0 1.0 7 2156.0 3.0 8 3987.0 6.0 9 3216.0 5.0
Modify sales further to get it into the following tidy format with three columns (
brand
,id
, andn
) and 7 rows of data. Note thatid
andn
are numeric,brand
is a character variable.|> sales mutate( brand = ifelse(str_detect(id, "Brand"), id, NA), id = parse_number(id), n = parse_number(n, na = "n")) |> fill(brand) |> drop_na() |> relocate(brand)
# A tibble: 7 × 3 brand id n <chr> <dbl> <dbl> 1 Brand 1 1234 8 2 Brand 1 8721 2 3 Brand 1 1822 3 4 Brand 2 3333 1 5 Brand 2 2156 3 6 Brand 2 3987 6 7 Brand 2 3216 5
Question 4.
Recreate the bake_sale
data frame, write it out to an Excel file using the write.xlsx()
function from the openxlsx
package.
= tibble(
bake_sale item = factor(c("brownie", "cupcake", "cookie")),
quantity = c(10, 5, 8)
)|>
bake_sale write.xlsx("docs/bake_sale.xlsx")
Question 5.
In Chapter 8 you learned about the janitor::clean_names()
function to turn columns names into snake case. Read the students.xlsx
file that we introduced earlier in this section and use this function to “clean” the column names.
# Option 1: Read in data from the google sheets
# library(googlesheets4)
# id = "1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
# For an easy reprex, write in the data now, suing:--
# read_sheet(id) |>
# dput()
= structure(list(`Student ID` = c(1, 2, 3, 4, 5, 6), `Full Name` = c("Sunil Huffmann", "Barclay Lynn", "Jayendra Lyne", "Leon Rossini", "Chidiegwu Dunkel","Güvenç Attila"), favourite.food = c("Strawberry yoghurt", "French fries", "N/A", "Anchovies", "Pizza", "Ice cream"), mealPlan = c("Lunch only", "Lunch only", "Breakfast and lunch", "Lunch only", "Breakfast and lunch", "Lunch only"), AGE = list(4, 5, 7, NULL, "five", 6)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
raw_data
|>
raw_data ::clean_names() |>
janitoras_tibble()
# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <list>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only <dbl [1]>
2 2 Barclay Lynn French fries Lunch only <dbl [1]>
3 3 Jayendra Lyne N/A Breakfast and lunch <dbl [1]>
4 4 Leon Rossini Anchovies Lunch only <NULL>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch <chr [1]>
6 6 Güvenç Attila Ice cream Lunch only <dbl [1]>
Question 6.
What happens if you try to read in a file with .xlsx extension with read_xls()?
If we try to open a *.xlsx
file with read_xls()
, an error is displayed that Error:
filepath libxls error: Unable to open file
read_xls("docs/sales.xlsx")
21.3.6 Exercises
Question 1.
Read the students data set from earlier in the chapter from Excel and also from Google Sheets, with no additional arguments supplied to the read_excel()
and read_sheet()
functions. Are the resulting data frames in R exactly the same? If not, how are they different?
The two resulting data frames are not exactly the same. The data frame created from read_excel()
, i.e. df_xl
has the has variable AGE
saved as character because one of the values is written in characters, instead of a number. Whenever some data is numeric and some data is character , read_excel()
converts all data within a column into character format.
On the other hand, the data frame created from read_sheet()
of googlesheets4
package, i.e. df_gs
has this variable stored as a “list”, which contains both numeric
and character
types of data.
= read_excel("docs/students.xlsx")
df_xl
library(googlesheets4)
= "1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
url_id
= read_sheet(url_id)
df_gs # Comparing the types of columns in the two data.frames
sapply(df_xl, class) == sapply(df_gs, class)
Student ID Full Name favourite.food mealPlan AGE
TRUE TRUE TRUE TRUE FALSE
class(df_xl$AGE)
[1] "character"
class(df_gs$AGE)
[1] "list"
sapply(df_gs$AGE, class)
[1] "numeric" "numeric" "numeric" "NULL" "character" "numeric"
Question 2.
Read the Google Sheet titled survey from https://pos.it/r4ds-survey, with survey_id as a character variable and n_pets as a numerical variable.
When we read Google sheets, using col_types
argument, we introduce NA
s by coercion.
= "https://docs.google.com/spreadsheets/d/1yc5gL-a2OOBr8M7B3IsDNX5uR17vBHOyWZq6xSTG2G8/edit#gid=0"
url_gs
read_sheet(
ss = url_gs,
col_types = "cd")
# A tibble: 6 × 2
survey_id n_pets
<chr> <dbl>
1 1 0
2 2 1
3 3 NA
4 4 NA
5 5 2
6 6 NA
Question 3.
Read the Google Sheet titled roster from https://pos.it/r4ds-roster. The resulting data frame should be called roster and should look like the following.
= "https://docs.google.com/spreadsheets/d/1LgZ0Bkg9d_NK8uTdP2uHXm07kAlwx8-Ictf8NocebIE/edit#gid=0"
url_gs1
read_sheet(
ss = url_gs1
|>
) fill(group, subgroup)
# A tibble: 12 × 3
group subgroup id
<dbl> <chr> <dbl>
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 4
5 1 B 5
6 1 B 6
7 1 B 7
8 2 A 8
9 2 A 9
10 2 B 10
11 2 B 11
12 2 B 12