library(tidyverse)
library(readxl)
library(writexl)Chapter 21
Spreadsheets
21.2.9 Exercises
The two main packages for reading data from and writing data to excel spreadsheets are readxl and writexl . But they are not core-tidyverse, so let us load them first.
The few important functions we will use are:
read_excel()excel_sheets()
Question 1.
In an Excel file, create the following dataset and save it as survey.xlsx. Alternatively, you can download it as an Excel file from here. Then, read it into R, with survey_id as a character variable and n_pets as a numerical variable.
survey_df = tibble(
survey_id = c(1:6),
n_pets = c(0,1,"N/A", "two", 2, "")
)
survey_df |>
write_xlsx("docs/survey.xlsx")
df = read_excel(
path = "docs/survey.xlsx",
col_names = TRUE,
col_types = c("text", "text"),
na = c("N/A", "")
) |>
mutate(
n_pets = ifelse(n_pets == "two", 2, n_pets),
n_pets = parse_number(n_pets)
)
df# A tibble: 6 × 2
survey_id n_pets
<chr> <dbl>
1 1 0
2 2 1
3 3 NA
4 4 2
5 5 2
6 6 NA
Question 2.
In another Excel file, create the following data-set and save it as roster.xlsx. Alternatively, you can download it as an Excel file from here. Then, read it into R. The resulting data frame should be called roster and should look like the following.
There are two ways of doing this:
Using the
read_excel()function withfill()function of thetidyrpackage.Using the package
openxlsxand its functionread.xlsx()which has an argumentfillMergedCells = TRUEto do the same task in one go. However, the output is a data.frame, which we must then convert to a tibble.
# Use readxl package with fill() from tidyr
read_excel(
path = "docs/roster.xlsx"
) |>
fill(group, subgroup)# A tibble: 12 × 3
group subgroup id
<dbl> <chr> <dbl>
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 4
5 1 B 5
6 1 B 6
7 1 B 7
8 2 A 8
9 2 A 9
10 2 B 10
11 2 B 11
12 2 B 12
# Option 2: using the openxlsx package
library(openxlsx)
read.xlsx(
xlsxFile = "docs/roster.xlsx",
fillMergedCells = TRUE
) |>
as_tibble()# A tibble: 12 × 3
group subgroup id
<dbl> <chr> <dbl>
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 4
5 1 B 5
6 1 B 6
7 1 B 7
8 2 A 8
9 2 A 9
10 2 B 10
11 2 B 11
12 2 B 12
Question 3.
In a new Excel file, create the following dataset and save it as sales.xlsx. Alternatively, you can download it as an Excel file from here.
Read
sales.xlsxin and save assales. The data frame should look like the following, withidandnas column names and with 9 rows.sales = read_excel( "docs/sales.xlsx", skip = 4, col_names = c("id", "n") ) sales# A tibble: 9 × 2 id n <chr> <chr> 1 Brand 1 n 2 1234.0 8.0 3 8721.0 2.0 4 1822.0 3.0 5 Brand 2 n 6 3333.0 1.0 7 2156.0 3.0 8 3987.0 6.0 9 3216.0 5.0Modify sales further to get it into the following tidy format with three columns (
brand,id, andn) and 7 rows of data. Note thatidandnare numeric,brandis a character variable.sales |> mutate( brand = ifelse(str_detect(id, "Brand"), id, NA), id = parse_number(id), n = parse_number(n, na = "n")) |> fill(brand) |> drop_na() |> relocate(brand)# A tibble: 7 × 3 brand id n <chr> <dbl> <dbl> 1 Brand 1 1234 8 2 Brand 1 8721 2 3 Brand 1 1822 3 4 Brand 2 3333 1 5 Brand 2 2156 3 6 Brand 2 3987 6 7 Brand 2 3216 5
Question 4.
Recreate the bake_sale data frame, write it out to an Excel file using the write.xlsx() function from the openxlsx package.
bake_sale = tibble(
item = factor(c("brownie", "cupcake", "cookie")),
quantity = c(10, 5, 8)
)
bake_sale |>
write.xlsx("docs/bake_sale.xlsx")Question 5.
In Chapter 8 you learned about the janitor::clean_names() function to turn columns names into snake case. Read the students.xlsx file that we introduced earlier in this section and use this function to “clean” the column names.
# Option 1: Read in data from the google sheets
# library(googlesheets4)
# id = "1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
# For an easy reprex, write in the data now, suing:--
# read_sheet(id) |>
# dput()
raw_data = structure(list(`Student ID` = c(1, 2, 3, 4, 5, 6), `Full Name` = c("Sunil Huffmann", "Barclay Lynn", "Jayendra Lyne", "Leon Rossini", "Chidiegwu Dunkel","Güvenç Attila"), favourite.food = c("Strawberry yoghurt", "French fries", "N/A", "Anchovies", "Pizza", "Ice cream"), mealPlan = c("Lunch only", "Lunch only", "Breakfast and lunch", "Lunch only", "Breakfast and lunch", "Lunch only"), AGE = list(4, 5, 7, NULL, "five", 6)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
raw_data |>
janitor::clean_names() |>
as_tibble()# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <chr> <list>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only <dbl [1]>
2 2 Barclay Lynn French fries Lunch only <dbl [1]>
3 3 Jayendra Lyne N/A Breakfast and lunch <dbl [1]>
4 4 Leon Rossini Anchovies Lunch only <NULL>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch <chr [1]>
6 6 Güvenç Attila Ice cream Lunch only <dbl [1]>
Question 6.
What happens if you try to read in a file with .xlsx extension with read_xls()?
If we try to open a *.xlsx file with read_xls() , an error is displayed that Error: filepath libxls error: Unable to open file
read_xls("docs/sales.xlsx")21.3.6 Exercises
Question 1.
Read the students data set from earlier in the chapter from Excel and also from Google Sheets, with no additional arguments supplied to the read_excel() and read_sheet() functions. Are the resulting data frames in R exactly the same? If not, how are they different?
The two resulting data frames are not exactly the same. The data frame created from read_excel() , i.e. df_xl has the has variable AGE saved as character because one of the values is written in characters, instead of a number. Whenever some data is numeric and some data is character , read_excel() converts all data within a column into character format.
On the other hand, the data frame created from read_sheet() of googlesheets4 package, i.e. df_gs has this variable stored as a “list”, which contains both numeric and character types of data.
df_xl = read_excel("docs/students.xlsx")
library(googlesheets4)
url_id = "1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
df_gs = read_sheet(url_id)
# Comparing the types of columns in the two data.frames
sapply(df_xl, class) == sapply(df_gs, class) Student ID Full Name favourite.food mealPlan AGE
TRUE TRUE TRUE TRUE FALSE
class(df_xl$AGE)[1] "character"
class(df_gs$AGE)[1] "list"
sapply(df_gs$AGE, class)[1] "numeric" "numeric" "numeric" "NULL" "character" "numeric"
Question 2.
Read the Google Sheet titled survey from https://pos.it/r4ds-survey, with survey_id as a character variable and n_pets as a numerical variable.
When we read Google sheets, using col_types argument, we introduce NAs by coercion.
url_gs = "https://docs.google.com/spreadsheets/d/1yc5gL-a2OOBr8M7B3IsDNX5uR17vBHOyWZq6xSTG2G8/edit#gid=0"
read_sheet(
ss = url_gs,
col_types = "cd")# A tibble: 6 × 2
survey_id n_pets
<chr> <dbl>
1 1 0
2 2 1
3 3 NA
4 4 NA
5 5 2
6 6 NA
Question 3.
Read the Google Sheet titled roster from https://pos.it/r4ds-roster. The resulting data frame should be called roster and should look like the following.
url_gs1 = "https://docs.google.com/spreadsheets/d/1LgZ0Bkg9d_NK8uTdP2uHXm07kAlwx8-Ictf8NocebIE/edit#gid=0"
read_sheet(
ss = url_gs1
) |>
fill(group, subgroup) # A tibble: 12 × 3
group subgroup id
<dbl> <chr> <dbl>
1 1 A 1
2 1 A 2
3 1 A 3
4 1 B 4
5 1 B 5
6 1 B 6
7 1 B 7
8 2 A 8
9 2 A 9
10 2 B 10
11 2 B 11
12 2 B 12