Chapter 21

Spreadsheets

Author

Aditya Dahiya

Published

August 15, 2023

21.2.9 Exercises

The two main packages for reading data from and writing data to excel spreadsheets are readxl and writexl . But they are not core-tidyverse, so let us load them first.

library(tidyverse)
library(readxl)
library(writexl)

The few important functions we will use are:

read_excel()
excel_sheets()

Question 1.

In an Excel file, create the following dataset and save it as survey.xlsx. Alternatively, you can download it as an Excel file from here. Then, read it into R, with survey_id as a character variable and n_pets as a numerical variable.

survey_df = tibble(
  survey_id = c(1:6),
  n_pets = c(0,1,"N/A", "two", 2, "")
)

survey_df |>
  write_xlsx("docs/survey.xlsx")

df = read_excel(
  path = "docs/survey.xlsx",
  col_names = TRUE,
  col_types = c("text", "text"),
  na = c("N/A", "")
  ) |>
  mutate(
    n_pets = ifelse(n_pets == "two", 2, n_pets),
    n_pets = parse_number(n_pets)
    )

df

# A tibble: 6 × 2
  survey_id n_pets
  <chr>      <dbl>
1 1              0
2 2              1
3 3             NA
4 4              2
5 5              2
6 6             NA

Question 2.

In another Excel file, create the following data-set and save it as roster.xlsx. Alternatively, you can download it as an Excel file from here. Then, read it into R. The resulting data frame should be called roster and should look like the following.

There are two ways of doing this:

Using the read_excel() function with fill() function of the tidyr package.
Using the package openxlsx and its function read.xlsx() which has an argument fillMergedCells = TRUE to do the same task in one go. However, the output is a data.frame, which we must then convert to a tibble.

# Use readxl package with fill() from tidyr
read_excel(
  path = "docs/roster.xlsx"
  ) |>
  fill(group, subgroup)

# A tibble: 12 × 3
   group subgroup    id
   <dbl> <chr>    <dbl>
 1     1 A            1
 2     1 A            2
 3     1 A            3
 4     1 B            4
 5     1 B            5
 6     1 B            6
 7     1 B            7
 8     2 A            8
 9     2 A            9
10     2 B           10
11     2 B           11
12     2 B           12

# Option 2: using the openxlsx package
library(openxlsx)
read.xlsx(
  xlsxFile = "docs/roster.xlsx",
  fillMergedCells = TRUE
  ) |>
  as_tibble()

# A tibble: 12 × 3
   group subgroup    id
   <dbl> <chr>    <dbl>
 1     1 A            1
 2     1 A            2
 3     1 A            3
 4     1 B            4
 5     1 B            5
 6     1 B            6
 7     1 B            7
 8     2 A            8
 9     2 A            9
10     2 B           10
11     2 B           11
12     2 B           12

Question 3.

In a new Excel file, create the following dataset and save it as sales.xlsx. Alternatively, you can download it as an Excel file from here.

Read sales.xlsx in and save as sales. The data frame should look like the following, with id and n as column names and with 9 rows.

sales = read_excel(
  "docs/sales.xlsx",
  skip = 4,
  col_names = c("id", "n")
)
sales

# A tibble: 9 × 2
  id      n    
  <chr>   <chr>
1 Brand 1 n    
2 1234.0  8.0  
3 8721.0  2.0  
4 1822.0  3.0  
5 Brand 2 n    
6 3333.0  1.0  
7 2156.0  3.0  
8 3987.0  6.0  
9 3216.0  5.0

Modify sales further to get it into the following tidy format with three columns (brand, id, and n) and 7 rows of data. Note that id and n are numeric, brand is a character variable.

sales |>
  mutate(
    brand = ifelse(str_detect(id, "Brand"), id, NA),
    id = parse_number(id),
    n = parse_number(n, na = "n")) |>
  fill(brand) |>
  drop_na() |>
  relocate(brand)

# A tibble: 7 × 3
  brand      id     n
  <chr>   <dbl> <dbl>
1 Brand 1  1234     8
2 Brand 1  8721     2
3 Brand 1  1822     3
4 Brand 2  3333     1
5 Brand 2  2156     3
6 Brand 2  3987     6
7 Brand 2  3216     5

Question 4.

Recreate the bake_sale data frame, write it out to an Excel file using the write.xlsx() function from the openxlsx package.

bake_sale = tibble(
  item = factor(c("brownie", "cupcake", "cookie")),
  quantity = c(10, 5, 8)
)
bake_sale |>
  write.xlsx("docs/bake_sale.xlsx")

Question 5.

In Chapter 8 you learned about the janitor::clean_names() function to turn columns names into snake case. Read the students.xlsx file that we introduced earlier in this section and use this function to “clean” the column names.

# Option 1: Read in data from the google sheets
# library(googlesheets4)
# id = "1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"

# For an easy reprex, write in the data now, suing:--
# read_sheet(id) |>
#   dput()

raw_data = structure(list(`Student ID` = c(1, 2, 3, 4, 5, 6), `Full Name` = c("Sunil Huffmann", "Barclay Lynn", "Jayendra Lyne", "Leon Rossini", "Chidiegwu Dunkel","Güvenç Attila"), favourite.food = c("Strawberry yoghurt", "French fries", "N/A", "Anchovies", "Pizza", "Ice cream"), mealPlan = c("Lunch only", "Lunch only", "Breakfast and lunch", "Lunch only", "Breakfast and lunch", "Lunch only"), AGE = list(4, 5, 7, NULL, "five", 6)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

raw_data |>
  janitor::clean_names() |>
  as_tibble()

# A tibble: 6 × 5
  student_id full_name        favourite_food     meal_plan           age      
       <dbl> <chr>            <chr>              <chr>               <list>   
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          <dbl [1]>
2          2 Barclay Lynn     French fries       Lunch only          <dbl [1]>
3          3 Jayendra Lyne    N/A                Breakfast and lunch <dbl [1]>
4          4 Leon Rossini     Anchovies          Lunch only          <NULL>   
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch <chr [1]>
6          6 Güvenç Attila    Ice cream          Lunch only          <dbl [1]>

Question 6.

What happens if you try to read in a file with .xlsx extension with read_xls()?

If we try to open a *.xlsx file with read_xls() , an error is displayed that Error: filepath libxls error: Unable to open file

read_xls("docs/sales.xlsx")

21.3.6 Exercises

Question 1.

Read the students data set from earlier in the chapter from Excel and also from Google Sheets, with no additional arguments supplied to the read_excel() and read_sheet() functions. Are the resulting data frames in R exactly the same? If not, how are they different?

The two resulting data frames are not exactly the same. The data frame created from read_excel() , i.e. df_xl has the has variable AGE saved as character because one of the values is written in characters, instead of a number. Whenever some data is numeric and some data is character , read_excel() converts all data within a column into character format.

On the other hand, the data frame created from read_sheet() of googlesheets4 package, i.e. df_gs has this variable stored as a “list”, which contains both numeric and character types of data.

df_xl = read_excel("docs/students.xlsx")

library(googlesheets4)
url_id = "1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"

df_gs = read_sheet(url_id)
# Comparing the types of columns in the two data.frames
sapply(df_xl, class) == sapply(df_gs, class)

    Student ID      Full Name favourite.food       mealPlan            AGE 
          TRUE           TRUE           TRUE           TRUE          FALSE

class(df_xl$AGE)

[1] "character"

class(df_gs$AGE)

[1] "list"

sapply(df_gs$AGE, class)

[1] "numeric"   "numeric"   "numeric"   "NULL"      "character" "numeric"

Question 2.

Read the Google Sheet titled survey from https://pos.it/r4ds-survey, with survey_id as a character variable and n_pets as a numerical variable.

When we read Google sheets, using col_types argument, we introduce NAs by coercion.

url_gs = "https://docs.google.com/spreadsheets/d/1yc5gL-a2OOBr8M7B3IsDNX5uR17vBHOyWZq6xSTG2G8/edit#gid=0"

read_sheet(
  ss = url_gs,
  col_types = "cd")

# A tibble: 6 × 2
  survey_id n_pets
  <chr>      <dbl>
1 1              0
2 2              1
3 3             NA
4 4             NA
5 5              2
6 6             NA

Question 3.

Read the Google Sheet titled roster from https://pos.it/r4ds-roster. The resulting data frame should be called roster and should look like the following.

url_gs1 = "https://docs.google.com/spreadsheets/d/1LgZ0Bkg9d_NK8uTdP2uHXm07kAlwx8-Ictf8NocebIE/edit#gid=0"

read_sheet(
  ss = url_gs1
) |>
fill(group, subgroup)

# A tibble: 12 × 3
   group subgroup    id
   <dbl> <chr>    <dbl>
 1     1 A            1
 2     1 A            2
 3     1 A            3
 4     1 B            4
 5     1 B            5
 6     1 B            6
 7     1 B            7
 8     2 A            8
 9     2 A            9
10     2 B           10
11     2 B           11
12     2 B           12