library(tidyverse)
library(gt)
library(gtExtras)Chapter 27
Iteration
27.2.8 Exercises
Question 1
Practice your across() skills by:
- Computing the number of unique values in each column of - palmerpenguins::penguins.- palmerpenguins::penguins |> summarise(across(.cols = everything(), .fns = n_distinct))- # A tibble: 1 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <int> <int> <int> <int> <int> <int> 1 3 3 165 81 56 95 # ℹ 2 more variables: sex <int>, year <int>
- Computing the mean of every column in - mtcars.- mtcars |> summarise(across(everything(), mean))- mpg cyl disp hp drat wt qsec vs am 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 gear carb 1 3.6875 2.8125
- Grouping - diamondsby- cut,- clarity, and- colorthen counting the number of observations and computing the mean of each numeric column.- ggplot2::diamonds |> group_by(cut, clarity, color) |> summarise( across( .cols = where(is.numeric), .fns = function(x) mean(x, na.rm = TRUE) ), n = n() )- # A tibble: 276 × 11 # Groups: cut, clarity [40] cut clarity color carat depth table price x y z n <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> 1 Fair I1 D 1.88 65.6 56.8 7383 7.52 7.42 4.90 4 2 Fair I1 E 0.969 65.6 58.1 2095. 6.17 6.06 4.01 9 3 Fair I1 F 1.02 65.7 58.4 2544. 6.14 6.04 4.00 35 4 Fair I1 G 1.23 65.3 57.7 3187. 6.52 6.43 4.23 53 5 Fair I1 H 1.50 65.8 58.4 4213. 6.96 6.86 4.55 52 6 Fair I1 I 1.32 65.7 58.4 3501 6.76 6.65 4.41 34 7 Fair I1 J 1.99 66.5 57.9 5795. 7.55 7.46 4.99 23 8 Fair SI2 D 1.02 64.7 58.6 4355. 6.24 6.17 4.01 56 9 Fair SI2 E 1.02 63.4 59.5 4172. 6.28 6.22 3.96 78 10 Fair SI2 F 1.08 63.8 59.5 4520. 6.36 6.30 4.04 89 # ℹ 266 more rows
Question 2
What happens if you use a list of functions in across(), but don’t name them? How is the output named?
Using a list of functions in across() without naming them will result in automatically generated names for the output columns, following the default convention of
- "{.col}"for the single function case and
- "{.col}_{.fn}"for the case where a named function list is used for- .fns.
- "{.col}_1",- "{.col}_2"etc. for the case where unnamed list is used for- .fns.
Thus, when we use a list, resulting columns will be named like x_mean when name of original column is x and new named function used across is mean.
# Create a sample data frame
data <- data.frame(
  A = c(1, 2, 3, NA, 4),
  B = c(4, NA, 5, 6, 7)
)
# Use across() without naming the functions
data %>%
  mutate(across(everything(), list(sqrt, log)))   A  B      A_1       A_2      B_1      B_2
1  1  4 1.000000 0.0000000 2.000000 1.386294
2  2 NA 1.414214 0.6931472       NA       NA
3  3  5 1.732051 1.0986123 2.236068 1.609438
4 NA  6       NA        NA 2.449490 1.791759
5  4  7 2.000000 1.3862944 2.645751 1.945910# Use across() with named new functions
data %>%
  mutate(
    across(
      everything(), 
      list(sqrt = sqrt, 
           log = log,
           mean = \(x) mean(x, na.rm = TRUE)
           )
      )
    )   A  B   A_sqrt     A_log A_mean   B_sqrt    B_log B_mean
1  1  4 1.000000 0.0000000    2.5 2.000000 1.386294    5.5
2  2 NA 1.414214 0.6931472    2.5       NA       NA    5.5
3  3  5 1.732051 1.0986123    2.5 2.236068 1.609438    5.5
4 NA  6       NA        NA    2.5 2.449490 1.791759    5.5
5  4  7 2.000000 1.3862944    2.5 2.645751 1.945910    5.5Question 3
Adjust expand_dates() to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?
To modify the expand_dates() function to automatically remove the date columns after they’ve been expanded, we can achieve this by adding the select() function to drop the original date columns.
We don’t need to embrace any additional arguments.
Here’s the adjusted expand_dates() function:
expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    ) |>
    select(!where(is.Date))
  }
df_date <- tibble(
  name = c("Amy", "Bob", "Charlie", "David", "Eva"),
  date = ymd(c("2009-08-03", "2010-01-16", "2012-05-20", "2013-11-30", "2015-07-12"))
  )
df_date |> 
  expand_dates()# A tibble: 5 × 4
  name    date_year date_month date_day
  <chr>       <dbl>      <dbl>    <int>
1 Amy          2009          8        3
2 Bob          2010          1       16
3 Charlie      2012          5       20
4 David        2013         11       30
5 Eva          2015          7       12Question 4
Explain what each step of the pipeline in this function does. What special feature of where() are we taking advantage of?
show_missing <- function(df, group_vars, summary_vars = everything()) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      across({{ summary_vars }}, \(x) sum(is.na(x))),
      .groups = "drop"
    ) |>
    select(where(\(x) any(x > 0)))
}
nycflights13::flights |> show_missing(c(year, month, day))The show_missing() function appears to be designed to display the number of missing values in a data-frame.
The each step of the pipeline is explained as follows: –
- df |> group_by(pick({{ group_vars }})):- The pipe operator (|>) is used to pass the dataframedfto the next operation.
- group_by()is applied to the dataframe, and it groups the data by the variables specified by- group_vars. The- group_varsparameter is enclosed in double curly braces (- {{ }}), which is a tidy evaluation feature used to work with non-standard evaluation. It allows you to pass column names as unquoted arguments to the function.
 
- The pipe operator (
- summarize(across({{ summary_vars }}, \(x) sum(is.na(x))), .groups = "drop"):- After grouping the data, summarize()is used to calculate summary statistics.
- across()is applied to the variables specified by- summary_vars. Similar to- group_vars,- summary_varsis also enclosed in double curly braces to use tidyeval. This means you can pass a list of column names to- summary_varsas unquoted arguments.
- Inside the across()function, a lambda function (i.e. an anonymous function)\(x) sum(is.na(x))is used to count the number of missing values in each column. This is done by applying theis.na()function to each columnxand then summing the logical values.
- The .groups = "drop"argument is used to drop the grouping structure after summarizing. This means the resulting dataframe will not have grouped rows.
 
- After grouping the data, 
- select(where(\(x) any(x > 0))):- Finally, the select()function is used to choose the columns for display.
- The where()function is applied to select columns based on a condition.
- The condition used here is \(x) any(x > 0), which checks if there is any value greater than 0 in each columnx. If there are any non-zero values in a column, it means there are missing values in that column, so it’s selected.
- This step ensures that only columns with missing values (i.e., columns with at least one non-zero value in the summary) are included in the final output.
 
- Finally, the 
So, the special feature of where() being taken advantage of is its ability to select columns based on a condition, and in this case, it’s used to select columns with missing values for display. The function is designed to help identify which columns have missing data after grouping by specified variables.
# Recreating the function
show_missing <- function(df, group_vars, summary_vars = everything()) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      across({{ summary_vars }}, \(x) sum(is.na(x))),
      .groups = "drop"
    ) |>
    select(where(\(x) any(x > 0)))
}
# Running an example
nycflights13::flights |> 
  show_missing(c(year, month, day)) |>
  slice_head(n = 5) |> 
  gt() |> gt_theme_538()| year | month | day | dep_time | dep_delay | arr_time | arr_delay | tailnum | air_time | 
|---|---|---|---|---|---|---|---|---|
| 2013 | 1 | 1 | 4 | 4 | 5 | 11 | 0 | 11 | 
| 2013 | 1 | 2 | 8 | 8 | 10 | 15 | 2 | 15 | 
| 2013 | 1 | 3 | 10 | 10 | 10 | 14 | 2 | 14 | 
| 2013 | 1 | 4 | 6 | 6 | 6 | 7 | 2 | 7 | 
| 2013 | 1 | 5 | 3 | 3 | 3 | 3 | 1 | 3 |