Chapter 27

Iteration

Author

Aditya Dahiya

Published

October 10, 2023

library(tidyverse)
library(gt)
library(gtExtras)

27.2.8 Exercises

Question 1

Practice your across() skills by:

Computing the number of unique values in each column of palmerpenguins::penguins.

palmerpenguins::penguins |>
  summarise(across(.cols = everything(),
                   .fns = n_distinct))

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       3      3            165            81                56          95
# ℹ 2 more variables: sex <int>, year <int>

Computing the mean of every column in mtcars.

mtcars |>
  summarise(across(everything(), mean))

       mpg    cyl     disp       hp     drat      wt     qsec     vs      am
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
    gear   carb
1 3.6875 2.8125

Grouping diamonds by cut, clarity, and color then counting the number of observations and computing the mean of each numeric column.

ggplot2::diamonds |>
  group_by(cut, clarity, color) |>
  summarise(
    across(
    .cols = where(is.numeric),
    .fns = function(x) mean(x, na.rm = TRUE)
    ),
    n = n()
    )

# A tibble: 276 × 11
# Groups:   cut, clarity [40]
   cut   clarity color carat depth table price     x     y     z     n
   <ord> <ord>   <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
 1 Fair  I1      D     1.88   65.6  56.8 7383   7.52  7.42  4.90     4
 2 Fair  I1      E     0.969  65.6  58.1 2095.  6.17  6.06  4.01     9
 3 Fair  I1      F     1.02   65.7  58.4 2544.  6.14  6.04  4.00    35
 4 Fair  I1      G     1.23   65.3  57.7 3187.  6.52  6.43  4.23    53
 5 Fair  I1      H     1.50   65.8  58.4 4213.  6.96  6.86  4.55    52
 6 Fair  I1      I     1.32   65.7  58.4 3501   6.76  6.65  4.41    34
 7 Fair  I1      J     1.99   66.5  57.9 5795.  7.55  7.46  4.99    23
 8 Fair  SI2     D     1.02   64.7  58.6 4355.  6.24  6.17  4.01    56
 9 Fair  SI2     E     1.02   63.4  59.5 4172.  6.28  6.22  3.96    78
10 Fair  SI2     F     1.08   63.8  59.5 4520.  6.36  6.30  4.04    89
# ℹ 266 more rows

Question 2

What happens if you use a list of functions in across(), but don’t name them? How is the output named?

Using a list of functions in across() without naming them will result in automatically generated names for the output columns, following the default convention of

"{.col}" for the single function case and
"{.col}_{.fn}" for the case where a named function list is used for .fns.
"{.col}_1" , "{.col}_2" etc. for the case where unnamed list is used for .fns.

Thus, when we use a list, resulting columns will be named like x_mean when name of original column is x and new named function used across is mean.

# Create a sample data frame
data <- data.frame(
  A = c(1, 2, 3, NA, 4),
  B = c(4, NA, 5, 6, 7)
)

# Use across() without naming the functions
data %>%
  mutate(across(everything(), list(sqrt, log)))

   A  B      A_1       A_2      B_1      B_2
1  1  4 1.000000 0.0000000 2.000000 1.386294
2  2 NA 1.414214 0.6931472       NA       NA
3  3  5 1.732051 1.0986123 2.236068 1.609438
4 NA  6       NA        NA 2.449490 1.791759
5  4  7 2.000000 1.3862944 2.645751 1.945910

# Use across() with named new functions
data %>%
  mutate(
    across(
      everything(), 
      list(sqrt = sqrt, 
           log = log,
           mean = \(x) mean(x, na.rm = TRUE)
           )
      )
    )

   A  B   A_sqrt     A_log A_mean   B_sqrt    B_log B_mean
1  1  4 1.000000 0.0000000    2.5 2.000000 1.386294    5.5
2  2 NA 1.414214 0.6931472    2.5       NA       NA    5.5
3  3  5 1.732051 1.0986123    2.5 2.236068 1.609438    5.5
4 NA  6       NA        NA    2.5 2.449490 1.791759    5.5
5  4  7 2.000000 1.3862944    2.5 2.645751 1.945910    5.5

Question 3

Adjust expand_dates() to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?

To modify the expand_dates() function to automatically remove the date columns after they’ve been expanded, we can achieve this by adding the select() function to drop the original date columns.

We don’t need to embrace any additional arguments.

Here’s the adjusted expand_dates() function:

expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    ) |>
    select(!where(is.Date))
  }

df_date <- tibble(
  name = c("Amy", "Bob", "Charlie", "David", "Eva"),
  date = ymd(c("2009-08-03", "2010-01-16", "2012-05-20", "2013-11-30", "2015-07-12"))
  )

df_date |> 
  expand_dates()

# A tibble: 5 × 4
  name    date_year date_month date_day
  <chr>       <dbl>      <dbl>    <int>
1 Amy          2009          8        3
2 Bob          2010          1       16
3 Charlie      2012          5       20
4 David        2013         11       30
5 Eva          2015          7       12

Question 4

Explain what each step of the pipeline in this function does. What special feature of where() are we taking advantage of?

show_missing <- function(df, group_vars, summary_vars = everything()) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      across({{ summary_vars }}, \(x) sum(is.na(x))),
      .groups = "drop"
    ) |>
    select(where(\(x) any(x > 0)))
}
nycflights13::flights |> show_missing(c(year, month, day))

The show_missing() function appears to be designed to display the number of missing values in a data-frame.

The each step of the pipeline is explained as follows: –

df |> group_by(pick({{ group_vars }})):
- The pipe operator (|>) is used to pass the dataframe df to the next operation.
- group_by() is applied to the dataframe, and it groups the data by the variables specified by group_vars. The group_vars parameter is enclosed in double curly braces ({{ }}), which is a tidy evaluation feature used to work with non-standard evaluation. It allows you to pass column names as unquoted arguments to the function.
summarize(across({{ summary_vars }}, \(x) sum(is.na(x))), .groups = "drop"):
- After grouping the data, summarize() is used to calculate summary statistics.
- across() is applied to the variables specified by summary_vars. Similar to group_vars, summary_vars is also enclosed in double curly braces to use tidyeval. This means you can pass a list of column names to summary_vars as unquoted arguments.
- Inside the across() function, a lambda function (i.e. an anonymous function) \(x) sum(is.na(x)) is used to count the number of missing values in each column. This is done by applying the is.na() function to each column x and then summing the logical values.
- The .groups = "drop" argument is used to drop the grouping structure after summarizing. This means the resulting dataframe will not have grouped rows.
select(where(\(x) any(x > 0))):
- Finally, the select() function is used to choose the columns for display.
- The where() function is applied to select columns based on a condition.
- The condition used here is \(x) any(x > 0), which checks if there is any value greater than 0 in each column x. If there are any non-zero values in a column, it means there are missing values in that column, so it’s selected.
- This step ensures that only columns with missing values (i.e., columns with at least one non-zero value in the summary) are included in the final output.

So, the special feature of where() being taken advantage of is its ability to select columns based on a condition, and in this case, it’s used to select columns with missing values for display. The function is designed to help identify which columns have missing data after grouping by specified variables.

# Recreating the function
show_missing <- function(df, group_vars, summary_vars = everything()) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      across({{ summary_vars }}, \(x) sum(is.na(x))),
      .groups = "drop"
    ) |>
    select(where(\(x) any(x > 0)))
}

# Running an example
nycflights13::flights |> 
  show_missing(c(year, month, day)) |>
  slice_head(n = 5) |> 
  gt() |> gt_theme_538()

year	month	day	dep_time	dep_delay	arr_time	arr_delay	tailnum	air_time
2013	1	1	4	4	5	11	0	11
2013	1	2	8	8	10	15	2	15
2013	1	3	10	10	10	14	2	14
2013	1	4	6	6	6	7	2	7
2013	1	5	3	3	3	3	1	3