library(tidyverse)
library(gt)
library(gtExtras)
Chapter 27
Iteration
27.2.8 Exercises
Question 1
Practice your across()
skills by:
Computing the number of unique values in each column of
palmerpenguins::penguins
.::penguins |> palmerpenguinssummarise(across(.cols = everything(), .fns = n_distinct))
# A tibble: 1 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <int> <int> <int> <int> <int> <int> 1 3 3 165 81 56 95 # ℹ 2 more variables: sex <int>, year <int>
Computing the mean of every column in
mtcars
.|> mtcars summarise(across(everything(), mean))
mpg cyl disp hp drat wt qsec vs am 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 gear carb 1 3.6875 2.8125
Grouping
diamonds
bycut
,clarity
, andcolor
then counting the number of observations and computing the mean of each numeric column.::diamonds |> ggplot2group_by(cut, clarity, color) |> summarise( across( .cols = where(is.numeric), .fns = function(x) mean(x, na.rm = TRUE) ),n = n() )
# A tibble: 276 × 11 # Groups: cut, clarity [40] cut clarity color carat depth table price x y z n <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> 1 Fair I1 D 1.88 65.6 56.8 7383 7.52 7.42 4.90 4 2 Fair I1 E 0.969 65.6 58.1 2095. 6.17 6.06 4.01 9 3 Fair I1 F 1.02 65.7 58.4 2544. 6.14 6.04 4.00 35 4 Fair I1 G 1.23 65.3 57.7 3187. 6.52 6.43 4.23 53 5 Fair I1 H 1.50 65.8 58.4 4213. 6.96 6.86 4.55 52 6 Fair I1 I 1.32 65.7 58.4 3501 6.76 6.65 4.41 34 7 Fair I1 J 1.99 66.5 57.9 5795. 7.55 7.46 4.99 23 8 Fair SI2 D 1.02 64.7 58.6 4355. 6.24 6.17 4.01 56 9 Fair SI2 E 1.02 63.4 59.5 4172. 6.28 6.22 3.96 78 10 Fair SI2 F 1.08 63.8 59.5 4520. 6.36 6.30 4.04 89 # ℹ 266 more rows
Question 2
What happens if you use a list of functions in across()
, but don’t name them? How is the output named?
Using a list of functions in across()
without naming them will result in automatically generated names for the output columns, following the default convention of
"{.col}"
for the single function case and"{.col}_{.fn}"
for the case where a named function list is used for.fns
."{.col}_1"
,"{.col}_2"
etc. for the case where unnamed list is used for.fns
.
Thus, when we use a list, resulting columns will be named like x_mean
when name of original column is x
and new named function used across is mean
.
# Create a sample data frame
<- data.frame(
data A = c(1, 2, 3, NA, 4),
B = c(4, NA, 5, 6, 7)
)
# Use across() without naming the functions
%>%
data mutate(across(everything(), list(sqrt, log)))
A B A_1 A_2 B_1 B_2
1 1 4 1.000000 0.0000000 2.000000 1.386294
2 2 NA 1.414214 0.6931472 NA NA
3 3 5 1.732051 1.0986123 2.236068 1.609438
4 NA 6 NA NA 2.449490 1.791759
5 4 7 2.000000 1.3862944 2.645751 1.945910
# Use across() with named new functions
%>%
data mutate(
across(
everything(),
list(sqrt = sqrt,
log = log,
mean = \(x) mean(x, na.rm = TRUE)
)
) )
A B A_sqrt A_log A_mean B_sqrt B_log B_mean
1 1 4 1.000000 0.0000000 2.5 2.000000 1.386294 5.5
2 2 NA 1.414214 0.6931472 2.5 NA NA 5.5
3 3 5 1.732051 1.0986123 2.5 2.236068 1.609438 5.5
4 NA 6 NA NA 2.5 2.449490 1.791759 5.5
5 4 7 2.000000 1.3862944 2.5 2.645751 1.945910 5.5
Question 3
Adjust expand_dates()
to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?
To modify the expand_dates()
function to automatically remove the date columns after they’ve been expanded, we can achieve this by adding the select()
function to drop the original date columns.
We don’t need to embrace any additional arguments.
Here’s the adjusted expand_dates()
function:
<- function(df) {
expand_dates |>
df mutate(
across(where(is.Date), list(year = year, month = month, day = mday))
|>
) select(!where(is.Date))
}
<- tibble(
df_date name = c("Amy", "Bob", "Charlie", "David", "Eva"),
date = ymd(c("2009-08-03", "2010-01-16", "2012-05-20", "2013-11-30", "2015-07-12"))
)
|>
df_date expand_dates()
# A tibble: 5 × 4
name date_year date_month date_day
<chr> <dbl> <dbl> <int>
1 Amy 2009 8 3
2 Bob 2010 1 16
3 Charlie 2012 5 20
4 David 2013 11 30
5 Eva 2015 7 12
Question 4
Explain what each step of the pipeline in this function does. What special feature of where()
are we taking advantage of?
show_missing <- function(df, group_vars, summary_vars = everything()) {
df |>
group_by(pick({{ group_vars }})) |>
summarize(
across({{ summary_vars }}, \(x) sum(is.na(x))),
.groups = "drop"
) |>
select(where(\(x) any(x > 0)))
}
nycflights13::flights |> show_missing(c(year, month, day))
The show_missing()
function appears to be designed to display the number of missing values in a data-frame.
The each step of the pipeline is explained as follows: –
df |> group_by(pick({{ group_vars }}))
:- The pipe operator (
|>
) is used to pass the dataframedf
to the next operation. group_by()
is applied to the dataframe, and it groups the data by the variables specified bygroup_vars
. Thegroup_vars
parameter is enclosed in double curly braces ({{ }}
), which is a tidy evaluation feature used to work with non-standard evaluation. It allows you to pass column names as unquoted arguments to the function.
- The pipe operator (
summarize(across({{ summary_vars }}, \(x) sum(is.na(x))), .groups = "drop")
:- After grouping the data,
summarize()
is used to calculate summary statistics. across()
is applied to the variables specified bysummary_vars
. Similar togroup_vars
,summary_vars
is also enclosed in double curly braces to use tidyeval. This means you can pass a list of column names tosummary_vars
as unquoted arguments.- Inside the
across()
function, a lambda function (i.e. an anonymous function)\(x) sum(is.na(x))
is used to count the number of missing values in each column. This is done by applying theis.na()
function to each columnx
and then summing the logical values. - The
.groups = "drop"
argument is used to drop the grouping structure after summarizing. This means the resulting dataframe will not have grouped rows.
- After grouping the data,
select(where(\(x) any(x > 0)))
:- Finally, the
select()
function is used to choose the columns for display. - The
where()
function is applied to select columns based on a condition. - The condition used here is
\(x) any(x > 0)
, which checks if there is any value greater than 0 in each columnx
. If there are any non-zero values in a column, it means there are missing values in that column, so it’s selected. - This step ensures that only columns with missing values (i.e., columns with at least one non-zero value in the summary) are included in the final output.
- Finally, the
So, the special feature of where()
being taken advantage of is its ability to select columns based on a condition, and in this case, it’s used to select columns with missing values for display. The function is designed to help identify which columns have missing data after grouping by specified variables.
# Recreating the function
<- function(df, group_vars, summary_vars = everything()) {
show_missing |>
df group_by(pick({{ group_vars }})) |>
summarize(
across({{ summary_vars }}, \(x) sum(is.na(x))),
.groups = "drop"
|>
) select(where(\(x) any(x > 0)))
}
# Running an example
::flights |>
nycflights13show_missing(c(year, month, day)) |>
slice_head(n = 5) |>
gt() |> gt_theme_538()
year | month | day | dep_time | dep_delay | arr_time | arr_delay | tailnum | air_time |
---|---|---|---|---|---|---|---|---|
2013 | 1 | 1 | 4 | 4 | 5 | 11 | 0 | 11 |
2013 | 1 | 2 | 8 | 8 | 10 | 15 | 2 | 15 |
2013 | 1 | 3 | 10 | 10 | 10 | 14 | 2 | 14 |
2013 | 1 | 4 | 6 | 6 | 6 | 7 | 2 | 7 |
2013 | 1 | 5 | 3 | 3 | 3 | 3 | 1 | 3 |