library(tidyverse)
library(gt)
library(gtExtras)
library(nycflights13)
data("flights")
Chapter 26
Functions
Important stuff from Chapter 26, R for Data Science (2e)
To find the definition of a function that you’ve written, place the cursor on the name of the function and press
F2
.To quickly jump to a function, press
Ctrl + .
to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.In R, the
{{}}
(double curly braces) operator is used for unquoting arguments in functions.The
:=
operator is used for in-place modification or assignment of columns within a data.table object. It is typically used to add or update columns by reference, meaning it doesn’t create a newdata.table
but instead modifies the existing one directly. This can be very efficient for large datasets because it avoids unnecessary copying of data.
26.2.5 Exercises
Question 1
Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?
mean(is.na(x))
mean(is.na(y))
mean(is.na(z))
The following function computes the proportion of missing values in a vector. It needs one argument.
<- function(x){
prop_missing mean(is.na(x))
}
_____________
x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)
The following function computes the proportion of each element of a vector to the sum of the vector. It needs one argument.
<- function(x){
prop_element / sum(x, na.rm = TRUE)
x }
_____________
round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)
This function below computes the percentage of each value (as compared to the sum of the vector of values) within one decimal place.
<- function(x){
perc_element round(x / sum(x, na.rm = TRUE) * 100, 1)
}
Question 2
In the second variant of rescale01()
, infinite values are left unchanged. Can you rewrite rescale01()
so that -Inf
is mapped to 0, and Inf
is mapped to 1?
<- function(x) {
rescale01
# Replace -Inf with the minimum number (other than -Inf)
<- min(x[is.finite(x)])
min_value == -Inf] <- min_value
x[x
# Replace +Inf with the maximum number (other than +Inf)
<- max(x[is.finite(x)])
max_value == Inf] <- max_value
x[x
<- range(x, na.rm = TRUE)
rng - rng[1]) / (rng[2] - rng[1])
(x }
Question 3
Given a vector of birth-dates, write a function to compute the age in years.
# Function to compute age in years from birth dates
<- function(birth_dates) {
compute_age
# Convert the birth dates to Date objects
<- as.Date(birth_dates)
birth_dates
# Calculate the current date
<- Sys.Date()
current_date
# Calculate age in years using lubridate
<- interval(birth_dates, current_date) %/% years(1)
ages
# Return the ages as a numeric vector
return(ages)
}
Question 4
Write your own functions to compute the variance and skewness of a numeric vector. You can look up the definitions on Wikipedia or elsewhere.
To compute the variance and skewness of a numeric vector, we can define custom functions for each: –
- Variance Calculation using the formula:
# Function to compute the variance
<- function(x) {
compute_variance <- length(x)
n <- mean(x)
x_bar <- sum((x - x_bar)^2)
sum_squared_diff <- sum_squared_diff / (n - 1)
variance return(variance)
}
- Skewness Calculation using the formula:
# Function to compute the skewness
<- function(x) {
compute_skewness <- length(x)
n <- mean(x)
x_bar <- sqrt(compute_variance(x)) # Using previously defined variance function
std_dev <- sum((x - x_bar)^3) / ((n - 1) * std_dev^3)
skewness return(skewness)
}
These custom functions compute_variance
and compute_skewness
will calculate the variance and skewness of a numeric vector based on the provided formulas, without relying on external packages.
Question 5
Write both_na()
, a summary function that takes two vectors of the same length and returns the number of positions that have an NA
in both vectors.
# Method 1
<- function(x,y){
both_na
<- which(is.na(x))
x_na <- which(is.na(y))
y_na
# values of x which are also present in y
<- x_na %in% y_na
common return(common)
}
# Method 2
<- function(vector1, vector2) {
both_na if (length(vector1) != length(vector2)) {
stop("Both vectors must have the same length.")
}
# Find the indices where both vectors have NA values
<- which(is.na(vector1) & is.na(vector2))
na_indices
return(na_indices)
}
Question 6
Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?
is_directory <- function(x) {
file.info(x)$isdir
}
This function tells whether an object is a file or a directory. It is useful because it will return a logical value, i.e., either TRUE
or FALSE
and thus can be used inside other operations, rather than using the $
operator; and is easy to remember in English.
_________
is_readable <- function(x) {
file.access(x, 4) == 0
}
This function tells us whether the file has a read permission or not, i.e, whether there is access to the file or not. It is useful because one does not have to remember the mode
argument values for file.access()
function, and is easy to remember in English for further use in for
loops etc.
26.3.5 Exercises
Question 1
Using the datasets from nycflights13
, write a function that:
# Creating a function to display the results in a nice way
<- function(data){
display_results |>
data slice_head(n = 5) |>
gt() |>
cols_label_with(fn = ~ janitor::make_clean_names(., "title")) |>
gt_theme_538()
}
Finds all flights that were cancelled (i.e.
is.na(arr_time)
) or delayed by more than an hour.flights |> filter_severe()
<- function(data, arrival_delay) { filter_severe |> data filter(is.na({{arrival_delay}}) | ({{arrival_delay}} > 60)) } # Runinng an example to show it works |> flights filter_severe(arr_delay) |> select(1:10) |> display_results()
Year Month Day Dep Time Sched Dep Time Dep Delay Arr Time Sched Arr Time Arr Delay Carrier 2013 1 1 811 630 101 1047 830 137 MQ 2013 1 1 848 1835 853 1001 1950 851 MQ 2013 1 1 957 733 144 1056 853 123 UA 2013 1 1 1114 900 134 1447 1222 145 UA 2013 1 1 1120 944 96 1331 1213 78 EV Counts the number of cancelled flights and the number of flights delayed by more than an hour.
flights |> group_by(dest) |> summarize_severe()
<- function(data, arrival_delay){ summarize_severe |> data summarize( cancelled_flights = sum(is.na({{arrival_delay}})), delayed_flights = sum({{arrival_delay}} > 60, na.rm = TRUE) ) } # Runinng an example to show it works |> flights group_by(dest) |> summarize_severe(arr_delay) |> display_results()
Dest Cancelled Flights Delayed Flights ABQ 0 25 ACK 1 12 ALB 21 59 ANC 0 0 ATL 378 1433 Finds all flights that were cancelled or delayed by more than a user supplied number of hours:
flights |> filter_severe(hours = 2)
<- function(data, hours){ filter_severe |> data filter(arr_delay > (hours*60)) } # Running an example to show that it works |> flights filter_severe(hours = 4) |> select(1:10) |> display_results()
Year Month Day Dep Time Sched Dep Time Dep Delay Arr Time Sched Arr Time Arr Delay Carrier 2013 1 1 848 1835 853 1001 1950 851 MQ 2013 1 1 1815 1325 290 2120 1542 338 EV 2013 1 1 1842 1422 260 1958 1535 263 EV 2013 1 1 2115 1700 255 2330 1920 250 9E 2013 1 1 2205 1720 285 46 2040 246 AA Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:
weather |> summarize_weather(temp)
<- function(data, variable){ summarize_weather |> data summarize( mean = mean({{variable}}, na.rm = TRUE), minimum = min({{variable}}, na.rm = TRUE), maximum = max({{variable}}, na.rm = TRUE) ) }# Runinng an example to show it works |> weather group_by(origin) |> summarize_weather(temp) |> display_results() |> fmt_number(decimals = 2)
Origin Mean Minimum Maximum EWR 55.55 10.94 100.04 JFK 54.47 12.02 98.06 LGA 55.76 12.02 98.96 Converts the user supplied variable that uses clock time (e.g.,
dep_time
,arr_time
, etc.) into a decimal time (i.e. hours + (minutes / 60)).
flights |> standardize_time(sched_dep_time)
# Method 1
<- function(data, variable){
standardize_time |>
data mutate(std_time = round({{variable}} %/% 100) + (({{variable}} %% 100)/60), 2) |>
relocate(std_time, .after = {{variable}})
}
# Method 2 (after learning use of ":=" in Section 26.4.2)
<- function(data, variable){
standardize_time |>
data mutate({{variable}} := round({{variable}} %/% 100) + (({{variable}} %% 100)/60), 2)
}
# Runinng an example to show it works
|>
flights standardize_time(sched_dep_time) |>
select(1:10) |>
display_results()
Year | Month | Day | Dep Time | Sched Dep Time | Dep Delay | Arr Time | Sched Arr Time | Arr Delay | Carrier |
---|---|---|---|---|---|---|---|---|---|
2013 | 1 | 1 | 517 | 5.250000 | 2 | 830 | 819 | 11 | UA |
2013 | 1 | 1 | 533 | 5.483333 | 4 | 850 | 830 | 20 | UA |
2013 | 1 | 1 | 542 | 5.666667 | 2 | 923 | 850 | 33 | AA |
2013 | 1 | 1 | 544 | 5.750000 | -1 | 1004 | 1022 | -18 | B6 |
2013 | 1 | 1 | 554 | 6.000000 | -6 | 812 | 837 | -25 | DL |
Question 2
For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: distinct()
, count()
, group_by()
, rename_with()
, slice_min()
, slice_sample()
.
In the context of the dplyr
package in R
, tidy evaluation refers to a method of non-standard evaluation where you can use variables as arguments to functions and refer to those variables within the context of a data frame. It has two parts: –
Data Masking: Data masking involves creating a temporary environment within a function where external variables from the global environment are hidden or masked. This means that variables from the global environment are temporarily unavailable inside the function, preventing unintended interactions or conflicts. For example:
<- data.frame(x = 1:5, y = 6:10) data <- function(df) { custom_function # Using data masking to prevent access to global 'data' variable %>% df mutate(z = x + y) } custom_function(data)
x y z 1 1 6 7 2 2 7 9 3 3 8 11 4 4 9 13 5 5 10 15
In this example, the
custom_function
takes a data framedf
as an argument and uses themutate()
function from dplyr to create a new variablez
. Inside the function, data masking ensures thatx
andy
refer to the columns within thedf
argument, not the globaldata
data frame.Tidy Selection: Tidy selection involves selecting or manipulating columns within a data frame using non-standard evaluation (NSE). It allows you to refer to column names as symbols or expressions, facilitating dynamic and programmatic column selection. For example: –
<- data.frame(x = 1:5, y = 6:10) data <- c("x", "y") selected_columns %>% data select(all_of(selected_columns))
x y 1 1 6 2 2 7 3 3 8 4 4 9 5 5 10
In this example, the
select()
function uses tidy selection to select columns specified in theselected_columns
vector.
Thus, Data masking primarily deals with the environment of the function, while tidy selection focuses on column selection and manipulation within a data frame.
All of the following functions use Tidy Evaluation in the following way: –
Function | Description | Tidy Selection | Arguments |
---|---|---|---|
distinct() |
Select distinct rows based on specified columns | Yes |
|
count() |
Count the number of rows in groups defined by specific variables | Yes |
|
group_by() |
Group data by specific columns | Yes |
|
rename_with() |
Rename columns based on a function or expression | Yes |
|
slice_min() |
Filter rows corresponding to the minimum value of a column | Yes |
|
slice_sample() |
Randomly sample rows from a data frame | Yes |
|
In summary, all of the functions mentioned above support tidy evaluation for specifying arguments related to column selection or manipulation.
Question 3
Generalize the following function so that you can supply any number of variables to count.
count_prop <- function(df, var, sort = FALSE) {
df |>
count({{ var }}, sort = sort) |>
mutate(prop = n / sum(n))
}
Here is the modified function to allow user to supply any number of variables to count: –
<- function(df, vars, sort = FALSE) {
count_prop |>
df count(pick({{ vars }}), sort = sort) |>
mutate(prop = n / sum(n))
}
# Testing the results
|>
flights count_prop(c(dest, origin)) |>
slice_head(n = 10) |>
gt() |> gt_theme_538() |> fmt_number(prop, decimals = 4)
dest | origin | n | prop |
---|---|---|---|
ABQ | JFK | 254 | 0.0008 |
ACK | JFK | 265 | 0.0008 |
ALB | EWR | 439 | 0.0013 |
ANC | EWR | 8 | 0.0000 |
ATL | EWR | 5022 | 0.0149 |
ATL | JFK | 1930 | 0.0057 |
ATL | LGA | 10263 | 0.0305 |
AUS | EWR | 968 | 0.0029 |
AUS | JFK | 1471 | 0.0044 |
AVL | EWR | 265 | 0.0008 |
26.4.4 Exercises
Build up a rich plotting function by incrementally implementing each of the steps below:
Question 1
Draw a scatter-plot given dataset and x
and y
variables.
<- function(data, x, y){
scatterplot |>
data ggplot(aes(x = {{x}},
y = {{y}})) +
geom_point() +
theme_classic()
}
Question 2
Add a line of best fit (i.e. a linear model with no standard errors).
<- function(data, x, y){
scatterplot |>
data ggplot(aes(x = {{x}},
y = {{y}})) +
geom_point() +
geom_smooth(method = "lm",
formula = {{y}} ~ {{x}},
se = FALSE) +
theme_classic()
}
Question 3
Add a title.
<- function(data, x, y){
scatterplot ggplot(data, aes(x = {{x}},
y = {{y}})) +
geom_point() +
geom_smooth(method = "lm",
formula = {{y}} ~ {{x}},
se = FALSE) +
labs(title = rlang::englue("A scatter plot of {{y}} vs. {{x}}")) +
theme_classic()
}
26.5.1 Exercises
Question 1
Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.
is_prefix()
# is_prefix
f1 <- function(string, prefix) {
str_sub(string, 1, str_length(prefix)) == prefix
}
match_length()
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
Question 2
Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.
Here are some suggestions for renaming html_nodes()
and its arguments (from the package rvest
):
Original Function:
html_nodes()
html_extract()
- This name suggests the action of extracting nodes from an HTML document.select_html_nodes()
- A more descriptive name that clarifies the purpose of the function.
Argument
css
:selector
- A more explicit name to indicate that this argument takes CSS selectors.element_selector
- If the purpose of the argument is to select HTML elements, this name makes it clear.
Argument
xpath
:xpath_expression
- A more descriptive name that specifies the type of input expected.xml_path
- A concise alternative that still hints at the purpose.
Argument
trim
:remove_whitespace
- A name that better reflects the action performed whentrim
is set toTRUE
.clean_whitespace
- Another option to indicate the removal of whitespace.
Question 3
Make a case for why norm_r()
, norm_d()
etc. would be better than rnorm()
, dnorm()
. Make a case for the opposite. How could you make the names even clearer?
Function naming conventions in R often follow a pattern where functions are named with a verb followed by a noun, which makes the code more readable and easier to understand.
Arguments for norm_r()
and norm_d()
(Custom Names):
Verb-First Naming: Putting the action (e.g., “normalize”) before the distribution name (e.g., “norm”) makes the function name more descriptive and aligns with R’s naming conventions.
Consistency: These names maintain consistency with other functions in R that follow a similar pattern (e.g.,
read_csv()
,plot_hist()
).Clarity: These names emphasize the action being performed (normalizing) over the distribution being used (normal distribution), which might be more intuitive to some users.
Readability: These names read like a sentence: “Normalize a random sample from a normal distribution.”
Arguments for rnorm()
and dnorm()
(Current Names):
Historical Convention: Functions like
rnorm()
anddnorm()
have been part of the base R distribution for a long time. Changing their names might lead to confusion for users who are already familiar with these functions.Alignment with Mathematical Notation: The
rnorm()
anddnorm()
naming closely resembles mathematical notation, making it easier for users with a strong mathematical background to recognize and use these functions.Package Consistency: Maintaining the original names ensures consistency with other functions in the stats package (e.g.,
pnorm()
,qnorm()
), which also follow this convention.
Even Clearer Names: If we want to make the function names even clearer, we can consider longer, more descriptive names, such as:
generate_random_samples_from_normal_distribution()
forrnorm()
: This explicitly states what the function does.probability_density_function_of_normal_distribution()
fordnorm()
: This specifies the exact purpose of the function.
While these longer names are more descriptive, they may become unwieldy in code. Striking a balance between clarity and brevity is important.