library(tidyverse)
library(gt)
library(gtExtras)
library(nycflights13)
data("flights")Chapter 26
Functions
Important stuff from Chapter 26, R for Data Science (2e)
To find the definition of a function that you’ve written, place the cursor on the name of the function and press
F2.To quickly jump to a function, press
Ctrl + .to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.In R, the
{{}}(double curly braces) operator is used for unquoting arguments in functions.The
:=operator is used for in-place modification or assignment of columns within a data.table object. It is typically used to add or update columns by reference, meaning it doesn’t create a newdata.tablebut instead modifies the existing one directly. This can be very efficient for large datasets because it avoids unnecessary copying of data.
26.2.5 Exercises
Question 1
Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?
mean(is.na(x))
mean(is.na(y))
mean(is.na(z))
The following function computes the proportion of missing values in a vector. It needs one argument.
prop_missing <- function(x){
mean(is.na(x))
}_____________
x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)
The following function computes the proportion of each element of a vector to the sum of the vector. It needs one argument.
prop_element <- function(x){
x / sum(x, na.rm = TRUE)
}_____________
round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)
This function below computes the percentage of each value (as compared to the sum of the vector of values) within one decimal place.
perc_element <- function(x){
round(x / sum(x, na.rm = TRUE) * 100, 1)
}Question 2
In the second variant of rescale01(), infinite values are left unchanged. Can you rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1?
rescale01 <- function(x) {
# Replace -Inf with the minimum number (other than -Inf)
min_value <- min(x[is.finite(x)])
x[x == -Inf] <- min_value
# Replace +Inf with the maximum number (other than +Inf)
max_value <- max(x[is.finite(x)])
x[x == Inf] <- max_value
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}Question 3
Given a vector of birth-dates, write a function to compute the age in years.
# Function to compute age in years from birth dates
compute_age <- function(birth_dates) {
# Convert the birth dates to Date objects
birth_dates <- as.Date(birth_dates)
# Calculate the current date
current_date <- Sys.Date()
# Calculate age in years using lubridate
ages <- interval(birth_dates, current_date) %/% years(1)
# Return the ages as a numeric vector
return(ages)
}Question 4
Write your own functions to compute the variance and skewness of a numeric vector. You can look up the definitions on Wikipedia or elsewhere.
To compute the variance and skewness of a numeric vector, we can define custom functions for each: –
- Variance Calculation using the formula:
# Function to compute the variance
compute_variance <- function(x) {
n <- length(x)
x_bar <- mean(x)
sum_squared_diff <- sum((x - x_bar)^2)
variance <- sum_squared_diff / (n - 1)
return(variance)
}- Skewness Calculation using the formula:
# Function to compute the skewness
compute_skewness <- function(x) {
n <- length(x)
x_bar <- mean(x)
std_dev <- sqrt(compute_variance(x)) # Using previously defined variance function
skewness <- sum((x - x_bar)^3) / ((n - 1) * std_dev^3)
return(skewness)
}These custom functions compute_variance and compute_skewness will calculate the variance and skewness of a numeric vector based on the provided formulas, without relying on external packages.
Question 5
Write both_na(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.
# Method 1
both_na <- function(x,y){
x_na <- which(is.na(x))
y_na <- which(is.na(y))
# values of x which are also present in y
common <- x_na %in% y_na
return(common)
}
# Method 2
both_na <- function(vector1, vector2) {
if (length(vector1) != length(vector2)) {
stop("Both vectors must have the same length.")
}
# Find the indices where both vectors have NA values
na_indices <- which(is.na(vector1) & is.na(vector2))
return(na_indices)
}Question 6
Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?
is_directory <- function(x) {
file.info(x)$isdir
}
This function tells whether an object is a file or a directory. It is useful because it will return a logical value, i.e., either TRUE or FALSE and thus can be used inside other operations, rather than using the $ operator; and is easy to remember in English.
_________
is_readable <- function(x) {
file.access(x, 4) == 0
}
This function tells us whether the file has a read permission or not, i.e, whether there is access to the file or not. It is useful because one does not have to remember the mode argument values for file.access() function, and is easy to remember in English for further use in for loops etc.
26.3.5 Exercises
Question 1
Using the datasets from nycflights13, write a function that:
# Creating a function to display the results in a nice way
display_results <- function(data){
data |>
slice_head(n = 5) |>
gt() |>
cols_label_with(fn = ~ janitor::make_clean_names(., "title")) |>
gt_theme_538()
}Finds all flights that were cancelled (i.e.
is.na(arr_time)) or delayed by more than an hour.flights |> filter_severe()filter_severe <- function(data, arrival_delay) { data |> filter(is.na({{arrival_delay}}) | ({{arrival_delay}} > 60)) } # Runinng an example to show it works flights |> filter_severe(arr_delay) |> select(1:10) |> display_results()Year Month Day Dep Time Sched Dep Time Dep Delay Arr Time Sched Arr Time Arr Delay Carrier 2013 1 1 811 630 101 1047 830 137 MQ 2013 1 1 848 1835 853 1001 1950 851 MQ 2013 1 1 957 733 144 1056 853 123 UA 2013 1 1 1114 900 134 1447 1222 145 UA 2013 1 1 1120 944 96 1331 1213 78 EV Counts the number of cancelled flights and the number of flights delayed by more than an hour.
flights |> group_by(dest) |> summarize_severe()summarize_severe <- function(data, arrival_delay){ data |> summarize( cancelled_flights = sum(is.na({{arrival_delay}})), delayed_flights = sum({{arrival_delay}} > 60, na.rm = TRUE) ) } # Runinng an example to show it works flights |> group_by(dest) |> summarize_severe(arr_delay) |> display_results()Dest Cancelled Flights Delayed Flights ABQ 0 25 ACK 1 12 ALB 21 59 ANC 0 0 ATL 378 1433 Finds all flights that were cancelled or delayed by more than a user supplied number of hours:
flights |> filter_severe(hours = 2)filter_severe <- function(data, hours){ data |> filter(arr_delay > (hours*60)) } # Running an example to show that it works flights |> filter_severe(hours = 4) |> select(1:10) |> display_results()Year Month Day Dep Time Sched Dep Time Dep Delay Arr Time Sched Arr Time Arr Delay Carrier 2013 1 1 848 1835 853 1001 1950 851 MQ 2013 1 1 1815 1325 290 2120 1542 338 EV 2013 1 1 1842 1422 260 1958 1535 263 EV 2013 1 1 2115 1700 255 2330 1920 250 9E 2013 1 1 2205 1720 285 46 2040 246 AA Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:
weather |> summarize_weather(temp)summarize_weather <- function(data, variable){ data |> summarize( mean = mean({{variable}}, na.rm = TRUE), minimum = min({{variable}}, na.rm = TRUE), maximum = max({{variable}}, na.rm = TRUE) ) } # Runinng an example to show it works weather |> group_by(origin) |> summarize_weather(temp) |> display_results() |> fmt_number(decimals = 2)Origin Mean Minimum Maximum EWR 55.55 10.94 100.04 JFK 54.47 12.02 98.06 LGA 55.76 12.02 98.96 Converts the user supplied variable that uses clock time (e.g.,
dep_time,arr_time, etc.) into a decimal time (i.e. hours + (minutes / 60)).
flights |> standardize_time(sched_dep_time)
# Method 1
standardize_time <- function(data, variable){
data |>
mutate(std_time = round({{variable}} %/% 100) + (({{variable}} %% 100)/60), 2) |>
relocate(std_time, .after = {{variable}})
}
# Method 2 (after learning use of ":=" in Section 26.4.2)
standardize_time <- function(data, variable){
data |>
mutate({{variable}} := round({{variable}} %/% 100) + (({{variable}} %% 100)/60), 2)
}
# Runinng an example to show it works
flights |>
standardize_time(sched_dep_time) |>
select(1:10) |>
display_results()| Year | Month | Day | Dep Time | Sched Dep Time | Dep Delay | Arr Time | Sched Arr Time | Arr Delay | Carrier |
|---|---|---|---|---|---|---|---|---|---|
| 2013 | 1 | 1 | 517 | 5.250000 | 2 | 830 | 819 | 11 | UA |
| 2013 | 1 | 1 | 533 | 5.483333 | 4 | 850 | 830 | 20 | UA |
| 2013 | 1 | 1 | 542 | 5.666667 | 2 | 923 | 850 | 33 | AA |
| 2013 | 1 | 1 | 544 | 5.750000 | -1 | 1004 | 1022 | -18 | B6 |
| 2013 | 1 | 1 | 554 | 6.000000 | -6 | 812 | 837 | -25 | DL |
Question 2
For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: distinct(), count(), group_by(), rename_with(), slice_min(), slice_sample().
In the context of the dplyr package in R, tidy evaluation refers to a method of non-standard evaluation where you can use variables as arguments to functions and refer to those variables within the context of a data frame. It has two parts: –
Data Masking: Data masking involves creating a temporary environment within a function where external variables from the global environment are hidden or masked. This means that variables from the global environment are temporarily unavailable inside the function, preventing unintended interactions or conflicts. For example:
data <- data.frame(x = 1:5, y = 6:10) custom_function <- function(df) { # Using data masking to prevent access to global 'data' variable df %>% mutate(z = x + y) } custom_function(data)x y z 1 1 6 7 2 2 7 9 3 3 8 11 4 4 9 13 5 5 10 15In this example, the
custom_functiontakes a data framedfas an argument and uses themutate()function from dplyr to create a new variablez. Inside the function, data masking ensures thatxandyrefer to the columns within thedfargument, not the globaldatadata frame.Tidy Selection: Tidy selection involves selecting or manipulating columns within a data frame using non-standard evaluation (NSE). It allows you to refer to column names as symbols or expressions, facilitating dynamic and programmatic column selection. For example: –
data <- data.frame(x = 1:5, y = 6:10) selected_columns <- c("x", "y") data %>% select(all_of(selected_columns))x y 1 1 6 2 2 7 3 3 8 4 4 9 5 5 10In this example, the
select()function uses tidy selection to select columns specified in theselected_columnsvector.
Thus, Data masking primarily deals with the environment of the function, while tidy selection focuses on column selection and manipulation within a data frame.
All of the following functions use Tidy Evaluation in the following way: –
| Function | Description | Tidy Selection | Arguments |
|---|---|---|---|
distinct() |
Select distinct rows based on specified columns | Yes |
|
count() |
Count the number of rows in groups defined by specific variables | Yes |
|
group_by() |
Group data by specific columns | Yes |
|
rename_with() |
Rename columns based on a function or expression | Yes |
|
slice_min() |
Filter rows corresponding to the minimum value of a column | Yes |
|
slice_sample() |
Randomly sample rows from a data frame | Yes |
|
In summary, all of the functions mentioned above support tidy evaluation for specifying arguments related to column selection or manipulation.
Question 3
Generalize the following function so that you can supply any number of variables to count.
count_prop <- function(df, var, sort = FALSE) {
df |>
count({{ var }}, sort = sort) |>
mutate(prop = n / sum(n))
}
Here is the modified function to allow user to supply any number of variables to count: –
count_prop <- function(df, vars, sort = FALSE) {
df |>
count(pick({{ vars }}), sort = sort) |>
mutate(prop = n / sum(n))
}
# Testing the results
flights |>
count_prop(c(dest, origin)) |>
slice_head(n = 10) |>
gt() |> gt_theme_538() |> fmt_number(prop, decimals = 4)| dest | origin | n | prop |
|---|---|---|---|
| ABQ | JFK | 254 | 0.0008 |
| ACK | JFK | 265 | 0.0008 |
| ALB | EWR | 439 | 0.0013 |
| ANC | EWR | 8 | 0.0000 |
| ATL | EWR | 5022 | 0.0149 |
| ATL | JFK | 1930 | 0.0057 |
| ATL | LGA | 10263 | 0.0305 |
| AUS | EWR | 968 | 0.0029 |
| AUS | JFK | 1471 | 0.0044 |
| AVL | EWR | 265 | 0.0008 |
26.4.4 Exercises
Build up a rich plotting function by incrementally implementing each of the steps below:
Question 1
Draw a scatter-plot given dataset and x and y variables.
scatterplot <- function(data, x, y){
data |>
ggplot(aes(x = {{x}},
y = {{y}})) +
geom_point() +
theme_classic()
}Question 2
Add a line of best fit (i.e. a linear model with no standard errors).
scatterplot <- function(data, x, y){
data |>
ggplot(aes(x = {{x}},
y = {{y}})) +
geom_point() +
geom_smooth(method = "lm",
formula = {{y}} ~ {{x}},
se = FALSE) +
theme_classic()
}Question 3
Add a title.
scatterplot <- function(data, x, y){
ggplot(data, aes(x = {{x}},
y = {{y}})) +
geom_point() +
geom_smooth(method = "lm",
formula = {{y}} ~ {{x}},
se = FALSE) +
labs(title = rlang::englue("A scatter plot of {{y}} vs. {{x}}")) +
theme_classic()
}26.5.1 Exercises
Question 1
Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.
is_prefix()
# is_prefix
f1 <- function(string, prefix) {
str_sub(string, 1, str_length(prefix)) == prefix
}
match_length()
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
Question 2
Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.
Here are some suggestions for renaming html_nodes() and its arguments (from the package rvest):
Original Function:
html_nodes()html_extract()- This name suggests the action of extracting nodes from an HTML document.select_html_nodes()- A more descriptive name that clarifies the purpose of the function.
Argument
css:selector- A more explicit name to indicate that this argument takes CSS selectors.element_selector- If the purpose of the argument is to select HTML elements, this name makes it clear.
Argument
xpath:xpath_expression- A more descriptive name that specifies the type of input expected.xml_path- A concise alternative that still hints at the purpose.
Argument
trim:remove_whitespace- A name that better reflects the action performed whentrimis set toTRUE.clean_whitespace- Another option to indicate the removal of whitespace.
Question 3
Make a case for why norm_r(), norm_d() etc. would be better than rnorm(), dnorm(). Make a case for the opposite. How could you make the names even clearer?
Function naming conventions in R often follow a pattern where functions are named with a verb followed by a noun, which makes the code more readable and easier to understand.
Arguments for norm_r() and norm_d() (Custom Names):
Verb-First Naming: Putting the action (e.g., “normalize”) before the distribution name (e.g., “norm”) makes the function name more descriptive and aligns with R’s naming conventions.
Consistency: These names maintain consistency with other functions in R that follow a similar pattern (e.g.,
read_csv(),plot_hist()).Clarity: These names emphasize the action being performed (normalizing) over the distribution being used (normal distribution), which might be more intuitive to some users.
Readability: These names read like a sentence: “Normalize a random sample from a normal distribution.”
Arguments for rnorm() and dnorm() (Current Names):
Historical Convention: Functions like
rnorm()anddnorm()have been part of the base R distribution for a long time. Changing their names might lead to confusion for users who are already familiar with these functions.Alignment with Mathematical Notation: The
rnorm()anddnorm()naming closely resembles mathematical notation, making it easier for users with a strong mathematical background to recognize and use these functions.Package Consistency: Maintaining the original names ensures consistency with other functions in the stats package (e.g.,
pnorm(),qnorm()), which also follow this convention.
Even Clearer Names: If we want to make the function names even clearer, we can consider longer, more descriptive names, such as:
generate_random_samples_from_normal_distribution()forrnorm(): This explicitly states what the function does.probability_density_function_of_normal_distribution()fordnorm(): This specifies the exact purpose of the function.
While these longer names are more descriptive, they may become unwieldy in code. Striking a balance between clarity and brevity is important.