Chapter 26

Functions

Author

Aditya Dahiya

Published

October 9, 2023

library(tidyverse)
library(gt)
library(gtExtras)
library(nycflights13)
data("flights")

Important stuff from Chapter 26, R for Data Science (2e)

To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.
To quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.
In R, the {{}} (double curly braces) operator is used for unquoting arguments in functions.
The := operator is used for in-place modification or assignment of columns within a data.table object. It is typically used to add or update columns by reference, meaning it doesn’t create a new data.table but instead modifies the existing one directly. This can be very efficient for large datasets because it avoids unnecessary copying of data.

26.2.5 Exercises

Question 1

Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?

mean(is.na(x))
mean(is.na(y))
mean(is.na(z))

The following function computes the proportion of missing values in a vector. It needs one argument.

prop_missing <- function(x){
  mean(is.na(x))
}

_____________

x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)

The following function computes the proportion of each element of a vector to the sum of the vector. It needs one argument.

prop_element <- function(x){
  x / sum(x, na.rm = TRUE)
}

_____________

round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)

This function below computes the percentage of each value (as compared to the sum of the vector of values) within one decimal place.

perc_element <- function(x){
  round(x / sum(x, na.rm = TRUE) * 100, 1)
}

Question 2

In the second variant of rescale01(), infinite values are left unchanged. Can you rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1?

rescale01 <- function(x) {
  
  # Replace -Inf with the minimum number (other than -Inf)
  min_value <- min(x[is.finite(x)])
  x[x == -Inf] <- min_value

  # Replace +Inf with the maximum number (other than +Inf)
  max_value <- max(x[is.finite(x)])
  x[x == Inf] <- max_value

  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

Question 3

Given a vector of birth-dates, write a function to compute the age in years.

# Function to compute age in years from birth dates

compute_age <- function(birth_dates) {
  
  # Convert the birth dates to Date objects
  birth_dates <- as.Date(birth_dates)
  
  # Calculate the current date
  current_date <- Sys.Date()
  
  # Calculate age in years using lubridate
  ages <- interval(birth_dates, current_date) %/% years(1)
  
  # Return the ages as a numeric vector
  return(ages)
}

Question 4

Write your own functions to compute the variance and skewness of a numeric vector. You can look up the definitions on Wikipedia or elsewhere.

To compute the variance and skewness of a numeric vector, we can define custom functions for each: –

Variance Calculation using the formula:

# Function to compute the variance
compute_variance <- function(x) {
  n <- length(x)
  x_bar <- mean(x)
  sum_squared_diff <- sum((x - x_bar)^2)
  variance <- sum_squared_diff / (n - 1)
  return(variance)
}

Skewness Calculation using the formula:

# Function to compute the skewness
compute_skewness <- function(x) {
  n <- length(x)
  x_bar <- mean(x)
  std_dev <- sqrt(compute_variance(x))  # Using previously defined variance function
  skewness <- sum((x - x_bar)^3) / ((n - 1) * std_dev^3)
  return(skewness)
}

These custom functions compute_variance and compute_skewness will calculate the variance and skewness of a numeric vector based on the provided formulas, without relying on external packages.

Question 5

Write both_na(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

# Method 1
both_na <- function(x,y){
  
  x_na <- which(is.na(x))
  y_na <- which(is.na(y))
  
  # values of x which are also present in y
  common <- x_na %in% y_na
  return(common)
}

# Method 2
both_na <- function(vector1, vector2) {
  if (length(vector1) != length(vector2)) {
    stop("Both vectors must have the same length.")
  }
  
  # Find the indices where both vectors have NA values
  na_indices <- which(is.na(vector1) & is.na(vector2))
  
  return(na_indices)
}

Question 6

Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?

is_directory <- function(x) {
  file.info(x)$isdir
 }

This function tells whether an object is a file or a directory. It is useful because it will return a logical value, i.e., either TRUE or FALSE and thus can be used inside other operations, rather than using the $ operator; and is easy to remember in English.

_________

is_readable <- function(x) {
  file.access(x, 4) == 0
}

This function tells us whether the file has a read permission or not, i.e, whether there is access to the file or not. It is useful because one does not have to remember the mode argument values for file.access() function, and is easy to remember in English for further use in for loops etc.

26.3.5 Exercises

Question 1

Using the datasets from nycflights13, write a function that:

# Creating a function to display the results in a nice way

display_results <- function(data){
  data |>
  slice_head(n = 5) |> 
  gt() |> 
  cols_label_with(fn = ~ janitor::make_clean_names(., "title")) |>
  gt_theme_538()
}

Finds all flights that were cancelled (i.e. is.na(arr_time)) or delayed by more than an hour.

flights |> filter_severe()

filter_severe <- function(data, arrival_delay) {
  data |>
    filter(is.na({{arrival_delay}}) | ({{arrival_delay}} > 60))
}

# Runinng an example to show it works
flights |>
  filter_severe(arr_delay) |>
  select(1:10) |>
  display_results()

Year	Month	Day	Dep Time	Sched Dep Time	Dep Delay	Arr Time	Sched Arr Time	Arr Delay	Carrier
2013	1	1	811	630	101	1047	830	137	MQ
2013	1	1	848	1835	853	1001	1950	851	MQ
2013	1	1	957	733	144	1056	853	123	UA
2013	1	1	1114	900	134	1447	1222	145	UA
2013	1	1	1120	944	96	1331	1213	78	EV

Counts the number of cancelled flights and the number of flights delayed by more than an hour.
```
flights |> group_by(dest) |> summarize_severe()
```
```
summarize_severe <- function(data, arrival_delay){
  data |>
    summarize(
      cancelled_flights = sum(is.na({{arrival_delay}})),
      delayed_flights = sum({{arrival_delay}} > 60, na.rm = TRUE)
    )
}

# Runinng an example to show it works
flights |>
  group_by(dest) |>
  summarize_severe(arr_delay) |>
  display_results()
```
Dest Cancelled Flights Delayed Flights

ABQ 0 25

ACK 1 12

ALB 21 59

ANC 0 0

ATL 378 1433

Dest	Cancelled Flights	Delayed Flights
ABQ	0	25
ACK	1	12
ALB	21	59
ANC	0	0
ATL	378	1433

Finds all flights that were cancelled or delayed by more than a user supplied number of hours:

flights |> filter_severe(hours = 2)

filter_severe <- function(data, hours){
  data |>
    filter(arr_delay > (hours*60))
}

# Running an example to show that it works
flights |>
  filter_severe(hours = 4) |>
  select(1:10) |>
  display_results()

Year	Month	Day	Dep Time	Sched Dep Time	Dep Delay	Arr Time	Sched Arr Time	Arr Delay	Carrier
2013	1	1	848	1835	853	1001	1950	851	MQ
2013	1	1	1815	1325	290	2120	1542	338	EV
2013	1	1	1842	1422	260	1958	1535	263	EV
2013	1	1	2115	1700	255	2330	1920	250	9E
2013	1	1	2205	1720	285	46	2040	246	AA

Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:
```
weather |> summarize_weather(temp)
```
```
summarize_weather <- function(data, variable){
  data |>
    summarize(
      mean = mean({{variable}}, na.rm = TRUE),
      minimum = min({{variable}}, na.rm = TRUE),
      maximum = max({{variable}}, na.rm = TRUE)
    )
}
# Runinng an example to show it works
weather |>
  group_by(origin) |>
  summarize_weather(temp) |>
  display_results() |> fmt_number(decimals = 2)
```
Origin Mean Minimum Maximum

EWR 55.55 10.94 100.04

JFK 54.47 12.02 98.06

LGA 55.76 12.02 98.96
Converts the user supplied variable that uses clock time (e.g., dep_time, arr_time, etc.) into a decimal time (i.e. hours + (minutes / 60)).

Origin	Mean	Minimum	Maximum
EWR	55.55	10.94	100.04
JFK	54.47	12.02	98.06
LGA	55.76	12.02	98.96

flights |> standardize_time(sched_dep_time)

# Method 1
standardize_time <- function(data, variable){
  data |>
    mutate(std_time = round({{variable}} %/% 100) + (({{variable}} %% 100)/60), 2) |>
    relocate(std_time, .after = {{variable}})
}

# Method 2 (after learning use of ":=" in Section 26.4.2)
standardize_time <- function(data, variable){
  data |>
    mutate({{variable}} := round({{variable}} %/% 100) + (({{variable}} %% 100)/60), 2) 
}

# Runinng an example to show it works
flights |>
  standardize_time(sched_dep_time) |>
  select(1:10) |>
  display_results()

Year	Month	Day	Dep Time	Sched Dep Time	Dep Delay	Arr Time	Sched Arr Time	Arr Delay	Carrier
2013	1	1	517	5.250000	2	830	819	11	UA
2013	1	1	533	5.483333	4	850	830	20	UA
2013	1	1	542	5.666667	2	923	850	33	AA
2013	1	1	544	5.750000	-1	1004	1022	-18	B6
2013	1	1	554	6.000000	-6	812	837	-25	DL

Question 2

For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: distinct(), count(), group_by(), rename_with(), slice_min(), slice_sample().

In the context of the dplyr package in R, tidy evaluation refers to a method of non-standard evaluation where you can use variables as arguments to functions and refer to those variables within the context of a data frame. It has two parts: –

Data Masking: Data masking involves creating a temporary environment within a function where external variables from the global environment are hidden or masked. This means that variables from the global environment are temporarily unavailable inside the function, preventing unintended interactions or conflicts. For example:
```
data <- data.frame(x = 1:5, y = 6:10)

custom_function <- function(df) {

  # Using data masking to prevent access to global 'data' variable
  df %>%
    mutate(z = x + y)
}

custom_function(data)
```
```
  x  y  z
1 1  6  7
2 2  7  9
3 3  8 11
4 4  9 13
5 5 10 15
```
In this example, the custom_function takes a data frame df as an argument and uses the mutate() function from dplyr to create a new variable z. Inside the function, data masking ensures that x and y refer to the columns within the df argument, not the global data data frame.
Tidy Selection: Tidy selection involves selecting or manipulating columns within a data frame using non-standard evaluation (NSE). It allows you to refer to column names as symbols or expressions, facilitating dynamic and programmatic column selection. For example: –
```
data <- data.frame(x = 1:5, y = 6:10)

selected_columns <- c("x", "y")

data %>%
  select(all_of(selected_columns))
```
```
  x  y
1 1  6
2 2  7
3 3  8
4 4  9
5 5 10
```
In this example, the select() function uses tidy selection to select columns specified in the selected_columns vector.

Thus, Data masking primarily deals with the environment of the function, while tidy selection focuses on column selection and manipulation within a data frame.

All of the following functions use Tidy Evaluation in the following way: –

Function	Description	Tidy Selection	Arguments
`distinct()`	Select distinct rows based on specified columns	Yes	The `data` argument does not involve Data Masking or Tidy Selection. The `...` argument supports Tidy Selection methods for specifying columns. It also employs data masking to prevent global environment variables from interfering. The `.keep_all` argument does not involve Data Masking or Tidy Selection.
`count()`	Count the number of rows in groups defined by specific variables	Yes	The `data` argument does not involve Data Masking or Tidy Selection. The `...` argument supports Tidy Selection methods for specifying columns. It also employs data masking to prevent global environment variables from interfering. The `.name` argument does not involve Data Masking or Tidy Selection.
`group_by()`	Group data by specific columns	Yes	The `data` argument does not involve Data Masking or Tidy Selection. The `...` argument supports data masking to prevent global environment variables from interfering. It does not use tidy selection. The `.add` argument does not involve Data Masking or Tidy Selection.
`rename_with()`	Rename columns based on a function or expression	Yes	The `data` argument does not involve Data Masking or Tidy Selection. The `.fn` and `...` arguments support Tidy Selection methods for specifying columns and renaming logic. The `.cols` argument supports Tidy Selection methods for specifying the columns to apply renaming to. It also employs data masking to prevent global environment variables from interfering.
`slice_min()`	Filter rows corresponding to the minimum value of a column	Yes	The `data` and `n` arguments do not involve Data Masking or Tidy Selection. The `order_by` argument supports Tidy Selection methods for specifying the column to determine the order for finding minimum values. It also employs data masking to prevent global environment variables from interfering.
`slice_sample()`	Randomly sample rows from a data frame	Yes	The `data`, `n`, `replace`, and `.groups` arguments do not involve Data Masking or Tidy Selection. The `weight` argument supports Tidy Selection methods for specifying the sampling weight variable.

In summary, all of the functions mentioned above support tidy evaluation for specifying arguments related to column selection or manipulation.

Question 3

Generalize the following function so that you can supply any number of variables to count.

count_prop <- function(df, var, sort = FALSE) {
  df |>
    count({{ var }}, sort = sort) |>
    mutate(prop = n / sum(n))
}

Here is the modified function to allow user to supply any number of variables to count: –

count_prop <- function(df, vars, sort = FALSE) {
  df |>
    count(pick({{ vars }}), sort = sort) |>
    mutate(prop = n / sum(n))
}

# Testing the results
flights |>
  count_prop(c(dest, origin)) |>
  slice_head(n = 10) |> 
  gt() |> gt_theme_538() |> fmt_number(prop, decimals = 4)

dest	origin	n	prop
ABQ	JFK	254	0.0008
ACK	JFK	265	0.0008
ALB	EWR	439	0.0013
ANC	EWR	8	0.0000
ATL	EWR	5022	0.0149
ATL	JFK	1930	0.0057
ATL	LGA	10263	0.0305
AUS	EWR	968	0.0029
AUS	JFK	1471	0.0044
AVL	EWR	265	0.0008

26.4.4 Exercises

Build up a rich plotting function by incrementally implementing each of the steps below:

Question 1

Draw a scatter-plot given dataset and x and y variables.

scatterplot <- function(data, x, y){
  data |>
    ggplot(aes(x = {{x}},
               y = {{y}})) +
    geom_point() +
    theme_classic()
}

Question 2

Add a line of best fit (i.e. a linear model with no standard errors).

scatterplot <- function(data, x, y){
  data |>
    ggplot(aes(x = {{x}},
               y = {{y}})) +
    geom_point() +
    geom_smooth(method = "lm",
                formula = {{y}} ~ {{x}},
                se = FALSE) +
    theme_classic()
}

Question 3

Add a title.

scatterplot <- function(data, x, y){
    ggplot(data, aes(x = {{x}},
               y = {{y}})) +
    geom_point() +
    geom_smooth(method = "lm",
                formula = {{y}} ~ {{x}},
                se = FALSE) +
    labs(title = rlang::englue("A scatter plot of {{y}} vs. {{x}}")) +
    theme_classic()
}

26.5.1 Exercises

Question 1

Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.

is_prefix()

# is_prefix
f1 <- function(string, prefix) {
  str_sub(string, 1, str_length(prefix)) == prefix
}

match_length()

f3 <- function(x, y) {
  rep(y, length.out = length(x))
}

Question 2

Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.

Here are some suggestions for renaming html_nodes() and its arguments (from the package rvest):

Original Function: html_nodes()
- html_extract() - This name suggests the action of extracting nodes from an HTML document.
- select_html_nodes() - A more descriptive name that clarifies the purpose of the function.
Argument css:
- selector - A more explicit name to indicate that this argument takes CSS selectors.
- element_selector - If the purpose of the argument is to select HTML elements, this name makes it clear.
Argument xpath:
- xpath_expression - A more descriptive name that specifies the type of input expected.
- xml_path - A concise alternative that still hints at the purpose.
Argument trim:
- remove_whitespace - A name that better reflects the action performed when trim is set to TRUE.
- clean_whitespace - Another option to indicate the removal of whitespace.

Question 3

Make a case for why norm_r(), norm_d() etc. would be better than rnorm(), dnorm(). Make a case for the opposite. How could you make the names even clearer?

Function naming conventions in R often follow a pattern where functions are named with a verb followed by a noun, which makes the code more readable and easier to understand.

Arguments for norm_r() and norm_d() (Custom Names):

Verb-First Naming: Putting the action (e.g., “normalize”) before the distribution name (e.g., “norm”) makes the function name more descriptive and aligns with R’s naming conventions.
Consistency: These names maintain consistency with other functions in R that follow a similar pattern (e.g., read_csv(), plot_hist()).
Clarity: These names emphasize the action being performed (normalizing) over the distribution being used (normal distribution), which might be more intuitive to some users.
Readability: These names read like a sentence: “Normalize a random sample from a normal distribution.”

Arguments for rnorm() and dnorm() (Current Names):

Historical Convention: Functions like rnorm() and dnorm() have been part of the base R distribution for a long time. Changing their names might lead to confusion for users who are already familiar with these functions.
Alignment with Mathematical Notation: The rnorm() and dnorm() naming closely resembles mathematical notation, making it easier for users with a strong mathematical background to recognize and use these functions.
Package Consistency: Maintaining the original names ensures consistency with other functions in the stats package (e.g., pnorm(), qnorm()), which also follow this convention.

Even Clearer Names: If we want to make the function names even clearer, we can consider longer, more descriptive names, such as:

generate_random_samples_from_normal_distribution() for rnorm(): This explicitly states what the function does.
probability_density_function_of_normal_distribution() for dnorm(): This specifies the exact purpose of the function.

While these longer names are more descriptive, they may become unwieldy in code. Striking a balance between clarity and brevity is important.