Chapter 26

Functions

Author

Aditya Dahiya

Published

October 9, 2023

library(tidyverse)
library(gt)
library(gtExtras)
library(nycflights13)
data("flights")

Important stuff from Chapter 26, R for Data Science (2e)

26.2.5 Exercises

Question 1

Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?

mean(is.na(x))
mean(is.na(y))
mean(is.na(z))

The following function computes the proportion of missing values in a vector. It needs one argument.

prop_missing <- function(x){
  mean(is.na(x))
}

_____________

x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)

The following function computes the proportion of each element of a vector to the sum of the vector. It needs one argument.

prop_element <- function(x){
  x / sum(x, na.rm = TRUE)
}

_____________

round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)

This function below computes the percentage of each value (as compared to the sum of the vector of values) within one decimal place.

perc_element <- function(x){
  round(x / sum(x, na.rm = TRUE) * 100, 1)
}

Question 2

In the second variant of rescale01(), infinite values are left unchanged. Can you rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1?

rescale01 <- function(x) {
  
  # Replace -Inf with the minimum number (other than -Inf)
  min_value <- min(x[is.finite(x)])
  x[x == -Inf] <- min_value

  # Replace +Inf with the maximum number (other than +Inf)
  max_value <- max(x[is.finite(x)])
  x[x == Inf] <- max_value

  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

Question 3

Given a vector of birth-dates, write a function to compute the age in years.

# Function to compute age in years from birth dates

compute_age <- function(birth_dates) {
  
  # Convert the birth dates to Date objects
  birth_dates <- as.Date(birth_dates)
  
  # Calculate the current date
  current_date <- Sys.Date()
  
  # Calculate age in years using lubridate
  ages <- interval(birth_dates, current_date) %/% years(1)
  
  # Return the ages as a numeric vector
  return(ages)
}

Question 4

Write your own functions to compute the variance and skewness of a numeric vector. You can look up the definitions on Wikipedia or elsewhere.

To compute the variance and skewness of a numeric vector, we can define custom functions for each: –

  1. Variance Calculation using the formula:
# Function to compute the variance
compute_variance <- function(x) {
  n <- length(x)
  x_bar <- mean(x)
  sum_squared_diff <- sum((x - x_bar)^2)
  variance <- sum_squared_diff / (n - 1)
  return(variance)
}
  1. Skewness Calculation using the formula:
# Function to compute the skewness
compute_skewness <- function(x) {
  n <- length(x)
  x_bar <- mean(x)
  std_dev <- sqrt(compute_variance(x))  # Using previously defined variance function
  skewness <- sum((x - x_bar)^3) / ((n - 1) * std_dev^3)
  return(skewness)
}

These custom functions compute_variance and compute_skewness will calculate the variance and skewness of a numeric vector based on the provided formulas, without relying on external packages.

Question 5

Write both_na(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

# Method 1
both_na <- function(x,y){
  
  x_na <- which(is.na(x))
  y_na <- which(is.na(y))
  
  # values of x which are also present in y
  common <- x_na %in% y_na
  return(common)
}

# Method 2
both_na <- function(vector1, vector2) {
  if (length(vector1) != length(vector2)) {
    stop("Both vectors must have the same length.")
  }
  
  # Find the indices where both vectors have NA values
  na_indices <- which(is.na(vector1) & is.na(vector2))
  
  return(na_indices)
}

Question 6

Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?

is_directory <- function(x) {
  file.info(x)$isdir
 }

This function tells whether an object is a file or a directory. It is useful because it will return a logical value, i.e., either TRUE or FALSE and thus can be used inside other operations, rather than using the $ operator; and is easy to remember in English.

_________

is_readable <- function(x) {
  file.access(x, 4) == 0
}

This function tells us whether the file has a read permission or not, i.e, whether there is access to the file or not. It is useful because one does not have to remember the mode argument values for file.access() function, and is easy to remember in English for further use in for loops etc.

26.3.5 Exercises

Question 1

Using the datasets from nycflights13, write a function that:

# Creating a function to display the results in a nice way

display_results <- function(data){
  data |>
  slice_head(n = 5) |> 
  gt() |> 
  cols_label_with(fn = ~ janitor::make_clean_names(., "title")) |>
  gt_theme_538()
}
  1. Finds all flights that were cancelled (i.e. is.na(arr_time)) or delayed by more than an hour.

    flights |> filter_severe()
    filter_severe <- function(data, arrival_delay) {
      data |>
        filter(is.na({{arrival_delay}}) | ({{arrival_delay}} > 60))
    }
    
    # Runinng an example to show it works
    flights |>
      filter_severe(arr_delay) |>
      select(1:10) |>
      display_results()
    Year Month Day Dep Time Sched Dep Time Dep Delay Arr Time Sched Arr Time Arr Delay Carrier
    2013 1 1 811 630 101 1047 830 137 MQ
    2013 1 1 848 1835 853 1001 1950 851 MQ
    2013 1 1 957 733 144 1056 853 123 UA
    2013 1 1 1114 900 134 1447 1222 145 UA
    2013 1 1 1120 944 96 1331 1213 78 EV
  2. Counts the number of cancelled flights and the number of flights delayed by more than an hour.

    flights |> group_by(dest) |> summarize_severe()
    summarize_severe <- function(data, arrival_delay){
      data |>
        summarize(
          cancelled_flights = sum(is.na({{arrival_delay}})),
          delayed_flights = sum({{arrival_delay}} > 60, na.rm = TRUE)
        )
    }
    
    # Runinng an example to show it works
    flights |>
      group_by(dest) |>
      summarize_severe(arr_delay) |>
      display_results()
    Dest Cancelled Flights Delayed Flights
    ABQ 0 25
    ACK 1 12
    ALB 21 59
    ANC 0 0
    ATL 378 1433
  3. Finds all flights that were cancelled or delayed by more than a user supplied number of hours:

    flights |> filter_severe(hours = 2)
    filter_severe <- function(data, hours){
      data |>
        filter(arr_delay > (hours*60))
    }
    
    # Running an example to show that it works
    flights |>
      filter_severe(hours = 4) |>
      select(1:10) |>
      display_results()
    Year Month Day Dep Time Sched Dep Time Dep Delay Arr Time Sched Arr Time Arr Delay Carrier
    2013 1 1 848 1835 853 1001 1950 851 MQ
    2013 1 1 1815 1325 290 2120 1542 338 EV
    2013 1 1 1842 1422 260 1958 1535 263 EV
    2013 1 1 2115 1700 255 2330 1920 250 9E
    2013 1 1 2205 1720 285 46 2040 246 AA
  4. Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:

    weather |> summarize_weather(temp)
    summarize_weather <- function(data, variable){
      data |>
        summarize(
          mean = mean({{variable}}, na.rm = TRUE),
          minimum = min({{variable}}, na.rm = TRUE),
          maximum = max({{variable}}, na.rm = TRUE)
        )
    }
    # Runinng an example to show it works
    weather |>
      group_by(origin) |>
      summarize_weather(temp) |>
      display_results() |> fmt_number(decimals = 2)
    Origin Mean Minimum Maximum
    EWR 55.55 10.94 100.04
    JFK 54.47 12.02 98.06
    LGA 55.76 12.02 98.96
  5. Converts the user supplied variable that uses clock time (e.g., dep_time, arr_time, etc.) into a decimal time (i.e. hours + (minutes / 60)).

flights |> standardize_time(sched_dep_time)
# Method 1
standardize_time <- function(data, variable){
  data |>
    mutate(std_time = round({{variable}} %/% 100) + (({{variable}} %% 100)/60), 2) |>
    relocate(std_time, .after = {{variable}})
}

# Method 2 (after learning use of ":=" in Section 26.4.2)
standardize_time <- function(data, variable){
  data |>
    mutate({{variable}} := round({{variable}} %/% 100) + (({{variable}} %% 100)/60), 2) 
}

# Runinng an example to show it works
flights |>
  standardize_time(sched_dep_time) |>
  select(1:10) |>
  display_results()
Year Month Day Dep Time Sched Dep Time Dep Delay Arr Time Sched Arr Time Arr Delay Carrier
2013 1 1 517 5.250000 2 830 819 11 UA
2013 1 1 533 5.483333 4 850 830 20 UA
2013 1 1 542 5.666667 2 923 850 33 AA
2013 1 1 544 5.750000 -1 1004 1022 -18 B6
2013 1 1 554 6.000000 -6 812 837 -25 DL

Question 2

For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: distinct(), count(), group_by(), rename_with(), slice_min(), slice_sample().

In the context of the dplyr package in R, tidy evaluation refers to a method of non-standard evaluation where you can use variables as arguments to functions and refer to those variables within the context of a data frame. It has two parts: –

  1. Data Masking: Data masking involves creating a temporary environment within a function where external variables from the global environment are hidden or masked. This means that variables from the global environment are temporarily unavailable inside the function, preventing unintended interactions or conflicts. For example:

    data <- data.frame(x = 1:5, y = 6:10)
    
    custom_function <- function(df) {
    
      # Using data masking to prevent access to global 'data' variable
      df %>%
        mutate(z = x + y)
    }
    
    custom_function(data)
      x  y  z
    1 1  6  7
    2 2  7  9
    3 3  8 11
    4 4  9 13
    5 5 10 15

    In this example, the custom_function takes a data frame df as an argument and uses the mutate() function from dplyr to create a new variable z. Inside the function, data masking ensures that x and y refer to the columns within the df argument, not the global data data frame.

  2. Tidy Selection: Tidy selection involves selecting or manipulating columns within a data frame using non-standard evaluation (NSE). It allows you to refer to column names as symbols or expressions, facilitating dynamic and programmatic column selection. For example: –

    data <- data.frame(x = 1:5, y = 6:10)
    
    selected_columns <- c("x", "y")
    
    data %>%
      select(all_of(selected_columns))
      x  y
    1 1  6
    2 2  7
    3 3  8
    4 4  9
    5 5 10

    In this example, the select() function uses tidy selection to select columns specified in the selected_columns vector.

Thus, Data masking primarily deals with the environment of the function, while tidy selection focuses on column selection and manipulation within a data frame.

All of the following functions use Tidy Evaluation in the following way: –

Function Description Tidy Selection Arguments
distinct() Select distinct rows based on specified columns Yes
  • The data argument does not involve Data Masking or Tidy Selection.

  • The ... argument supports Tidy Selection methods for specifying columns. It also employs data masking to prevent global environment variables from interfering.

  • The .keep_all argument does not involve Data Masking or Tidy Selection.

count() Count the number of rows in groups defined by specific variables Yes
  • The data argument does not involve Data Masking or Tidy Selection.

  • The ... argument supports Tidy Selection methods for specifying columns. It also employs data masking to prevent global environment variables from interfering.

  • The .name argument does not involve Data Masking or Tidy Selection.

group_by() Group data by specific columns Yes
  • The data argument does not involve Data Masking or Tidy Selection.

  • The ... argument supports data masking to prevent global environment variables from interfering. It does not use tidy selection.

  • The .add argument does not involve Data Masking or Tidy Selection.

rename_with() Rename columns based on a function or expression Yes
  • The data argument does not involve Data Masking or Tidy Selection.

  • The .fn and ... arguments support Tidy Selection methods for specifying columns and renaming logic.

  • The .cols argument supports Tidy Selection methods for specifying the columns to apply renaming to. It also employs data masking to prevent global environment variables from interfering.

slice_min() Filter rows corresponding to the minimum value of a column Yes
  • The data and n arguments do not involve Data Masking or Tidy Selection.

  • The order_by argument supports Tidy Selection methods for specifying the column to determine the order for finding minimum values. It also employs data masking to prevent global environment variables from interfering.

slice_sample() Randomly sample rows from a data frame Yes
  • The data, n, replace, and .groups arguments do not involve Data Masking or Tidy Selection.

  • The weight argument supports Tidy Selection methods for specifying the sampling weight variable.

In summary, all of the functions mentioned above support tidy evaluation for specifying arguments related to column selection or manipulation.

Question 3

Generalize the following function so that you can supply any number of variables to count.

count_prop <- function(df, var, sort = FALSE) {
  df |>
    count({{ var }}, sort = sort) |>
    mutate(prop = n / sum(n))
}

Here is the modified function to allow user to supply any number of variables to count: –

count_prop <- function(df, vars, sort = FALSE) {
  df |>
    count(pick({{ vars }}), sort = sort) |>
    mutate(prop = n / sum(n))
}

# Testing the results
flights |>
  count_prop(c(dest, origin)) |>
  slice_head(n = 10) |> 
  gt() |> gt_theme_538() |> fmt_number(prop, decimals = 4)
dest origin n prop
ABQ JFK 254 0.0008
ACK JFK 265 0.0008
ALB EWR 439 0.0013
ANC EWR 8 0.0000
ATL EWR 5022 0.0149
ATL JFK 1930 0.0057
ATL LGA 10263 0.0305
AUS EWR 968 0.0029
AUS JFK 1471 0.0044
AVL EWR 265 0.0008

26.4.4 Exercises

Build up a rich plotting function by incrementally implementing each of the steps below:

Question 1

Draw a scatter-plot given dataset and x and y variables.

scatterplot <- function(data, x, y){
  data |>
    ggplot(aes(x = {{x}},
               y = {{y}})) +
    geom_point() +
    theme_classic()
}

Question 2

Add a line of best fit (i.e. a linear model with no standard errors).

scatterplot <- function(data, x, y){
  data |>
    ggplot(aes(x = {{x}},
               y = {{y}})) +
    geom_point() +
    geom_smooth(method = "lm",
                formula = {{y}} ~ {{x}},
                se = FALSE) +
    theme_classic()
}

Question 3

Add a title.

scatterplot <- function(data, x, y){
    ggplot(data, aes(x = {{x}},
               y = {{y}})) +
    geom_point() +
    geom_smooth(method = "lm",
                formula = {{y}} ~ {{x}},
                se = FALSE) +
    labs(title = rlang::englue("A scatter plot of {{y}} vs. {{x}}")) +
    theme_classic()
}

26.5.1 Exercises

Question 1

Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.

is_prefix()

# is_prefix
f1 <- function(string, prefix) {
  str_sub(string, 1, str_length(prefix)) == prefix
}

match_length()

f3 <- function(x, y) {
  rep(y, length.out = length(x))
}

Question 2

Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.

Here are some suggestions for renaming html_nodes() and its arguments (from the package rvest):

  1. Original Function: html_nodes()

    • html_extract() - This name suggests the action of extracting nodes from an HTML document.

    • select_html_nodes() - A more descriptive name that clarifies the purpose of the function.

  2. Argument css:

    • selector - A more explicit name to indicate that this argument takes CSS selectors.

    • element_selector - If the purpose of the argument is to select HTML elements, this name makes it clear.

  3. Argument xpath:

    • xpath_expression - A more descriptive name that specifies the type of input expected.

    • xml_path - A concise alternative that still hints at the purpose.

  4. Argument trim:

    • remove_whitespace - A name that better reflects the action performed when trim is set to TRUE.

    • clean_whitespace - Another option to indicate the removal of whitespace.

Question 3

Make a case for why norm_r(), norm_d() etc. would be better than rnorm(), dnorm(). Make a case for the opposite. How could you make the names even clearer?

Function naming conventions in R often follow a pattern where functions are named with a verb followed by a noun, which makes the code more readable and easier to understand.

Arguments for norm_r() and norm_d() (Custom Names):

  1. Verb-First Naming: Putting the action (e.g., “normalize”) before the distribution name (e.g., “norm”) makes the function name more descriptive and aligns with R’s naming conventions.

  2. Consistency: These names maintain consistency with other functions in R that follow a similar pattern (e.g., read_csv(), plot_hist()).

  3. Clarity: These names emphasize the action being performed (normalizing) over the distribution being used (normal distribution), which might be more intuitive to some users.

  4. Readability: These names read like a sentence: “Normalize a random sample from a normal distribution.”

Arguments for rnorm() and dnorm() (Current Names):

  1. Historical Convention: Functions like rnorm() and dnorm() have been part of the base R distribution for a long time. Changing their names might lead to confusion for users who are already familiar with these functions.

  2. Alignment with Mathematical Notation: The rnorm() and dnorm() naming closely resembles mathematical notation, making it easier for users with a strong mathematical background to recognize and use these functions.

  3. Package Consistency: Maintaining the original names ensures consistency with other functions in the stats package (e.g., pnorm(), qnorm()), which also follow this convention.

Even Clearer Names: If we want to make the function names even clearer, we can consider longer, more descriptive names, such as:

  • generate_random_samples_from_normal_distribution() for rnorm(): This explicitly states what the function does.
  • probability_density_function_of_normal_distribution() for dnorm(): This specifies the exact purpose of the function.

While these longer names are more descriptive, they may become unwieldy in code. Striking a balance between clarity and brevity is important.