Chapter 16

Regular Expressions

Author

Aditya Dahiya

Published

September 12, 2023

Regular Expressions

library(tidyverse)
library(babynames)
library(gt)
library(gtExtras)
library(janitor)

16.3.5 Exercises

Question 1

What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)

The code below shows the analysis. The answers are: —

The baby names with most vowels, i.e., 8 of them are Mariadelrosario and Mariaguadalupe.
The baby names with highest proportion of vowels, i.e. 1 (they are entirely composed of vowels) are Ai, Aia, Aoi, Ea, Eua, Ia, Ii and Io.

b1 = babynames |>
  mutate(
    nos_vowels = str_count(name, pattern = "[AEIOUaeiou]"),
    name_length = str_length(name),
    prop_vowels = nos_vowels / name_length
  )
  
b1 |> 
  group_by(name) |>
  summarise(nos_vowels = mean(nos_vowels)) |>
  arrange(desc(nos_vowels)) |>
  slice_head(n = 5)

# A tibble: 5 × 2
  name            nos_vowels
  <chr>                <dbl>
1 Mariadelrosario          8
2 Mariaguadalupe           8
3 Aaliyahmarie             7
4 Abigailmarie             7
5 Aliciamarie              7

b1 |> 
  group_by(name) |>
  summarise(prop_vowels = mean(prop_vowels)) |>
  filter(prop_vowels == 1) |>
  select(name) |>
  as_vector() |>
  str_flatten(collapse = ", ", last = " and ")

[1] "Ai, Aia, Aoi, Ea, Eua, Ia, Ii and Io"

Question 2

Replace all forward slashes in "a/b/c/d/e" with backslashes. What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes? (We’ll discuss the problem very soon.)

The following code replaces the “forward slashes” with “backward slashes”: —

test_string = "a/b/c/d/e"
str_replace_all(test_string,
                pattern = "/",
                replacement = "\\\\") |>
  str_view()

[1] │ a\b\c\d\e

Further, when we try to do the same in reverse, there is an error because “\” is an escape character. Thus, we need to add four \ to include one in the final output: —

# test_string2 = "a\b\c\d\e" throws an error because \ is an escape character
# Thus, we need to add four \ to include one in the final output.

test_string2 = str_replace_all(test_string,
                pattern = "/",
                replacement = "\\\\")
test_string2 |>
  str_replace_all(pattern = "\\\\",
                  replace = "/") |>
  str_view()

[1] │ a/b/c/d/e

Question 3

Implement a simple version of str_to_lower() using str_replace_all().

The following code implements str_to_lower() using str_replace_all() function: —

test_string3 = "Take The Match And Strike It Against Your Shoe."

str_replace_all(test_string3,
                pattern = "[A-Z|a-z]",
                replacement = tolower)

[1] "take the match and strike it against your shoe."

Question 4

Create a regular expression that will match telephone numbers as commonly written in your country.

The regular expressions used in the code below can match USA format(s) telephone numbers and generate clean data into columns in Table 1: —

telephone_numbers = c(
  "555-123-4567",
  "(555) 555-7890",
  "888-555-4321",
  "(123) 456-7890",
  "555-987-6543",
  "(555) 123-7890"
)

telephone_numbers |>
  str_replace(" ", "-") |>
  str_replace("\\(", "") |>
  str_replace("\\)", "") |>
  as_tibble() |>
  separate_wider_regex(
    cols = value,
    patterns = c(
      area_code = "[0-9]+",
      "-| ",
      exchange_code = "[0-9]+",
      "-| ",
      line_number = "[0-9]+"
    )
  ) |>
  gt() |>
  gtExtras::gt_theme_538() |>
  cols_label_with(fn = ~ janitor::make_clean_names(., case = "title"))

Table 1:
Use of regex to match USA telephone numbers
Area Code	Exchange Code	Line Number
555	123	4567
555	555	7890
888	555	4321
123	456	7890
555	987	6543
555	123	7890

Further, telephone numbers in the USA are commonly written in several formats, including:

(123) 456-7890
123-456-7890
123.456.7890
1234567890

You can use the following regular expression to match these commonly written telephone number formats:

^($\d{3}$\s*|\d{3}[-.]?)\d{3}[-.]?\d{4}$

Explanation of the regular expression:

^ and $ match the start and end of the string, ensuring that the entire string is matched.
($\d{3}$\s*|\d{3}[-.]?) matches the area code, which can be enclosed in parentheses or separated by a hyphen or a period. It uses the | (OR) operator to allow either format.
- $\d{3}$ matches a three-digit area code enclosed in parentheses.
- \s* matches zero or more whitespace characters (in case there’s space between the closing parenthesis and the next part of the number).
- | is the OR operator.
- \d{3}[-.]? matches a three-digit area code followed by an optional hyphen or period.
\d{3}[-.]?\d{4} matches the main part of the phone number, which is three digits followed by an optional hyphen or period, and then four more digits.

16.4.7 Exercises

Question 1

How would you match the literal string "'\? How about "$^$"?

To match the literal string "'\ in R using the stringr package, we need to escape the special characters in our regular expression pattern. Thus, the matching pattern becomes \"\'\\\\ . Here’s an example:

# The string you want to match
input_string <- "\"'\\"
str_view(input_string)

[1] │ "'\

# Pattern to match the literal string
match_pattern <- "\"\'\\\\"
str_view(match_pattern)

[1] │ "'\\

# Use str_detect to check if the string contains the pattern
if (str_detect(input_string, match_pattern)) {
  print("Pattern found in the input string.")
} else {
  print("Pattern not found in the input string.")
}

[1] "Pattern found in the input string."

To match the literal string "$^$" in R using the stringr package, we need to escape the special characters in our regular expression pattern. Thus, the matching pattern becomes . Here’s an example:

# The string you want to match
input_string <- "\"$^$\""
str_view(input_string)

[1] │ "$^$"

# Pattern to match the literal string
match_pattern <- "\"\\$\\^\\$\""
str_view(match_pattern)

[1] │ "\$\^\$"

# Use str_detect to check if the string contains the pattern
if (str_detect(input_string, match_pattern)) {
  print("Pattern found in the input string.")
} else {
  print("Pattern not found in the input string.")
}

[1] "Pattern found in the input string."

Question 2

Explain why each of these patterns don’t match a \: "\", "\\", "\\\".

Each of the patterns you provided does not match a single backslash \ for the following reasons: —

"\" - This pattern does not match a single backslash because the backslash is an escape character in regular expressions. In most regular expression engines, a single backslash is used to escape special characters. So, when you use "\" alone, it is interpreted as an escape character, and it doesn’t match a literal backslash in the input string.
"\\" - This pattern also does not match a single backslash. It may seem like it should work because you’re escaping the backslash with another backslash, but in many regular expression engines, "\\" represents a literal backslash when you’re defining the regular expression. However, when applied to the input string, it’s still interpreted as a single backslash.
"\\\" - This pattern does not match a single backslash for the same reason as the previous ones. The combination "\\\" is treated as a literal backslash in the regular expression definition, but when applied to the input string, it’s still interpreted as a single backslash, and the extra "\" followed by a quotation mark is not part of the pattern.

To match a single backslash "\", you would typically need to use four backslashes in the regular expression pattern, "\\\\" . This way, the first two backslashes represent a literal backslash, and the next two backslashes escape each other, resulting in a pattern that matches a single backslash in the input string.

Question 3

Given the corpus of common words in stringr::words, create regular expressions that find all words that:

Start with “y”.

words |>
  str_view(pattern = "^y")

[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
[980] │ <y>oung

Don’t start with “y”.

# To view words not sarting with a y
words |>
  str_view(pattern = "^(?!y)")

 [1] │ <>a
 [2] │ <>able
 [3] │ <>about
 [4] │ <>absolute
 [5] │ <>accept
 [6] │ <>account
 [7] │ <>achieve
 [8] │ <>across
 [9] │ <>act
[10] │ <>active
[11] │ <>actual
[12] │ <>add
[13] │ <>address
[14] │ <>admit
[15] │ <>advertise
[16] │ <>affect
[17] │ <>afford
[18] │ <>after
[19] │ <>afternoon
[20] │ <>again
... and 954 more

# Check the number of such words
words |>
  str_view(pattern = "^(?!y)") |>
  length()

[1] 974

# Check the number of words starting with y and total number of words
# to confirm the matter
words |> length()

[1] 980

End with “x”.

To display these, we can use the regular expression "x$"

words |>
  str_view(pattern = "x$")

[108] │ bo<x>
[747] │ se<x>
[772] │ si<x>
[841] │ ta<x>

Are exactly three letters long. (Don’t cheat by using str_length()!)

To display these, we can use the regular expression "\\b\\w{3}\\b", where: –

\\b is a word boundary anchor, ensuring that the matched word is exactly three letters long and not part of a longer word.
\\w{3} matches exactly three word characters (letters).
str_subset is used to find all words in the dataset that match the specified regular expression pattern.

# Finding letters exactly three letters long using regex
words |>
  str_subset(pattern = "\\b\\w{3}\\b")

  [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad"
 [13] "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can"
 [25] "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg"
 [37] "end" "eye" "far" "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy"
 [49] "hit" "hot" "how" "job" "key" "kid" "lad" "law" "lay" "leg" "let" "lie"
 [61] "lot" "low" "man" "may" "mrs" "new" "non" "not" "now" "odd" "off" "old"
 [73] "one" "out" "own" "pay" "per" "put" "red" "rid" "run" "say" "see" "set"
 [85] "sex" "she" "sir" "sit" "six" "son" "sun" "tax" "tea" "ten" "the" "tie"
 [97] "too" "top" "try" "two" "use" "war" "way" "wee" "who" "why" "win" "yes"
[109] "yet" "you"

# Finding letters exactly three letters long using str_length()
three_let_words = str_length(words) == 3
words[three_let_words]

  [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad"
 [13] "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can"
 [25] "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg"
 [37] "end" "eye" "far" "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy"
 [49] "hit" "hot" "how" "job" "key" "kid" "lad" "law" "lay" "leg" "let" "lie"
 [61] "lot" "low" "man" "may" "mrs" "new" "non" "not" "now" "odd" "off" "old"
 [73] "one" "out" "own" "pay" "per" "put" "red" "rid" "run" "say" "see" "set"
 [85] "sex" "she" "sir" "sit" "six" "son" "sun" "tax" "tea" "ten" "the" "tie"
 [97] "too" "top" "try" "two" "use" "war" "way" "wee" "who" "why" "win" "yes"
[109] "yet" "you"

# Checking results
words[three_let_words] |>
  length()

[1] 110

words |>
  str_view(pattern = "\\b\\w{3}\\b") |>
  length()

[1] 110

Have seven letters or more.

To display these, we can use the regular expression "\\b\\w{7,}\\b" , where: –

\\b is a word boundary anchor to ensure that the matched word has no additional characters before or after it.
\\w{7,} matches words with 7 or more word characters (letters).
str_subset is used to filter the words in the dataset based on the regular expression pattern, so it selects words with 7 letters or more.

words |>
  str_subset(pattern = "\\b\\w{7,}\\b")

  [1] "absolute"    "account"     "achieve"     "address"     "advertise"  
  [6] "afternoon"   "against"     "already"     "alright"     "although"   
 [11] "america"     "another"     "apparent"    "appoint"     "approach"   
 [16] "appropriate" "arrange"     "associate"   "authority"   "available"  
 [21] "balance"     "because"     "believe"     "benefit"     "between"    
 [26] "brilliant"   "britain"     "brother"     "business"    "certain"    
 [31] "chairman"    "character"   "Christmas"   "colleague"   "collect"    
 [36] "college"     "comment"     "committee"   "community"   "company"    
 [41] "compare"     "complete"    "compute"     "concern"     "condition"  
 [46] "consider"    "consult"     "contact"     "continue"    "contract"   
 [51] "control"     "converse"    "correct"     "council"     "country"    
 [56] "current"     "decision"    "definite"    "department"  "describe"   
 [61] "develop"     "difference"  "difficult"   "discuss"     "district"   
 [66] "document"    "economy"     "educate"     "electric"    "encourage"  
 [71] "english"     "environment" "especial"    "evening"     "evidence"   
 [76] "example"     "exercise"    "expense"     "experience"  "explain"    
 [81] "express"     "finance"     "fortune"     "forward"     "function"   
 [86] "further"     "general"     "germany"     "goodbye"     "history"    
 [91] "holiday"     "hospital"    "however"     "hundred"     "husband"    
 [96] "identify"    "imagine"     "important"   "improve"     "include"    
[101] "increase"    "individual"  "industry"    "instead"     "interest"   
[106] "introduce"   "involve"     "kitchen"     "language"    "machine"    
[111] "meaning"     "measure"     "mention"     "million"     "minister"   
[116] "morning"     "necessary"   "obvious"     "occasion"    "operate"    
[121] "opportunity" "organize"    "original"    "otherwise"   "paragraph"  
[126] "particular"  "pension"     "percent"     "perfect"     "perhaps"    
[131] "photograph"  "picture"     "politic"     "position"    "positive"   
[136] "possible"    "practise"    "prepare"     "present"     "pressure"   
[141] "presume"     "previous"    "private"     "probable"    "problem"    
[146] "proceed"     "process"     "produce"     "product"     "programme"  
[151] "project"     "propose"     "protect"     "provide"     "purpose"    
[156] "quality"     "quarter"     "question"    "realise"     "receive"    
[161] "recognize"   "recommend"   "relation"    "remember"    "represent"  
[166] "require"     "research"    "resource"    "respect"     "responsible"
[171] "saturday"    "science"     "scotland"    "secretary"   "section"    
[176] "separate"    "serious"     "service"     "similar"     "situate"    
[181] "society"     "special"     "specific"    "standard"    "station"    
[186] "straight"    "strategy"    "structure"   "student"     "subject"    
[191] "succeed"     "suggest"     "support"     "suppose"     "surprise"   
[196] "telephone"   "television"  "terrible"    "therefore"   "thirteen"   
[201] "thousand"    "through"     "thursday"    "together"    "tomorrow"   
[206] "tonight"     "traffic"     "transport"   "trouble"     "tuesday"    
[211] "understand"  "university"  "various"     "village"     "wednesday"  
[216] "welcome"     "whether"     "without"     "yesterday"

Contain a vowel-consonant pair.

To display these, we can use the regular expression "[aeiou][^aeiou]" , where: –

[aeiou] matches any vowel (a, e, i, o, or u).
[^aeiou] matches any character that is not a vowel, which ensures that there’s a consonant after the vowel.

words |>
  str_view(pattern = "[aeiou][^aeiou]")

 [2] │ <ab>le
 [3] │ <ab>o<ut>
 [4] │ <ab>s<ol><ut>e
 [5] │ <ac>c<ep>t
 [6] │ <ac>co<un>t
 [7] │ <ac>hi<ev>e
 [8] │ <ac>r<os>s
 [9] │ <ac>t
[10] │ <ac>t<iv>e
[11] │ <ac>tu<al>
[12] │ <ad>d
[13] │ <ad>dr<es>s
[14] │ <ad>m<it>
[15] │ <ad>v<er>t<is>e
[16] │ <af>f<ec>t
[17] │ <af>f<or>d
[18] │ <af>t<er>
[19] │ <af>t<er>no<on>
[20] │ <ag>a<in>
[21] │ <ag>a<in>st
... and 924 more

Contain at least two vowel-consonant pairs in a row.

To display these, we can use the regular expression "[aeiou][^aeiou][aeiou][^aeiou]" , where: —

[aeiou] matches any vowel (a, e, i, o, or u).
[^aeiou]* matches zero or more characters that are not vowels, allowing for consonants between the vowel-consonant pairs.
[aeiou][^aeiou][aeiou][^aeiou] specifies the pattern for at least two consecutive vowel-consonant pairs.

words |>
  str_view(pattern = "[aeiou][^aeiou][aeiou][^aeiou]")

  [4] │ abs<olut>e
 [23] │ <agen>t
 [30] │ <alon>g
 [36] │ <amer>ica
 [39] │ <anot>her
 [42] │ <apar>t
 [43] │ app<aren>t
 [61] │ auth<orit>y
 [62] │ ava<ilab>le
 [63] │ <awar>e
 [64] │ <away>
 [70] │ b<alan>ce
 [75] │ b<asis>
 [81] │ b<ecom>e
 [83] │ b<efor>e
 [84] │ b<egin>
 [85] │ b<ehin>d
 [87] │ b<enef>it
[119] │ b<usin>ess
[143] │ ch<arac>ter
... and 149 more

Only consist of repeated vowel-consonant pairs.

To display these, we can use the regular expression "^(?:[aeiou][^aeiou]){2,}$" , where: —
- ^: This anchor asserts the start of the string.
- (?: ... ): This is a non-capturing group used to group the pattern for a vowel-consonant pair.
- [aeiou]: This character class matches any vowel (a, e, i, o, or u). It’s the first part of the vowel-consonant pair.
- [^aeiou]: This character class matches any character that is not a vowel. It’s the second part of the vowel-consonant pair and matches a consonant.
- {2,}: This quantifier specifies that the preceding pattern (the vowel-consonant pair) must occur at least two or more times.
- $: This anchor asserts the end of the string.
```
words |>
  str_view(pattern = "^(?:[aeiou][^aeiou]){2,}$")
```
```
 [64] │ <away>
[265] │ <eleven>
[279] │ <even>
[281] │ <ever>
[436] │ <item>
[573] │ <okay>
[579] │ <open>
[586] │ <original>
[591] │ <over>
[905] │ <unit>
[911] │ <upon>
```

Question 4

Create 11 regular expressions that match the British or American spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise. Try and make the shortest possible regex!

# Sample passage with mixed spellings
sample_text <- "The airplane is made of aluminum. The analog signal is stronger. Don't be an ass. The center is closed for defense training. I prefer a donut, while she likes a doughnut. His hair is gray, but hers is grey. We're modeling a new project. The skeptic will not believe it. Please summarize the report."

# Define the regular expressions
patterns_to_detect <- c(
  "air(?:plane|oplane)",
  "alumin(?:um|ium)",
  "analog(?:ue)?",
  "ass|arse",
  "cent(?:er|re)",
  "defen(?:se|ce)",
  "dou(?:gh)?nut",
  "gr(?:a|e)y",
  "model(?:ing|ling)",
  "skep(?:tic|tic)",
  "summar(?:ize|ise)"
)

# Find and highlight the spellings
for (pattern in patterns_to_detect) {
  matches <- str_extract_all(sample_text, pattern)
  if (length(matches[[1]]) > 0) {
    sample_text <- str_replace_all(sample_text, 
                                   pattern, 
                                   paste0("**", matches[[1]], "**"))
  }
}

sample_text

[1] "The **airplane** is made of **aluminum**. The **analog** signal is stronger. Don't be an **ass**. The **center** is closed for **defense** training. I prefer a donut, while she likes a **doughnut**. His hair is **gray**, but hers is **gray**. We're **modeling** a new project. The **skeptic** will not believe it. Please **summarize** the report."
[2] "The **airplane** is made of **aluminum**. The **analog** signal is stronger. Don't be an **ass**. The **center** is closed for **defense** training. I prefer a donut, while she likes a **doughnut**. His hair is **grey**, but hers is **grey**. We're **modeling** a new project. The **skeptic** will not believe it. Please **summarize** the report."

Question 5

Switch the first and last letters in words. Which of those strings are still words?

We can switch the first and last letters in words using the code given below, with str_replace_all(). To display the new strings which are still words, we can use the code below: —

# Code to switch the first and last letters in words
new_words = words |>
  str_replace_all(pattern = "\\b(\\w)(\\w*)(\\w)\\b", 
                  replacement = "\\3\\2\\1")

# Finding which of the new strings are part of the original "words"
tibble(
  original_words = words[new_words %in% words],
  new_words = new_words[new_words %in% words]
) |>
  gt() |>
  cols_label_with(columns = everything(),
                  fn = ~ make_clean_names(., case = "title")) |>
  opt_interactive()

Question 6

Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)

A Regular Expression (Regex) is a formalized way to represent a pattern that can match various strings. It is language-independent and can be used in various programming languages, such as Python, R etc. On the other hand, a string that defines a Regular Expression is a regular character string that contains the textual representation of a regular expression. It is not the actual regular expression itself but rather a sequence of characters that programmers use to create and define regular expressions in their code. It is passed as an argument to a function or method provided by the programming language’s regular expression library to create a regular expression object.

Here’s a description of what each of these regular expressions matches:

^.*$: This regular expression matches an entire string. It starts with ^ (caret), which anchors the match to the beginning of the string, followed by .* which matches any number of characters (including none), and ends with $ (dollar sign), which anchors the match to the end of the string. So, it essentially matches any string, including an empty one.
"\\{.+\\}": This is a string defining a regular expression matches strings that contain curly braces {} with content inside them. The double backslashes \\ are used to escape the curly braces, and .+ matches one or more of any characters within the braces. So, it would match strings like “{abc}” or “{123}”.
\d{4}-\d{2}-\d{2}: This regular expression matches a date-like pattern in the format “YYYY-MM-DD.” Here, \d matches a digit, and {4}, {2}, and {2} specify the exact number of digits for the year, month, and day, respectively. So, it matches strings like “2023-09-14.”
"\\\\{4}": This is a string that defines a regular expression which matches strings that contains exactly four backslashes. Each backslash is escaped with another backslash, so \\ matches a single backslash, and {4} specifies that exactly four backslashes must appear consecutively in the string. It matches strings like “\\\\abcd” but not “\\efg” (which contains only two backslashes).

\..\..\..: This regular expression matches strings that have three dots separated by any character. The dot . is a special character in regular expressions, so it’s escaped with a backslash \. to match a literal dot . . Thereafter, the . matches any character, and this pattern is repeated three times. So, it matches strings like “.a.b.c” or “.1.2.3”

test_string = c("a.bc..de", "1.2.3", "x...y", ".1.2.3",
                "a.b.c.", ".a.b.c")

test_regex = "\\..\\..\\.."
str_view(test_regex)

[1] │ \..\..\..

tibble(
  test_string = test_string,
  match_result = str_detect(test_string, test_regex)
)

# A tibble: 6 × 2
  test_string match_result
  <chr>       <lgl>       
1 a.bc..de    FALSE       
2 1.2.3       FALSE       
3 x...y       FALSE       
4 .1.2.3      TRUE        
5 a.b.c.      FALSE       
6 .a.b.c      TRUE

(.)\1\1: This regular expression matches strings that contain three consecutive identical characters. The parentheses (.) capture any single character, and \1 refers to the first captured character. So, it matches strings like “aaa” or “111.”
"(..)\\1": This is a string that represents a regular expression which matches strings consisting of two identical characters repeated twice. The (..) captures any two characters, and \\1 refers to the first captured two characters. So, it matches strings like aa or 11 within double quotes.

16.6.4 Exercises

Question 1

For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

Find all words that start or end with x.

The words are displayed below: —

# Using a singular regular expression
str_view(words, "(^xX)|(x$)")

[108] │ bo<x>
[747] │ se<x>
[772] │ si<x>
[841] │ ta<x>

# Using a combination of multiple str_detect() calls
words[str_detect(words, "^xX") | str_detect(words, "x$")]

[1] "box" "sex" "six" "tax"

Find all words that start with a vowel and end with a consonant.

The words are displayed below: —

# Using a singular regular expression
pattern_b = "^(?i)[aeiou].*[^aeiou]$"
str_subset(words, pattern_b)

  [1] "about"       "accept"      "account"     "across"      "act"        
  [6] "actual"      "add"         "address"     "admit"       "affect"     
 [11] "afford"      "after"       "afternoon"   "again"       "against"    
 [16] "agent"       "air"         "all"         "allow"       "almost"     
 [21] "along"       "already"     "alright"     "although"    "always"     
 [26] "amount"      "and"         "another"     "answer"      "any"        
 [31] "apart"       "apparent"    "appear"      "apply"       "appoint"    
 [36] "approach"    "arm"         "around"      "art"         "as"         
 [41] "ask"         "at"          "attend"      "authority"   "away"       
 [46] "awful"       "each"        "early"       "east"        "easy"       
 [51] "eat"         "economy"     "effect"      "egg"         "eight"      
 [56] "either"      "elect"       "electric"    "eleven"      "employ"     
 [61] "end"         "english"     "enjoy"       "enough"      "enter"      
 [66] "environment" "equal"       "especial"    "even"        "evening"    
 [71] "ever"        "every"       "exact"       "except"      "exist"      
 [76] "expect"      "explain"     "express"     "identify"    "if"         
 [81] "important"   "in"          "indeed"      "individual"  "industry"   
 [86] "inform"      "instead"     "interest"    "invest"      "it"         
 [91] "item"        "obvious"     "occasion"    "odd"         "of"         
 [96] "off"         "offer"       "often"       "okay"        "old"        
[101] "on"          "only"        "open"        "opportunity" "or"         
[106] "order"       "original"    "other"       "ought"       "out"        
[111] "over"        "own"         "under"       "understand"  "union"      
[116] "unit"        "university"  "unless"      "until"       "up"         
[121] "upon"        "usual"

# Using a combination of multiple str_detect() calls
words[
  str_detect(words, "^(?i)[aeiou]") &
  str_detect(words, "[^aeiou]$")  
]

  [1] "about"       "accept"      "account"     "across"      "act"        
  [6] "actual"      "add"         "address"     "admit"       "affect"     
 [11] "afford"      "after"       "afternoon"   "again"       "against"    
 [16] "agent"       "air"         "all"         "allow"       "almost"     
 [21] "along"       "already"     "alright"     "although"    "always"     
 [26] "amount"      "and"         "another"     "answer"      "any"        
 [31] "apart"       "apparent"    "appear"      "apply"       "appoint"    
 [36] "approach"    "arm"         "around"      "art"         "as"         
 [41] "ask"         "at"          "attend"      "authority"   "away"       
 [46] "awful"       "each"        "early"       "east"        "easy"       
 [51] "eat"         "economy"     "effect"      "egg"         "eight"      
 [56] "either"      "elect"       "electric"    "eleven"      "employ"     
 [61] "end"         "english"     "enjoy"       "enough"      "enter"      
 [66] "environment" "equal"       "especial"    "even"        "evening"    
 [71] "ever"        "every"       "exact"       "except"      "exist"      
 [76] "expect"      "explain"     "express"     "identify"    "if"         
 [81] "important"   "in"          "indeed"      "individual"  "industry"   
 [86] "inform"      "instead"     "interest"    "invest"      "it"         
 [91] "item"        "obvious"     "occasion"    "odd"         "of"         
 [96] "off"         "offer"       "often"       "okay"        "old"        
[101] "on"          "only"        "open"        "opportunity" "or"         
[106] "order"       "original"    "other"       "ought"       "out"        
[111] "over"        "own"         "under"       "understand"  "union"      
[116] "unit"        "university"  "unless"      "until"       "up"         
[121] "upon"        "usual"

Are there any words that contain at least one of each different vowel?

No, there are no such words in words.
```
pattern_c = "^(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u).+"
str_subset(words, pattern_c)
```
```
character(0)
```

Question 2

Construct patterns to find evidence for and against the rule “i before e except after c”?

The code given below provides annotated evidence for and against the rule “i before e except after c”.

# Creating the regexp's first to use in stringr functions
pattern_1a = "\\b\\w*ie\\w*\\b"
pattern_1b = "\\b\\w+ei\\w*\\b"

pattern_2a = "\\b\\w*cei\\w*\\b"
pattern_2b = "\\b\\w*cie\\w*\\b"

# Words which contain "i" before "e"
words[str_detect(words, pattern_1a)]

 [1] "achieve"    "believe"    "brief"      "client"     "die"       
 [6] "experience" "field"      "friend"     "lie"        "piece"     
[11] "quiet"      "science"    "society"    "tie"        "view"

# Words which contain "e" before an "i", thus giving evidence against
# the rule, unless there is a preceeding "c"
words[str_detect(words, pattern_1b)]

[1] "receive" "weigh"

# Words which contain "e" before an "i" after "c", thus following the rule.
# That is, evidence in favour of the rule
words[str_detect(words, pattern_2a)]

[1] "receive"

# Words which contain an "i" before "e" after "c", thus violating the rule.
# That is, evidence against the rule
words[str_detect(words, pattern_2b)]

[1] "science" "society"

Question 3

colors() contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).

The R code col_vec = colours(distinct = TRUE) creates a vector col_vec containing a set of distinct color names available in R’s default color palette.
The code col_vec = col_vec[!str_detect(col_vec, "\\b\\w*\\d\\w*\\b")] filters the vector col_vec to exclude color names that contain any digits within them.
Finally, the code col_vec[str_detect(col_vec, "\\b(?:light|dark)\\w*\\b")] will return a subset of the col_vec vector containing color names that have modifiers like “light” or “dark” in them, effectively identifying color names with modifiers.

col_vec = colours(distinct = TRUE)
col_vec = col_vec[!str_detect(col_vec, "\\b\\w*\\d\\w*\\b")]

col_vec[str_detect(col_vec, "\\b(?:light|dark)\\w*\\b")]

 [1] "darkgoldenrod"        "darkgray"             "darkgreen"           
 [4] "darkkhaki"            "darkmagenta"          "darkolivegreen"      
 [7] "darkorange"           "darkorchid"           "darkred"             
[10] "darksalmon"           "darkseagreen"         "darkslateblue"       
[13] "darkslategray"        "darkturquoise"        "darkviolet"          
[16] "lightblue"            "lightcoral"           "lightcyan"           
[19] "lightgoldenrod"       "lightgoldenrodyellow" "lightgray"           
[22] "lightgreen"           "lightpink"            "lightsalmon"         
[25] "lightseagreen"        "lightskyblue"         "lightslateblue"      
[28] "lightslategray"       "lightsteelblue"       "lightyellow"

Question 4

Create a regular expression that finds any base R dataset. You can get a list of these data-sets via a special use of the data() function: data(package = "datasets")$results[, "Item"]. Note that a number of old data-sets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off.

The following code does the job, and purpose of each line is explained in the annotations.

# Extract all base R datasets into a character vector
base_r_packs = data(package = "datasets")$results[, "Item"]

# Remove all the names of grouping data.frames in parenthesis 
base_r_packs = str_replace_all(base_r_packs, 
                pattern = "\\([^()]+\\)", 
                replacement = "")
# Remove the whitespace, i.e., " " let after removing the parenthesis words
base_r_packs = str_replace_all(base_r_packs, 
                pattern = "\\s+$", 
                replacement = "")

# Create the regular expression
huge_regex = str_c("\\b(", str_flatten(base_r_packs, "|"), ")\\b")