library(tidyverse)
library(babynames)
library(gt)
library(gtExtras)
library(janitor)
Chapter 16
Regular Expressions
Regular Expressions
16.3.5 Exercises
Question 1
What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)
The code below shows the analysis. The answers are: —
The baby names with most vowels, i.e., 8 of them are Mariadelrosario and Mariaguadalupe.
The baby names with highest proportion of vowels, i.e. 1 (they are entirely composed of vowels) are Ai, Aia, Aoi, Ea, Eua, Ia, Ii and Io.
= babynames |>
b1 mutate(
nos_vowels = str_count(name, pattern = "[AEIOUaeiou]"),
name_length = str_length(name),
prop_vowels = nos_vowels / name_length
)
|>
b1 group_by(name) |>
summarise(nos_vowels = mean(nos_vowels)) |>
arrange(desc(nos_vowels)) |>
slice_head(n = 5)
# A tibble: 5 × 2
name nos_vowels
<chr> <dbl>
1 Mariadelrosario 8
2 Mariaguadalupe 8
3 Aaliyahmarie 7
4 Abigailmarie 7
5 Aliciamarie 7
|>
b1 group_by(name) |>
summarise(prop_vowels = mean(prop_vowels)) |>
filter(prop_vowels == 1) |>
select(name) |>
as_vector() |>
str_flatten(collapse = ", ", last = " and ")
[1] "Ai, Aia, Aoi, Ea, Eua, Ia, Ii and Io"
Question 2
Replace all forward slashes in "a/b/c/d/e"
with backslashes. What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes? (We’ll discuss the problem very soon.)
The following code replaces the “forward slashes” with “backward slashes”: —
= "a/b/c/d/e"
test_string str_replace_all(test_string,
pattern = "/",
replacement = "\\\\") |>
str_view()
[1] │ a\b\c\d\e
Further, when we try to do the same in reverse, there is an error because “\
” is an escape character. Thus, we need to add four \
to include one in the final output: —
# test_string2 = "a\b\c\d\e" throws an error because \ is an escape character
# Thus, we need to add four \ to include one in the final output.
= str_replace_all(test_string,
test_string2 pattern = "/",
replacement = "\\\\")
|>
test_string2 str_replace_all(pattern = "\\\\",
replace = "/") |>
str_view()
[1] │ a/b/c/d/e
Question 3
Implement a simple version of str_to_lower()
using str_replace_all()
.
The following code implements str_to_lower()
using str_replace_all()
function: —
= "Take The Match And Strike It Against Your Shoe."
test_string3
str_replace_all(test_string3,
pattern = "[A-Z|a-z]",
replacement = tolower)
[1] "take the match and strike it against your shoe."
Question 4
Create a regular expression that will match telephone numbers as commonly written in your country.
The regular expressions used in the code below can match USA format(s) telephone numbers and generate clean data into columns in Table 1: —
= c(
telephone_numbers "555-123-4567",
"(555) 555-7890",
"888-555-4321",
"(123) 456-7890",
"555-987-6543",
"(555) 123-7890"
)
|>
telephone_numbers str_replace(" ", "-") |>
str_replace("\\(", "") |>
str_replace("\\)", "") |>
as_tibble() |>
separate_wider_regex(
cols = value,
patterns = c(
area_code = "[0-9]+",
"-| ",
exchange_code = "[0-9]+",
"-| ",
line_number = "[0-9]+"
)|>
) gt() |>
::gt_theme_538() |>
gtExtrascols_label_with(fn = ~ janitor::make_clean_names(., case = "title"))
Area Code | Exchange Code | Line Number |
---|---|---|
555 | 123 | 4567 |
555 | 555 | 7890 |
888 | 555 | 4321 |
123 | 456 | 7890 |
555 | 987 | 6543 |
555 | 123 | 7890 |
Further, telephone numbers in the USA are commonly written in several formats, including:
(123) 456-7890
123-456-7890
123.456.7890
1234567890
You can use the following regular expression to match these commonly written telephone number formats:
^(\(\d{3}\)\s*|\d{3}[-.]?)\d{3}[-.]?\d{4}$
Explanation of the regular expression:
^
and$
match the start and end of the string, ensuring that the entire string is matched.(\(\d{3}\)\s*|\d{3}[-.]?)
matches the area code, which can be enclosed in parentheses or separated by a hyphen or a period. It uses the|
(OR) operator to allow either format.\(\d{3}\)
matches a three-digit area code enclosed in parentheses.\s*
matches zero or more whitespace characters (in case there’s space between the closing parenthesis and the next part of the number).|
is the OR operator.\d{3}[-.]?
matches a three-digit area code followed by an optional hyphen or period.
\d{3}[-.]?\d{4}
matches the main part of the phone number, which is three digits followed by an optional hyphen or period, and then four more digits.
16.4.7 Exercises
Question 1
How would you match the literal string "'\
? How about "$^$"
?
To match the literal string "'\
in R
using the stringr
package, we need to escape the special characters in our regular expression pattern. Thus, the matching pattern becomes \"\'\\\\
. Here’s an example:
# The string you want to match
<- "\"'\\"
input_string str_view(input_string)
[1] │ "'\
# Pattern to match the literal string
<- "\"\'\\\\"
match_pattern str_view(match_pattern)
[1] │ "'\\
# Use str_detect to check if the string contains the pattern
if (str_detect(input_string, match_pattern)) {
print("Pattern found in the input string.")
else {
} print("Pattern not found in the input string.")
}
[1] "Pattern found in the input string."
To match the literal string "$^$"
in R
using the stringr
package, we need to escape the special characters in our regular expression pattern. Thus, the matching pattern becomes . Here’s an example:
# The string you want to match
<- "\"$^$\""
input_string str_view(input_string)
[1] │ "$^$"
# Pattern to match the literal string
<- "\"\\$\\^\\$\""
match_pattern str_view(match_pattern)
[1] │ "\$\^\$"
# Use str_detect to check if the string contains the pattern
if (str_detect(input_string, match_pattern)) {
print("Pattern found in the input string.")
else {
} print("Pattern not found in the input string.")
}
[1] "Pattern found in the input string."
Question 2
Explain why each of these patterns don’t match a \
: "\"
, "\\"
, "\\\"
.
Each of the patterns you provided does not match a single backslash \
for the following reasons: —
"\"
- This pattern does not match a single backslash because the backslash is an escape character in regular expressions. In most regular expression engines, a single backslash is used to escape special characters. So, when you use"\"
alone, it is interpreted as an escape character, and it doesn’t match a literal backslash in the input string."\\"
- This pattern also does not match a single backslash. It may seem like it should work because you’re escaping the backslash with another backslash, but in many regular expression engines,"\\"
represents a literal backslash when you’re defining the regular expression. However, when applied to the input string, it’s still interpreted as a single backslash."\\\"
- This pattern does not match a single backslash for the same reason as the previous ones. The combination"\\\"
is treated as a literal backslash in the regular expression definition, but when applied to the input string, it’s still interpreted as a single backslash, and the extra"\"
followed by a quotation mark is not part of the pattern.
To match a single backslash "\"
, you would typically need to use four backslashes in the regular expression pattern, "\\\\"
. This way, the first two backslashes represent a literal backslash, and the next two backslashes escape each other, resulting in a pattern that matches a single backslash in the input string.
Question 3
Given the corpus of common words in stringr::words
, create regular expressions that find all words that:
Start with “y”.
|> words str_view(pattern = "^y")
[975] │ <y>ear [976] │ <y>es [977] │ <y>esterday [978] │ <y>et [979] │ <y>ou [980] │ <y>oung
Don’t start with “y”.
# To view words not sarting with a y |> words str_view(pattern = "^(?!y)")
[1] │ <>a [2] │ <>able [3] │ <>about [4] │ <>absolute [5] │ <>accept [6] │ <>account [7] │ <>achieve [8] │ <>across [9] │ <>act [10] │ <>active [11] │ <>actual [12] │ <>add [13] │ <>address [14] │ <>admit [15] │ <>advertise [16] │ <>affect [17] │ <>afford [18] │ <>after [19] │ <>afternoon [20] │ <>again ... and 954 more
# Check the number of such words |> words str_view(pattern = "^(?!y)") |> length()
[1] 974
# Check the number of words starting with y and total number of words # to confirm the matter |> length() words
[1] 980
End with “x”.
- To display these, we can use the regular expression
"x$"
|> words str_view(pattern = "x$")
[108] │ bo<x> [747] │ se<x> [772] │ si<x> [841] │ ta<x>
- To display these, we can use the regular expression
Are exactly three letters long. (Don’t cheat by using
str_length()
!)To display these, we can use the regular expression
"\\b\\w{3}\\b"
, where: –\\b
is a word boundary anchor, ensuring that the matched word is exactly three letters long and not part of a longer word.\\w{3}
matches exactly three word characters (letters).str_subset
is used to find all words in the dataset that match the specified regular expression pattern.
# Finding letters exactly three letters long using regex |> words str_subset(pattern = "\\b\\w{3}\\b")
[1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad" [13] "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can" [25] "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg" [37] "end" "eye" "far" "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy" [49] "hit" "hot" "how" "job" "key" "kid" "lad" "law" "lay" "leg" "let" "lie" [61] "lot" "low" "man" "may" "mrs" "new" "non" "not" "now" "odd" "off" "old" [73] "one" "out" "own" "pay" "per" "put" "red" "rid" "run" "say" "see" "set" [85] "sex" "she" "sir" "sit" "six" "son" "sun" "tax" "tea" "ten" "the" "tie" [97] "too" "top" "try" "two" "use" "war" "way" "wee" "who" "why" "win" "yes" [109] "yet" "you"
# Finding letters exactly three letters long using str_length() = str_length(words) == 3 three_let_words words[three_let_words]
[1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad" [13] "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can" [25] "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg" [37] "end" "eye" "far" "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy" [49] "hit" "hot" "how" "job" "key" "kid" "lad" "law" "lay" "leg" "let" "lie" [61] "lot" "low" "man" "may" "mrs" "new" "non" "not" "now" "odd" "off" "old" [73] "one" "out" "own" "pay" "per" "put" "red" "rid" "run" "say" "see" "set" [85] "sex" "she" "sir" "sit" "six" "son" "sun" "tax" "tea" "ten" "the" "tie" [97] "too" "top" "try" "two" "use" "war" "way" "wee" "who" "why" "win" "yes" [109] "yet" "you"
# Checking results |> words[three_let_words] length()
[1] 110
|> words str_view(pattern = "\\b\\w{3}\\b") |> length()
[1] 110
Have seven letters or more.
To display these, we can use the regular expression
"\\b\\w{7,}\\b"
, where: –\\b
is a word boundary anchor to ensure that the matched word has no additional characters before or after it.\\w{7,}
matches words with 7 or more word characters (letters).str_subset
is used to filter the words in the dataset based on the regular expression pattern, so it selects words with 7 letters or more.
|> words str_subset(pattern = "\\b\\w{7,}\\b")
[1] "absolute" "account" "achieve" "address" "advertise" [6] "afternoon" "against" "already" "alright" "although" [11] "america" "another" "apparent" "appoint" "approach" [16] "appropriate" "arrange" "associate" "authority" "available" [21] "balance" "because" "believe" "benefit" "between" [26] "brilliant" "britain" "brother" "business" "certain" [31] "chairman" "character" "Christmas" "colleague" "collect" [36] "college" "comment" "committee" "community" "company" [41] "compare" "complete" "compute" "concern" "condition" [46] "consider" "consult" "contact" "continue" "contract" [51] "control" "converse" "correct" "council" "country" [56] "current" "decision" "definite" "department" "describe" [61] "develop" "difference" "difficult" "discuss" "district" [66] "document" "economy" "educate" "electric" "encourage" [71] "english" "environment" "especial" "evening" "evidence" [76] "example" "exercise" "expense" "experience" "explain" [81] "express" "finance" "fortune" "forward" "function" [86] "further" "general" "germany" "goodbye" "history" [91] "holiday" "hospital" "however" "hundred" "husband" [96] "identify" "imagine" "important" "improve" "include" [101] "increase" "individual" "industry" "instead" "interest" [106] "introduce" "involve" "kitchen" "language" "machine" [111] "meaning" "measure" "mention" "million" "minister" [116] "morning" "necessary" "obvious" "occasion" "operate" [121] "opportunity" "organize" "original" "otherwise" "paragraph" [126] "particular" "pension" "percent" "perfect" "perhaps" [131] "photograph" "picture" "politic" "position" "positive" [136] "possible" "practise" "prepare" "present" "pressure" [141] "presume" "previous" "private" "probable" "problem" [146] "proceed" "process" "produce" "product" "programme" [151] "project" "propose" "protect" "provide" "purpose" [156] "quality" "quarter" "question" "realise" "receive" [161] "recognize" "recommend" "relation" "remember" "represent" [166] "require" "research" "resource" "respect" "responsible" [171] "saturday" "science" "scotland" "secretary" "section" [176] "separate" "serious" "service" "similar" "situate" [181] "society" "special" "specific" "standard" "station" [186] "straight" "strategy" "structure" "student" "subject" [191] "succeed" "suggest" "support" "suppose" "surprise" [196] "telephone" "television" "terrible" "therefore" "thirteen" [201] "thousand" "through" "thursday" "together" "tomorrow" [206] "tonight" "traffic" "transport" "trouble" "tuesday" [211] "understand" "university" "various" "village" "wednesday" [216] "welcome" "whether" "without" "yesterday"
Contain a vowel-consonant pair.
To display these, we can use the regular expression
"[aeiou][^aeiou]"
, where: –[aeiou]
matches any vowel (a, e, i, o, or u).[^aeiou]
matches any character that is not a vowel, which ensures that there’s a consonant after the vowel.
|> words str_view(pattern = "[aeiou][^aeiou]")
[2] │ <ab>le [3] │ <ab>o<ut> [4] │ <ab>s<ol><ut>e [5] │ <ac>c<ep>t [6] │ <ac>co<un>t [7] │ <ac>hi<ev>e [8] │ <ac>r<os>s [9] │ <ac>t [10] │ <ac>t<iv>e [11] │ <ac>tu<al> [12] │ <ad>d [13] │ <ad>dr<es>s [14] │ <ad>m<it> [15] │ <ad>v<er>t<is>e [16] │ <af>f<ec>t [17] │ <af>f<or>d [18] │ <af>t<er> [19] │ <af>t<er>no<on> [20] │ <ag>a<in> [21] │ <ag>a<in>st ... and 924 more
Contain at least two vowel-consonant pairs in a row.
To display these, we can use the regular expression
"[aeiou][^aeiou][aeiou][^aeiou]"
, where: —[aeiou]
matches any vowel (a, e, i, o, or u).[^aeiou]*
matches zero or more characters that are not vowels, allowing for consonants between the vowel-consonant pairs.[aeiou][^aeiou][aeiou][^aeiou]
specifies the pattern for at least two consecutive vowel-consonant pairs.
|> words str_view(pattern = "[aeiou][^aeiou][aeiou][^aeiou]")
[4] │ abs<olut>e [23] │ <agen>t [30] │ <alon>g [36] │ <amer>ica [39] │ <anot>her [42] │ <apar>t [43] │ app<aren>t [61] │ auth<orit>y [62] │ ava<ilab>le [63] │ <awar>e [64] │ <away> [70] │ b<alan>ce [75] │ b<asis> [81] │ b<ecom>e [83] │ b<efor>e [84] │ b<egin> [85] │ b<ehin>d [87] │ b<enef>it [119] │ b<usin>ess [143] │ ch<arac>ter ... and 149 more
Only consist of repeated vowel-consonant pairs.
To display these, we can use the regular expression
"^(?:[aeiou][^aeiou]){2,}$"
, where: —^
: This anchor asserts the start of the string.(?: ... )
: This is a non-capturing group used to group the pattern for a vowel-consonant pair.[aeiou]
: This character class matches any vowel (a, e, i, o, or u). It’s the first part of the vowel-consonant pair.[^aeiou]
: This character class matches any character that is not a vowel. It’s the second part of the vowel-consonant pair and matches a consonant.{2,}
: This quantifier specifies that the preceding pattern (the vowel-consonant pair) must occur at least two or more times.$
: This anchor asserts the end of the string.
|> words str_view(pattern = "^(?:[aeiou][^aeiou]){2,}$")
[64] │ <away> [265] │ <eleven> [279] │ <even> [281] │ <ever> [436] │ <item> [573] │ <okay> [579] │ <open> [586] │ <original> [591] │ <over> [905] │ <unit> [911] │ <upon>
Question 4
Create 11 regular expressions that match the British or American spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise. Try and make the shortest possible regex!
# Sample passage with mixed spellings
<- "The airplane is made of aluminum. The analog signal is stronger. Don't be an ass. The center is closed for defense training. I prefer a donut, while she likes a doughnut. His hair is gray, but hers is grey. We're modeling a new project. The skeptic will not believe it. Please summarize the report."
sample_text
# Define the regular expressions
<- c(
patterns_to_detect "air(?:plane|oplane)",
"alumin(?:um|ium)",
"analog(?:ue)?",
"ass|arse",
"cent(?:er|re)",
"defen(?:se|ce)",
"dou(?:gh)?nut",
"gr(?:a|e)y",
"model(?:ing|ling)",
"skep(?:tic|tic)",
"summar(?:ize|ise)"
)
# Find and highlight the spellings
for (pattern in patterns_to_detect) {
<- str_extract_all(sample_text, pattern)
matches if (length(matches[[1]]) > 0) {
<- str_replace_all(sample_text,
sample_text
pattern, paste0("**", matches[[1]], "**"))
}
}
sample_text
[1] "The **airplane** is made of **aluminum**. The **analog** signal is stronger. Don't be an **ass**. The **center** is closed for **defense** training. I prefer a donut, while she likes a **doughnut**. His hair is **gray**, but hers is **gray**. We're **modeling** a new project. The **skeptic** will not believe it. Please **summarize** the report."
[2] "The **airplane** is made of **aluminum**. The **analog** signal is stronger. Don't be an **ass**. The **center** is closed for **defense** training. I prefer a donut, while she likes a **doughnut**. His hair is **grey**, but hers is **grey**. We're **modeling** a new project. The **skeptic** will not believe it. Please **summarize** the report."
Question 5
Switch the first and last letters in words
. Which of those strings are still words
?
We can switch the first and last letters in words
using the code given below, with str_replace_all()
. To display the new strings which are still words
, we can use the code below: —
# Code to switch the first and last letters in words
= words |>
new_words str_replace_all(pattern = "\\b(\\w)(\\w*)(\\w)\\b",
replacement = "\\3\\2\\1")
# Finding which of the new strings are part of the original "words"
tibble(
original_words = words[new_words %in% words],
new_words = new_words[new_words %in% words]
|>
) gt() |>
cols_label_with(columns = everything(),
fn = ~ make_clean_names(., case = "title")) |>
opt_interactive()
Question 6
Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
A Regular Expression (Regex) is a formalized way to represent a pattern that can match various strings. It is language-independent and can be used in various programming languages, such as Python, R etc. On the other hand, a string that defines a Regular Expression is a regular character string that contains the textual representation of a regular expression. It is not the actual regular expression itself but rather a sequence of characters that programmers use to create and define regular expressions in their code. It is passed as an argument to a function or method provided by the programming language’s regular expression library to create a regular expression object.
Here’s a description of what each of these regular expressions matches:
^.*$
: This regular expression matches an entire string. It starts with^
(caret), which anchors the match to the beginning of the string, followed by.*
which matches any number of characters (including none), and ends with$
(dollar sign), which anchors the match to the end of the string. So, it essentially matches any string, including an empty one."\\{.+\\}"
: This is a string defining a regular expression matches strings that contain curly braces{}
with content inside them. The double backslashes\\
are used to escape the curly braces, and.+
matches one or more of any characters within the braces. So, it would match strings like “{abc}” or “{123}”.\d{4}-\d{2}-\d{2}
: This regular expression matches a date-like pattern in the format “YYYY-MM-DD.” Here,\d
matches a digit, and{4}
,{2}
, and{2}
specify the exact number of digits for the year, month, and day, respectively. So, it matches strings like “2023-09-14.”"\\\\{4}"
: This is a string that defines a regular expression which matches strings that contains exactly four backslashes. Each backslash is escaped with another backslash, so\\
matches a single backslash, and{4}
specifies that exactly four backslashes must appear consecutively in the string. It matches strings like “\\\\abcd” but not “\\efg” (which contains only two backslashes).\..\..\..
: This regular expression matches strings that have three dots separated by any character. The dot.
is a special character in regular expressions, so it’s escaped with a backslash\.
to match a literal dot.
. Thereafter, the.
matches any character, and this pattern is repeated three times. So, it matches strings like “.a.b.c” or “.1.2.3”= c("a.bc..de", "1.2.3", "x...y", ".1.2.3", test_string "a.b.c.", ".a.b.c") = "\\..\\..\\.." test_regex str_view(test_regex)
[1] │ \..\..\..
tibble( test_string = test_string, match_result = str_detect(test_string, test_regex) )
# A tibble: 6 × 2 test_string match_result <chr> <lgl> 1 a.bc..de FALSE 2 1.2.3 FALSE 3 x...y FALSE 4 .1.2.3 TRUE 5 a.b.c. FALSE 6 .a.b.c TRUE
(.)\1\1
: This regular expression matches strings that contain three consecutive identical characters. The parentheses(.)
capture any single character, and\1
refers to the first captured character. So, it matches strings like “aaa” or “111.”"(..)\\1"
: This is a string that represents a regular expression which matches strings consisting of two identical characters repeated twice. The(..)
captures any two characters, and\\1
refers to the first captured two characters. So, it matches strings likeaa
or11
within double quotes.
16.6.4 Exercises
Question 1
For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect()
calls.
Find all
words
that start or end withx
.The words are displayed below: —
# Using a singular regular expression str_view(words, "(^xX)|(x$)")
[108] │ bo<x> [747] │ se<x> [772] │ si<x> [841] │ ta<x>
# Using a combination of multiple str_detect() calls str_detect(words, "^xX") | str_detect(words, "x$")] words[
[1] "box" "sex" "six" "tax"
Find all
words
that start with a vowel and end with a consonant.The words are displayed below: —
# Using a singular regular expression = "^(?i)[aeiou].*[^aeiou]$" pattern_b str_subset(words, pattern_b)
[1] "about" "accept" "account" "across" "act" [6] "actual" "add" "address" "admit" "affect" [11] "afford" "after" "afternoon" "again" "against" [16] "agent" "air" "all" "allow" "almost" [21] "along" "already" "alright" "although" "always" [26] "amount" "and" "another" "answer" "any" [31] "apart" "apparent" "appear" "apply" "appoint" [36] "approach" "arm" "around" "art" "as" [41] "ask" "at" "attend" "authority" "away" [46] "awful" "each" "early" "east" "easy" [51] "eat" "economy" "effect" "egg" "eight" [56] "either" "elect" "electric" "eleven" "employ" [61] "end" "english" "enjoy" "enough" "enter" [66] "environment" "equal" "especial" "even" "evening" [71] "ever" "every" "exact" "except" "exist" [76] "expect" "explain" "express" "identify" "if" [81] "important" "in" "indeed" "individual" "industry" [86] "inform" "instead" "interest" "invest" "it" [91] "item" "obvious" "occasion" "odd" "of" [96] "off" "offer" "often" "okay" "old" [101] "on" "only" "open" "opportunity" "or" [106] "order" "original" "other" "ought" "out" [111] "over" "own" "under" "understand" "union" [116] "unit" "university" "unless" "until" "up" [121] "upon" "usual"
# Using a combination of multiple str_detect() calls words[str_detect(words, "^(?i)[aeiou]") & str_detect(words, "[^aeiou]$") ]
[1] "about" "accept" "account" "across" "act" [6] "actual" "add" "address" "admit" "affect" [11] "afford" "after" "afternoon" "again" "against" [16] "agent" "air" "all" "allow" "almost" [21] "along" "already" "alright" "although" "always" [26] "amount" "and" "another" "answer" "any" [31] "apart" "apparent" "appear" "apply" "appoint" [36] "approach" "arm" "around" "art" "as" [41] "ask" "at" "attend" "authority" "away" [46] "awful" "each" "early" "east" "easy" [51] "eat" "economy" "effect" "egg" "eight" [56] "either" "elect" "electric" "eleven" "employ" [61] "end" "english" "enjoy" "enough" "enter" [66] "environment" "equal" "especial" "even" "evening" [71] "ever" "every" "exact" "except" "exist" [76] "expect" "explain" "express" "identify" "if" [81] "important" "in" "indeed" "individual" "industry" [86] "inform" "instead" "interest" "invest" "it" [91] "item" "obvious" "occasion" "odd" "of" [96] "off" "offer" "often" "okay" "old" [101] "on" "only" "open" "opportunity" "or" [106] "order" "original" "other" "ought" "out" [111] "over" "own" "under" "understand" "union" [116] "unit" "university" "unless" "until" "up" [121] "upon" "usual"
Are there any
words
that contain at least one of each different vowel?No, there are no such words in
words
.= "^(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u).+" pattern_c str_subset(words, pattern_c)
character(0)
Question 2
Construct patterns to find evidence for and against the rule “i before e except after c”?
The code given below provides annotated evidence for and against the rule “i before e except after c”.
# Creating the regexp's first to use in stringr functions
= "\\b\\w*ie\\w*\\b"
pattern_1a = "\\b\\w+ei\\w*\\b"
pattern_1b
= "\\b\\w*cei\\w*\\b"
pattern_2a = "\\b\\w*cie\\w*\\b"
pattern_2b
# Words which contain "i" before "e"
str_detect(words, pattern_1a)] words[
[1] "achieve" "believe" "brief" "client" "die"
[6] "experience" "field" "friend" "lie" "piece"
[11] "quiet" "science" "society" "tie" "view"
# Words which contain "e" before an "i", thus giving evidence against
# the rule, unless there is a preceeding "c"
str_detect(words, pattern_1b)] words[
[1] "receive" "weigh"
# Words which contain "e" before an "i" after "c", thus following the rule.
# That is, evidence in favour of the rule
str_detect(words, pattern_2a)] words[
[1] "receive"
# Words which contain an "i" before "e" after "c", thus violating the rule.
# That is, evidence against the rule
str_detect(words, pattern_2b)] words[
[1] "science" "society"
Question 3
colors()
contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).
The R code
col_vec = colours(distinct = TRUE)
creates a vectorcol_vec
containing a set of distinct color names available in R’s default color palette.The code
col_vec = col_vec[!str_detect(col_vec, "\\b\\w*\\d\\w*\\b")]
filters the vectorcol_vec
to exclude color names that contain any digits within them.Finally, the code
col_vec[str_detect(col_vec, "\\b(?:light|dark)\\w*\\b")]
will return a subset of thecol_vec
vector containing color names that have modifiers like “light” or “dark” in them, effectively identifying color names with modifiers.
= colours(distinct = TRUE)
col_vec = col_vec[!str_detect(col_vec, "\\b\\w*\\d\\w*\\b")]
col_vec
str_detect(col_vec, "\\b(?:light|dark)\\w*\\b")] col_vec[
[1] "darkgoldenrod" "darkgray" "darkgreen"
[4] "darkkhaki" "darkmagenta" "darkolivegreen"
[7] "darkorange" "darkorchid" "darkred"
[10] "darksalmon" "darkseagreen" "darkslateblue"
[13] "darkslategray" "darkturquoise" "darkviolet"
[16] "lightblue" "lightcoral" "lightcyan"
[19] "lightgoldenrod" "lightgoldenrodyellow" "lightgray"
[22] "lightgreen" "lightpink" "lightsalmon"
[25] "lightseagreen" "lightskyblue" "lightslateblue"
[28] "lightslategray" "lightsteelblue" "lightyellow"
Question 4
Create a regular expression that finds any base R dataset. You can get a list of these data-sets via a special use of the data()
function: data(package = "datasets")$results[, "Item"]
. Note that a number of old data-sets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off.
The following code does the job, and purpose of each line is explained in the annotations.
# Extract all base R datasets into a character vector
= data(package = "datasets")$results[, "Item"]
base_r_packs
# Remove all the names of grouping data.frames in parenthesis
= str_replace_all(base_r_packs,
base_r_packs pattern = "\\([^()]+\\)",
replacement = "")
# Remove the whitespace, i.e., " " let after removing the parenthesis words
= str_replace_all(base_r_packs,
base_r_packs pattern = "\\s+$",
replacement = "")
# Create the regular expression
= str_c("\\b(", str_flatten(base_r_packs, "|"), ")\\b") huge_regex