library(tidyverse)
library(palmerpenguins)
= penguins penguins
Chapter 2
Data Visualization
2.2.5 Exercises
- How many rows are in
penguins
? How many columns?
The number of rows in penguins
data-set is 344 and the number of columns is 8
- What does the
bill_depth_mm
variable in thepenguins
data frame describe? Read the help for?penguins
to find out.
First, we find out the names of the variables in the penguins
data frame in Table 1.
names(penguins) |>
t() |>
as_tibble() |>
::gt() gt
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 |
---|---|---|---|---|---|---|---|
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
# Finding the details of the variables.
# ?penguins
The variable name bill_depth_mm
depicts “a number denoting bill depth (millimeters)”.[Gorman, Williams, and Fraser (2014)](Horst, Hill, and Gorman 2020)
- Make a scatter-plot of
bill_depth_mm
vs.bill_length_mm
. That is, make a scatter-plot withbill_depth_mm
on the y-axis andbill_length_mm
on the x-axis. Describe the relationship between these two variables.
The scatter-plot is depicted below.
|>
penguins ggplot(mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
col = species)) +
geom_point() +
geom_smooth(se = FALSE,
method = "lm") +
theme_classic() +
labs(x = "Bill Length (mm)", y = "Bill Depth (mm)")
We now test the correlations, and create a beautiful table using gt
(Iannone et al. 2023)and gtExtras
packages.(Mock 2022)
# Checking the correlation between the two variables
= function(x) {cor.test(x$bill_length_mm, x$bill_depth_mm)$estimate}
test1
# An empty data-frame to collect results
= tibble(Penguins = NA,
df Correlation = NA,
.rows = 4)
# Finding Correlation by each penguin variety
for (y in 1:3) {
= penguins |>
c filter(species == unique(penguins$species)[y]) |>
test1() |>
format(digits = 2)
2] = c
df[y,1] = unique(penguins$species)[y]
df[y,
}# Converting the nature of 1st column from factor to character
$Penguins = as.character(df$Penguins)
df# Storing the overall correlation
4,1] = "Overall"
df[4,2] = penguins |> test1() |> format(digits = 2)
df[
# Displaying the result
::gt(df) |>
gt::tab_header(title = "Correlation Coefficitents",
gtsubtitle = "Between Bill Length & Bill Depth amongst
different penguins") |>
::gt_theme_538() |>
gtExtras::gt_highlight_rows(rows = 4, fill = "#d4cecd") gtExtras
Correlation Coefficitents | |
Between Bill Length & Bill Depth amongst different penguins | |
Penguins | Correlation |
---|---|
Adelie | 0.39 |
Gentoo | 0.64 |
Chinstrap | 0.65 |
Overall | -0.24 |
Thus, we see that the relation is not apparent on a simple scatter plot, but if we plot a different colour for each species, we observe that there is positive correlation between Bill Length and Bill Depth, in all three species. The strongest correlation is amongst Gentoo and Chinstrap penguins.
What happens if you make a scatter-plot of
species
vs.bill_depth_mm
? What might be a better choice of geom?If we make a scatter-plot of
species
vs.bill_depth_mm
, the following happens:-|> penguins ggplot(mapping = aes(x = species, y = bill_depth_mm)) + geom_point() + theme_bw() + labs(x = "Species", y = "Bill Depth (mm)")
This produces an awkward scatter-plot, since the x-axis variable is discrete, and not continuous. A better choice of
geom
might be a box-plot, which is a good way to present the relationship between a continuous (Bill Depth) and a categorical (species) variable. which shows that the average Bill Depth (in mm) is lower in Gentoo penguins compared to the other two.|> penguins ggplot(mapping = aes(x = species, y = bill_depth_mm)) + geom_boxplot() + theme_bw() + labs(x = "Species", y = "Bill Depth (mm)")
Why does the following give an error and how would you fix it?
ggplot(data = penguins) + geom_point()
The above code will give an error, because we have only given the data to the
ggplot
call, but not specified themapping
aesthetics, i.e., the x-axis and y-axis for the scatter plot called by thegeom_point()
. We can fix the error as follows in Figure 4 :---ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).
What does the
na.rm
argument do ingeom_point()
? What is the default value of the argument? Create a scatterplot where you successfully use this argument set toTRUE
.Within the function
geom_point()
thena.rm
argument can do one of the two things. If it is set toFALSE
, as it is by default, then the missing values are removed but the following warning message is displayed:–Warning message: Removed 2 rows containing missing values (`geom_point()`)
But, if it is set to
na.rm = TRUE
, then the missing values are silently removed. Here’s the code withna.rm = TRUE
to produce Figure 5 :---ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point(na.rm = TRUE)
Add the following caption to the plot you made in the previous exercise: “Data come from the
palmerpenguins
package.” Hint: Take a look at the documentation forlabs()
.The caption is added here with the
labs
function withggplot
function below in (fi?)ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point(na.rm = TRUE) + labs(caption = "Data come from the palmerpenguins package.")
Recreate the following visualization. What aesthetic should
bill_depth_mm
be mapped to? And should it be mapped at the global level or at the geom level?|> penguins ggplot(mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes(color = bill_depth_mm)) + geom_smooth()
The code above recreates the Figure 7. The aesthetic should
bill_depth_mm
should be mapped the aestheticcolo
in thegeom_point()
function level. It should not be done at the global level, because then it will even be an aesthetic forgeom_smooth
resulting in multiple smoother lines fitted for each level ofbill_depth_mm
, and possible result in an error becausebill_depth_mm
is not a categorical variable or a factor variable with certain distinct categories or levels.Luckily,
ggplot2
recognizes this error and still produces the same plot by droppin thecolor
aesthetic, i.e.,The following aesthetics were dropped during statistical transformation: colour.
So,ggplot2
is trying to guess our intentions, and it works, but thecode
not correct. The wrongcode
is tested at Figure 8.|> penguins ggplot(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = bill_depth_mm)) + geom_point() + geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: The following aesthetics were dropped during statistical transformation: colour ℹ This can happen when ggplot fails to infer the correct grouping structure in the data. ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
Warning: Removed 2 rows containing missing values (`geom_point()`).
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)) + geom_point() + geom_smooth(se = FALSE)
On visual inspection, I believe this code should create a scatter plot of penguins flipper lengths (on x-axis) vs. body mass (on y-axis), with the dots coloured by islands on the penguins. Further, a smoother line if fitted to show the relationship, with a separate smoother line for each island type. Thus, since we know there are three types of islands, we expect three smoother lines fitted to the plot, without the display of standard error intervals.
Now, let us check our predictions with the code in the Figure 9 :--
Will these two graphs look different? Why/why not?
# Code 1 ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() + geom_smooth() # Code 2 ggplot() + geom_point(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) + ) geom_smooth(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g))
Yes, these two graphs should look the same. Since, the data and the aesthetics mapped are the same in both. Only difference is that the second code has redundancy.
Here’s the visual confirmation for both codes in Figure 10.
2.4.3 Exercises
Make a bar plot of
species
ofpenguins
, where you assignspecies
to they
aesthetic. How is this plot different?When we assign
species
to the y-axis, we get a horizontal bar plot, instead of the vertical bar plot given in the textbook. The results are compared in Figure 11 .= penguins |> p1 ggplot(aes(x = species)) + geom_bar() + labs(caption = "Species on x-axis") = penguins |> p2 ggplot(aes(y = species)) + geom_bar() + labs(caption = "Species on y-axis") ::grid.arrange(p1, p2, ncol = 2) gridExtra
How are the following two plots different? Which aesthetic,
color
orfill
, is more useful for changing the color of bars?The output of the two plots is in Figure 12 .
::grid.arrange( gridExtra ggplot(penguins, aes(x = species)) + geom_bar(color = "red") + labs(caption = "Color = Red"), ggplot(penguins, aes(x = species)) + geom_bar(fill = "red") + labs(caption = "Fill = Red"), ncol = 2)
The two plots are different in where the colour red appears. As a
color
aesthetic, it appears only on the borders. But, as afill
aesthetic, it fills the entire bar(s).Thus, the aesthetic
fill
is more useful in changing the colour of the bars.What does the
bins
argument ingeom_histogram()
do?The bins argument tell the number of bins (i.e. number of bars) in the histogram to be plotted. The default value is 30. However, if the
binwidth
is also specified, then thebinwidth
argument over-rides thebins
argument.Make a histogram of the
carat
variable in thediamonds
dataset that is available when you load the tidyverse package. Experiment with different binwidths. What bin-width reveals the most interesting patterns?= ggplot(diamonds, aes(x=carat)) + g1 geom_histogram(fill = "white", color = "black") + theme_classic() + labs(x = NULL, y = NULL, subtitle = "Default Bindwidth") = ggplot(diamonds, aes(x=carat)) + g2 geom_histogram(fill = "white", color = "black", binwidth = 0.1) + theme_classic() + labs(x = NULL, y = NULL, subtitle = "Bindwidth = 0.1") = ggplot(diamonds, aes(x=carat)) + g3 geom_histogram(fill = "white", color = "black", binwidth = 0.2) + theme_classic() + labs(x = NULL, y = NULL, subtitle = "Bindwidth = 0.2") = ggplot(diamonds, aes(x=carat)) + g4 geom_histogram(fill = "white", color = "black", binwidth = 0.3) + theme_classic() + labs(x = NULL, y = NULL, subtitle = "Bindwidth = 0.3") = ggplot(diamonds, aes(x=carat)) + g5 geom_histogram(fill = "white", color = "black", binwidth = 0.5) + theme_classic() + labs(x = NULL, y = NULL, subtitle = "Bindwidth = 0.5") = ggplot(diamonds, aes(x=carat)) + g6 geom_histogram(fill = "white", color = "black", binwidth = 1) + theme_classic() + labs(x = NULL, y = NULL, subtitle = "Bindwidth = 1") ::grid.arrange(g1, g2, g3, g4, g5, g6, gridExtrancol = 3, nrow = 2)
Thus, we see that best
binwidth
is either the defaultbinwidth
chosen byggplot2
or the bind-width of 0.2 per bin, since it reveals the most interesting patterns.
2.5.5 Exercises
The
mpg
data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables inmpg
are categorical? Which variables are numerical? (Hint: Type?mpg
to read the documentation for the dataset.) How can you see this information when you runmpg
?The code below displays the summary fo the
mpg
data-set. The following variables are categorical:manufacturer
(manufacturer name),model
(model name),trans
(type of transmission),drv
(the type of drive train: front, rear or 4-wheel),fl
(fuel type), andclass
(type of car). The numerical variables aredispl
(engine displacement, in litres),year
(year of manufacture) ,cyl
(number of cylinders) ,cty
(city miles per gallon) andhwy
(highway miles per gallon). We can see these in the square parenthesis the column titled Variable in the output of the code below .# Visualize summary of the data frame |> mpg ::dfSummary(plain.ascii = FALSE, summarytoolsstyle = "grid", graph.magnif = 0.75, valid.col = FALSE, na.col = FALSE, headings = FALSE) |> view()
If we simply run
mpg
, we can still see this information in theR
console output, by the terms<chr>
(for categorical variables) ; and,<dbl>
or<int>
(for numerical variables)mpg## # A tibble: 234 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… ## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… ## 3 audi a4 2 2008 4 manu… f 20 31 p comp… ## 4 audi a4 2 2008 4 auto… f 21 30 p comp… ## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… ## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… ## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp… ## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… ## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… ## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp… ## # ℹ 224 more rows
Make a scatterplot of
hwy
vs.displ
using thempg
data frame. Next, map a third, numerical variable tocolor
, thensize
, then bothcolor
andsize
, thenshape
. How do these aesthetics behave differently for categorical vs. numerical variables?= mpg |> g1 ggplot(aes(x = hwy, y = displ)) + geom_point() + theme_minimal() + labs(caption = "Original Plot") # Using numerical variable 'cty' to map to colour, size = mpg |> g2 ggplot(aes(x = hwy, y = displ, color = cty)) + geom_point() + theme_minimal()+ labs(caption = "cty mapped to color") = mpg |> g3 ggplot(aes(x = hwy, y = displ, size = cty)) + geom_point(alpha = 0.5) + theme_minimal()+ labs(caption = "cty mapped to size") = mpg |> g4 ggplot(aes(x = hwy, y = displ, color = cty, size = cty)) + geom_point() + theme_minimal()+ labs(caption = "cty mapped to size and color") ::grid.arrange(g1, g2, g3, g4, ncol = 2) gridExtra
So, we see that we can map a numerical variable to
color
orsize
aesthetics, andggplot2
will itself make a scale and display the output with a legend. However, numerical variables (i.e., continuous variables) don’t map toshape
aesthetic, as there cannot be any continuum amongst shapes. Accordingly, when mapped toshape
, the code throws an error as below:---# Using numerical variable 'cty' to map to size aesthetic |> mpg ggplot(aes(x = hwy, y = displ, shape = cty)) + geom_point()
Error in `geom_point()`: ! Problem while computing aesthetics. ℹ Error occurred in the 1st layer. Caused by error in `scale_f()`: ! A continuous variable cannot be mapped to the shape aesthetic ℹ choose a different aesthetic or use `scale_shape_binned()`
Thus, the
shape
aesthetic works only with categorical variables, whereascolor
works with both numerical and categorical variables; and, by definitionsize
aesthetic should be used only with numerical variables (it can work with categorical variables, but then the sizes are assigned arbitrarily to different categories).In the scatter-plot of
hwy
vs.displ
, what happens if you map a third variable tolinewidth
?|> mpg ggplot(aes(x = hwy, y = displ, linewidth = cty)) + geom_point() + theme_minimal()
As we see, nothing changes with addition of the
linewidth
argument in Figure 15 . This is because thelinewidth
argument “scales the width of lines and polygon strokes.” inggplot2
documentation. Since we are only plotting point geoms, and no lines, the argument is useless and not used to produce the output.What happens if you map the same variable to multiple aesthetics?
We can map the same variable to multiple aesthetics, and the output will display its variations in all such aesthetics. But it is redundant, and make plot cluttery with too much visual input.
For example, Figure 16 shows a poorly understandable plot where
class
of the vehicle is mapped to size, shape and color. It works, but there’s too much information redundancy.|> mpg ggplot(aes(x=hwy, y = cty, size = class, color = class, shape = class)) + geom_point(alpha = 0.5) + theme_classic()
Make a scatterplot of
bill_depth_mm
vs.bill_length_mm
and color the points byspecies
. What does adding coloring by species reveal about the relationship between these two variables? What about faceting byspecies
?The Figure 17 shows the importance of coloring or faceting by
species
. This allows us to detect a fairly strong positive correlation which was not apparent in the simple scatter plot. This, perhaps, can be called an example of negative confounding (Mehio-Sibai et al. 2005) of the relation between bill depth and bill length by the species type.= penguins |> p1 ggplot(mapping = aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme_classic() + labs(x = "Bill Length (mm)", y = "Bill Depth (mm)", subtitle = "No relation is apparent") = penguins |> p2 ggplot(mapping = aes(x = bill_length_mm, y = bill_depth_mm, col = species)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme_classic() + labs(x = "Bill Length (mm)", y = "Bill Depth (mm)", subtitle = "Colouring by species reveals relations") = penguins |> p3 ggplot(mapping = aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + facet_wrap(~species) + theme_classic() + labs(x = "Bill Length (mm)", y = "Bill Depth (mm)", subtitle = "Faceting also reveals the relations") = rbind(c(1,1,2,2,2), lay c(3,3,3,3,3)) ::grid.arrange(p1, p2, p3, layout_matrix = lay) gridExtra
Why does the following yield two separate legends? How would you fix it to combine the two legends?
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
color = species,
shape = species)) +
geom_point() +
labs(color = "Species")
This code presents a plot with two legends because in the last line, we have forced ggplot2
to name out “Color” legend as the string “Species”. Thus, ggplot2
differentiates between “species” and “Species”.
We can correct this issue in either of the following two ways:--
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
color = species,
shape = species)) +
geom_point()
or,
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
color = species,
shape = species)) +
geom_point()
Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?
The plots are produced in Figure 18 .
= ggplot(penguins, aes(x = island, g1 fill = species)) + geom_bar(position = "fill") + labs(subtitle = "Sub-figure A") = ggplot(penguins, aes(x = species, g2 fill = island)) + geom_bar(position = "fill") + labs(subtitle = "Sub-figure B") ::grid.arrange(g1, g2, ncol = 2) gridExtra
The Sub-Figure A answers the question, that “On each of the three islands, what proportion of penguins belong to which species?”
The Sub-Figure B answers the question reg. distribution of the population of each species of penguins, that is, “For each of the penguin species’, what proportion of each species total population is found on which island?”
2.6.1 Exercises
Run the following lines of code. Which of the two plots is saved as
mpg-plot.png
? Why?ggplot(mpg, aes(x = class)) + geom_bar() ggplot(mpg, aes(x = cty, y = hwy)) + geom_point() ggsave("mpg-plot.png")
The second plot, i.e., the scatter plot is saved into the file “mpg-plot.png” in the working directory, because the function
ggsave()
saves only the most recent plot into the file.What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in
ggsave()
?To save the plot as a PDF file, we will need to add the arguments
device = "pdf"
to theggsave()
function call. We can find out the types of image files that would work by using the help forggsave()
function by running the code?ggsave
at the command prompt.The documentation for the
device
argument withinggsave()
function tells us that following image document types work with it:--a device function (e.g. png), or
one of “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf” (windows only).