Chapter 2

Data Visualization

Author

Aditya Dahiya

Published

July 29, 2023

library(tidyverse)
library(palmerpenguins)
penguins = penguins

2.2.5 Exercises

  1. How many rows are in penguins? How many columns?

The number of rows in penguins data-set is 344 and the number of columns is 8

  1. What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.

First, we find out the names of the variables in the penguins data frame in Table 1.

names(penguins) |>
  t() |>
  as_tibble() |>
  gt::gt()
Table 1:

List of variables in the penguins dataset

V1 V2 V3 V4 V5 V6 V7 V8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
# Finding the details of the variables.
# ?penguins

The variable name bill_depth_mm depicts “a number denoting bill depth (millimeters)”.[Gorman, Williams, and Fraser (2014)](Horst, Hill, and Gorman 2020)

  1. Make a scatter-plot of bill_depth_mm vs. bill_length_mm. That is, make a scatter-plot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.

The scatter-plot is depicted below.

penguins |>
  ggplot(mapping = aes(x = bill_length_mm,
                       y = bill_depth_mm,
                       col = species)) +
  geom_point() +
  geom_smooth(se = FALSE,
              method = "lm") +
  theme_classic() +
  labs(x = "Bill Length (mm)", y = "Bill Depth (mm)")

Figure 1: Scatterplot of relation between Bill Length and Bill Depth

We now test the correlations, and create a beautiful table using gt (Iannone et al. 2023)and gtExtras packages.(Mock 2022)

# Checking the correlation between the two variables
test1 = function(x) {cor.test(x$bill_length_mm, x$bill_depth_mm)$estimate}

# An empty data-frame to collect results
df = tibble(Penguins = NA,
            Correlation = NA,
            .rows = 4)
# Finding Correlation by each penguin variety
for (y in 1:3) {
  c = penguins |>
      filter(species == unique(penguins$species)[y]) |>
      test1() |>
      format(digits = 2)
  df[y,2] = c
  df[y,1] = unique(penguins$species)[y]
  }
# Converting the nature of 1st column from factor to character
df$Penguins = as.character(df$Penguins)  
# Storing the overall correlation
df[4,1] = "Overall"
df[4,2] = penguins |> test1() |> format(digits = 2)

# Displaying the result
gt::gt(df) |>
  gt::tab_header(title = "Correlation Coefficitents",
                 subtitle = "Between Bill Length & Bill Depth amongst   
                 different penguins") |>
  gtExtras::gt_theme_538() |>
  gtExtras::gt_highlight_rows(rows = 4, fill = "#d4cecd")
Table 2:

Correlation Table amongst different types of penguins

Correlation Coefficitents
Between Bill Length & Bill Depth amongst different penguins
Penguins Correlation
Adelie 0.39
Gentoo 0.64
Chinstrap 0.65
Overall -0.24

Thus, we see that the relation is not apparent on a simple scatter plot, but if we plot a different colour for each species, we observe that there is positive correlation between Bill Length and Bill Depth, in all three species. The strongest correlation is amongst Gentoo and Chinstrap penguins.

  1. What happens if you make a scatter-plot of species vs. bill_depth_mm? What might be a better choice of geom?

    If we make a scatter-plot of species vs. bill_depth_mm, the following happens:-

    penguins |>
      ggplot(mapping = aes(x = species,
                           y = bill_depth_mm)) +
      geom_point() +
      theme_bw() +
      labs(x = "Species", y = "Bill Depth (mm)")

    Figure 2: Scatter plot of species vs. Bill Depth

    This produces an awkward scatter-plot, since the x-axis variable is discrete, and not continuous. A better choice of geom might be a box-plot, which is a good way to present the relationship between a continuous (Bill Depth) and a categorical (species) variable. which shows that the average Bill Depth (in mm) is lower in Gentoo penguins compared to the other two.

    penguins |>
      ggplot(mapping = aes(x = species,
                           y = bill_depth_mm)) +
      geom_boxplot() +
      theme_bw() +
      labs(x = "Species", y = "Bill Depth (mm)")

    Figure 3: Box-plot of species vs. Bill Depth
  2. Why does the following give an error and how would you fix it?

    ggplot(data = penguins) + geom_point()

    The above code will give an error, because we have only given the data to the ggplot call, but not specified the mapping aesthetics, i.e., the x-axis and y-axis for the scatter plot called by the geom_point() . We can fix the error as follows in Figure 4 :---

    ggplot(data = penguins,
           mapping = aes(x = bill_depth_mm,
                         y = bill_length_mm)) +
      geom_point()
    Warning: Removed 2 rows containing missing values (`geom_point()`).

    Figure 4: Corrected code to display the plot
  3. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

    Within the function geom_point() the na.rm argument can do one of the two things. If it is set to FALSE , as it is by default, then the missing values are removed but the following warning message is displayed:–

    Warning message: 
    Removed 2 rows containing missing values (`geom_point()`)

    But, if it is set to na.rm = TRUE, then the missing values are silently removed. Here’s the code with na.rm = TRUE to produce Figure 5 :---

    ggplot(data = penguins,
           mapping = aes(x = bill_depth_mm,
                         y = bill_length_mm)) +
      geom_point(na.rm = TRUE)

    Figure 5: Corrected code to display the plot with na.rm = TRUE
  4. Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

    The caption is added here with the labs function with ggplot function below in (fi?)

    ggplot(data = penguins,
           mapping = aes(x = bill_depth_mm,
                         y = bill_length_mm)) +
      geom_point(na.rm = TRUE) +
      labs(caption = "Data come from the palmerpenguins package.")

    Figure 6: Plot with a caption added in ggplot call itself
  5. Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

    penguins |>
      ggplot(mapping = aes(x = flipper_length_mm,
                           y = body_mass_g)) + 
      geom_point(mapping = aes(color = bill_depth_mm)) + 
      geom_smooth()

    Figure 7: Recreated figure using the ggplot2 code

    The code above recreates the Figure 7. The aesthetic should bill_depth_mm should be mapped the aesthetic colo in the geom_point() function level. It should not be done at the global level, because then it will even be an aesthetic for geom_smooth resulting in multiple smoother lines fitted for each level of bill_depth_mm , and possible result in an error because bill_depth_mm is not a categorical variable or a factor variable with certain distinct categories or levels.

    Luckily, ggplot2 recognizes this error and still produces the same plot by droppin the color aesthetic, i.e., The following aesthetics were dropped during statistical transformation: colour. So, ggplot2 is trying to guess our intentions, and it works, but the code not correct. The wrong code is tested at Figure 8.

    penguins |>
      ggplot(mapping = aes(x = flipper_length_mm,
                           y = body_mass_g,
                           color = bill_depth_mm)) + 
      geom_point() + 
      geom_smooth()
    `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
    Warning: The following aesthetics were dropped during statistical transformation: colour
    ℹ This can happen when ggplot fails to infer the correct grouping structure in
      the data.
    ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
      variable into a factor?
    Warning: Removed 2 rows containing missing values (`geom_point()`).

    Figure 8: The Wrong Code - Recreated figure is the same - but the code is fundamentally flawed
  6. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

    ggplot(data = penguins,
           mapping = aes(x = flipper_length_mm, 
                         y = body_mass_g, 
                         color = island)) +
      geom_point() +
      geom_smooth(se = FALSE)

    On visual inspection, I believe this code should create a scatter plot of penguins flipper lengths (on x-axis) vs. body mass (on y-axis), with the dots coloured by islands on the penguins. Further, a smoother line if fitted to show the relationship, with a separate smoother line for each island type. Thus, since we know there are three types of islands, we expect three smoother lines fitted to the plot, without the display of standard error intervals.

    Now, let us check our predictions with the code in the Figure 9 :--

    Figure 9: Plot generated from running the Code of Question 9
  7. Will these two graphs look different? Why/why not?

    # Code 1
    ggplot(data = penguins,
           mapping = aes(x = flipper_length_mm, 
                         y = body_mass_g)) +
      geom_point() +
      geom_smooth()
    
    # Code 2
    ggplot() +
      geom_point(data = penguins,
                 mapping = aes(x = flipper_length_mm, 
                               y = body_mass_g)
      ) +
      geom_smooth(data = penguins,
                  mapping = aes(x = flipper_length_mm, 
                                y = body_mass_g))

    Yes, these two graphs should look the same. Since, the data and the aesthetics mapped are the same in both. Only difference is that the second code has redundancy.

    Here’s the visual confirmation for both codes in Figure 10.

    Figure 10: Comparison of the two plots produced by the codes in Question 10

2.4.3 Exercises

  1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

    When we assign species to the y-axis, we get a horizontal bar plot, instead of the vertical bar plot given in the textbook. The results are compared in Figure 11 .

    p1 = penguins |>
          ggplot(aes(x = species)) +
          geom_bar() +
          labs(caption = "Species on x-axis")
    p2 = penguins |>
          ggplot(aes(y = species)) +
          geom_bar() +
          labs(caption = "Species on y-axis")
    gridExtra::grid.arrange(p1, p2, ncol = 2)

    Figure 11: Change in figure when species is assigned to y-axis
  2. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

    The output of the two plots is in Figure 12 .

    gridExtra::grid.arrange(
    
    ggplot(penguins, aes(x = species)) + geom_bar(color = "red") +
      labs(caption = "Color = Red"),
    
    ggplot(penguins, aes(x = species)) + geom_bar(fill = "red") +
      labs(caption = "Fill = Red"),
    
    ncol = 2)

    Figure 12: The two plots produced by the code given, with red color vs. red fill

    The two plots are different in where the colour red appears. As a color aesthetic, it appears only on the borders. But, as a fill aesthetic, it fills the entire bar(s).

    Thus, the aesthetic fill is more useful in changing the colour of the bars.

  3. What does the bins argument in geom_histogram() do?

    The bins argument tell the number of bins (i.e. number of bars) in the histogram to be plotted. The default value is 30. However, if the binwidth is also specified, then the binwidth argument over-rides the bins argument.

  4. Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What bin-width reveals the most interesting patterns?

    g1 = ggplot(diamonds, aes(x=carat)) + 
      geom_histogram(fill = "white", color = "black") + 
      theme_classic() + labs(x = NULL, y = NULL, 
                             subtitle = "Default Bindwidth")
    g2 = ggplot(diamonds, aes(x=carat)) + 
        geom_histogram(fill = "white", color = "black", 
                       binwidth = 0.1) + 
        theme_classic() + 
        labs(x = NULL, y = NULL, 
             subtitle = "Bindwidth = 0.1")
    g3 = ggplot(diamonds, aes(x=carat)) + 
        geom_histogram(fill = "white", color = "black", 
                       binwidth = 0.2) + 
        theme_classic() + 
        labs(x = NULL, y = NULL, 
             subtitle = "Bindwidth = 0.2")  
    g4 = ggplot(diamonds, aes(x=carat)) + 
        geom_histogram(fill = "white", color = "black", 
                       binwidth = 0.3) + 
        theme_classic() + 
        labs(x = NULL, y = NULL, 
             subtitle = "Bindwidth = 0.3")
    g5 = ggplot(diamonds, aes(x=carat)) + 
        geom_histogram(fill = "white", color = "black", 
                       binwidth = 0.5) + 
        theme_classic() + 
        labs(x = NULL, y = NULL, 
             subtitle = "Bindwidth = 0.5")
    g6 = ggplot(diamonds, aes(x=carat)) + 
        geom_histogram(fill = "white", color = "black", 
                       binwidth = 1) + 
        theme_classic() + 
        labs(x = NULL, y = NULL, 
             subtitle = "Bindwidth = 1")
    
    gridExtra::grid.arrange(g1, g2, g3, g4, g5, g6,
                            ncol = 3, nrow = 2)

    Figure 13: Histogram with different bin-widths tried out to select the most relevant one

    Thus, we see that best binwidth is either the default binwidth chosen by ggplot2 or the bind-width of 0.2 per bin, since it reveals the most interesting patterns.

2.5.5 Exercises

  1. The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?

    The code below displays the summary fo the mpg data-set. The following variables are categorical: manufacturer (manufacturer name), model (model name), trans (type of transmission), drv (the type of drive train: front, rear or 4-wheel), fl (fuel type), and class (type of car). The numerical variables are displ (engine displacement, in litres), year (year of manufacture) , cyl (number of cylinders) , cty (city miles per gallon) and hwy (highway miles per gallon). We can see these in the square parenthesis the column titled Variable in the output of the code below .

    # Visualize summary of the data frame
    mpg |>
      summarytools::dfSummary(plain.ascii  = FALSE, 
                              style = "grid", 
                              graph.magnif = 0.75, 
                              valid.col = FALSE,
                              na.col = FALSE,
                              headings = FALSE) |>
      view()

    If we simply run mpg , we can still see this information in the R console output, by the terms <chr> (for categorical variables) ; and, <dbl> or <int> (for numerical variables)

    mpg
    ## # A tibble: 234 × 11
    ##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
    ##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
    ##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
    ##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
    ##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
    ##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
    ##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
    ##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
    ##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
    ##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
    ##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
    ## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
    ## # ℹ 224 more rows
  2. Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

    g1 = mpg |>
          ggplot(aes(x = hwy, y = displ)) +
          geom_point() +
          theme_minimal() +
          labs(caption = "Original Plot")
    # Using numerical variable 'cty' to map to colour, size
    g2 = mpg |>
          ggplot(aes(x = hwy, y = displ, color = cty)) +
          geom_point() +
          theme_minimal()+
          labs(caption = "cty mapped to color")
    g3 = mpg |>
          ggplot(aes(x = hwy, y = displ, size = cty)) +
          geom_point(alpha = 0.5) +
          theme_minimal()+
          labs(caption = "cty mapped to size")
    g4 = mpg |>
          ggplot(aes(x = hwy, y = displ, 
                     color = cty, size = cty)) +
          geom_point() +
          theme_minimal()+
          labs(caption = "cty mapped to size and color")
    gridExtra::grid.arrange(g1, g2, g3, g4, ncol = 2)

    Figure 14: Scatterplots of different kinds for different aesthetic mappings

    So, we see that we can map a numerical variable to color or size aesthetics, and ggplot2 will itself make a scale and display the output with a legend. However, numerical variables (i.e., continuous variables) don’t map to shape aesthetic, as there cannot be any continuum amongst shapes. Accordingly, when mapped to shape, the code throws an error as below:---

    # Using numerical variable 'cty' to map to size aesthetic 
      mpg |>
          ggplot(aes(x = hwy, 
                     y = displ, 
                     shape = cty)) +
          geom_point()
    Error in `geom_point()`:
    ! Problem while computing aesthetics.
    ℹ Error occurred in the 1st layer.
    Caused by error in `scale_f()`:
    ! A continuous variable cannot be mapped to the shape aesthetic
    ℹ choose a different aesthetic or use `scale_shape_binned()`

    Thus, the shape aesthetic works only with categorical variables, whereas color works with both numerical and categorical variables; and, by definition size aesthetic should be used only with numerical variables (it can work with categorical variables, but then the sizes are assigned arbitrarily to different categories).

  3. In the scatter-plot of hwy vs. displ, what happens if you map a third variable to linewidth?

    mpg |>
          ggplot(aes(x = hwy, y = displ,
                     linewidth = cty)) +
          geom_point() +
          theme_minimal()

    Figure 15: Experiment with mapping line width to a third variable

    As we see, nothing changes with addition of the linewidth argument in Figure 15 . This is because the linewidth argument “scales the width of lines and polygon strokes.” in ggplot2 documentation. Since we are only plotting point geoms, and no lines, the argument is useless and not used to produce the output.

  4. What happens if you map the same variable to multiple aesthetics?

    We can map the same variable to multiple aesthetics, and the output will display its variations in all such aesthetics. But it is redundant, and make plot cluttery with too much visual input.

    For example, Figure 16 shows a poorly understandable plot where class of the vehicle is mapped to size, shape and color. It works, but there’s too much information redundancy.

    mpg |>
      ggplot(aes(x=hwy, y = cty,
                 size = class,
                 color = class,
                 shape = class)) +
      geom_point(alpha = 0.5) +
      theme_classic()

    Figure 16: A messy plot with Mutliple aesthetics defined by the same variable
  5. Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?

    The Figure 17 shows the importance of coloring or faceting by species. This allows us to detect a fairly strong positive correlation which was not apparent in the simple scatter plot. This, perhaps, can be called an example of negative confounding (Mehio-Sibai et al. 2005) of the relation between bill depth and bill length by the species type.

    p1 = penguins |>
      ggplot(mapping = aes(x = bill_length_mm,
                           y = bill_depth_mm)) +
      geom_point() +
      geom_smooth(se = FALSE,
                  method = "lm") +
      theme_classic() +
      labs(x = "Bill Length (mm)", y = "Bill Depth (mm)",
           subtitle = "No relation is apparent")
    
    p2 = penguins |>
      ggplot(mapping = aes(x = bill_length_mm,
                           y = bill_depth_mm,
                           col = species)) +
      geom_point() +
      geom_smooth(se = FALSE,
                  method = "lm") +
      theme_classic() +
      labs(x = "Bill Length (mm)", y = "Bill Depth (mm)",
           subtitle = "Colouring  by species reveals relations")
    
    p3 = penguins |>
      ggplot(mapping = aes(x = bill_length_mm,
                           y = bill_depth_mm)) +
      geom_point() +
      geom_smooth(se = FALSE,
                  method = "lm") +
      facet_wrap(~species) +
      theme_classic() +
      labs(x = "Bill Length (mm)", y = "Bill Depth (mm)",
           subtitle = "Faceting also reveals the relations")
    
    lay = rbind(c(1,1,2,2,2),
                c(3,3,3,3,3))
    gridExtra::grid.arrange(p1, p2, p3, layout_matrix = lay)

    Figure 17: Adding color by species reveals a strong relationship
  6. Why does the following yield two separate legends? How would you fix it to combine the two legends?

ggplot(data = penguins,   
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm,      
                     color = species, 
                     shape = species)) +   
  geom_point() +   
  labs(color = "Species")

This code presents a plot with two legends because in the last line, we have forced ggplot2 to name out “Color” legend as the string “Species”. Thus, ggplot2 differentiates between “species” and “Species”.

We can correct this issue in either of the following two ways:--

ggplot(data = penguins,   
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm,      
                     color = species, 
                     shape = species)) +   
  geom_point()

or,

ggplot(data = penguins,   
       mapping = aes(x = bill_length_mm, 
                     y = bill_depth_mm,      
                     color = species, 
                     shape = species)) +   
  geom_point()
  1. Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

    The plots are produced in Figure 18 .

    g1 = ggplot(penguins, aes(x = island, 
                         fill = species)) +   
      geom_bar(position = "fill") +
      labs(subtitle = "Sub-figure A")
    
    g2 = ggplot(penguins, aes(x = species, 
                         fill = island)) +   
      geom_bar(position = "fill") +
      labs(subtitle = "Sub-figure B")
    
    gridExtra::grid.arrange(g1, g2, ncol = 2)

    Figure 18: The two stacked bar plots produced by the code

    The Sub-Figure A answers the question, that “On each of the three islands, what proportion of penguins belong to which species?”

    The Sub-Figure B answers the question reg. distribution of the population of each species of penguins, that is, “For each of the penguin species’, what proportion of each species total population is found on which island?”

2.6.1 Exercises

  1. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

    ggplot(mpg, aes(x = class)) +   
    geom_bar() 
    
    ggplot(mpg, aes(x = cty, y = hwy)) +   
    geom_point() 
    
    ggsave("mpg-plot.png")

    The second plot, i.e., the scatter plot is saved into the file “mpg-plot.png” in the working directory, because the function ggsave() saves only the most recent plot into the file.

  2. What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?

    To save the plot as a PDF file, we will need to add the arguments device = "pdf" to the ggsave() function call. We can find out the types of image files that would work by using the help for ggsave() function by running the code ?ggsave at the command prompt.

    The documentation for the device argument within ggsave() function tells us that following image document types work with it:--

    • a device function (e.g. png), or

    • one of “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg” or “wmf” (windows only).

References

Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis).” Edited by André Chiaradia. PLoS ONE 9 (3): e90081. https://doi.org/10.1371/journal.pone.0090081.
Horst, Allison M, Alison Presmanes Hill, and Kristen B Gorman. 2020. Allisonhorst/Palmerpenguins: V0.1.0. Zenodo. https://doi.org/10.5281/ZENODO.3960218.
Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2023. “Gt: Easily Create Presentation-Ready Display Tables.” https://CRAN.R-project.org/package=gt.
Mehio-Sibai, Abla, Manning Feinleib, Tarek A. Sibai, and Haroutune K. Armenian. 2005. “A Positive or a Negative Confounding Variable? A Simple Teaching Aid for Clinicians and Students.” Annals of Epidemiology 15 (6): 421–23. https://doi.org/10.1016/j.annepidem.2004.10.004.
Mock, Thomas. 2022. “gtExtras: Extending ’Gt’ for Beautiful HTML Tables.” https://CRAN.R-project.org/package=gtExtras.