Netflix Movie Runtimes vs. Viewership

Creating ridge-density plots and uncovering viewership patterns in Netflix’s movie catalog.

#TidyTuesday

{ggridges}

Density Ridges Plot

Author

Aditya Dahiya

Published

July 30, 2025

About the Data

This analysis uses Netflix viewing data from the TidyTuesday project (Week 30, 2025), which compiles viewing statistics from Netflix’s official Engagement Reports spanning late 2023 through the first half of 2025. The dataset, curated by Jen Richmond from RLadies-Sydney, captures approximately 99% of all Netflix viewing activity, representing over 95 billion hours of content consumption across a diverse range of genres and languages. The data includes two main components: movies and TV shows, with detailed information on viewing hours, release dates, runtime, global availability status, and calculated view counts (derived from hours viewed divided by runtime). This comprehensive dataset provides insights into viewing patterns, content performance over time, and the relationship between release timing and audience engagement on the world’s leading streaming platform.

Figure 1: This visualization displays the distribution of Netflix movie viewership across 13 runtime categories, from under 30 minutes to over 3 hours. Each row shows a density ridge plot revealing the spread of viewership within that duration category, overlaid with boxplots indicating quartiles and outliers. Red dots mark the trimmed mean viewership for each category, while numbers on the right show the count of movies in each runtime bin. The x-axis uses a logarithmic scale to accommodate the wide range of viewership values.

How the Graphic Was Created

This visualization was created using R and the tidyverse ecosystem, combining multiple layers to reveal Netflix movie viewership patterns across runtime categories. I first converted the runtime strings (e.g., “1H 30M 0S”) into numerical hours using lubridate’s duration() function and custom regex extraction. The movies were then binned into 13 ordered runtime categories using cut() with carefully defined breaks from <30 minutes to >3 hours. The core visualization layers three complementary geoms: geom_density_ridges() from the ggridges package to show the distribution shape of viewership within each runtime category, geom_boxplot() to display statistical summaries (quartiles and outliers), and geom_point() to highlight the trimmed mean viewership for each category. The color palette was sourced from paletteer using the “khroma::smoothrainbow” scheme, while text formatting utilized Google Fonts via showtext and ggtext for enhanced typography. The x-axis employs a log10 transformation with scales formatting to handle the wide range of viewership values, and custom annotations display the count of movies in each duration category to provide context for the distributions.

Loading required libraries

Code

pacman::p_load(
  tidyverse,            # All things tidy
  
  scales,               # Nice Scales for ggplot2
  fontawesome,          # Icons display in ggplot2
  ggtext,               # Markdown text support for ggplot2
  showtext,             # Display fonts in ggplot2
  colorspace,           # Lighten and Darken colours

  patchwork,            # Composing Plots
  ggridges,             # Making ridgeline plot 
  gghalves              # Half geoms in ggplot2
)

movies <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/movies.csv')

shows <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-07-29/shows.csv')

Visualization Parameters

Code

# Font for titles
font_add_google("Saira",
  family = "title_font"
) 

# Font for the caption
font_add_google("Saira Condensed",
  family = "body_font"
) 

# Font for plot text
font_add_google("Saira Extra Condensed",
  family = "caption_font"
) 

showtext_auto()

# A base Colour
bg_col <- "white"
seecolor::print_color(bg_col)

# Colour for highlighted text
text_hil <- "grey20"
seecolor::print_color(text_hil)

# Colour for the text
text_col <- "grey20"
seecolor::print_color(text_col)

line_col <- "grey30"

# Define Base Text Size
bts <- 80

# Caption stuff for the plot
sysfonts::font_add(
  family = "Font Awesome 6 Brands",
  regular = here::here("docs", "Font Awesome 6 Brands-Regular-400.otf")
)
github <- "&#xf09b"
github_username <- "aditya-dahiya"
xtwitter <- "&#xe61b"
xtwitter_username <- "@adityadahiyaias"
social_caption_1 <- glue::glue("<span style='font-family:\"Font Awesome 6 Brands\";'>{github};</span> <span style='color: {text_hil}'>{github_username}  </span>")
social_caption_2 <- glue::glue("<span style='font-family:\"Font Awesome 6 Brands\";'>{xtwitter};</span> <span style='color: {text_hil}'>{xtwitter_username}</span>")
plot_caption <- paste0(
  "**Data:** Netflix & Jen Richmond", 
  " |  **Code:** ", 
  social_caption_1, 
  " |  **Graphics:** ", 
  social_caption_2
  )
rm(github, github_username, xtwitter, 
   xtwitter_username, social_caption_1, 
   social_caption_2)

# Add text to plot-------------------------------------------------
plot_title <- "Movie Duration vs. Viewership Patterns"

plot_subtitle <- "While most Netflix movies fall between 90-150 minutes, the highest average viewership belongs to extreme outliers. Short films under 30 minutes and epic movies over 3 hours capture the most viewer attention per title, despite being fewer in number." |> 
  str_wrap(85)

str_view(plot_subtitle)

Exploratory Data Analysis and Wrangling

Code

# Credits: Claude Sonnet 4.0
# Function to convert runtime string to duration
convert_runtime <- function(runtime_string) {
  # Extract hours, minutes, and seconds using regex
  hours <- str_extract(runtime_string, "\\d+(?=H)") |> as.numeric()
  minutes <- str_extract(runtime_string, "\\d+(?=M)") |> as.numeric()
  seconds <- str_extract(runtime_string, "\\d+(?=S)") |> as.numeric()
  
  # Replace NA with 0
  hours <- ifelse(is.na(hours), 0, hours)
  minutes <- ifelse(is.na(minutes), 0, minutes)
  seconds <- ifelse(is.na(seconds), 0, seconds)
  
  # Create duration object
  duration(hours = hours, minutes = minutes, seconds = seconds)
}

# Apply to movies dataframe
movies <- movies |> 
  mutate(
    # Convert duration to a numerical / time duration variable
    runtime_duration = convert_runtime(runtime),
    # Convert runtime to hours for better visualization
    runtime_hours = as.numeric(runtime_duration, "hours")
  )

# Apply to shows dataframe
shows <- shows |> 
  mutate(
    # Convert duration to a numerical / time duration variable
    runtime_duration = convert_runtime(runtime),
    # Convert runtime to hours for better visualization
    runtime_hours = as.numeric(runtime_duration, "hours")
  )


# Explore the data with summarytools
# pacman::p_load(summarytools)

# movies |> 
#   dfSummary() |> 
#   view()
#   
# 
# shows |> 
#   dfSummary() |> 
#   view()


# Some trial examples to detect pattern in shows and movies duration
# shows |> 
#   ggplot(
#     mapping = aes(
#       x = as.numeric(runtime_duration, "hours")
#     )
#   ) +
#   geom_boxplot(alpha = 0.2) +
#   scale_x_log10(name = "Runtime (hours)")
# 
# shows |> 
#   ggplot(
#     mapping = aes(
#       x = as.numeric(runtime_duration, "hours"),
#       y = hours_viewed
#     )
#   ) +
#   geom_point(
#     alpha = 0.3
#   ) +
#   geom_smooth() +
#   scale_x_log10(name = "Runtime (hours)")
#   
#   
# shows |> 
#   ggplot(
#     mapping = aes(
#       x = as.numeric(runtime_duration, "hours"),
#       y = hours_viewed
#     )
#   ) +
#   geom_violin() +
#   scale_x_log10(name = "Runtime (hours)")
# 
# # I took some Ideas from AI: Claude Sonnet 4.0
# library(ggridges)    # For ridge plots
# 
# movies |> 
#   select(runtime_hours, hours_viewed, views) |> 
#   filter(runtime_hours > (1/2) & runtime_hours < 3) |> 
#   ggplot(aes(runtime_hours)) +
#   geom_boxplot()
  
# Planning a RIDGE PLOT - Shows distribution of views across runtime bins

df1 <- movies |> 
  select(runtime_hours, hours_viewed, views) |> 
  mutate(
    runtime_bin = cut(
      runtime_hours, 
      breaks = c(0, 0.5, 0.67, 0.83, 1, 1.17, 1.33, 1.5, 1.67, 1.83, 2, 2.5, 3, Inf),
      labels = c(
        "<30min", 
        "30-40min", 
        "40-50min", 
        "50-60min", 
        "1h-1h10m", 
        "1h10-1h20m", 
        "1h20-1h30m", 
        "1h30-1h40m", 
        "1h40-1h50m", 
        "1h50m-2h", 
        "2h-2h30m", 
        "2h30m-3h", 
        ">3 hours"
      ),
      include.lowest = TRUE,
      ordered_result = TRUE
    )
  ) |> 
  filter(
    !is.na(runtime_bin),
    !is.na(views),
    is.finite(views),
    views > 0,  # Remove zero or negative views
    views < Inf  # Remove infinite values
  )

# df1$runtime_bin |> levels()

# pacman::p_load(ggridges, gghalves)

# df1 |> 
#   count(runtime_bin)

df2 <- df1 |> 
  group_by(runtime_bin) |> 
  summarise(
    mean_views = mean(views, na.rm = T, trim = 0.1),
    median_views = median(views, na.rm = T)
  )

The Plot

Code

g <- df1 |> 
  ggplot(
    mapping = aes(
      x = views, 
      y = runtime_bin
    )
  ) +
  geom_density_ridges(
    mapping = aes(fill = runtime_bin),
    alpha = 0.7,
    scale = 1.2,
    colour = text_col
  ) +
  geom_boxplot(
    mapping = aes(fill = runtime_bin),
    outliers = FALSE,
    staplewidth = 1,
    linewidth = 0.3,
    width = 0.25,
    na.rm = TRUE,
    colour = text_col,
    alpha = 0.7
  ) +
  geom_point(
    data = df2,
    mapping = aes(x = mean_views),
    size = 6,
    stroke = 0.6,
    pch = 21,
    colour = text_col,
    fill = "red"
  ) +
  geom_label(
    data = df1 |> count(runtime_bin),
    aes(x = 2.5e6, label = scales::number(n, big.mark = ",")),
    size = bts / 3,
    hjust = 0,
    nudge_y = 0.4,
    fill = alpha(bg_col, 0.4),
    label.size = NA,
    family = "body_font",
    label.padding = unit(0.1, "lines")
  ) +
  annotate(
    geom = "label",
    x = 2.5e6,
    y = 13.65,
    label = "Number of movies in\nthe duration category",
    size = bts / 3,
    lineheight = 0.3,
    hjust = 0.5,
    vjust = 0,
    fill = alpha(bg_col, 0.4),
    label.size = NA,
    family = "body_font",
    label.padding = unit(0.1, "lines")
  ) +
  scale_x_continuous(
    labels = label_number(scale = 1e-6, suffix = "M"),
    limits = c(5e4, 5e6),
    transform = "log10",
    expand = expansion(0)
  ) +
  scale_fill_manual(
    values = paletteer::paletteer_d("khroma::smoothrainbow")[seq(from = 3, by = 2, length = 13)]
  ) +
  coord_cartesian(clip = "off") +
  labs(
    x = "Viewership, on log scale (millions)", 
    y = "Duration of the Movies",
    title = plot_title,
    subtitle = plot_subtitle,
    caption = plot_caption
  ) +
  ggthemes::theme_map(
    base_family = "body_font",
    base_size = bts
  ) +
  theme(
    
    legend.position = "none",
    
    # Overall
    text = element_text(
      margin = margin(0,0,0,0, "mm"),
      colour = text_col,
      lineheight = 0.3
    ),
    
    # Labels and Strip Text
    plot.title = element_text(
      margin = margin(5,0,5,0, "mm"),
      hjust = 0.5,
      vjust = 0.5,
      colour = text_hil,
      size = 3 * bts,
      family = "body_font"
      ),
    plot.subtitle = element_text(
      margin = margin(2,0,2,10, "mm"),
      vjust = 0.5,
      colour = text_hil,
      size = 1.1 * bts,
      hjust = 0,
      family = "body_font",
      lineheight = 0.3
      ),
    plot.caption = element_textbox(
      margin = margin(5,0,5,0, "mm"),
      hjust = 0.5,
      halign = 0.5,
      colour = text_hil,
      size = bts,
      family = "caption_font"
    ),
    plot.caption.position = "plot",
    plot.title.position = "plot",
    plot.margin = margin(0,0,0,0, "mm"),
    
    # Axes Lines, Ticks, Text and labels
    axis.line = element_line(
      colour = text_hil,
      linewidth = 0.3,
      arrow = arrow(length = unit(5, "mm"))
    ),
    axis.ticks = element_line(
      colour = text_hil,
      linewidth = 0.3
    ),
    axis.ticks.length = unit(2, "mm"),
    axis.text.x = element_text(
      margin = margin(1,1,1,1, "mm"),
      colour = text_hil
    ),
    axis.text.y = element_text(
      margin = margin(1,1,1,1, "mm"),
      colour = text_hil,
      family = "caption_font",
      size = 0.75 * bts,
      vjust = 0
    ),
    axis.title.y = element_text(
      colour = text_hil,
      margin = margin(2,2,2,2, "mm"),
      size = bts * 1.5
    ),
    axis.title.x = element_text(
      colour = text_hil,
      margin = margin(2,2,2,2, "mm"),
      size = bts * 1.5
    )
    
  )

ggsave(
  filename = here::here(
    "data_vizs",
    "tidy_netflix_data.png"
  ),
  plot = g,
  width = 400,
  height = 500,
  units = "mm",
  bg = bg_col
)

Savings the thumbnail for the webpage

Code

# Saving a thumbnail

library(magick)

# Saving a thumbnail for the webpage
image_read(here::here("data_vizs", 
                      "tidy_netflix_data.png")) |> 
  image_resize(geometry = "x400") |> 
  image_write(
    here::here(
      "data_vizs", 
      "thumbnails", 
      "tidy_netflix_data.png"
    )
  )

Session Info

Code

pacman::p_load(
  tidyverse,            # All things tidy
  
  scales,               # Nice Scales for ggplot2
  fontawesome,          # Icons display in ggplot2
  ggtext,               # Markdown text support for ggplot2
  showtext,             # Display fonts in ggplot2
  colorspace,           # Lighten and Darken colours

  patchwork             # Composing Plots
)

sessioninfo::session_info()$packages |> 
  as_tibble() |> 
  dplyr::select(package, 
         version = loadedversion, 
         date, source) |> 
  dplyr::arrange(package) |> 
  janitor::clean_names(
    case = "title"
  ) |> 
  gt::gt() |> 
  gt::opt_interactive(
    use_search = TRUE
  ) |> 
  gtExtras::gt_theme_espn()

Table 1: R Packages and their versions used in the creation of this page and graphics

About the Data

How the Graphic Was Created

Loading required libraries

Visualization Parameters

Exploratory Data Analysis and Wrangling

The Plot

Savings the thumbnail for the webpage

Session Info

Links