Chapter 8: Geographic data I/O

Key Learnings from, and Solutions to the exercises in Chapter 8 of the book Geocomputation with R by Robin Lovelace, Jakub Nowosad and Jannes Muenchow.

Geocomputation with R
Textbook Solutions
Author

Aditya Dahiya

Published

January 19, 2025

Code
library(sf)        # Simple Features in R
library(terra)     # Handling rasters in R
library(tidyterra) # For plotting rasters in ggplot2
library(magrittr)  # Using pipes with raster objects
library(tidyverse) # All things tidy; Data Wrangling
library(spData)    # Spatial Datasets
library(patchwork) # Composing plots
library(gt)        # Display GT tables with R

8.1 Introduction

  • Importance of Data I/O: Geographic data input/output (I/O) is critical for geocomputation, ensuring accurate real-world applications and enabling data sharing.
  • Impact of Errors: Mistakes in data I/O (e.g., using outdated datasets) can cause significant issues later in projects.

8.2 File Formats

  • Geographic data storage: Stored as files or spatial databases. File formats can store either vector or raster data; databases can handle both, and the most common types are shown in Table 1.
Code
# A Code to fetch icons of different file formats
# Get file icons for common file types
file_format = c(
  "Shapefile", "GeoJSON", "KML", "GPKG", "CSV", "DXF", "MapInfo File", "GML", 
  "PostGIS", "TopoJSON", "GeoTIFF", "NetCDF", "JPEG", "PNG", "HDF4", "HDF5", 
  "GRIB", "Erdas Imagine", "ASCII Grid", "MBTiles")

temp_df <- tibble(
  file_format = file_format,
  id = 1:20
)

# Get a custom google search engine and API key
# Tutorial: https://developers.google.com/custom-search/v1/overview
# Tutorial 2: https://programmablesearchengine.google.com/
# google_api_key <- "LOAD YOUR GOOGLE API KEY HERE"
# my_cx <- "GET YOUR CUSTOM SEARCH ENGINE ID HERE"


# Load necessary packages
library(httr)
library(magick)

# Define function to download and save movie poster
download_icons <- function(file_format_name, id) {
  
  api_key <- google_api_key
  cx <- my_cx
  
  # Build the API request URL
  url <- paste0("https://www.googleapis.com/customsearch/v1?q=", 
                URLencode(paste0(file_format_name, " file icon svg")), 
                "&cx=", cx, 
                "&searchType=image&key=", api_key)
  
  # Make the request
  response <- GET(url)
  result <- content(response, "parsed")
  
  # Get the URL of the first image result
  image_url <- result$items[[1]]$link
  
  im <- magick::image_read(image_url) |> 
    image_resize("x100")
  
  # set background as white
  image_write(
    image = im,
    path = here::here("book_solutions", "images", "temp_icons", 
                      paste0("temp_icon_", id,".png")),
    format = "png"
    )
}

for (i in 1:20) {
  download_icons(
    file_format_name = temp_df$file_format[i],
    id = i
  )
}


# for (i in c(1:3, 5:14, 16:17, 19)) {
#   download_icons(
#     file_format_name = temp_df$file_format[i],
#     id = i
#   )
# }

# Some custom work
image_url_4 <- "https://icons.veryicon.com/png/o/construction-tools/supermap-gis-product-color-system-function/postgis-workspace.png"

image_url_15 <- "https://cdn.iconscout.com/icon/premium/png-256-thumb/hdf-file-2538306-2129659.png"

image_url_18 <- "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQA0Pb2JqaMO5UGuYCgZzRi4BXQCw9lbmTDvg&s"

image_url_20 <- "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRlucEC8-sm9U0IXIfY-wh_c3nBFKLyg0JJdQ&s"

image_url_19 <- "https://cdn.iconscout.com/icon/premium/png-256-thumb/ascii-file-1934516-1634566.png?f=webp&w=256"

download_icon_custom <- function(id) {
  im <- magick::image_read(get(paste0("image_url_", id))) |>
    image_resize("x100")

  # set background as white
  image_write(
    image = im,
    path = here::here("book_solutions", "images", "temp_icons",
                      paste0("temp_icon_", id,".png")),
    format = "png"
    )
}

for (i in c(4, 15, 18, 20)) {
  download_icon_custom(i)
}

image_url_7 <- "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR0Zb1IW-Ui71fJQy4G5kj3Ue2zQop38Kn1mA&s"

image_url_12 <- "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTr8HP5xS8im3nK5Zm8l4iP8zDrGhoiakFbCA&s"

for (i in c(7, 12, 19)) {
  download_icon_custom(i)
}

image_url_11 <- "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSHPryNCuawMSnTpre5hHZZiGuWqAhRQsjWQg&s"

download_icon_custom(11)
Code
# Get file icons for common file types
file_format = c(
  "Shapefile", "GeoJSON", "KML", "GPKG", "CSV", "DXF", "MapInfo File", "GML", 
  "PostGIS", "TopoJSON", "GeoTIFF", "NetCDF", "JPEG", "PNG", "HDF4", "HDF5", 
  "GRIB", "Erdas Imagine", "ASCII Grid", "MBTiles")

temp_df <- tibble(
  file_format = file_format,
  id = 1:20
)


# Raster formats tibble
raster_formats <- tibble(
  Format = c(
    "GeoTIFF", "NetCDF", "JPEG", "PNG", "HDF4", "HDF5", 
    "GRIB", "Erdas Imagine", "ASCII Grid", "MBTiles"
  ),
  Description = c(
    "Tagged Image File Format for georeferenced raster data.",
    "Network Common Data Form for array-oriented scientific data.",
    "Compressed image format commonly used for photographic images.",
    "Portable Network Graphics format for raster graphics.",
    "Hierarchical Data Format v4 for scientific data.",
    "Hierarchical Data Format v5, successor to HDF4.",
    "Gridded Binary format for meteorological data.",
    "Raster format for Erdas Imagine software.",
    "ASCII representation of raster data.",
    "SQLite-based raster tile storage format."
  ),
  Extension = c(
    ".tif/.tiff", ".nc", ".jpg/.jpeg", ".png", ".hdf", 
    ".h5", ".grb/.grib", ".img", ".asc", ".mbtiles"
  ),
  GeoReferencing = c(
    "Yes", "Yes", "No", "No", "Yes", "Yes", 
    "Yes", "Yes", "Yes", "Yes"
  )
) |> 
  mutate(group = "Raster File Formats")

# Vector formats tibble
vector_formats <- tibble(
  Format = c("Shapefile", "GeoJSON", "KML", "GPKG", "CSV", "DXF", 
             "MapInfo File", "GML", "PostGIS", "TopoJSON"),
  Description = c(
    "Esri Shapefile format for vector geometries.",
    "GeoJSON format for encoding a variety of geographic data structures.",
    "Keyhole Markup Language for geographic data in XML.",
    "Geopackage format for spatial data storage.",
    "Comma-Separated Values for point data with attributes.",
    "Drawing Exchange Format for CAD data.",
    "MapInfo TAB format for vector geometries.",
    "Geography Markup Language for XML-based vector data.",
    "PostGIS spatial extension for PostgreSQL database.",
    "TopoJSON format for topology-preserving JSON data."
  ),
  Extension = c(".shp", ".geojson", ".kml", ".gpkg", ".csv", ".dxf", 
                ".tab", ".gml", "PostgreSQL", ".topojson"),
  GeoReferencing = c(
    "Yes", "Yes", "Yes", "Yes", "No", "No", 
    "Yes", "Yes", "Yes", "Yes"
  )
) |> 
  mutate(
    group = "Vector File Formats"
  )

# Display vector formats table
bind_rows(
  vector_formats,
  raster_formats
) |> 
  left_join(temp_df, by = join_by(Format == file_format)) |> 
  mutate(file_path = paste0("images/temp_icons/temp_icon_", id, ".png")) |> 
  select(-id) |> 
  group_by(group) |> 
  gt() |> 
  cols_label(
    Format = "File Format",
    Description = "Description",
    Extension = "File Extension",
    GeoReferencing = "Geo-referencing",
    file_path = "File Icon"
  ) |> 
  tab_footnote(
    md("_**Source**: GDAL documentation (https://gdal.org/)_")
  ) |> 
  tab_header(
    title = "Different kinds of file formats for geographical data"
  ) |> 
  gtExtras::gt_theme_espn() |> 
  fmt_image(
    columns = file_path
  ) |> 
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    locations = cells_body(
      columns = Format
    )
  ) |> 
  tab_style(
    style = list(
      cell_text(color = "darkgreen", weight = "bold")
    ),
    locations = cells_body(
      rows = GeoReferencing == "Yes",
      columns = GeoReferencing
    )
  ) |> 
  tab_style(
    style = list(
      cell_text(color = "red", weight = "bold")
    ),
    locations = cells_body(
      rows = GeoReferencing == "No",
      columns = GeoReferencing
    )
  ) |> 
  tab_style(
    style = list(
      cell_text(align = "center")
    ),
    locations = cells_body(
      columns = c(Extension, GeoReferencing)
    )
  ) |> 
  tab_style(
    style = list(
      cell_text(align = "left")
    ),
    locations = cells_body(
      columns = Description
    )
  ) |> 
  tab_style(
    style = list(
      cell_fill(color = "lightgrey"),    
      cell_text(weight = "bold", align = "center")      
    ),
    locations = cells_row_groups()           
  ) |> 
  tab_style(
    style = list(
      cell_text(align = "center")
    ),
    locations = cells_title()
  )
Table 1: Different kinds of File Formats dealing with Geo-Spatial Data
Different kinds of file formats for geographical data
File Format Description File Extension Geo-referencing File Icon
Vector File Formats
Shapefile Esri Shapefile format for vector geometries. .shp Yes
GeoJSON GeoJSON format for encoding a variety of geographic data structures. .geojson Yes
KML Keyhole Markup Language for geographic data in XML. .kml Yes
GPKG Geopackage format for spatial data storage. .gpkg Yes
CSV Comma-Separated Values for point data with attributes. .csv No
DXF Drawing Exchange Format for CAD data. .dxf No
MapInfo File MapInfo TAB format for vector geometries. .tab Yes
GML Geography Markup Language for XML-based vector data. .gml Yes
PostGIS PostGIS spatial extension for PostgreSQL database. PostgreSQL Yes
TopoJSON TopoJSON format for topology-preserving JSON data. .topojson Yes
Raster File Formats
GeoTIFF Tagged Image File Format for georeferenced raster data. .tif/.tiff Yes
NetCDF Network Common Data Form for array-oriented scientific data. .nc Yes
JPEG Compressed image format commonly used for photographic images. .jpg/.jpeg No
PNG Portable Network Graphics format for raster graphics. .png No
HDF4 Hierarchical Data Format v4 for scientific data. .hdf Yes
HDF5 Hierarchical Data Format v5, successor to HDF4. .h5 Yes
GRIB Gridded Binary format for meteorological data. .grb/.grib Yes
Erdas Imagine Raster format for Erdas Imagine software. .img Yes
ASCII Grid ASCII representation of raster data. .asc Yes
MBTiles SQLite-based raster tile storage format. .mbtiles Yes

Source: GDAL documentation (https://gdal.org/)

  • GDAL (Geospatial Data Abstraction Library):
    • Unified library for reading/writing raster and vector formats since the year 2000.
    • Powers GIS software like GRASS GIS, ArcGIS, and QGIS.
    • Supports over 200 vector and raster formats.
  • Common vector file formats:
    • ESRI Shapefile: Multi-file format; limited by 2GB size, 255 columns, and 10-character names.
    • GeoJSON: JSON-based, supports coordinates; open format.
    • KML/KMZ: XML-based; used for Google Earth.
    • GPX: GPS data exchange.
    • FlatGeobuf: Fast, single-file with streaming capabilities.
  • Common raster file formats:
    • GeoTIFF: Popular raster format embedding CRS; supports COG (Cloud Optimized GeoTIFF) for efficient online access.
    • Arc ASCII: Text-based raster format with header and cell values.
  • Database formats:
    • SQLite/SpatiaLite: Lightweight relational database with spatial extensions.
    • ESRI FileGDB: Proprietary database format supporting feature classes and topology.
    • GeoPackage: Open standard SQLite-based lightweight spatial database for vector, raster, and non-spatial data.
  • Open standards by OGC (Open Geospatial Consortium):
    • Promotes transparency and user-driven development.
    • Formats like GML, KML, and GeoPackage adhere to OGC standards.
ShapeFiles and their Limitations

ESRI Shapefiles have been a cornerstone in geospatial data exchange since their introduction in the early 1990s.

Reasons for Shapefile Popularity:

  1. Historical Prevalence: Introduced by ESRI in the early 1990s, Shapefiles became widely adopted due to their early entry into the GIS market.

  2. Simplicity and Compatibility: Shapefiles are straightforward to use and are supported by a vast array of GIS software, facilitating easy sharing and exchange of geographic information across different platforms.

  3. Open Specification: Despite being a proprietary format, ESRI published the technical specifications for Shapefiles, allowing other software developers to implement support for them, which contributed to their widespread adoption.

Contents of a ShapeFile Folder

  • .shp: Stores geometric shape data (points, lines, polygons).
  • .shx: Index file for quick spatial queries of shapes.
  • .dbf: Attribute table with associated data in tabular format.
  • .prj: Coordinate system and projection information.
  • .cpg: Encoding of character data (e.g., UTF-8).
  • .sbn/.sbx: Optional spatial index for faster searches.
  • .qpj: Alternative projection format used by some software.

Limitations of Shapefiles:

  1. Multi-File Structure: A single Shapefile consists of multiple files (e.g., .shp, .shx, .dbf), which can complicate file management and increase the risk of losing associated files.

  2. Attribute Constraints: Field names in the attribute table are limited to 10 characters, and the maximum number of fields is 255, restricting the complexity of attribute data that can be stored.

  3. File Size Limitation: Shapefiles have a maximum size limit of 2 GB, which can be restrictive for large datasets.

  4. Lack of Topological Information: Shapefiles do not store topological relationships between features, limiting their utility in analyses that require such information.

  5. Limited Data Types and Encoding: Shapefiles have poor support for Unicode, limiting international character representation, and they cannot store null values for numeric or text fields, which can lead to data inaccuracies.

Emerging Alternatives:

  1. GeoPackage: An open, standards-based format developed by the Open Geospatial Consortium (OGC), GeoPackage uses a single SQLite database file to store both vector and raster data, overcoming many of the limitations associated with Shapefiles.

  2. File Geodatabase: ESRI’s File Geodatabase offers a more robust solution with no size limitations, support for complex data types, and improved performance, making it a suitable alternative for users within the ESRI ecosystem.

  3. GeoJSON: A format based on JavaScript Object Notation (JSON), GeoJSON is widely used for web applications and supports a variety of geometry types, making it a flexible alternative for sharing geospatial data, especially in web environments.

  4. FlatGeobuf: A modern, open format designed for performant reading and writing of geospatial data, FlatGeobuf supports spatial indexing and is suitable for web and cloud environments.

  • Other new emerging formats:
    • LAS/LAZ: For lidar point clouds.
    • NetCDF/HDF: For multidimensional arrays.
    • Emerging formats: GeoArrow, Zarr.
  • Non-spatial formats:
    • Tabular formats (e.g., CSV, Excel) used for simple spatial data sharing.
    • Lack support for complex geometries and spatial metadata like CRS.
  • Additional resources:

8.3 Data input (I)

  • Loading vector data: Use sf::read_sf() to load vector data into R.
  • Loading raster data: Use terra::rast() to read raster data.
  • Data storage: Imported data is stored in RAM and becomes accessible in the .GlobalEnv of the session.
  • Workspace management: Imported objects can be listed with ls() and are visible in IDE ‘Environment’ panels.

8.3.1 Vector Data

  • Spatial vector data formats: Common formats include .geojson, .gpkg, .shp. Import these with sf::read_sf() or sf::st_read(), which rely on GDAL drivers.

  • Driver overview: Use sf::st_drivers() to see available GDAL drivers, their features, and file format capabilities (e.g., vector, raster support).

Code
st_drivers("vector") |> 
  as_tibble() |> 
  select(name, long_name, write, copy, is_raster) |> 
  mutate(
    write = if_else(write, "Yes", NA),
    copy = if_else(copy, "Yes", NA),
    is_raster = if_else(is_raster, "Yes", NA)
  ) |> 
  gt::gt() |> 
  cols_label(
    name = "Name",
    long_name = "Full Name of file format",
    write = "Can Write?",
    copy = "Can copy?",
    is_raster = "Also handles rasters?"
  ) |> 
  sub_missing(missing_text = "") |> 
  tab_style(
    locations = cells_body(
      columns = c(write, copy, is_raster)
    ),
    style = list(
      cell_text(
        align = "center"
      )
    )
  ) |> 
  tab_style(
    locations = cells_body(
      columns = name
    ),
    style = list(
      cell_text(
        font = "monospace"
      )
    )
  ) |> 
  opt_interactive()
Table 2: The drivers available to GDAL, and therefore {sf}, in R, accessed using st_driver()
  • read_sf() arguments and their purposes: —

    • File input:

      • dsn parameter specifies file paths, folders, database credentials, or GeoJSON strings.
      • Drivers are auto-detected from file extensions like .gpkg or .shp.
    • Multiple layers: Some formats (e.g., .gpkg) support multiple data layers. Use the layer argument in read_sf() to specify the desired layer.

    • Subset data reading:

      • query: Use OGR SQL queries for conditional imports (e.g., filter rows by column values).
      • wkt_filter: Filter geometries using Well-Known Text (WKT) spatial queries (e.g., buffers around a region).
  • Driver-specific options:

    • Formats like .csv require specifying coordinate column names via the options parameter. Refer to GDAL driver documentation for details.
  • Coordinate Reference Systems (CRS): If missing, assign CRS using sf::st_set_crs().

  • Other formats: .kml: Use sf::st_layers() to list layers in KML files before reading specific layers.

  • Alternative tools: duckdb, an R interface with spatial extension, offers additional options for data import and querying.

Note

Integrating DuckDB’s spatial extension with R and the {sf} package

  1. Install and Load DuckDB’s Spatial Extension: This setup enables geospatial data processing capabilities within DuckDB.

    library(DBI)
    con <- dbConnect(duckdb::duckdb())
    dbExecute(con, "INSTALL spatial;")
    dbExecute(con, "LOAD spatial;")
  2. Import Geospatial Data into DuckDB: This command reads the specified shapefile and creates a table named spatial_data containing the geospatial information.

    CREATE TABLE spatial_data AS
    SELECT * FROM ST_Read('path/to/your/file.shp');
  3. Perform Spatial Queries in DuckDB: With the spatial extension, you can execute spatial functions such as ST_Intersects, ST_Within, and ST_Contains directly within DuckDB.

    SELECT *
    FROM spatial_data
    WHERE ST_Within(geometry_column, ST_GeomFromText('POLYGON((...))'));
  4. Export Data to R for Use with {sf}: After processing, you can export the geospatial data back to R and convert it into an {sf} object for further analysis or visualization.

    spatial_df <- dbGetQuery(con, "SELECT * FROM spatial_data;")
    library(sf)
    sf_object <- st_as_sf(spatial_df, wkt = "geometry_column")

8.3.2 Raster data

  • Raster File Formats:
    • Raster data supports single-layer and multi-layer files.
    • Use the rast() function from the terra package to load raster files.
    • Reading Online Raster Data: Use the /vsicurl/ prefix to read raster files from web resources (e.g., HTTP, FTP).
Note

Cloud Optimized GeoTIFF (COG) and GeoTIFF Files

  • GeoTIFF:
    • GeoTIFF is a public domain metadata standard that allows geo-referencing information to be embedded within a TIFF file.
    • Stores raster data with spatial reference, essential for GIS and remote sensing applications.
    • Includes metadata like coordinate system, map projections, and georeferencing tags.
  • Cloud Optimized GeoTIFF (COG):
    • A specialized format of GeoTIFF designed for efficient access and processing over the web.
    • COG files can be accessed using URLs with the /vsicurl/ prefix in GDAL-supported libraries. Functions like terra::crop() or terra::extract() fetch required data portions on demand, without downloading the full dataset.
    • Thus, /vsicurl/ prefix establishes a connection to the COG file without loading it into RAM. Values are accessed only during specific operations, e.g., crop() or extract().
  • Advantages of COG:
    • Streaming Access: Reads only the necessary portions of data (e.g., specific coordinates), minimizing memory usage.
    • Compatibility: Readable by tools like GDAL, Rasterio, and R packages like terra and stars.
  • Virtual File Systems (VFS): GDAL provides several Virtual File System (VFS) prefixes (shown in Table 3) that allow direct access to various file systems, including compressed archives, network resources, and in-memory files. These prefixes can be utilized in R using functions like sf::read_sf() for vector data and terra::rast() for raster data. Source GDAL Virtual File Systems.
Code
# Create the tibble
file_access_table <- tibble(
  Prefix = c(
    "/vsicurl/",
    "/vsizip/",
    "/vsi7z/",
    "/vsis3/",
    "/vsigs/",
    "/vsicurl_streaming/"
  ),
  Description = c(
    "Enables GDAL to access files over [HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP), [HTTPS](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview), or [FTP](https://en.wikipedia.org/wiki/File_Transfer_Protocol) protocols, allowing direct reading of remote datasets such as [GeoTIFFs](https://gdal.org/drivers/raster/gtiff.html) or [Shapefiles](https://gdal.org/drivers/vector/shapefile.html) hosted online.",
    "Allows reading files directly from [ZIP archives](https://en.wikipedia.org/wiki/Zip_(file_format)) without extraction. ZIP files can contain various geospatial data formats like [Shapefiles](https://gdal.org/drivers/vector/shapefile.html), [GeoJSON](https://gdal.org/drivers/vector/geojson.html), or [KML files](https://gdal.org/drivers/vector/kml.html).",
    "Supports reading files directly from [7z archives](https://en.wikipedia.org/wiki/7z), a high-compression format. These archives can include geospatial files such as [GeoTIFFs](https://gdal.org/drivers/raster/gtiff.html) or [Shapefiles](https://gdal.org/drivers/vector/shapefile.html).",
    "Facilitates access to files stored in [Amazon S3](https://aws.amazon.com/s3/) buckets, enabling operations on remote datasets like [GeoTIFFs](https://gdal.org/drivers/raster/gtiff.html) or [Shapefiles](https://gdal.org/drivers/vector/shapefile.html) stored in the cloud.",
    "Provides access to files stored in [Google Cloud Storage](https://cloud.google.com/storage/), allowing GDAL to read datasets such as [GeoTIFFs](https://gdal.org/drivers/raster/gtiff.html) or [Shapefiles](https://gdal.org/drivers/vector/shapefile.html) directly from cloud storage.",
    "Enables streaming of files over [HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP), [HTTPS](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview), or [FTP](https://en.wikipedia.org/wiki/File_Transfer_Protocol) without seeking capabilities, suitable for sequential reading of large remote datasets like [GeoJSON](https://gdal.org/drivers/vector/geojson.html) files."
  )
)

# Display the tibble using gt
file_access_table |> 
  gt() |> 
  fmt_markdown(columns = Description) |> 
  tab_header(
    title = "Virtual File(s) Access Methods with GDAL",
    subtitle = md("Prefixes, Descriptions, and Example Usage for `sf::read_sf()` & `terra::rast()`")
  ) |> 
  gtExtras::gt_theme_espn() |> 
  tab_footnote(
    footnote = md("GDAL virtual file system prefixes, their descriptions, and example usage with `read_sf()` and `rast()` functions in R for accessing remote and compressed files. For more details, refer to the [GDAL Virtual File Systems documentation](https://gdal.org/user/virtual_file_systems.html) and the [terra package reference](https://rspatial.org/terra/).")
  ) |> 
  tab_style(
    style = cell_text(
      size = "smaller"
    ),
    locations = cells_footnotes()
  ) |> 
  tab_style(
    style = list(
      cell_text(
        font = "monospace",
        weight = "bold"
      )
    ),
    locations = cells_body(
      columns = c(Prefix)
    )
  )
Table 3: Table showcasing various GDAL virtual file system prefixes for accessing remote and compressed files
Virtual File(s) Access Methods with GDAL

Prefixes, Descriptions, and Example Usage for sf::read_sf() & terra::rast()

Prefix Description
/vsicurl/

Enables GDAL to access files over HTTP, HTTPS, or FTP protocols, allowing direct reading of remote datasets such as GeoTIFFs or Shapefiles hosted online.

/vsizip/

Allows reading files directly from ZIP archives without extraction. ZIP files can contain various geospatial data formats like Shapefiles, GeoJSON, or KML files.

/vsi7z/

Supports reading files directly from 7z archives, a high-compression format. These archives can include geospatial files such as GeoTIFFs or Shapefiles.

/vsis3/

Facilitates access to files stored in Amazon S3 buckets, enabling operations on remote datasets like GeoTIFFs or Shapefiles stored in the cloud.

/vsigs/

Provides access to files stored in Google Cloud Storage, allowing GDAL to read datasets such as GeoTIFFs or Shapefiles directly from cloud storage.

/vsicurl_streaming/

Enables streaming of files over HTTP, HTTPS, or FTP without seeking capabilities, suitable for sequential reading of large remote datasets like GeoJSON files.

GDAL virtual file system prefixes, their descriptions, and example usage with read_sf() and rast() functions in R for accessing remote and compressed files. For more details, refer to the GDAL Virtual File Systems documentation and the terra package reference.

8.4 Data Output (O)

8.4.1 Vector Data

  • Writing vector data: The counterpart of read_sf() is write_sf(). It writes sf objects to various geographic vector file formats, shown in Table 4
    • Automatic driver selection: write_sf() automatically determines the file driver based on the file name.
    • Overwriting files: By default, write_sf() overwrites files if the same file name is used.
Code
library(tibble)
library(gt)

file_types <- tibble(
  `File Type` = c("ESRI Shapefile", "GeoJSON", "GPKG (GeoPackage)", "KML", "CSV with WKT geometry"),
  Extension = c(".shp", ".geojson", ".gpkg", ".kml", ".csv"),
  `Use Cases` = c(
    "Widely used for vector data storage and exchange",
    "Web applications and data sharing",
    "Efficient storage and transfer of spatial data",
    "Sharing geographic data with Google Earth and similar applications",
    "Simple data exchange and interoperability"
  ),
  Advantages = c(
    "Broad compatibility with many GIS applications",
    "Human-readable; easy integration with web mapping libraries",
    "Single-file SQLite-based format; supports both vector and raster data",
    "Suitable for visualization in Google Earth",
    "Easy to create and read; compatible with non-GIS software"
  ),
  Disadvantages = c(
    "Limited attribute name lengths; consists of multiple files",
    "Larger file sizes; slower read/write performance",
    "Less widespread adoption compared to Shapefile",
    "Limited attribute support; not ideal for large datasets",
    "Lacks spatial indexing; larger file sizes; limited to simple geometries"
  )
)
file_types |> 
  gt() |> 
  tab_header(
    title = md("Comparison of File Formats Supported by `sf::st_write()`")
  ) |> 
  tab_footnote(
    footnote = md("*Sources: R Spatial Documentation ([r-spatial.github.io](https://r-spatial.github.io/sf/reference/st_write.html)) and GDAL Vector Drivers Documentation ([gdal.org](https://gdal.org/drivers/vector/index.html)).*")
  ) |> 
  gtExtras::gt_theme_espn() |> 
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    locations = cells_body(
      columns = `File Type`
    )
  ) |> 
  tab_style(
    style = list(
      cell_text(font = "monospace")
    ),
    locations = cells_body(
      columns = Extension
    )
  )
Table 4: The key file formats compatible with the st_write() function in the sf package, including their use cases, advantages, and disadvantages.

Comparison of File Formats Supported by sf::st_write()

File Type Extension Use Cases Advantages Disadvantages
ESRI Shapefile .shp Widely used for vector data storage and exchange Broad compatibility with many GIS applications Limited attribute name lengths; consists of multiple files
GeoJSON .geojson Web applications and data sharing Human-readable; easy integration with web mapping libraries Larger file sizes; slower read/write performance
GPKG (GeoPackage) .gpkg Efficient storage and transfer of spatial data Single-file SQLite-based format; supports both vector and raster data Less widespread adoption compared to Shapefile
KML .kml Sharing geographic data with Google Earth and similar applications Suitable for visualization in Google Earth Limited attribute support; not ideal for large datasets
CSV with WKT geometry .csv Simple data exchange and interoperability Easy to create and read; compatible with non-GIS software Lacks spatial indexing; larger file sizes; limited to simple geometries

Sources: R Spatial Documentation (r-spatial.github.io) and GDAL Vector Drivers Documentation (gdal.org).

  • Adding layers: Use the layer argument to add new layers to formats like GeoPackage instead of overwriting existing files.
    • Example: layer = "second_layer" for additional layers in a .gpkg file.
    • Writing spatial data as text: Use the layer_options argument to write spatial data to text files:
      • GEOMETRY=AS_XY: Creates two columns for coordinates (useful for point datasets).
      • GEOMETRY=AS_WKT: Creates a column with well-known text (WKT) representations for complex spatial data.
  • Alternative function: st_write() is equivalent to write_sf(). Key differences:
    • Does not overwrite files (throws an error instead).
    • Provides a summary of the written file format and object.

8.4.2 Raster Data

  • Saving Raster Data:
    • Use writeRaster() from the terra package to save SpatRaster objects. It supports custom file formats, shown in Table 5, data types, and GDAL options.
Code
library(tibble)
library(gt)

raster_formats <- tibble::tibble(
  `File Type` = c("GeoTIFF", "COG (Cloud Optimized GeoTIFF)", "NetCDF", "ESRI ASCII", "ENVI"),
  `Extension` = c(".tif", ".tif", ".nc", ".asc", ".envi"),
  `Use Cases` = c(
    "Widely used for geospatial data storage and exchange",
    "Efficient for web-based applications and cloud storage",
    "Ideal for climate and atmospheric data; supports multidimensional data",
    "Simple text format; useful for data exchange and readability",
    "Commonly used in remote sensing applications"
  ),
  `Advantages` = c(
    "Supports multiple bands; allows compression; retains georeferencing info",
    "Optimized for cloud storage; supports streaming and efficient access",
    "Handles large datasets efficiently; self-describing format",
    "Human-readable; easy to edit and share",
    "Supports multiple bands; retains metadata"
  ),
  `Disadvantages` = c(
    "Larger file sizes if not compressed",
    "Requires specific software to fully leverage streaming benefits",
    "Complexity in structure; may require specific software to read",
    "Larger file sizes; slower read/write operations",
    "Requires accompanying header file; less widely supported"
  )
)

raster_formats |> 
  gt() |> 
  tab_header(
    title = md("Comparison of Raster File Formats Supported by `terra::writeRaster()`")
  ) |> 
  tab_footnote(
    footnote = md("Sourced from the [terra package documentation](https://rdrr.io/cran/terra/man/writeRaster.html) and other geospatial data resources.")
  ) |> 
  gtExtras::gt_theme_espn() |> 
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    locations = cells_body(
      columns = `File Type`
    )
  ) |> 
  tab_style(
    style = list(
      cell_text(font = "monospace")
    ),
    locations = cells_body(
      columns = Extension
    )
  )
Table 5: This table summarizes key details about various raster file formats supported by the terra::writeRaster() function in R

Comparison of Raster File Formats Supported by terra::writeRaster()

File Type Extension Use Cases Advantages Disadvantages
GeoTIFF .tif Widely used for geospatial data storage and exchange Supports multiple bands; allows compression; retains georeferencing info Larger file sizes if not compressed
COG (Cloud Optimized GeoTIFF) .tif Efficient for web-based applications and cloud storage Optimized for cloud storage; supports streaming and efficient access Requires specific software to fully leverage streaming benefits
NetCDF .nc Ideal for climate and atmospheric data; supports multidimensional data Handles large datasets efficiently; self-describing format Complexity in structure; may require specific software to read
ESRI ASCII .asc Simple text format; useful for data exchange and readability Human-readable; easy to edit and share Larger file sizes; slower read/write operations
ENVI .envi Commonly used in remote sensing applications Supports multiple bands; retains metadata Requires accompanying header file; less widely supported

Sourced from the terra package documentation and other geospatial data resources.

  • Supported Data Types:
    • Seven types: INT1U, INT2S, INT2U, INT4S, INT4U, FLT4S, FLT8S.
    • Use unsigned integers (INT1U, INT2U, INT4U) for categorical data.
    • Use floats (FLT4S, FLT8S) for continuous data.
    • Default: FLT4S, but consider smaller types for efficiency based on data range.
Note

Optimizing GeoTIFFs for Efficient Map Serving

In his article “GeoTIFF Compression for Dummies,” Paul Ramsey discusses effective strategies for preparing GeoTIFF images for map serving. He recommends using JPEG compression, internal tiling, the YCBCR color space, and internal overviews to significantly reduce file sizes without noticeable loss in visual quality. These techniques enhance performance in open-source rendering engines like MapServer, GeoServer, and Mapnik. Additionally, Ramsey highlights the advantages of Cloud Optimized GeoTIFFs (COGs) for efficient access in cloud storage environments. Some key advantages of Cloud Optimized GeoTIFFs (COGs) over traditional GeoTIFF files:

  1. Direct Access to Data: COGs allow efficient partial reads, enabling users to access specific portions of the file (e.g., particular regions or zoom levels) without downloading the entire file.

  2. Optimized for Cloud Storage: They are structured to work seamlessly with cloud storage systems, enabling faster access and reducing storage and retrieval costs.

  3. Internal Overviews and Tiling: COGs come with built-in overviews (lower-resolution versions of the dataset) and tiling, improving performance for applications requiring multi-resolution access, such as web mapping.

  4. Compatibility with Web-Based Tools: Many open-source tools and cloud platforms, such as AWS S3 and Google Earth Engine, directly support COGs, facilitating their integration into cloud-based workflows.

  5. Improved Performance for Large Datasets: By accessing only the needed data chunks, COGs reduce network bandwidth requirements and improve responsiveness for large raster datasets.

  6. Interoperability: COGs are compatible with standard GeoTIFF readers while offering enhanced functionality when used with tools that understand their optimized format.

8.5 Geoportals

  • Interactive exploration of datasets in a browser is an effective way to understand available layers.

  • Downloading data with code is preferable for reproducibility and efficiency. Command-line techniques can initiate downloads using URLs or APIs.

Code
library(tidyverse)
library(gt)

geoportals <- tibble(
  `Portal Name` = c(
    "[Data.gov](https://catalog.data.gov/dataset?metadata_type=geospatial)",
    "[GEOSS Portal](https://www.geoportal.org/)",
    "[Copernicus Data Space Ecosystem](https://dataspace.copernicus.eu/)",
    "[SEDAC](https://sedac.ciesin.columbia.edu/)",
    "[INSPIRE Geoportal](http://inspire-geoportal.ec.europa.eu/)",
    "[EarthExplorer](https://earthexplorer.usgs.gov/)"
  ),
  `Description` = c(
    "Offers a vast array of geospatial datasets, including environmental data, climate records, and transportation networks.",
    "Provides access to global Earth observation data, including satellite imagery and in-situ measurements across agriculture, biodiversity, and climate.",
    "Delivers free access to Copernicus Sentinel satellite data, including high-resolution optical and radar imagery for environmental monitoring.",
    "Focuses on integrating socioeconomic and environmental data, offering population distribution, urban extents, and climate impact indicators.",
    "Facilitates access to European spatial data, including administrative boundaries, hydrography, and transportation networks adhering to INSPIRE standards.",
    "Provides access to remote sensing data like Landsat imagery and digital elevation models managed by the U.S. Geological Survey."
  ),
  Type = c(
    "Raster & Vector",
    "Raster & Vector",
    "Raster",
    "Vector",
    "Raster & Vector",
    "Raster"
  )
)

# Print the tibble
geoportals |> 
  gt() |> 
  fmt_markdown(columns = `Portal Name`) |> 
  cols_label_with(fn = snakecase::to_title_case) |> 
  tab_header(
    title = "Geospatial Data Portals for Raster and Vector Datasets"
  ) |> 
  tab_footnote(
    footnote = md("Data sources: Adapted from the textbook `Geocomputation with R`, *Section 8.5 Geoportals*.")
  ) |> 
  gtExtras::gt_theme_espn()
Table 6: Some Common Geoportals available freely on the internet.
Geospatial Data Portals for Raster and Vector Datasets
Portal Name Description Type

Data.gov

Offers a vast array of geospatial datasets, including environmental data, climate records, and transportation networks. Raster & Vector

GEOSS Portal

Provides access to global Earth observation data, including satellite imagery and in-situ measurements across agriculture, biodiversity, and climate. Raster & Vector

Copernicus Data Space Ecosystem

Delivers free access to Copernicus Sentinel satellite data, including high-resolution optical and radar imagery for environmental monitoring. Raster

SEDAC

Focuses on integrating socioeconomic and environmental data, offering population distribution, urban extents, and climate impact indicators. Vector

INSPIRE Geoportal

Facilitates access to European spatial data, including administrative boundaries, hydrography, and transportation networks adhering to INSPIRE standards. Raster & Vector

EarthExplorer

Provides access to remote sensing data like Landsat imagery and digital elevation models managed by the U.S. Geological Survey. Raster

Data sources: Adapted from the textbook Geocomputation with R, Section 8.5 Geoportals.

8.6 Geographic data packages

Code
library(tidyverse)
library(gt)

geographic_data_packages <- tibble(
  Package = c(
    "[climateR](https://cran.r-project.org/package=climateR)",
    "[elevatr](https://cran.r-project.org/package=elevatr)",
    "[FedData](https://cran.r-project.org/package=FedData)",
    "[geodata](https://cran.r-project.org/package=geodata)",
    "[osmdata](https://cran.r-project.org/package=osmdata)",
    "[osmextract](https://cran.r-project.org/package=osmextract)",
    "[rnaturalearth](https://cran.r-project.org/package=rnaturalearth)",
    "[rnoaa](https://cran.r-project.org/package=rnoaa)",
    "[giscoR](https://cran.r-project.org/package=giscoR)"
  ),
  Description = c(
    "Provides access to over 100,000 gridded climate and landscape datasets from more than 2,000 data providers, allowing users to retrieve data by specifying an area of interest.",
    "Facilitates access to point and raster elevation data from various sources, enabling users to obtain elevation information for specific locations or regions.",
    "Allows users to download datasets maintained by the U.S. federal government, including elevation, land cover, and other geospatial data.",
    "Simplifies the download and import of administrative boundaries, elevation models, WorldClim climate data, and more for various regions worldwide.",
    "Enables downloading and importing of small to medium-sized datasets from OpenStreetMap, suitable for detailed geographic analyses.",
    "Designed for downloading and importing large OpenStreetMap datasets, making it suitable for extensive geographic analyses.",
    "Provides access to Natural Earth vector and raster data, including country boundaries, rivers, and urban areas, useful for creating maps and visualizations.",
    "Interfaces with the National Oceanic and Atmospheric Administration (NOAA) to import climate and weather data, including forecasts, historical data, and severe weather alerts.",
    "Allows users to download and use geospatial data from GISCO (Geographical Information System of the European Commission), including administrative boundaries and other geographic features."
  ),
  Important_Functions = c(
    "`getDataAOI()` – Retrieve data for a specified area of interest.",
    "`get_elev_point()` – Get elevation data for specific points.",
    "`get_ned()` – Download National Elevation Dataset data.",
    "`worldclim_global()` – Access global climate data.",
    "`opq()` – Build an Overpass query; `add_osm_feature()` – Specify features to retrieve.",
    "`oe_read()` – Read OpenStreetMap data for a specified region.",
    "`ne_countries()` – Retrieve country boundaries; `ne_download()` – Download Natural Earth data.",
    "`meteo_pull_monitors()` – Retrieve meteorological data from specified monitors.",
    "`gisco_get_countries()` – Retrieve country boundaries; `gisco_get_nuts()` – Access NUTS regions."
  )
)

geographic_data_packages |> 
  gt() |> 
  fmt_markdown(columns = c(Package, 
                           Important_Functions)) |> 
  cols_label_with(fn = snakecase::to_title_case) |> 
  tab_header(
    title = "R Packages for Geographic Data"
  ) |> 
  gtExtras::gt_theme_espn() |> 
  tab_style(
    style = list(
      cell_text(font = "monospace")
    ),
    locations = cells_body(
      columns = Package
    )
  )
Table 7: Selected R Packages for Geographic Data Retrieval and Analysis
R Packages for Geographic Data
Package Description Important Functions

climateR

Provides access to over 100,000 gridded climate and landscape datasets from more than 2,000 data providers, allowing users to retrieve data by specifying an area of interest.

getDataAOI() – Retrieve data for a specified area of interest.

elevatr

Facilitates access to point and raster elevation data from various sources, enabling users to obtain elevation information for specific locations or regions.

get_elev_point() – Get elevation data for specific points.

FedData

Allows users to download datasets maintained by the U.S. federal government, including elevation, land cover, and other geospatial data.

get_ned() – Download National Elevation Dataset data.

geodata

Simplifies the download and import of administrative boundaries, elevation models, WorldClim climate data, and more for various regions worldwide.

worldclim_global() – Access global climate data.

osmdata

Enables downloading and importing of small to medium-sized datasets from OpenStreetMap, suitable for detailed geographic analyses.

opq() – Build an Overpass query; add_osm_feature() – Specify features to retrieve.

osmextract

Designed for downloading and importing large OpenStreetMap datasets, making it suitable for extensive geographic analyses.

oe_read() – Read OpenStreetMap data for a specified region.

rnaturalearth

Provides access to Natural Earth vector and raster data, including country boundaries, rivers, and urban areas, useful for creating maps and visualizations.

ne_countries() – Retrieve country boundaries; ne_download() – Download Natural Earth data.

rnoaa

Interfaces with the National Oceanic and Atmospheric Administration (NOAA) to import climate and weather data, including forecasts, historical data, and severe weather alerts.

meteo_pull_monitors() – Retrieve meteorological data from specified monitors.

giscoR

Allows users to download and use geospatial data from GISCO (Geographical Information System of the European Commission), including administrative boundaries and other geographic features.

gisco_get_countries() – Retrieve country boundaries; gisco_get_nuts() – Access NUTS regions.