R package for efficient raster processing using GDAL VRTs • vrtility

vrtility is an R package that aims to make the best use of GDAL’s VRT capabilities for efficient processing of large raster datasets - mainly with Earth Observation (EO) in mind. vrtility uses the VRT format to access awesome features such as pixel functions but also harnesses the VRT data structure to facilitate complex image processing tasks such as multi-band compositing and time series filtering.

[!NOTE] This package is under active development and is likely to change. Contributions and suggestions are still very welcome!

Features

Modular design: We’re basically creating remote sensing pipelines using nested VRTs. This allows for the easy addition of new pixel and masking functions. but could easily be adapted for deriving spectral indices or calculating complex time series functions. All powered by {gdalraster}.
vrtility enables the use of GDAL VRT python and built-in (C++) pixel functions. The python numpy-based pixel functions can be used to apply cloud masks and summarise pixel values (e.g. median) from multiple images (i.e create a composite image). All python environment and package management is handled by {reticulate}. Built-In GDAL pixel functions vary depending on the GDAL version, but are highly performant - recent GDAL versions even support expressions for more complex operations.
Efficient parallel processing using {mirai}
Advanced compositing methods that maintain spectral consistency, such as the geometric median and medoid.
Time series filtering functions to improve temporal consistency and reduce noise.
on-the-fly cloud mask filtering using pixel functions. Ability to use OmniCloudMask cloud/shadow masking, embedded within the vrt pipeline.

Installation

You can install vrtility from GitHub with:

# install.packages("pak")
pak::pkg_install("Permian-Global-Research/vrtility")

Example

Here is a simple example where we:

Define a bounding box and search a STAC catalog for Sentinel-2 data
Create a vrt_collection object - essentially a list of individual VRTs (each making up one image) which we refer to as vrt_blocks in this package.
Then, we apply the mask using pixel functions. This simply modifies the XML of the VRT “blocks”.
Because this set of images have more than one common spatial reference system (SRS) we warp the vrt_blocks to a new spatially-aligned vrt_collection using vrt_warp, giving us a vrt_collection_warped object.
These images are then “stacked” (combined into a single VRT with multiple layers in each VRTRasterBand), giving us a vrt_stack object.
A median pixel function is then added to the vrt_stack.
Finally, we calculate the median composite using vrt_compute with the gdalraster engine to write the output which, in combination with the mirai package processes the data in parallel across bands and image tiles.

library(vrtility)
#> ✔ Using GDAL version 3.11.3
#> ℹ GDAL_CACHEMAX set to 6.247 GiB; to change this use
#>   vrtility::set_gdal_cache_max()
#  Set up asynchronous workers to parallelise vrt_collect and vrt_set_maskfun
mirai::daemons(10)

bbox <- gdalraster::bbox_from_wkt(
  wkt = "POINT (144.3 -7.6)",
  extend_x = 0.17,
  extend_y = 0.125
)

te <- bbox_to_projected(bbox)
trs <- attr(te, "wkt")

s2_stac <- hls_stac_query(
  bbox = bbox,
  start_date = "2023-01-01",
  end_date = "2023-12-31",
  stac_source = "https://planetarycomputer.microsoft.com/api/stac/v1/",
  collection = "hls2-s30",
  max_cloud_cover = 40,
  assets = c("B02", "B03", "B04", "B8A", "Fmask")
)
# number of items:
length(s2_stac$features)
#> [1] 10

system.time({
  median_composite <- vrt_collect(s2_stac) |>
    vrt_set_maskfun(
      mask_band = "Fmask",
      mask_values = c(0, 1, 2, 3),
      build_mask_pixfun = build_bitmask()
    ) |>
    vrt_warp(t_srs = trs, te = te, tr = c(30, 30)) |>
    vrt_stack() |>
    vrt_set_py_pixelfun(pixfun = median_numpy()) |>
    vrt_compute(
      engine = "gdalraster"
    )
})
#>    user  system elapsed 
#>   1.964   0.214  16.800


plot_raster_src(
  median_composite,
  c(3, 2, 1)
)

Asynchronous download/processing

{vrtility} uses {mirai}, alongside {purrr} to manage asynchronous parallel processing. By setting mirai::daemons(n) before running the vrt pipeline, we can often improve performance, depending on the speed of the server holding the data. In order to use asynchronous processing, in the vrt_compute function, we must set engine = "gdalraster" or we can use engine = "warp" if we are computing a vrt_collection (multiple discrete images). It is strongly recommended to use multi-processing, especially for multi-band reduction methods (i.e. multiband_reduce).

Direct file access for on-disk or remote rasters

We can also directly access raster files (or indeed remote files as urls) too, as shown here with this example dataset - note that the inputs have multiple spatial reference systems and therefore we need to warp them (as in the above example) before stacking or applying any reduction-type function.

In this example, we create a medoid composite from the warped collection. Using medoid or other multi-band pixel functions (e.g. geomedian) can be extremely powerful but requires more compute than band-wise pixel functions.

s2files <- fs::dir_ls(system.file("s2-data", package = "vrtility"))[1:4]

ex_collect <- vrt_collect(s2files)

ex_collect_mask <- vrt_set_maskfun(
  ex_collect,
  mask_band = "SCL",
  mask_values = c(0, 1, 2, 3, 8, 9, 10, 11),
)

par(mfrow = c(2, 2))

purrr::walk(
  seq_len(ex_collect$n_items),
  ~ plot(ex_collect, item = .x, bands = c(3, 2, 1))
)


purrr::walk(
  seq_len(ex_collect_mask$n_items),
  ~ plot(ex_collect_mask, item = .x, bands = c(3, 2, 1))
)


# extract a block to use as a template for warping
t_block <- ex_collect$vrt[[1]]

ex_composite <- vrt_warp(
  ex_collect_mask,
  t_srs = t_block$srs,
  te = t_block$bbox,
  tr = c(20, 20)
) |>
  vrt_set_scale(scale_value = 0.0001, offset = -0.1, band_idx = 1:4) |>
  multiband_reduce(reduce_fun = medoid())

purrr::walk(
  1:4,
  ~ plot_raster_src(ex_composite, bands = .x, minmax_pct_cut = c(2, 98))
)