What this solves

  • Most SDM workflows start the same: open climate rasters, sample at occurrence points, fit a model.
  • The wall is memory. A continental stack is several GB; R loads it whole and copies on every edit, so the session dies with cannot allocate vector of size ....
  • Two files hit this: a table too big to read, and a raster too big to open.
  • vectra reads in chunks and streams past the engine, so peak memory stays small at any file size. Same dplyr verbs, plain R package, no external runtime.

Below: each job done the familiar way and with vectra, side by side. Everything runs offline and sizes to the machine.

Setup

install.packages("vectra")      # on CRAN
library(vectra)

A big CSV: read.csv vs vectra

A stand-in occurrence table (one row per visit, a few climate columns, presence = 1/0), sized to this machine’s RAM so it behaves the same wherever it opens:

ram_gb <- gb(total_ram_bytes())
cores  <- parallel::detectCores()
csv_mb <- min(200, max(40, 0.01 * ram_gb * 1024))   # CSV size, scaled to RAM
n_rows <- as.integer(csv_mb * 1024^2 / 75)          # ~75 bytes per row

cat(sprintf("This machine: %.0f GB RAM, %d cores.\n", ram_gb, cores))
#> This machine: 64 GB RAM, 32 cores.
cat(sprintf("Demo CSV: %.1f million rows, about %.0f MB on disk.\n",
            n_rows / 1e6, csv_mb))
#> Demo CSV: 2.8 million rows, about 200 MB on disk.

Fill it with random values and write to a temporary CSV:

dat <- data.frame(
  bio1 = runif(n_rows), bio2 = runif(n_rows), bio3 = runif(n_rows),
  bio4 = runif(n_rows), presence = as.integer(runif(n_rows) < 0.3)
)
csv <- tempfile(fileext = ".csv")
write.csv(dat, csv, row.names = FALSE)
rm(dat); invisible(gc())

round(file.info(csv)$size / 1024^2)        # actual size on disk, MB
#> [1] 200

Goal: one number, the share of sites with the species present (mean(presence)). The usual way reads the whole file, then takes the mean. Right answer, but watch the memory:

reset_peak()
d <- read.csv(csv)
answer_readcsv <- mean(d$presence)
peak_readcsv   <- gc_mb("max")
rm(d); invisible(gc())

c(answer = round(answer_readcsv, 4), peak_MB = round(peak_readcsv))
#>    answer   peak_MB 
#>    0.2995 1419.0000

Same number, without holding the file: tbl_csv() opens it, dplyr verbs describe the work, collect() streams it past the engine one chunk at a time.

reset_peak()
answer_vectra <- (tbl_csv(csv) |>
  summarise(presence_rate = mean(presence)) |>
  collect())$presence_rate
peak_vectra <- gc_mb("max")

c(answer = round(answer_vectra, 4), peak_MB = round(peak_vectra))
#>   answer  peak_MB 
#>   0.2995 158.0000

Same number, a fraction of the memory:

c(read.csv_MB = round(peak_readcsv),
  vectra_MB   = round(peak_vectra),
  times_less  = round(peak_readcsv / peak_vectra, 1),
  same_answer = isTRUE(all.equal(answer_readcsv, answer_vectra)))
#> read.csv_MB   vectra_MB  times_less same_answer 
#>        1419         158           9           1

read.csv holds every row at once, so its peak grows with the file. vectra’s peak is set by the chunk size, so the same code runs on a CSV larger than RAM.

A raster: terra vs vectra

A two-band raster over Central Europe (band1 temperature, band2 precipitation), written to a GeoTIFF. In a real project this is the multi-GB stack.

nx <- 1200; ny <- 900
xmin <- 5; xmax <- 17; ymin <- 45; ymax <- 55
xres <- (xmax - xmin) / nx; yres <- (ymax - ymin) / ny

g <- expand.grid(col = 0:(nx - 1), row = 0:(ny - 1))
g$x <- xmin + (g$col + 0.5) * xres
g$y <- ymax - (g$row + 0.5) * yres
g$band1 <- 14 - 0.6 * (g$y - 50) + 0.2 * (g$x - 11) + rnorm(nrow(g), 0, 0.3)
g$band2 <- 750 + 25 * (g$y - 50) - 8 * (g$x - 11) + rnorm(nrow(g), 0, 10)

tif <- tempfile(fileext = ".tif")
write_tiff(g[, c("x", "y", "band1", "band2")], tif, crs = 4326L)

Sample both bands at 2000 survey points. With terra you read the raster and call extract:

N <- 2000
pts <- data.frame(x = runif(N, xmin + 0.3, xmax - 0.3),
                  y = runif(N, ymin + 0.3, ymax - 0.3))
library(terra)
#> terra 1.9.27
r <- rast(tif)
names(r) <- c("band1", "band2")
terra_vals <- extract(r, as.matrix(pts))
head(terra_vals, 3)
#>      band1    band2
#> 1 16.18802 641.3130
#> 2 16.95523 634.8926
#> 3 12.93592 789.4721

vectra does it in one call, tiff_extract_points(), reading only the parts that hold query points:

vectra_vals <- tiff_extract_points(tif, pts)
head(vectra_vals[, c("band1", "band2")], 3)
#>      band1    band2
#> 1 16.18802 641.3130
#> 2 16.95523 634.8926
#> 3 12.93592 789.4721
c(band1_max_diff = max(abs(vectra_vals$band1 - terra_vals$band1)),
  band2_max_diff = max(abs(vectra_vals$band2 - terra_vals$band2)))
#> band1_max_diff band2_max_diff 
#>              0              0

Identical values. vectra hands the raster back as a lazy table, so the same chunked engine reads it.

See the raster

The field we just sampled, with the survey points on top:

How vectra reads it

A full read loads the whole grid first. vectra reads only the row-strips holding query points, one at a time. The right-hand bar tracks cells resident in memory.

A raster as tidy data

The raster is a lazy table, so dplyr verbs apply directly; collect() reads it in chunks:

tbl_tiff(tif) |>
  filter(band1 > 8) |>
  mutate(temp_c = band1, precip_mm = band2) |>
  summarise(n_pixels = n(), mean_temp = mean(temp_c)) |>
  collect()
#>   n_pixels mean_temp
#> 1  1080000  14.00035

Fit a model off disk

Build a small modelling frame from the extracted points (standardize predictors, draw presence/absence) and write it to vectra’s .vtr:

occ <- data.frame(
  bio1  = as.numeric(scale(vectra_vals$band1)),
  bio12 = as.numeric(scale(vectra_vals$band2))
)
eta <- -0.2 + 1.3 * occ$bio1 - 0.9 * occ$bio12
occ$presence <- rbinom(nrow(occ), 1, plogis(eta))

vtr <- tempfile(fileext = ".vtr")
write_vtr(occ, vtr)

collect_chunked() folds a function over the file one chunk at a time, holding one chunk plus a running total. Here, the prevalence:

acc <- tbl(vtr) |>
  collect_chunked(
    function(acc, chunk) {
      acc$n <- acc$n + nrow(chunk)
      acc$p <- acc$p + sum(chunk$presence)
      acc
    },
    .init = list(n = 0L, p = 0L)
  )
round(acc$p / acc$n, 3)      # prevalence over the whole file
#> [1] 0.484

If the table fits in memory, collect() it and use glm() as normal:

fit <- glm(presence ~ bio1 + bio12, data = occ, family = binomial())
round(coef(fit), 3)
#> (Intercept)        bio1       bio12 
#>      -0.136       1.790      -0.321

Larger than memory? chunk_feeder() exposes the query as a resettable stream that biglm::bigglm() drives directly, never holding more than one chunk:

src <- function() tbl(vtr) |> select(presence, bio1, bio12)

fit_stream <- biglm::bigglm(
  presence ~ bio1 + bio12,
  data = chunk_feeder(src),
  family = binomial()
)
round(coef(fit_stream), 3)
#> (Intercept)        bio1       bio12 
#>      -0.136       1.790      -0.321

The streamed fit matches in-memory glm() to convergence tolerance, holding one chunk at a time. The same call scales to any .vtr or CSV larger than RAM.

The same fit, in motion: each chunk of rows folds into the running coefficients, then drops before the next arrives. The fit settles; the whole table never sits in memory.

A GAM: offload over groups

A GAM needs all its data at once: the spline basis and REML couple every row, so it can’t be folded chunk-by-chunk like a GLM. But you usually fit one GAM per group (species, region, site), not one over everything. offload() runs that off disk, one group at a time:

  • load one group’s rows into memory,
  • fit a full GAM on it and save the result,
  • release it and move to the next.

One group in memory at a time, one fit per group, never the whole table.

library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.9-4. For overview type '?mgcv'.
gdf <- data.frame(g = sample.int(6L, 3000L, TRUE), x = runif(3000))
gdf$y <- sin(2 * pi * gdf$x + gdf$g) + rnorm(nrow(gdf), 0, 0.3)
gvtr <- tempfile(fileext = ".vtr"); write_vtr(gdf, gvtr)

p    <- offload(tbl(gvtr) |> select(g, x, y), by = "g")
fits <- group_map(p, function(dd, key) gam(y ~ s(x), data = dd))
length(fits)        # one fitted GAM per group
#> [1] 6

The same loop in motion: load a group, fit it, save the result, release, next.