cannot allocate vector of size ....vectra reads in chunks and streams past the engine, so
peak memory stays small at any file size. Same dplyr verbs, plain R
package, no external runtime.Below: each job done the familiar way and with vectra, side by side. Everything runs offline and sizes to the machine.
read.csv vs vectraA stand-in occurrence table (one row per visit, a few climate
columns, presence = 1/0), sized to this machine’s RAM so it
behaves the same wherever it opens:
ram_gb <- gb(total_ram_bytes())
cores <- parallel::detectCores()
csv_mb <- min(200, max(40, 0.01 * ram_gb * 1024)) # CSV size, scaled to RAM
n_rows <- as.integer(csv_mb * 1024^2 / 75) # ~75 bytes per row
cat(sprintf("This machine: %.0f GB RAM, %d cores.\n", ram_gb, cores))
#> This machine: 64 GB RAM, 32 cores.
cat(sprintf("Demo CSV: %.1f million rows, about %.0f MB on disk.\n",
n_rows / 1e6, csv_mb))
#> Demo CSV: 2.8 million rows, about 200 MB on disk.Fill it with random values and write to a temporary CSV:
dat <- data.frame(
bio1 = runif(n_rows), bio2 = runif(n_rows), bio3 = runif(n_rows),
bio4 = runif(n_rows), presence = as.integer(runif(n_rows) < 0.3)
)
csv <- tempfile(fileext = ".csv")
write.csv(dat, csv, row.names = FALSE)
rm(dat); invisible(gc())
round(file.info(csv)$size / 1024^2) # actual size on disk, MB
#> [1] 200Goal: one number, the share of sites with the species present
(mean(presence)). The usual way reads the whole file, then
takes the mean. Right answer, but watch the memory:
reset_peak()
d <- read.csv(csv)
answer_readcsv <- mean(d$presence)
peak_readcsv <- gc_mb("max")
rm(d); invisible(gc())
c(answer = round(answer_readcsv, 4), peak_MB = round(peak_readcsv))
#> answer peak_MB
#> 0.2995 1419.0000Same number, without holding the file: tbl_csv() opens
it, dplyr verbs describe the work, collect() streams it
past the engine one chunk at a time.
reset_peak()
answer_vectra <- (tbl_csv(csv) |>
summarise(presence_rate = mean(presence)) |>
collect())$presence_rate
peak_vectra <- gc_mb("max")
c(answer = round(answer_vectra, 4), peak_MB = round(peak_vectra))
#> answer peak_MB
#> 0.2995 158.0000Same number, a fraction of the memory:
c(read.csv_MB = round(peak_readcsv),
vectra_MB = round(peak_vectra),
times_less = round(peak_readcsv / peak_vectra, 1),
same_answer = isTRUE(all.equal(answer_readcsv, answer_vectra)))
#> read.csv_MB vectra_MB times_less same_answer
#> 1419 158 9 1read.csv holds every row at once, so its peak grows with
the file. vectra’s peak is set by the chunk size, so the same code runs
on a CSV larger than RAM.
terra vs vectraA two-band raster over Central Europe (band1
temperature, band2 precipitation), written to a GeoTIFF. In
a real project this is the multi-GB stack.
nx <- 1200; ny <- 900
xmin <- 5; xmax <- 17; ymin <- 45; ymax <- 55
xres <- (xmax - xmin) / nx; yres <- (ymax - ymin) / ny
g <- expand.grid(col = 0:(nx - 1), row = 0:(ny - 1))
g$x <- xmin + (g$col + 0.5) * xres
g$y <- ymax - (g$row + 0.5) * yres
g$band1 <- 14 - 0.6 * (g$y - 50) + 0.2 * (g$x - 11) + rnorm(nrow(g), 0, 0.3)
g$band2 <- 750 + 25 * (g$y - 50) - 8 * (g$x - 11) + rnorm(nrow(g), 0, 10)
tif <- tempfile(fileext = ".tif")
write_tiff(g[, c("x", "y", "band1", "band2")], tif, crs = 4326L)Sample both bands at 2000 survey points. With terra you
read the raster and call extract:
N <- 2000
pts <- data.frame(x = runif(N, xmin + 0.3, xmax - 0.3),
y = runif(N, ymin + 0.3, ymax - 0.3))library(terra)
#> terra 1.9.27
r <- rast(tif)
names(r) <- c("band1", "band2")
terra_vals <- extract(r, as.matrix(pts))
head(terra_vals, 3)
#> band1 band2
#> 1 16.18802 641.3130
#> 2 16.95523 634.8926
#> 3 12.93592 789.4721vectra does it in one call, tiff_extract_points(),
reading only the parts that hold query points:
vectra_vals <- tiff_extract_points(tif, pts)
head(vectra_vals[, c("band1", "band2")], 3)
#> band1 band2
#> 1 16.18802 641.3130
#> 2 16.95523 634.8926
#> 3 12.93592 789.4721c(band1_max_diff = max(abs(vectra_vals$band1 - terra_vals$band1)),
band2_max_diff = max(abs(vectra_vals$band2 - terra_vals$band2)))
#> band1_max_diff band2_max_diff
#> 0 0Identical values. vectra hands the raster back as a lazy table, so the same chunked engine reads it.
The field we just sampled, with the survey points on top:
A full read loads the whole grid first. vectra reads only the row-strips holding query points, one at a time. The right-hand bar tracks cells resident in memory.
The raster is a lazy table, so dplyr verbs apply directly;
collect() reads it in chunks:
Build a small modelling frame from the extracted points (standardize
predictors, draw presence/absence) and write it to vectra’s
.vtr:
occ <- data.frame(
bio1 = as.numeric(scale(vectra_vals$band1)),
bio12 = as.numeric(scale(vectra_vals$band2))
)
eta <- -0.2 + 1.3 * occ$bio1 - 0.9 * occ$bio12
occ$presence <- rbinom(nrow(occ), 1, plogis(eta))
vtr <- tempfile(fileext = ".vtr")
write_vtr(occ, vtr)collect_chunked() folds a function over the file one
chunk at a time, holding one chunk plus a running total. Here, the
prevalence:
acc <- tbl(vtr) |>
collect_chunked(
function(acc, chunk) {
acc$n <- acc$n + nrow(chunk)
acc$p <- acc$p + sum(chunk$presence)
acc
},
.init = list(n = 0L, p = 0L)
)
round(acc$p / acc$n, 3) # prevalence over the whole file
#> [1] 0.484If the table fits in memory, collect() it and use
glm() as normal:
fit <- glm(presence ~ bio1 + bio12, data = occ, family = binomial())
round(coef(fit), 3)
#> (Intercept) bio1 bio12
#> -0.136 1.790 -0.321Larger than memory? chunk_feeder() exposes the query as
a resettable stream that biglm::bigglm() drives directly,
never holding more than one chunk:
src <- function() tbl(vtr) |> select(presence, bio1, bio12)
fit_stream <- biglm::bigglm(
presence ~ bio1 + bio12,
data = chunk_feeder(src),
family = binomial()
)
round(coef(fit_stream), 3)
#> (Intercept) bio1 bio12
#> -0.136 1.790 -0.321The streamed fit matches in-memory glm() to convergence
tolerance, holding one chunk at a time. The same call scales to any
.vtr or CSV larger than RAM.
The same fit, in motion: each chunk of rows folds into the running coefficients, then drops before the next arrives. The fit settles; the whole table never sits in memory.
A GAM needs all its data at once: the spline basis and REML couple
every row, so it can’t be folded chunk-by-chunk like a GLM. But you
usually fit one GAM per group (species, region, site), not one
over everything. offload() runs that off disk, one group at
a time:
One group in memory at a time, one fit per group, never the whole table.
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.9-4. For overview type '?mgcv'.
gdf <- data.frame(g = sample.int(6L, 3000L, TRUE), x = runif(3000))
gdf$y <- sin(2 * pi * gdf$x + gdf$g) + rnorm(nrow(gdf), 0, 0.3)
gvtr <- tempfile(fileext = ".vtr"); write_vtr(gdf, gvtr)
p <- offload(tbl(gvtr) |> select(g, x, y), by = "g")
fits <- group_map(p, function(dd, key) gam(y ~ s(x), data = dd))
length(fits) # one fitted GAM per group
#> [1] 6The same loop in motion: load a group, fit it, save the result, release, next.