vectra is an R-native columnar query engine for datasets larger than RAM.
Write dplyr-style pipelines against multi-GB files on a laptop. Data streams through a C11 pull-based engine one row group at a time, so peak memory stays bounded regardless of file size.
Quick Start
Point vectra at any file and query it with dplyr verbs. Nothing runs until collect().
library(vectra)
# CSV — lazy scan with type inference
tbl_csv("measurements.csv") |>
filter(temperature > 30, year >= 2020) |>
group_by(station) |>
summarise(avg_temp = mean(temperature), n = n()) |>
collect()
# GeoTIFF — climate rasters as tidy data
tbl_tiff("worldclim_bio1.tif") |>
filter(band1 > 0) |>
mutate(temp_c = band1 / 10) |>
collect()
# SQLite — zero-dependency, no DBI required
tbl_sqlite("survey.db", "responses") |>
filter(year == 2025) |>
left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
collect()For repeated queries, convert to vectra’s native .vtr format for faster reads:
write_vtr(big_df, "data.vtr", batch_size = 100000)
tbl("data.vtr") |>
filter(x > 0, region == "EU") |>
group_by(region) |>
summarise(total = sum(value), n = n()) |>
collect()Append new data without rewriting the file, or do a key-based diff between two snapshots:
# Append new rows as a new row group — existing data untouched
append_vtr(new_rows_df, "data.vtr")
# Logical diff: what was added or deleted between two snapshots?
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added) # rows present in new but not old
d$deleted # key values present in old but not newFuzzy string matching runs inside the C engine, no round-trip to R:
tbl("taxa.vtr") |>
filter(levenshtein(species, "Quercus robur") <= 2) |>
mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
arrange(desc(similarity)) |>
collect()Use explain() to inspect the optimized plan:
Why vectra
Querying large datasets in R usually means Arrow (requires compiled binaries matching your platform), DuckDB (links a 30 MB bundled library), or Spark (requires a JVM and cluster configuration).
vectra is a self-contained C11 engine compiled as a standard R extension. No external libraries beyond zlib, no JVM, no runtime configuration. It provides:
- Streaming execution: data flows one row group at a time, never fully in memory
- Zero-copy filtering: selection vectors avoid row duplication
- Query optimizer: column pruning skips unneeded columns at scan; predicate pushdown uses per-rowgroup min/max statistics to skip entire row groups
- Hash joins: build right, stream left — join a 50 GB fact table against a lookup without materializing both
- External sort: 1 GB memory budget with automatic spill-to-disk
-
Window functions:
row_number(),rank(),dense_rank(),lag(),lead(),cumsum(),cummean(),cummin(),cummax() -
String expressions:
nchar(),substr(),grepl()evaluated in the engine without round-tripping to R -
Multiple data sources:
.vtr, CSV, SQLite, GeoTIFF — all produce the same lazy query nodes
Features
Full tidyselect support in select(), rename(), relocate(), and across(): starts_with(), ends_with(), contains(), matches(), where(), everything(), all_of(), any_of().
Installation
# CRAN
install.packages("vectra")
# Development version
pak::pak("gcol33/vectra")Documentation
- Getting Started — Full walkthrough with runnable examples
- Format Backends — CSV, SQLite, Excel, GeoTIFF, and streaming conversion pipelines
- Joins — All join types, fuzzy joins, key coercion, and memory model
- String Operations — Pattern matching, fuzzy matching, and block lookups
-
Indexing and Optimization — Hash indexes, zone-map pruning, column pruning, and reading
explain()output - Working with Large Data — Streaming pipelines, append/delete/diff, external sort, and memory budgeting
- Engine Reference — Execution model, types, coercion, .vtr format, and limitations
- Function Reference