pkgdown/mathjax-config.html

Skip to contents

vectra 0.5.0

Compression backend rewire

  • Replaced the bespoke v4 codec with tdc, a standalone typed-dimensional compression library vendored into src/tdc/. Encode and decode go through a self-describing block record (model + transform chain + entropy) rather than per-column tag constants. Deleted vtr_codec.c, vtr_encodings.c, vtr_compress.c, vtr1.c, and vtr_codec_internal.h.
  • The .vtr on-disk format is a deliberate breaking change: pre-0.5 files are not readable. write_vtr() and append_vtr() write the new container; tbl() reads only the new container.
  • Per-row-group column statistics (min/max) are carried in the container index so the scan layer can still prune unreachable row groups.
  • Parallel row-group reads are preserved.
  • Custom vendoring via tools/vendor_tdc.sh and configure / configure.win pull the latest upstream tdc on every install when the source checkout is present; the pre-vendored copy is used otherwise.

Known regression

  • The v4 dict-defer CHARSXP fast path is gone — duplicate strings now hit R’s CHARSXP hash per row. Will be re-implemented on top of tdc’s dictionary-encoded varlen output when it becomes a hot spot.

Fixes

  • man/write_vtr.Rd: replaced a literal percent sign in the compress argument description that produced malformed Rd output on build.
  • Windows: write_vtr(), append_vtr() and delete_vtr() now use MoveFileEx with a short retry loop for the final temp-to-target swap. Previously, a preceding tbl() read could leave the target file mmap’d pending GC, and the swap would fail with a sharing violation.

vectra 0.4.1

Star schema and lookup

  • New vtr_schema(), link(), and lookup() functions for star-schema workflows. Register a fact table with named dimension links once, then pull columns from any dimension without writing explicit joins. Only referenced dimensions are scanned.
  • lookup() reports unmatched keys per dimension by default, catching referential integrity issues before they propagate NAs silently.
  • Supports both "left" (default) and "inner" join modes, named keys for differing column names, and reusable schema objects across multiple queries.

vectra 0.3.2

  • Fix misaligned int64_t memory access in vtr_codec.c (UBSAN). Dictionary encoding wrote and read 8-byte offsets through an unaligned pointer; delta decoding had the same issue. All fixed with memcpy.

vectra 0.3.1

  • CRAN submission fixes: title case, quoted technical terms in DESCRIPTION, corrected documentation URLs.

vectra 0.3.0

File operations

  • append_vtr(df, path): append a data.frame as a new row group to an existing .vtr file. Existing row groups are never rewritten.
  • delete_vtr(path, row_ids): logically delete rows by 0-based physical index. Writes a tombstone side file (<path>.del); the .vtr file is never modified. Deletions are cumulative and excluded automatically on the next tbl() call.
  • diff_vtr(old_path, new_path, key_col): key-based logical diff between two .vtr files. Returns a list with added (a lazy vectra_node) and deleted (a vector of key values). Implemented as a single-pass C streaming engine with O(n_unique_keys) memory.

Expressions

  • tolower(), toupper(), trimws(): case conversion and whitespace trimming for string columns in filter() and mutate().
  • levenshtein(x, y) / levenshtein_norm(x, y): Levenshtein edit distance and normalised variant (0–1). Supports column-vs-column and column-vs-literal comparisons. Optional max_dist argument for early termination.
  • dl_dist(x, y) / dl_dist_norm(x, y): Damerau-Levenshtein distance (counts transpositions as cost 1) and normalised variant.
  • jaro_winkler(x, y): Jaro-Winkler similarity (0–1, higher = more similar). All string-similarity functions propagate NA and work in filter() and mutate().
  • resolve(fk, pk, value): scalar self-join — looks up value where pk == fk within the same batch. Useful for denormalising parent-child tables without a join.
  • propagate(parent_id, id, seed): tree-traversal aggregation — propagates non-NA seed values down a parent-child hierarchy until all reachable nodes are filled. Converges in O(depth) passes.

Format

  • .vtr format version 4 with a two-layer codec (no external dependencies):
    • Encoding: PLAIN (default), DICTIONARY (string columns with < 50% unique values), DELTA (monotonically increasing int64 columns).
    • Compression: custom LZ77 byte compressor (LZ_VTR, ~120 lines of C). Applied after encoding; skipped for buffers < 64 bytes or when compression does not reduce size. Files written with v4 are typically 30–60% smaller than v3. tbl() reads v1–v4 files; write_vtr() always writes v4.

vectra 0.2.2

Query optimizer

  • Column pruning: scan nodes only read columns needed by the query plan.
  • Predicate pushdown: filter predicates are attached to scan nodes and use .vtr v3 per-rowgroup min/max statistics to skip entire row groups.

Engine

  • .vtr format version 3 with per-column per-rowgroup statistics (min/max).
  • O(n log n) rank() and dense_rank() (replaces O(n²) comparison-based).
  • Nested expressions in summarise(): summarise(m = mean(x + y)) auto-inserts a hidden mutate.

Expressions

  • year(), month(), day(), hour(), minute(), second(): date/time component extraction for Date and POSIXct columns.
  • as.Date() and as.POSIXct() literals in filter expressions (e.g. filter(date > as.Date("2020-01-01"))).
  • as.Date(string_col): convert ISO-format date strings to Date values.
  • nchar(): returns string length as integer.
  • substr(x, start, stop): substring extraction (1-based, like R).
  • grepl(pattern, x): fixed string matching (no regex).
  • paste0(a, b): two-argument string concatenation.
  • gsub(pattern, replacement, x) / sub(): fixed-string replacement.
  • startsWith() / endsWith(): string prefix/suffix matching.
  • pmin() / pmax(): element-wise minimum/maximum.
  • log2(), log10(), sign(), trunc(): additional math functions.

Aggregation

  • sd() and var(): sample standard deviation and variance via Welford’s online algorithm. Returns NA for groups with fewer than 2 values (R semantics).
  • first() and last(): first and last non-NA value per group. Both support na.rm = TRUE.

Verbs

  • slice_min() and slice_max() gain a working with_ties parameter (default TRUE). Ties at the boundary are now included by default; use with_ties = FALSE for exactly n rows.
  • count() and tally() gain a working sort parameter. sort = TRUE returns results in descending order of the count column.
  • transmute() and reframe() now support across().
  • distinct(.keep_all = TRUE) with a column subset now emits a message when falling back to R.

Utilities

  • glimpse(): preview column names, types, and first few values without collecting the full result.
  • collect() now works on data.frames (no-op), so slice_min(...) |> collect() works regardless of the with_ties path.

Documentation

vectra 0.2.1

Engine

  • External merge sort with 1 GB memory budget and automatic spill-to-disk.
  • Sort-based group_by() |> summarise() path for spill-safe aggregation.
  • Chunked FULL join finalize (65,536 rows per batch).
  • Automatic type coercion (int64 <-> double) in join keys and bind_rows().
  • rank() and dense_rank() window functions.

Type system

Infrastructure

  • Engine reference vignette (vignette("engine")).
  • 17-scenario benchmark suite with baseline snapshots and regression thresholds.
  • ASAN/UBSAN CI job on Linux.
  • Benchmark smoke job on PRs.

vectra 0.1.0