String Operations and Fuzzy Matching

Introduction

String-heavy datasets are common in ecology. Species names arrive with typos, observer notes contain free-text descriptions, and site labels mix upper and lower case in ways that break joins. Cleaning and matching these strings is often the slowest part of an analysis pipeline, both in wall-clock time and in the number of lines of R code required.

vectra handles all string operations inside its C engine, operating directly on raw byte buffers rather than going through R’s internal CHARSXP string pool. When you write mutate(name_clean = tolower(trimws(species))), that expression is serialized into an execution plan and evaluated in C as data flows through the pipeline. No temporary R character vectors are allocated, no intermediate copies exist, and the global string cache that R maintains for every unique character value is bypassed entirely.

This architecture has two practical consequences. First, string operations scale linearly with data size and do not suffer from R’s O(n log n) string interning overhead. Second, the CPU-intensive operations (regex matching, edit-distance computation) can be parallelized with OpenMP because the C engine owns the memory layout. For datasets with more than a thousand rows, vectra automatically distributes regex grepl, levenshtein, dl_dist, and jaro_winkler across available cores.

This vignette walks through every string operation vectra supports, from basic transformations through regex pattern matching to fuzzy joins. We will use ecological data throughout: species names with realistic errors, site labels from field surveys, and observer notes that need cleaning before analysis. The examples are deliberately small so that outputs fit on screen, but every operation shown here works identically on datasets with millions of rows, streaming through memory one row group at a time without materializing the full result until collect() is called.

Basic string functions

We will start with a small dataset of tree observations where names need cleaning. The data has trailing spaces, inconsistent case, and a few columns we will want to slice apart.

library(vectra)
#> 
#> Attaching package: 'vectra'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:graphics':
#> 
#>     grid

trees <- data.frame(
  species  = c("Quercus robur  ", "  fagus sylvatica", "ACER platanoides",
               "Betula pendula", "  Pinus SYLVESTRIS  "),
  site     = c("plot_A01", "plot_B12", "plot_A01", "plot_C03", "plot_B12"),
  observer = c("J. Smith", "A. Mueller", "J. Smith",
               "B. Novak", "A. Mueller"),
  dbh_cm   = c(45.2, 38.1, 22.7, 31.0, 55.3),
  stringsAsFactors = FALSE
)

f <- tempfile(fileext = ".vtr")
write_vtr(trees, f)

Trimming whitespace

trimws() strips leading and trailing whitespace from every value in a column. The C engine operates on the raw bytes, so the overhead per string is a single pass to find the first and last non-space characters. We apply it inside mutate() just as we would in base R.

tbl(f) |>
  mutate(species_clean = trimws(species)) |>
  select(species, species_clean) |>
  collect()
#>                species    species_clean
#> 1      Quercus robur      Quercus robur
#> 2      fagus sylvatica  fagus sylvatica
#> 3     ACER platanoides ACER platanoides
#> 4       Betula pendula   Betula pendula
#> 5   Pinus SYLVESTRIS   Pinus SYLVESTRIS

The original column is preserved alongside the cleaned version. If we no longer need it, a subsequent select() drops it from the pipeline before any downstream node sees it.

Case conversion

toupper() and tolower() convert ASCII characters in place. These are pure byte-level operations (A-Z mapped to a-z and vice versa), so they are effectively free compared to the I/O cost of reading the column from disk.

tbl(f) |>
  mutate(species_lower = tolower(trimws(species)),
         species_upper = toupper(trimws(species))) |>
  select(species_lower, species_upper) |>
  collect()
#>      species_lower    species_upper
#> 1    quercus robur    QUERCUS ROBUR
#> 2  fagus sylvatica  FAGUS SYLVATICA
#> 3 acer platanoides ACER PLATANOIDES
#> 4   betula pendula   BETULA PENDULA
#> 5 pinus sylvestris PINUS SYLVESTRIS

Chaining tolower(trimws(...)) creates a single expression node in the plan. The engine evaluates the innermost function first and passes the result directly to the outer one, with no intermediate allocation visible from R.

String length

nchar() returns the byte length of each string as a double (vectra’s integer type). This is useful for filtering out suspiciously short or long entries that might indicate data-entry errors.

tbl(f) |>
  mutate(name_len = nchar(trimws(species))) |>
  select(species, name_len) |>
  collect()
#>                species name_len
#> 1      Quercus robur         13
#> 2      fagus sylvatica       15
#> 3     ACER platanoides       16
#> 4       Betula pendula       14
#> 5   Pinus SYLVESTRIS         16

Because vectra operates on raw bytes, nchar() returns the byte length, not the number of Unicode code points. For pure ASCII text (English letters, digits, common punctuation), bytes and characters are the same thing. But for multi-byte UTF-8 characters such as accented letters (“ü” = 2 bytes, “é” = 2 bytes) or CJK ideographs (3 bytes each), the reported length will be higher than what base::nchar() returns in R. The same applies to substr(): its start and stop positions refer to byte offsets. In ecological data, this rarely matters because taxonomic names and site codes are overwhelmingly ASCII, but it is worth keeping in mind if your data contains place names with diacritics or observer names in non-Latin scripts.

Extracting substrings

substr(x, start, stop) and substring(x, first, last) both extract a portion of each string by position. Positions are 1-based, following R convention. These are useful for pulling fixed-width codes out of identifier columns.

tbl(f) |>
  mutate(plot_letter = substr(site, 6, 6),
         plot_number = substr(site, 7, 8)) |>
  select(site, plot_letter, plot_number) |>
  collect()
#>       site plot_letter plot_number
#> 1 plot_A01           A          01
#> 2 plot_B12           B          12
#> 3 plot_A01           A          01
#> 4 plot_C03           C          03
#> 5 plot_B12           B          12

The site codes follow a plot_X## pattern, so character 6 is the letter and characters 7-8 are the numeric part. Both are returned as strings.

Prefix and suffix tests

startsWith() and endsWith() return a logical column. They take a literal string argument (not a column reference) and match it against the beginning or end of each value. Because the comparison is a simple memcmp on the byte prefix or suffix, it runs faster than a regex equivalent.

tbl(f) |>
  filter(startsWith(site, "plot_A")) |>
  select(species, site) |>
  collect()
#>            species     site
#> 1  Quercus robur   plot_A01
#> 2 ACER platanoides plot_A01

tbl(f) |>
  mutate(is_mueller = endsWith(observer, "Mueller")) |>
  select(observer, is_mueller) |>
  collect()
#>     observer is_mueller
#> 1   J. Smith      FALSE
#> 2 A. Mueller       TRUE
#> 3   J. Smith      FALSE
#> 4   B. Novak      FALSE
#> 5 A. Mueller       TRUE

startsWith and endsWith are particularly useful in filter() for subsetting data by site code prefixes or file path suffixes, since the C engine can evaluate them without compiling a regex.

String concatenation

paste() and paste0() concatenate an arbitrary number of arguments into a single string column. These are N-ary functions: you can pass two columns, five columns, or a mix of columns and literal strings. The C engine’s two-pass string builder computes the total output length across all arguments first, allocates a single buffer, and fills it in one sweep.

paste0: no separator

paste0() joins its arguments with no separator, which is the common case for building identifiers or composite keys.

tbl(f) |>
  mutate(record_id = paste0(site, "_", trimws(species))) |>
  select(record_id) |>
  collect()
#>                   record_id
#> 1    plot_A01_Quercus robur
#> 2  plot_B12_fagus sylvatica
#> 3 plot_A01_ACER platanoides
#> 4   plot_C03_Betula pendula
#> 5 plot_B12_Pinus SYLVESTRIS

Here we pass three arguments: the site column, a literal underscore, and the trimmed species name. The engine handles mixed column-and-literal arguments natively.

paste: custom separator

paste() uses a space as the default separator, matching base R behaviour. The sep parameter overrides this.

tbl(f) |>
  mutate(label = paste(observer, site, sep = " @ ")) |>
  select(label) |>
  collect()
#>                   label
#> 1   J. Smith @ plot_A01
#> 2 A. Mueller @ plot_B12
#> 3   J. Smith @ plot_A01
#> 4   B. Novak @ plot_C03
#> 5 A. Mueller @ plot_B12

Multi-column paste

When building composite labels from more than two columns, passing all of them in a single paste() call is cleaner than nesting multiple paste0() calls.

tbl(f) |>
  mutate(full_label = paste(observer, site,
                            trimws(species), sep = " | ")) |>
  select(full_label) |>
  collect()
#>                                 full_label
#> 1      J. Smith | plot_A01 | Quercus robur
#> 2  A. Mueller | plot_B12 | fagus sylvatica
#> 3   J. Smith | plot_A01 | ACER platanoides
#> 4     B. Novak | plot_C03 | Betula pendula
#> 5 A. Mueller | plot_B12 | Pinus SYLVESTRIS

The engine constructs each output string in a single allocation regardless of how many arguments are passed, so there is no performance penalty for concatenating five columns versus two.

Pattern matching: fixed strings

vectra’s grepl(), gsub(), and sub() accept a fixed parameter that controls whether the pattern is treated as a literal byte sequence or as a POSIX regular expression. When fixed = TRUE (the default in vectra), the engine uses a direct substring search, which avoids the cost of compiling a regex automaton. For exact substring tests and simple replacements, this is the right choice.

vectra defaults to fixed = TRUE, which differs from base R where fixed defaults to FALSE. We chose this default because in our experience the majority of string filtering and replacement tasks in data cleaning involve literal substrings, not regex patterns. When you need actual regex, pass fixed = FALSE explicitly. This is covered in the next section.

Filtering with grepl

grepl(pattern, x, fixed = TRUE) returns a boolean column that is TRUE wherever the pattern appears as a literal substring. We use it here to find all records where the observer’s surname contains “Smith”.

tbl(f) |>
  filter(grepl("Smith", observer, fixed = TRUE)) |>
  collect()
#>            species     site observer dbh_cm
#> 1  Quercus robur   plot_A01 J. Smith   45.2
#> 2 ACER platanoides plot_A01 J. Smith   22.7

Only one row matches. The fixed-string search scans each byte buffer with a Boyer-Moore-style skip, so it scales well even on columns with long string values.

Replacing with gsub and sub

gsub(pattern, replacement, x, fixed = TRUE) replaces every occurrence of the pattern; sub() replaces only the first. Both accept fixed = TRUE for literal matching and return a new string column.

A common cleaning task is standardizing separators. Suppose our site codes sometimes use underscores and sometimes hyphens.

messy_sites <- data.frame(
  site = c("plot_A01", "plot-B12", "plot_A01", "plot-C03", "plot_B12"),
  stringsAsFactors = FALSE
)
f2 <- tempfile(fileext = ".vtr")
write_vtr(messy_sites, f2)

tbl(f2) |>
  mutate(site_clean = gsub("-", "_", site, fixed = TRUE)) |>
  collect()
#>       site site_clean
#> 1 plot_A01   plot_A01
#> 2 plot-B12   plot_B12
#> 3 plot_A01   plot_A01
#> 4 plot-C03   plot_C03
#> 5 plot_B12   plot_B12

Every hyphen becomes an underscore. Since fixed = TRUE is the default, writing gsub("-", "_", site) without the explicit argument produces the same result.

sub: first match only

sub() replaces only the first occurrence. This is useful when a pattern might appear multiple times and we want to change just the leading instance.

notes <- data.frame(
  note = c("tree dead, bark loose, dead branches",
           "alive, healthy canopy",
           "dead standing, no bark"),
  stringsAsFactors = FALSE
)
f3 <- tempfile(fileext = ".vtr")
write_vtr(notes, f3)

tbl(f3) |>
  mutate(note_edited = sub("dead", "DEAD", note, fixed = TRUE)) |>
  select(note, note_edited) |>
  collect()
#>                                   note                          note_edited
#> 1 tree dead, bark loose, dead branches tree DEAD, bark loose, dead branches
#> 2                alive, healthy canopy                alive, healthy canopy
#> 3               dead standing, no bark               DEAD standing, no bark

In the first row, only the first “dead” is uppercased; the second “dead” in “dead branches” remains unchanged. gsub would have replaced both.

Pattern matching: regex

When literal substring matching is not enough, vectra supports full POSIX regular expressions by passing fixed = FALSE. The C engine compiles each regex pattern once per batch using regcomp() and applies it across all rows. For batches with more than 1000 rows, the matching loop is parallelized with OpenMP, with each thread compiling its own copy of the regex for thread safety.

The regex dialect is POSIX Extended Regular Expressions (ERE), the same dialect used by grep -E on Unix systems. This means character classes like [A-Z], quantifiers +, *, ?, alternation with |, and grouping with () all work as expected. Backreferences in replacement strings (\\1, \\2) refer to captured groups. Perl-specific extensions like lookaheads, lookbehinds, and non-greedy quantifiers (*?) are not supported; this is a deliberate constraint that keeps the regex engine simple and predictable.

Regex filtering with grepl

We will build a dataset with species names that have various formatting issues. The regex ^[A-Z][a-z]+ [a-z]+$ matches properly formatted binomial names (capital genus, lowercase epithet, single space).

species_list <- data.frame(
  name = c("Quercus robur", "fagus sylvatica", "ACER PLATANOIDES",
           "Betula pendula", "Pinus  sylvestris", "Tilia cordata"),
  stringsAsFactors = FALSE
)
f4 <- tempfile(fileext = ".vtr")
write_vtr(species_list, f4)

tbl(f4) |>
  mutate(valid_format = grepl("^[A-Z][a-z]+ [a-z]+$",
                              name, fixed = FALSE)) |>
  collect()
#>                name valid_format
#> 1     Quercus robur         TRUE
#> 2   fagus sylvatica        FALSE
#> 3  ACER PLATANOIDES        FALSE
#> 4    Betula pendula         TRUE
#> 5 Pinus  sylvestris        FALSE
#> 6     Tilia cordata         TRUE

Three names fail: “fagus sylvatica” (lowercase genus), “ACER PLATANOIDES” (all caps), and “Pinus sylvestris” (double space). This kind of validation pass is a useful first step before attempting joins against a reference taxonomy.

Regex replacement with gsub

gsub() with fixed = FALSE supports POSIX regex in the pattern and backreferences in the replacement string. Backreferences \\1, \\2, etc. refer to captured groups in the pattern.

Suppose we want to reformat species names from “Genus epithet” to “G. epithet” for compact display labels.

tbl(f4) |>
  mutate(short_name = gsub("^([A-Za-z])[a-z]+ ([a-z]+)$",
                           "\\1. \\2",
                           name, fixed = FALSE)) |>
  select(name, short_name) |>
  collect()
#>                name        short_name
#> 1     Quercus robur          Q. robur
#> 2   fagus sylvatica      f. sylvatica
#> 3  ACER PLATANOIDES  ACER PLATANOIDES
#> 4    Betula pendula        B. pendula
#> 5 Pinus  sylvestris Pinus  sylvestris
#> 6     Tilia cordata        T. cordata

The pattern captures the first letter of the genus as group \\1 and the entire epithet as group \\2. Rows that do not match the pattern are returned unchanged, which is consistent with base R’s gsub behaviour.

sub with regex: first match only

sub() with regex works the same way but replaces only the first match. This is handy for stripping a leading prefix that varies in length.

records <- data.frame(
  code = c("ID:001-Quercus", "ID:042-Fagus",
           "ID:007-Betula", "ID:113-Pinus"),
  stringsAsFactors = FALSE
)
f5 <- tempfile(fileext = ".vtr")
write_vtr(records, f5)

tbl(f5) |>
  mutate(genus = sub("^ID:[0-9]+-", "", code, fixed = FALSE)) |>
  select(code, genus) |>
  collect()
#>             code   genus
#> 1 ID:001-Quercus Quercus
#> 2   ID:042-Fagus   Fagus
#> 3  ID:007-Betula  Betula
#> 4   ID:113-Pinus   Pinus

The regex ^ID:[0-9]+- matches the literal “ID:” prefix, one or more digits, and the trailing hyphen. Replacing it with an empty string leaves just the genus name.

str_extract: capturing structured parts

str_extract(x, pattern) returns the first match of the regex pattern, or NA if no match is found. When the pattern contains a capture group (round brackets), str_extract returns the contents of that group rather than the full match.

This is particularly useful for pulling structured components out of semi-formatted strings. Here we extract numeric plot identifiers from site codes.

sites <- data.frame(
  site_code = c("Forest_Plot_042", "Meadow_Transect_007",
                "Forest_Plot_113", "Wetland_Quad_019"),
  stringsAsFactors = FALSE
)
f6 <- tempfile(fileext = ".vtr")
write_vtr(sites, f6)

tbl(f6) |>
  mutate(plot_num = str_extract(site_code, "([0-9]+)")) |>
  select(site_code, plot_num) |>
  collect()
#>             site_code plot_num
#> 1     Forest_Plot_042      042
#> 2 Meadow_Transect_007      007
#> 3     Forest_Plot_113      113
#> 4    Wetland_Quad_019      019

The capture group ([0-9]+) matches one or more digits. Since there is a group, str_extract returns the group content. Without the parentheses, it would return the full match, which in this case is the same.

A more interesting example extracts the habitat type from the site code. The pattern uses a group to capture everything before the first underscore.

tbl(f6) |>
  mutate(habitat = str_extract(site_code, "^([A-Za-z]+)_")) |>
  select(site_code, habitat) |>
  collect()
#>             site_code habitat
#> 1     Forest_Plot_042  Forest
#> 2 Meadow_Transect_007  Meadow
#> 3     Forest_Plot_113  Forest
#> 4    Wetland_Quad_019 Wetland

The group captures “Forest”, “Meadow”, “Forest”, “Wetland” while the trailing underscore is part of the match but not part of the group. This approach is cleaner than combining substr with a computed position.

Fuzzy string matching

Exact string matching breaks down when data comes from multiple sources with inconsistent spelling. A field botanist writes “Qurecus robur” (transposed letters), a database export has “Fagus silvatica” (single-letter substitution), and a citizen-science platform records “quercus ROBUR” (case mismatch). Fuzzy string matching quantifies how different two strings are and lets us match records that are close enough.

vectra provides three families of distance functions, all usable inside mutate() and filter(). Each takes two string arguments (columns or literals) and returns a numeric distance or similarity score. The raw variants (levenshtein, dl_dist) return integer edit counts; the normalized variants (levenshtein_norm, dl_dist_norm) divide by the length of the longer string to produce a value between 0 and 1; and jaro_winkler returns a similarity score where 1 means identical.

Levenshtein distance

levenshtein(x, y) counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform x into y. The result is a non-negative integer. levenshtein_norm(x, y) divides by the length of the longer string, producing a value between 0 (identical) and 1 (completely different).

typos <- data.frame(
  field_name = c("Qurecus robur", "Fagus silvatica",
                 "Acer platanodes", "Betula pendula",
                 "Pinus sylvestrs"),
  ref_name   = c("Quercus robur", "Fagus sylvatica",
                 "Acer platanoides", "Betula pendula",
                 "Pinus sylvestris"),
  stringsAsFactors = FALSE
)
f7 <- tempfile(fileext = ".vtr")
write_vtr(typos, f7)

tbl(f7) |>
  mutate(lev = levenshtein(field_name, ref_name),
         lev_norm = levenshtein_norm(field_name, ref_name)) |>
  collect()
#>        field_name         ref_name lev   lev_norm
#> 1   Qurecus robur    Quercus robur   2 0.15384615
#> 2 Fagus silvatica  Fagus sylvatica   1 0.06666667
#> 3 Acer platanodes Acer platanoides   1 0.06250000
#> 4  Betula pendula   Betula pendula   0 0.00000000
#> 5 Pinus sylvestrs Pinus sylvestris   1 0.06250000

“Qurecus robur” vs “Quercus robur” has a Levenshtein distance of 2 (the transposed “re” to “er” counts as two operations: delete and insert, or two substitutions depending on the alignment). “Betula pendula” matches exactly, giving a distance of 0.

Damerau-Levenshtein distance

dl_dist(x, y) extends Levenshtein by counting transpositions of two adjacent characters as a single edit operation. This better reflects the kinds of errors humans make when typing. dl_dist_norm(x, y) is the normalized variant.

tbl(f7) |>
  mutate(dl = dl_dist(field_name, ref_name),
         dl_norm = dl_dist_norm(field_name, ref_name)) |>
  collect()
#>        field_name         ref_name dl    dl_norm
#> 1   Qurecus robur    Quercus robur  1 0.07692308
#> 2 Fagus silvatica  Fagus sylvatica  1 0.06666667
#> 3 Acer platanodes Acer platanoides  1 0.06250000
#> 4  Betula pendula   Betula pendula  0 0.00000000
#> 5 Pinus sylvestrs Pinus sylvestris  1 0.06250000

Compare the first row: “Qurecus” vs “Quercus” is a transposition of “r” and “e”. Levenshtein counts this as 2 edits, but Damerau-Levenshtein counts it as 1. For human-typed data, DL distance is usually the better default because transposition errors are common.

Jaro-Winkler similarity

jaro_winkler(x, y) returns a similarity score between 0 and 1, where 1 means the strings are identical. Unlike the edit-distance functions, higher values indicate better matches. Jaro-Winkler gives extra weight to matching prefixes, making it particularly effective for personal names and short strings where the first few characters are usually correct.

tbl(f7) |>
  mutate(jw = jaro_winkler(field_name, ref_name)) |>
  collect()
#>        field_name         ref_name        jw
#> 1   Qurecus robur    Quercus robur 0.9794872
#> 2 Fagus silvatica  Fagus sylvatica 0.9447619
#> 3 Acer platanodes Acer platanoides 0.9875000
#> 4  Betula pendula   Betula pendula 1.0000000
#> 5 Pinus sylvestrs Pinus sylvestris 0.9875000

The scores are all above 0.9 for these examples because the errors are small relative to string length. Jaro-Winkler is less informative for long strings with small differences (everything scores above 0.95) but excels at distinguishing between short strings like surnames.

Choosing a distance metric

The choice depends on the error profile of your data:

Damerau-Levenshtein (dl_dist / dl_dist_norm): the best general default. Handles insertions, deletions, substitutions, and transpositions. Use this for species names, place names, and any free-text field where typos are the primary error source.
Levenshtein (levenshtein / levenshtein_norm): appropriate when transpositions are rare or when you need compatibility with systems that use classical Levenshtein. The scores will be slightly more conservative (higher distance) than DL for transposed strings.
Jaro-Winkler (jaro_winkler): best for short strings (under 15 characters) and personal names. The prefix weighting helps when the first few characters are reliable and errors tend to occur later in the string. Less useful for long species names where a single-character change barely moves the score.

Filtering by distance threshold

In practice, we often want to flag records that exceed some distance threshold from a reference value. Here we filter for species entries that are within a normalized DL distance of 0.15 from “Quercus robur”.

field_data <- data.frame(
  species = c("Qurecus robur", "Quercus robor", "Fagus sylvatica",
              "Quercus robur", "Quercis rubur", "Betula pendula"),
  plot    = c("A1", "A2", "B1", "A3", "B2", "C1"),
  stringsAsFactors = FALSE
)
f8 <- tempfile(fileext = ".vtr")
write_vtr(field_data, f8)

tbl(f8) |>
  mutate(dist = dl_dist_norm(species, "Quercus robur")) |>
  filter(dist < 0.15) |>
  collect()
#>         species plot       dist
#> 1 Qurecus robur   A1 0.07692308
#> 2 Quercus robor   A2 0.07692308
#> 3 Quercus robur   A3 0.00000000

Only entries close to “Quercus robur” survive the filter. “Fagus sylvatica” and “Betula pendula” are too distant, and “Quercis rubur” sits right at the boundary (whether it passes depends on the exact normalized distance).

Ranking by similarity

When matching against multiple candidates, sorting by distance gives us the best matches first.

tbl(f8) |>
  mutate(sim = jaro_winkler(species, "Quercus robur")) |>
  arrange(desc(sim)) |>
  collect()
#>           species plot       sim
#> 1   Quercus robur   A3 1.0000000
#> 2   Qurecus robur   A1 0.9794872
#> 3   Quercus robor   A2 0.9692308
#> 4   Quercis rubur   B2 0.9358974
#> 5 Fagus sylvatica   B1 0.4769231
#> 6  Betula pendula   C1 0.3644689

The exact match scores 1.0, and the candidates with small typos rank next. This pattern is useful for building a shortlist of probable matches to review manually.

Fuzzy joins

Fuzzy matching within mutate() works when we already have paired columns to compare. More often, we have two separate tables and need to find which rows in a messy dataset correspond to which rows in a clean reference. This is where fuzzy_join() comes in.

Basic fuzzy join

fuzzy_join(x, y, by, method, max_dist) joins two vectra tables using approximate string matching. The by argument is a named character vector of length 1 specifying which columns to compare. The result includes all columns from both sides plus a fuzzy_dist column with the normalized distance of each match.

ref <- data.frame(
  canonical = c("Quercus robur", "Fagus sylvatica",
                "Acer platanoides", "Betula pendula",
                "Pinus sylvestris", "Tilia cordata"),
  family    = c("Fagaceae", "Fagaceae", "Sapindaceae",
                "Betulaceae", "Pinaceae", "Malvaceae"),
  stringsAsFactors = FALSE
)
f_ref <- tempfile(fileext = ".vtr")
write_vtr(ref, f_ref)

messy <- data.frame(
  field_species = c("Qurecus robur", "Fagus silvatica",
                    "Acer platanodes", "Betla pendula",
                    "Pinis sylvestris"),
  count         = c(12L, 7L, 3L, 15L, 9L),
  stringsAsFactors = FALSE
)
f_messy <- tempfile(fileext = ".vtr")
write_vtr(messy, f_messy)

fuzzy_join(
  tbl(f_messy), tbl(f_ref),
  by = c("field_species" = "canonical"),
  method = "dl",
  max_dist = 0.25
) |> collect()
#>      field_species count        canonical      family fuzzy_dist
#> 1    Qurecus robur    12    Quercus robur    Fagaceae 0.07692308
#> 2  Fagus silvatica     7  Fagus sylvatica    Fagaceae 0.06666667
#> 3  Acer platanodes     3 Acer platanoides Sapindaceae 0.06250000
#> 4    Betla pendula    15   Betula pendula  Betulaceae 0.07142857
#> 5 Pinis sylvestris     9 Pinus sylvestris    Pinaceae 0.06250000

Each messy field name finds its closest match in the reference table. The fuzzy_dist column shows how far apart they are, which is useful for flagging matches that might need manual verification (say, anything above 0.15).

Methods: dl, levenshtein, jw

The method parameter selects the distance algorithm. The three options are "dl" (Damerau-Levenshtein, the default), "levenshtein", and "jw" (Jaro-Winkler). When using "jw", max_dist is interpreted as the maximum dissimilarity (i.e., 1 - jaro_winkler_score), so max_dist = 0.2 keeps matches with a Jaro-Winkler score of 0.8 or higher.

fuzzy_join(
  tbl(f_messy), tbl(f_ref),
  by = c("field_species" = "canonical"),
  method = "jw",
  max_dist = 0.15
) |> collect()
#>      field_species count        canonical      family fuzzy_dist
#> 1    Qurecus robur    12    Quercus robur    Fagaceae 0.02051282
#> 2  Fagus silvatica     7  Fagus sylvatica    Fagaceae 0.05523810
#> 3  Acer platanodes     3 Acer platanoides Sapindaceae 0.01250000
#> 4    Betla pendula    15   Betula pendula  Betulaceae 0.01666667
#> 5 Pinis sylvestris     9 Pinus sylvestris    Pinaceae 0.02916667

Jaro-Winkler produces a tighter match set here because it penalizes differences in the prefix more heavily.

Blocking for performance

Without blocking, fuzzy_join compares every probe row against every build row. For large tables this is quadratic. The block_by parameter specifies a column for exact-match blocking: only pairs that share the same blocking key are compared.

In a taxonomic context, the natural blocking key is the genus. We split the genus from both sides and block on it, so “Qurecus robur” is only compared against reference rows where the genus starts with “Q” (or more precisely, where the blocking column matches exactly).

ref_blocked <- data.frame(
  genus     = c("Quercus", "Fagus", "Acer",
                "Betula", "Pinus", "Tilia"),
  canonical = c("Quercus robur", "Fagus sylvatica",
                "Acer platanoides", "Betula pendula",
                "Pinus sylvestris", "Tilia cordata"),
  family    = c("Fagaceae", "Fagaceae", "Sapindaceae",
                "Betulaceae", "Pinaceae", "Malvaceae"),
  stringsAsFactors = FALSE
)
f_ref2 <- tempfile(fileext = ".vtr")
write_vtr(ref_blocked, f_ref2)

messy_blocked <- data.frame(
  genus_field   = c("Quercus", "Fagus", "Acer",
                    "Betula", "Pinus"),
  field_species = c("Qurecus robur", "Fagus silvatica",
                    "Acer platanodes", "Betla pendula",
                    "Pinis sylvestris"),
  count         = c(12L, 7L, 3L, 15L, 9L),
  stringsAsFactors = FALSE
)
f_messy2 <- tempfile(fileext = ".vtr")
write_vtr(messy_blocked, f_messy2)

fuzzy_join(
  tbl(f_messy2), tbl(f_ref2),
  by = c("field_species" = "canonical"),
  method = "dl",
  max_dist = 0.25,
  block_by = c("genus_field" = "genus")
) |> collect()
#>   genus_field    field_species count   genus        canonical      family
#> 1     Quercus    Qurecus robur    12 Quercus    Quercus robur    Fagaceae
#> 2       Fagus  Fagus silvatica     7   Fagus  Fagus sylvatica    Fagaceae
#> 3        Acer  Acer platanodes     3    Acer Acer platanoides Sapindaceae
#> 4      Betula    Betla pendula    15  Betula   Betula pendula  Betulaceae
#> 5       Pinus Pinis sylvestris     9   Pinus Pinus sylvestris    Pinaceae
#>   fuzzy_dist
#> 1 0.07692308
#> 2 0.06666667
#> 3 0.06250000
#> 4 0.07142857
#> 5 0.06250000

With blocking, each probe row is compared against at most one reference row (the one sharing its genus) instead of all six. On real datasets with thousands of reference species, this reduces the number of distance computations by orders of magnitude.

Thread control

The n_threads parameter controls how many OpenMP threads the distance computations use. The default is 4, which is a reasonable starting point. On a workstation with many cores, increasing it to 8 or 16 can help for large joins. On a laptop where other processes need CPU time, reducing it to 2 is reasonable.

fuzzy_join(
  tbl(f_messy), tbl(f_ref),
  by = c("field_species" = "canonical"),
  method = "dl",
  max_dist = 0.25,
  n_threads = 8L
) |> collect()

Block lookups

fuzzy_join() is a streaming operation: the probe side flows through the pipeline and the build side is materialized into a hash table internally. But sometimes we need to look up the same reference repeatedly with different query sets, and rebuilding the hash table each time is wasteful.

materialize() converts a vectra node into a persistent in-memory block. Once materialized, we can probe it with block_lookup() (exact matching) or block_fuzzy_lookup() (fuzzy matching) as many times as needed without re-reading the source data.

Exact lookups

block_lookup(block, column, keys) performs a hash-based exact match on a string column. The hash index is built lazily on the first call and cached for subsequent calls.

blk <- tbl(f_ref) |>
  select(canonical, family) |>
  materialize()

block_lookup(blk, "canonical",
             c("Quercus robur", "Betula pendula"))
#>   query_idx      canonical     family
#> 1         1  Quercus robur   Fagaceae
#> 2         2 Betula pendula Betulaceae

The result includes a query_idx column that maps each result row back to the input key vector (1-based). Queries that match multiple rows in the block return all matches.

Case-insensitive lookups

Field data often has inconsistent capitalization. The ci = TRUE parameter performs case-insensitive matching without requiring a separate lowercased column in the reference.

block_lookup(blk, "canonical",
             c("quercus robur", "BETULA PENDULA"),
             ci = TRUE)
#>   query_idx      canonical     family
#> 1         1  Quercus robur   Fagaceae
#> 2         2 Betula pendula Betulaceae

Both queries match despite the case mismatch. Internally, the engine builds a separate case-folded hash index on first use and caches it alongside the case-sensitive one.

Fuzzy lookups

block_fuzzy_lookup() computes string distances between query keys and every row in the block (or, with blocking, only rows sharing a blocking key). The interface mirrors fuzzy_join() but operates on a materialized block.

blk2 <- tbl(f_ref2) |>
  select(genus, canonical, family) |>
  materialize()

block_fuzzy_lookup(
  blk2, "canonical",
  keys = c("Qurecus robur", "Fagus silvatica"),
  method = "dl",
  max_dist = 0.2,
  block_col = "genus",
  block_keys = c("Quercus", "Fagus"),
  n_threads = 2L
)
#>   query_idx fuzzy_dist   genus       canonical   family
#> 1         1 0.07692308 Quercus   Quercus robur Fagaceae
#> 2         2 0.06666667   Fagus Fagus sylvatica Fagaceae

The block_col and block_keys arguments work together: the first query (“Qurecus robur”) is only compared against rows where genus == "Quercus", and the second query against rows where genus == "Fagus". This is the same blocking logic as fuzzy_join() but applied to a materialized block that can be reused across multiple lookup calls without re-reading disk.

Performance

String operations in vectra fall into three performance tiers. Understanding which tier an operation belongs to helps predict where the bottleneck will be in a pipeline.

Tier 1: byte-level operations (essentially free)

toupper, tolower, trimws, nchar, substr, startsWith, endsWith, and paste/paste0 all operate on raw bytes with O(n) single-pass algorithms. Their cost is negligible compared to disk I/O. On a modern SSD, reading a 100 MB string column takes roughly 50 ms; applying tolower to it adds less than 1 ms, so in practice these operations contribute negligible overhead compared to the I/O cost of reading the data from disk.

Tier 2: pattern matching (depends on pattern complexity)

grepl, gsub, sub, and str_extract compile a POSIX regex (or set up a fixed-string scanner) and apply it to every row. Fixed-string matching (fixed = TRUE) is in tier 1 territory. Regex matching cost depends on the pattern: a simple alternation like ^(A|B|C)$ is nearly as fast as fixed, while a pattern with nested quantifiers like (.+?)* can be exponentially slow on pathological inputs. For batches larger than 1000 rows, the engine distributes the matching loop across OpenMP threads with per-thread regex compilation, so doubling the available cores roughly halves the wall time.

The practical advice: use fixed = TRUE (or omit the parameter, since it defaults to TRUE) whenever the pattern is a literal string. Reserve fixed = FALSE for cases where you genuinely need regex features like character classes, alternations, or capture groups.

Tier 3: fuzzy distance computation (CPU-bound)

levenshtein, dl_dist, and jaro_winkler compute an O(mn) dynamic programming alignment for each pair of strings, where m and n are the string lengths. For column-vs-column comparisons in mutate(), this is O(rows m * n). For fuzzy joins without blocking, it is O(probe_rows * build_rows m n). Blocking reduces this to O(probe_rows * avg_partition_size * m * n), which is why it matters so much. A loose max_dist does not change the number of distance computations, but it increases the number of rows that survive the threshold, and in a many-to-many fuzzy join this can cause a combinatorial explosion of output rows. If both tables contain 10,000 rows and a generous threshold matches each probe row to 50 build rows, the result has 500,000 rows. Tightening max_dist or adding a block_by column are the two most effective ways to keep the output manageable.

All three distance functions use OpenMP parallelization when the input exceeds 1000 rows. The work is distributed with schedule(dynamic, 64), meaning threads grab 64-row chunks as they finish, which balances load well when string lengths vary.

Two-pass string building

String-producing expressions (everything that outputs a new string column) use a two-pass strategy. The first pass computes the output length for each row without writing any bytes. The second pass allocates a single contiguous buffer and fills it. This avoids the realloc-and-copy pattern that R uses internally, which can fragment memory and cause O(n^2) behaviour on long strings. For paste() with many arguments or gsub with complex replacements, the two-pass approach keeps memory allocation linear.

Practical guidance

Which distance metric for which use case

For taxonomic name matching, dl_dist_norm with a threshold of 0.15-0.20 is the workhorse. It catches transpositions (the most common typo type in Latin binomials), single-letter substitutions, and dropped characters. If your reference list is large and you can block on genus, the false-positive rate at 0.20 is low enough for automated matching with a manual review pass for distances above 0.10.

For matching personal names (observers, collectors, authors), jaro_winkler with a threshold of 0.10-0.15 (remember, this is dissimilarity, so 0.10 means a similarity score of 0.90 or higher) is the better choice. Names are short, and the prefix weighting in Jaro-Winkler exploits the fact that people rarely mistype the first letter of their own name.

For free-text fields where errors include missing words, reordered words, or abbreviations, none of these character-level metrics work well. Consider preprocessing with tolower, trimws, and gsub to normalize the strings before computing distances.

Choosing max_dist thresholds

Start conservative (low max_dist) and increase gradually. A max_dist of 0.10 with DL distance catches most single-character typos in strings of 10+ characters. At 0.20, you start matching strings that differ by two edits, which increases recall but also increases false positives. Above 0.30, the match set usually contains too much noise to be useful without manual filtering.

For Jaro-Winkler, the scale is inverted (higher score = better match). A max_dist of 0.10 (score >= 0.90) is a reasonable starting point. Below 0.05 (score >= 0.95), you catch only very minor differences.

A useful workflow is to run the fuzzy join once with a generous threshold, sort the results by fuzzy_dist, and examine the matches near the boundary. This gives an empirical sense of where true matches end and false positives begin for your specific data.

Handling encoding issues

vectra operates on raw bytes and assumes UTF-8 encoding. If your input data is Latin-1 or another encoding, convert it to UTF-8 before writing to .vtr format. In R, iconv(x, from = "latin1", to = "UTF-8") handles this. Once the data is in UTF-8, all string operations work correctly on ASCII characters and pass through multi-byte sequences without corruption, though the distance metrics count bytes rather than Unicode code points for multi-byte characters.

This byte-level behaviour has a practical implication for distance computation: if your strings contain accented characters encoded as multi-byte UTF-8 sequences (e.g., “u” + combining accent), the edit distance between “u” and “ü” will be larger than 1 because the engine sees multiple bytes changing, not a single character. For Latin-alphabet ecological data (species names, place names), this is rarely a problem because accented characters are uncommon in standardized taxonomic databases. If your data does include accented characters, consider normalizing them to ASCII before computing distances (e.g., replacing “ü” with “u” via gsub).

Common cleaning patterns

A typical cleaning pipeline chains several string operations before matching. Here is a pattern we use repeatedly for species name standardization.

raw_names <- data.frame(
  species = c("  Quercus ROBUR ", "fagus sylvatica.",
              "Acer platanoides (L.)", "BETULA   pendula"),
  stringsAsFactors = FALSE
)
f9 <- tempfile(fileext = ".vtr")
write_vtr(raw_names, f9)

tbl(f9) |>
  mutate(clean = tolower(trimws(species))) |>
  mutate(clean = gsub(".", "", clean, fixed = TRUE)) |>
  mutate(clean = gsub(" +", " ", clean, fixed = FALSE)) |>
  select(species, clean) |>
  collect()
#>                 species                clean
#> 1        Quercus ROBUR         quercus robur
#> 2      fagus sylvatica.      fagus sylvatica
#> 3 Acer platanoides (L.) acer platanoides (l)
#> 4      BETULA   pendula       betula pendula

The pipeline strips whitespace, lowercases everything, removes periods, and collapses multiple spaces into one. Each mutate() call creates a streaming node, so chaining them costs no extra memory. After this cleaning, the strings are ready for exact or fuzzy matching against a standardized reference.

Combining approaches

The most effective matching strategy layers exact and fuzzy methods. Start with an exact join on cleaned names (catching the easy cases at near-zero cost), then run a fuzzy join on the unmatched remainder. This avoids spending CPU cycles on distance computations for records that match perfectly after basic cleaning.

ref_clean <- data.frame(
  canonical = c("quercus robur", "fagus sylvatica",
                "acer platanoides", "betula pendula"),
  status    = c("accepted", "accepted", "accepted", "accepted"),
  stringsAsFactors = FALSE
)
f_refc <- tempfile(fileext = ".vtr")
write_vtr(ref_clean, f_refc)

cleaned <- tbl(f9) |>
  mutate(clean = tolower(trimws(species))) |>
  mutate(clean = gsub(".", "", clean, fixed = TRUE)) |>
  mutate(clean = gsub(" +", " ", clean, fixed = FALSE))

# Step 1: exact join on cleaned names
exact <- left_join(
  cleaned, tbl(f_refc),
  by = c("clean" = "canonical")
) |> collect()

exact
#>                 species                clean   status
#> 1        Quercus ROBUR         quercus robur accepted
#> 2      fagus sylvatica.      fagus sylvatica accepted
#> 3      BETULA   pendula       betula pendula accepted
#> 4 Acer platanoides (L.) acer platanoides (l)     <NA>

Records where status is NA did not match exactly and would go to a second pass with fuzzy_join(). In this example all four match after cleaning, which is the ideal outcome: the fuzzy join handles only the genuinely ambiguous cases.

This layered strategy reflects how taxonomic name resolution works in practice. The Global Biodiversity Information Facility (GBIF) backbone taxonomy, for instance, contains over 7 million names. Running a fuzzy join against the full list for every input record would be prohibitively expensive. Cleaning first, exact-matching second, and fuzzy-matching only the remainder keeps the computational cost proportional to the number of genuinely problematic records rather than the total dataset size. In a typical citizen-science dataset, 80-90% of records match exactly after basic normalization, leaving only 10-20% for the distance computation stage.

Gilles Colling

2026-06-29