String Operations and Fuzzy Matching
Gilles Colling
2026-04-04
Source:vignettes/string-ops.Rmd
string-ops.RmdIntroduction
String-heavy datasets are common in ecology. Species names arrive with typos, observer notes contain free-text descriptions, and site labels mix upper and lower case in ways that break joins. Cleaning and matching these strings is often the slowest part of an analysis pipeline, both in wall-clock time and in the number of lines of R code required.
vectra handles all string operations inside its C engine, operating
directly on raw byte buffers rather than going through R’s internal
CHARSXP string pool. When you write
mutate(name_clean = tolower(trimws(species))), that
expression is serialized into an execution plan and evaluated in C as
data flows through the pipeline. No temporary R character vectors are
allocated, no intermediate copies exist, and the global string cache
that R maintains for every unique character value is bypassed
entirely.
This architecture has two practical consequences. First, string
operations scale linearly with data size and do not suffer from R’s O(n
log n) string interning overhead. Second, the CPU-intensive operations
(regex matching, edit-distance computation) can be parallelized with
OpenMP because the C engine owns the memory layout. For datasets with
more than a thousand rows, vectra automatically distributes regex
grepl, levenshtein, dl_dist, and
jaro_winkler across available cores.
This vignette walks through every string operation vectra supports,
from basic transformations through regex pattern matching to fuzzy
joins. We will use ecological data throughout: species names with
realistic errors, site labels from field surveys, and observer notes
that need cleaning before analysis. The examples are deliberately small
so that outputs fit on screen, but every operation shown here works
identically on datasets with millions of rows, streaming through memory
one row group at a time without materializing the full result until
collect() is called.
Basic string functions
We will start with a small dataset of tree observations where names need cleaning. The data has trailing spaces, inconsistent case, and a few columns we will want to slice apart.
library(vectra)
#>
#> Attaching package: 'vectra'
#> The following object is masked from 'package:stats':
#>
#> filter
trees <- data.frame(
species = c("Quercus robur ", " fagus sylvatica", "ACER platanoides",
"Betula pendula", " Pinus SYLVESTRIS "),
site = c("plot_A01", "plot_B12", "plot_A01", "plot_C03", "plot_B12"),
observer = c("J. Smith", "A. Mueller", "J. Smith",
"B. Novak", "A. Mueller"),
dbh_cm = c(45.2, 38.1, 22.7, 31.0, 55.3),
stringsAsFactors = FALSE
)
f <- tempfile(fileext = ".vtr")
write_vtr(trees, f)Trimming whitespace
trimws() strips leading and trailing whitespace from
every value in a column. The C engine operates on the raw bytes, so the
overhead per string is a single pass to find the first and last
non-space characters. We apply it inside mutate() just as
we would in base R.
tbl(f) |>
mutate(species_clean = trimws(species)) |>
select(species, species_clean) |>
collect()
#> species species_clean
#> 1 Quercus robur Quercus robur
#> 2 fagus sylvatica fagus sylvatica
#> 3 ACER platanoides ACER platanoides
#> 4 Betula pendula Betula pendula
#> 5 Pinus SYLVESTRIS Pinus SYLVESTRISThe original column is preserved alongside the cleaned version. If we
no longer need it, a subsequent select() drops it from the
pipeline before any downstream node sees it.
Case conversion
toupper() and tolower() convert ASCII
characters in place. These are pure byte-level operations (A-Z mapped to
a-z and vice versa), so they are effectively free compared to the I/O
cost of reading the column from disk.
tbl(f) |>
mutate(species_lower = tolower(trimws(species)),
species_upper = toupper(trimws(species))) |>
select(species_lower, species_upper) |>
collect()
#> species_lower species_upper
#> 1 quercus robur QUERCUS ROBUR
#> 2 fagus sylvatica FAGUS SYLVATICA
#> 3 acer platanoides ACER PLATANOIDES
#> 4 betula pendula BETULA PENDULA
#> 5 pinus sylvestris PINUS SYLVESTRISChaining tolower(trimws(...)) creates a single
expression node in the plan. The engine evaluates the innermost function
first and passes the result directly to the outer one, with no
intermediate allocation visible from R.
String length
nchar() returns the byte length of each string as a
double (vectra’s integer type). This is useful for filtering out
suspiciously short or long entries that might indicate data-entry
errors.
tbl(f) |>
mutate(name_len = nchar(trimws(species))) |>
select(species, name_len) |>
collect()
#> species name_len
#> 1 Quercus robur 13
#> 2 fagus sylvatica 15
#> 3 ACER platanoides 16
#> 4 Betula pendula 14
#> 5 Pinus SYLVESTRIS 16Because vectra operates on raw bytes, nchar() returns
the byte length, not the number of Unicode code points. For pure ASCII
text (English letters, digits, common punctuation), bytes and characters
are the same thing. But for multi-byte UTF-8 characters such as accented
letters (“ü” = 2 bytes, “é” = 2 bytes) or CJK ideographs (3 bytes each),
the reported length will be higher than what base::nchar()
returns in R. The same applies to substr(): its start and
stop positions refer to byte offsets. In ecological data, this rarely
matters because taxonomic names and site codes are overwhelmingly ASCII,
but it is worth keeping in mind if your data contains place names with
diacritics or observer names in non-Latin scripts.
Extracting substrings
substr(x, start, stop) and
substring(x, first, last) both extract a portion of each
string by position. Positions are 1-based, following R convention. These
are useful for pulling fixed-width codes out of identifier columns.
tbl(f) |>
mutate(plot_letter = substr(site, 6, 6),
plot_number = substr(site, 7, 8)) |>
select(site, plot_letter, plot_number) |>
collect()
#> site plot_letter plot_number
#> 1 plot_A01 A 01
#> 2 plot_B12 B 12
#> 3 plot_A01 A 01
#> 4 plot_C03 C 03
#> 5 plot_B12 B 12The site codes follow a plot_X## pattern, so character 6
is the letter and characters 7-8 are the numeric part. Both are returned
as strings.
Prefix and suffix tests
startsWith() and endsWith() return a
logical column. They take a literal string argument (not a column
reference) and match it against the beginning or end of each value.
Because the comparison is a simple memcmp on the byte prefix or suffix,
it runs faster than a regex equivalent.
tbl(f) |>
filter(startsWith(site, "plot_A")) |>
select(species, site) |>
collect()
#> species site
#> 1 Quercus robur plot_A01
#> 2 ACER platanoides plot_A01
tbl(f) |>
mutate(is_mueller = endsWith(observer, "Mueller")) |>
select(observer, is_mueller) |>
collect()
#> observer is_mueller
#> 1 J. Smith FALSE
#> 2 A. Mueller TRUE
#> 3 J. Smith FALSE
#> 4 B. Novak FALSE
#> 5 A. Mueller TRUEstartsWith and endsWith are particularly
useful in filter() for subsetting data by site code
prefixes or file path suffixes, since the C engine can evaluate them
without compiling a regex.
String concatenation
paste() and paste0() concatenate an
arbitrary number of arguments into a single string column. These are
N-ary functions: you can pass two columns, five columns, or a mix of
columns and literal strings. The C engine’s two-pass string builder
computes the total output length across all arguments first, allocates a
single buffer, and fills it in one sweep.
paste0: no separator
paste0() joins its arguments with no separator, which is
the common case for building identifiers or composite keys.
tbl(f) |>
mutate(record_id = paste0(site, "_", trimws(species))) |>
select(record_id) |>
collect()
#> record_id
#> 1 plot_A01_Quercus robur
#> 2 plot_B12_fagus sylvatica
#> 3 plot_A01_ACER platanoides
#> 4 plot_C03_Betula pendula
#> 5 plot_B12_Pinus SYLVESTRISHere we pass three arguments: the site column, a literal
underscore, and the trimmed species name. The engine handles mixed
column-and-literal arguments natively.
paste: custom separator
paste() uses a space as the default separator, matching
base R behaviour. The sep parameter overrides this.
Multi-column paste
When building composite labels from more than two columns, passing
all of them in a single paste() call is cleaner than
nesting multiple paste0() calls.
tbl(f) |>
mutate(full_label = paste(observer, site,
trimws(species), sep = " | ")) |>
select(full_label) |>
collect()
#> full_label
#> 1 J. Smith | plot_A01 | Quercus robur
#> 2 A. Mueller | plot_B12 | fagus sylvatica
#> 3 J. Smith | plot_A01 | ACER platanoides
#> 4 B. Novak | plot_C03 | Betula pendula
#> 5 A. Mueller | plot_B12 | Pinus SYLVESTRISThe engine constructs each output string in a single allocation regardless of how many arguments are passed, so there is no performance penalty for concatenating five columns versus two.
Pattern matching: fixed strings
vectra’s grepl(), gsub(), and
sub() accept a fixed parameter that controls
whether the pattern is treated as a literal byte sequence or as a POSIX
regular expression. When fixed = TRUE (the default in
vectra), the engine uses a direct substring search, which avoids the
cost of compiling a regex automaton. For exact substring tests and
simple replacements, this is the right choice.
Note that vectra defaults to fixed = TRUE, which differs
from base R where fixed defaults to FALSE. We
chose this default because in our experience the majority of string
filtering and replacement tasks in data cleaning involve literal
substrings, not regex patterns. When you need actual regex, pass
fixed = FALSE explicitly. This is covered in the next
section.
Filtering with grepl
grepl(pattern, x, fixed = TRUE) returns a boolean column
that is TRUE wherever the pattern appears as a literal
substring. We use it here to find all records where the observer’s
surname contains “Smith”.
tbl(f) |>
filter(grepl("Smith", observer, fixed = TRUE)) |>
collect()
#> species site observer dbh_cm
#> 1 Quercus robur plot_A01 J. Smith 45.2
#> 2 ACER platanoides plot_A01 J. Smith 22.7Only one row matches. The fixed-string search scans each byte buffer with a Boyer-Moore-style skip, so it scales well even on columns with long string values.
Replacing with gsub and sub
gsub(pattern, replacement, x, fixed = TRUE) replaces
every occurrence of the pattern; sub() replaces only the
first. Both accept fixed = TRUE for literal matching and
return a new string column.
A common cleaning task is standardizing separators. Suppose our site codes sometimes use underscores and sometimes hyphens.
messy_sites <- data.frame(
site = c("plot_A01", "plot-B12", "plot_A01", "plot-C03", "plot_B12"),
stringsAsFactors = FALSE
)
f2 <- tempfile(fileext = ".vtr")
write_vtr(messy_sites, f2)
tbl(f2) |>
mutate(site_clean = gsub("-", "_", site, fixed = TRUE)) |>
collect()
#> site site_clean
#> 1 plot_A01 plot_A01
#> 2 plot-B12 plot_B12
#> 3 plot_A01 plot_A01
#> 4 plot-C03 plot_C03
#> 5 plot_B12 plot_B12Every hyphen becomes an underscore. Since fixed = TRUE
is the default, writing gsub("-", "_", site) without the
explicit argument produces the same result.
sub: first match only
sub() replaces only the first occurrence. This is useful
when a pattern might appear multiple times and we want to change just
the leading instance.
notes <- data.frame(
note = c("tree dead, bark loose, dead branches",
"alive, healthy canopy",
"dead standing, no bark"),
stringsAsFactors = FALSE
)
f3 <- tempfile(fileext = ".vtr")
write_vtr(notes, f3)
tbl(f3) |>
mutate(note_edited = sub("dead", "DEAD", note, fixed = TRUE)) |>
select(note, note_edited) |>
collect()
#> note note_edited
#> 1 tree dead, bark loose, dead branches tree DEAD, bark loose, dead branches
#> 2 alive, healthy canopy alive, healthy canopy
#> 3 dead standing, no bark DEAD standing, no barkIn the first row, only the first “dead” is uppercased; the second
“dead” in “dead branches” remains unchanged. gsub would
have replaced both.
Pattern matching: regex
When literal substring matching is not enough, vectra supports full
POSIX regular expressions by passing fixed = FALSE. The C
engine compiles each regex pattern once per batch using
regcomp() and applies it across all rows. For batches with
more than 1000 rows, the matching loop is parallelized with OpenMP, with
each thread compiling its own copy of the regex for thread safety.
The regex dialect is POSIX Extended Regular Expressions (ERE), the
same dialect used by grep -E on Unix systems. This means
character classes like [A-Z], quantifiers +,
*, ?, alternation with |, and
grouping with () all work as expected. Backreferences in
replacement strings (\\1, \\2) refer to
captured groups. Perl-specific extensions like lookaheads, lookbehinds,
and non-greedy quantifiers (*?) are not supported; this is
a deliberate constraint that keeps the regex engine simple and
predictable.
Regex filtering with grepl
We will build a dataset with species names that have various
formatting issues. The regex ^[A-Z][a-z]+ [a-z]+$ matches
properly formatted binomial names (capital genus, lowercase epithet,
single space).
species_list <- data.frame(
name = c("Quercus robur", "fagus sylvatica", "ACER PLATANOIDES",
"Betula pendula", "Pinus sylvestris", "Tilia cordata"),
stringsAsFactors = FALSE
)
f4 <- tempfile(fileext = ".vtr")
write_vtr(species_list, f4)
tbl(f4) |>
mutate(valid_format = grepl("^[A-Z][a-z]+ [a-z]+$",
name, fixed = FALSE)) |>
collect()
#> name valid_format
#> 1 Quercus robur TRUE
#> 2 fagus sylvatica FALSE
#> 3 ACER PLATANOIDES FALSE
#> 4 Betula pendula TRUE
#> 5 Pinus sylvestris FALSE
#> 6 Tilia cordata TRUEThree names fail: “fagus sylvatica” (lowercase genus), “ACER PLATANOIDES” (all caps), and “Pinus sylvestris” (double space). This kind of validation pass is a useful first step before attempting joins against a reference taxonomy.
Regex replacement with gsub
gsub() with fixed = FALSE supports POSIX
regex in the pattern and backreferences in the replacement string.
Backreferences \\1, \\2, etc. refer to
captured groups in the pattern.
Suppose we want to reformat species names from “Genus epithet” to “G. epithet” for compact display labels.
tbl(f4) |>
mutate(short_name = gsub("^([A-Za-z])[a-z]+ ([a-z]+)$",
"\\1. \\2",
name, fixed = FALSE)) |>
select(name, short_name) |>
collect()
#> name short_name
#> 1 Quercus robur Q. robur
#> 2 fagus sylvatica f. sylvatica
#> 3 ACER PLATANOIDES ACER PLATANOIDES
#> 4 Betula pendula B. pendula
#> 5 Pinus sylvestris Pinus sylvestris
#> 6 Tilia cordata T. cordataThe pattern captures the first letter of the genus as group
\\1 and the entire epithet as group \\2. Rows
that do not match the pattern are returned unchanged, which is
consistent with base R’s gsub behaviour.
sub with regex: first match only
sub() with regex works the same way but replaces only
the first match. This is handy for stripping a leading prefix that
varies in length.
records <- data.frame(
code = c("ID:001-Quercus", "ID:042-Fagus",
"ID:007-Betula", "ID:113-Pinus"),
stringsAsFactors = FALSE
)
f5 <- tempfile(fileext = ".vtr")
write_vtr(records, f5)
tbl(f5) |>
mutate(genus = sub("^ID:[0-9]+-", "", code, fixed = FALSE)) |>
select(code, genus) |>
collect()
#> code genus
#> 1 ID:001-Quercus Quercus
#> 2 ID:042-Fagus Fagus
#> 3 ID:007-Betula Betula
#> 4 ID:113-Pinus PinusThe regex ^ID:[0-9]+- matches the literal “ID:” prefix,
one or more digits, and the trailing hyphen. Replacing it with an empty
string leaves just the genus name.
str_extract: capturing structured parts
str_extract(x, pattern) returns the first match of the
regex pattern, or NA if no match is found. When the pattern
contains a capture group (round brackets), str_extract
returns the contents of that group rather than the full match.
This is particularly useful for pulling structured components out of semi-formatted strings. Here we extract numeric plot identifiers from site codes.
sites <- data.frame(
site_code = c("Forest_Plot_042", "Meadow_Transect_007",
"Forest_Plot_113", "Wetland_Quad_019"),
stringsAsFactors = FALSE
)
f6 <- tempfile(fileext = ".vtr")
write_vtr(sites, f6)
tbl(f6) |>
mutate(plot_num = str_extract(site_code, "([0-9]+)")) |>
select(site_code, plot_num) |>
collect()
#> site_code plot_num
#> 1 Forest_Plot_042 042
#> 2 Meadow_Transect_007 007
#> 3 Forest_Plot_113 113
#> 4 Wetland_Quad_019 019The capture group ([0-9]+) matches one or more digits.
Since there is a group, str_extract returns the group
content. Without the parentheses, it would return the full match, which
in this case is the same.
A more interesting example extracts the habitat type from the site code. The pattern uses a group to capture everything before the first underscore.
tbl(f6) |>
mutate(habitat = str_extract(site_code, "^([A-Za-z]+)_")) |>
select(site_code, habitat) |>
collect()
#> site_code habitat
#> 1 Forest_Plot_042 Forest
#> 2 Meadow_Transect_007 Meadow
#> 3 Forest_Plot_113 Forest
#> 4 Wetland_Quad_019 WetlandThe group captures “Forest”, “Meadow”, “Forest”, “Wetland” while the
trailing underscore is part of the match but not part of the group. This
approach is cleaner than combining substr with a computed
position.
Fuzzy string matching
Exact string matching breaks down when data comes from multiple sources with inconsistent spelling. A field botanist writes “Qurecus robur” (transposed letters), a database export has “Fagus silvatica” (single-letter substitution), and a citizen-science platform records “quercus ROBUR” (case mismatch). Fuzzy string matching quantifies how different two strings are and lets us match records that are close enough.
vectra provides three families of distance functions, all usable
inside mutate() and filter(). Each takes two
string arguments (columns or literals) and returns a numeric distance or
similarity score. The raw variants (levenshtein,
dl_dist) return integer edit counts; the normalized
variants (levenshtein_norm, dl_dist_norm)
divide by the length of the longer string to produce a value between 0
and 1; and jaro_winkler returns a similarity score where 1
means identical.
Levenshtein distance
levenshtein(x, y) counts the minimum number of
single-character edits (insertions, deletions, substitutions) needed to
transform x into y. The result is a
non-negative integer. levenshtein_norm(x, y) divides by the
length of the longer string, producing a value between 0 (identical) and
1 (completely different).
typos <- data.frame(
field_name = c("Qurecus robur", "Fagus silvatica",
"Acer platanodes", "Betula pendula",
"Pinus sylvestrs"),
ref_name = c("Quercus robur", "Fagus sylvatica",
"Acer platanoides", "Betula pendula",
"Pinus sylvestris"),
stringsAsFactors = FALSE
)
f7 <- tempfile(fileext = ".vtr")
write_vtr(typos, f7)
tbl(f7) |>
mutate(lev = levenshtein(field_name, ref_name),
lev_norm = levenshtein_norm(field_name, ref_name)) |>
collect()
#> field_name ref_name lev lev_norm
#> 1 Qurecus robur Quercus robur 2 0.15384615
#> 2 Fagus silvatica Fagus sylvatica 1 0.06666667
#> 3 Acer platanodes Acer platanoides 1 0.06250000
#> 4 Betula pendula Betula pendula 0 0.00000000
#> 5 Pinus sylvestrs Pinus sylvestris 1 0.06250000“Qurecus robur” vs “Quercus robur” has a Levenshtein distance of 2 (the transposed “re” to “er” counts as two operations: delete and insert, or two substitutions depending on the alignment). “Betula pendula” matches exactly, giving a distance of 0.
Damerau-Levenshtein distance
dl_dist(x, y) extends Levenshtein by counting
transpositions of two adjacent characters as a single edit operation.
This better reflects the kinds of errors humans make when typing.
dl_dist_norm(x, y) is the normalized variant.
tbl(f7) |>
mutate(dl = dl_dist(field_name, ref_name),
dl_norm = dl_dist_norm(field_name, ref_name)) |>
collect()
#> field_name ref_name dl dl_norm
#> 1 Qurecus robur Quercus robur 1 0.07692308
#> 2 Fagus silvatica Fagus sylvatica 1 0.06666667
#> 3 Acer platanodes Acer platanoides 1 0.06250000
#> 4 Betula pendula Betula pendula 0 0.00000000
#> 5 Pinus sylvestrs Pinus sylvestris 1 0.06250000Compare the first row: “Qurecus” vs “Quercus” is a transposition of “r” and “e”. Levenshtein counts this as 2 edits, but Damerau-Levenshtein counts it as 1. For human-typed data, DL distance is usually the better default because transposition errors are common.
Jaro-Winkler similarity
jaro_winkler(x, y) returns a similarity score between 0
and 1, where 1 means the strings are identical. Unlike the edit-distance
functions, higher values indicate better matches. Jaro-Winkler gives
extra weight to matching prefixes, making it particularly effective for
personal names and short strings where the first few characters are
usually correct.
tbl(f7) |>
mutate(jw = jaro_winkler(field_name, ref_name)) |>
collect()
#> field_name ref_name jw
#> 1 Qurecus robur Quercus robur 0.9794872
#> 2 Fagus silvatica Fagus sylvatica 0.9447619
#> 3 Acer platanodes Acer platanoides 0.9875000
#> 4 Betula pendula Betula pendula 1.0000000
#> 5 Pinus sylvestrs Pinus sylvestris 0.9875000The scores are all above 0.9 for these examples because the errors are small relative to string length. Jaro-Winkler is less informative for long strings with small differences (everything scores above 0.95) but excels at distinguishing between short strings like surnames.
Choosing a distance metric
The choice depends on the error profile of your data:
Damerau-Levenshtein (
dl_dist/dl_dist_norm): the best general default. Handles insertions, deletions, substitutions, and transpositions. Use this for species names, place names, and any free-text field where typos are the primary error source.Levenshtein (
levenshtein/levenshtein_norm): appropriate when transpositions are rare or when you need compatibility with systems that use classical Levenshtein. The scores will be slightly more conservative (higher distance) than DL for transposed strings.Jaro-Winkler (
jaro_winkler): best for short strings (under 15 characters) and personal names. The prefix weighting helps when the first few characters are reliable and errors tend to occur later in the string. Less useful for long species names where a single-character change barely moves the score.
Filtering by distance threshold
In practice, we often want to flag records that exceed some distance threshold from a reference value. Here we filter for species entries that are within a normalized DL distance of 0.15 from “Quercus robur”.
field_data <- data.frame(
species = c("Qurecus robur", "Quercus robor", "Fagus sylvatica",
"Quercus robur", "Quercis rubur", "Betula pendula"),
plot = c("A1", "A2", "B1", "A3", "B2", "C1"),
stringsAsFactors = FALSE
)
f8 <- tempfile(fileext = ".vtr")
write_vtr(field_data, f8)
tbl(f8) |>
mutate(dist = dl_dist_norm(species, "Quercus robur")) |>
filter(dist < 0.15) |>
collect()
#> species plot dist
#> 1 Qurecus robur A1 0.07692308
#> 2 Quercus robor A2 0.07692308
#> 3 Quercus robur A3 0.00000000Only entries close to “Quercus robur” survive the filter. “Fagus sylvatica” and “Betula pendula” are too distant, and “Quercis rubur” sits right at the boundary (whether it passes depends on the exact normalized distance).
Ranking by similarity
When matching against multiple candidates, sorting by distance gives us the best matches first.
tbl(f8) |>
mutate(sim = jaro_winkler(species, "Quercus robur")) |>
arrange(desc(sim)) |>
collect()
#> species plot sim
#> 1 Quercus robur A3 1.0000000
#> 2 Qurecus robur A1 0.9794872
#> 3 Quercus robor A2 0.9692308
#> 4 Quercis rubur B2 0.9358974
#> 5 Fagus sylvatica B1 0.4769231
#> 6 Betula pendula C1 0.3644689The exact match scores 1.0, and the candidates with small typos rank next. This pattern is useful for building a shortlist of probable matches to review manually.
Fuzzy joins
Fuzzy matching within mutate() works when we already
have paired columns to compare. More often, we have two separate tables
and need to find which rows in a messy dataset correspond to which rows
in a clean reference. This is where fuzzy_join() comes
in.
Basic fuzzy join
fuzzy_join(x, y, by, method, max_dist) joins two vectra
tables using approximate string matching. The by argument
is a named character vector of length 1 specifying which columns to
compare. The result includes all columns from both sides plus a
fuzzy_dist column with the normalized distance of each
match.
ref <- data.frame(
canonical = c("Quercus robur", "Fagus sylvatica",
"Acer platanoides", "Betula pendula",
"Pinus sylvestris", "Tilia cordata"),
family = c("Fagaceae", "Fagaceae", "Sapindaceae",
"Betulaceae", "Pinaceae", "Malvaceae"),
stringsAsFactors = FALSE
)
f_ref <- tempfile(fileext = ".vtr")
write_vtr(ref, f_ref)
messy <- data.frame(
field_species = c("Qurecus robur", "Fagus silvatica",
"Acer platanodes", "Betla pendula",
"Pinis sylvestris"),
count = c(12L, 7L, 3L, 15L, 9L),
stringsAsFactors = FALSE
)
f_messy <- tempfile(fileext = ".vtr")
write_vtr(messy, f_messy)
fuzzy_join(
tbl(f_messy), tbl(f_ref),
by = c("field_species" = "canonical"),
method = "dl",
max_dist = 0.25
) |> collect()
#> field_species count canonical family fuzzy_dist
#> 1 Qurecus robur 12 Quercus robur Fagaceae 0.07692308
#> 2 Fagus silvatica 7 Fagus sylvatica Fagaceae 0.06666667
#> 3 Acer platanodes 3 Acer platanoides Sapindaceae 0.06250000
#> 4 Betla pendula 15 Betula pendula Betulaceae 0.07142857
#> 5 Pinis sylvestris 9 Pinus sylvestris Pinaceae 0.06250000Each messy field name finds its closest match in the reference table.
The fuzzy_dist column shows how far apart they are, which
is useful for flagging matches that might need manual verification (say,
anything above 0.15).
Methods: dl, levenshtein, jw
The method parameter selects the distance algorithm. The
three options are "dl" (Damerau-Levenshtein, the default),
"levenshtein", and "jw" (Jaro-Winkler). When
using "jw", max_dist is interpreted as the
maximum dissimilarity (i.e.,
1 - jaro_winkler_score), so max_dist = 0.2
keeps matches with a Jaro-Winkler score of 0.8 or higher.
fuzzy_join(
tbl(f_messy), tbl(f_ref),
by = c("field_species" = "canonical"),
method = "jw",
max_dist = 0.15
) |> collect()
#> field_species count canonical family fuzzy_dist
#> 1 Qurecus robur 12 Quercus robur Fagaceae 0.02051282
#> 2 Fagus silvatica 7 Fagus sylvatica Fagaceae 0.05523810
#> 3 Acer platanodes 3 Acer platanoides Sapindaceae 0.01250000
#> 4 Betla pendula 15 Betula pendula Betulaceae 0.01666667
#> 5 Pinis sylvestris 9 Pinus sylvestris Pinaceae 0.02916667Jaro-Winkler produces a tighter match set here because it penalizes differences in the prefix more heavily.
Blocking for performance
Without blocking, fuzzy_join compares every probe row
against every build row. For large tables this is quadratic. The
block_by parameter specifies a column for exact-match
blocking: only pairs that share the same blocking key are compared.
In a taxonomic context, the natural blocking key is the genus. We split the genus from both sides and block on it, so “Qurecus robur” is only compared against reference rows where the genus starts with “Q” (or more precisely, where the blocking column matches exactly).
ref_blocked <- data.frame(
genus = c("Quercus", "Fagus", "Acer",
"Betula", "Pinus", "Tilia"),
canonical = c("Quercus robur", "Fagus sylvatica",
"Acer platanoides", "Betula pendula",
"Pinus sylvestris", "Tilia cordata"),
family = c("Fagaceae", "Fagaceae", "Sapindaceae",
"Betulaceae", "Pinaceae", "Malvaceae"),
stringsAsFactors = FALSE
)
f_ref2 <- tempfile(fileext = ".vtr")
write_vtr(ref_blocked, f_ref2)
messy_blocked <- data.frame(
genus_field = c("Quercus", "Fagus", "Acer",
"Betula", "Pinus"),
field_species = c("Qurecus robur", "Fagus silvatica",
"Acer platanodes", "Betla pendula",
"Pinis sylvestris"),
count = c(12L, 7L, 3L, 15L, 9L),
stringsAsFactors = FALSE
)
f_messy2 <- tempfile(fileext = ".vtr")
write_vtr(messy_blocked, f_messy2)
fuzzy_join(
tbl(f_messy2), tbl(f_ref2),
by = c("field_species" = "canonical"),
method = "dl",
max_dist = 0.25,
block_by = c("genus_field" = "genus")
) |> collect()
#> genus_field field_species count genus canonical family
#> 1 Quercus Qurecus robur 12 Quercus Quercus robur Fagaceae
#> 2 Fagus Fagus silvatica 7 Fagus Fagus sylvatica Fagaceae
#> 3 Acer Acer platanodes 3 Acer Acer platanoides Sapindaceae
#> 4 Betula Betla pendula 15 Betula Betula pendula Betulaceae
#> 5 Pinus Pinis sylvestris 9 Pinus Pinus sylvestris Pinaceae
#> fuzzy_dist
#> 1 0.07692308
#> 2 0.06666667
#> 3 0.06250000
#> 4 0.07142857
#> 5 0.06250000With blocking, each probe row is compared against at most one reference row (the one sharing its genus) instead of all six. On real datasets with thousands of reference species, this reduces the number of distance computations by orders of magnitude.
Thread control
The n_threads parameter controls how many OpenMP threads
the distance computations use. The default is 4, which is a reasonable
starting point. On a workstation with many cores, increasing it to 8 or
16 can help for large joins. On a laptop where other processes need CPU
time, reducing it to 2 is reasonable.
fuzzy_join(
tbl(f_messy), tbl(f_ref),
by = c("field_species" = "canonical"),
method = "dl",
max_dist = 0.25,
n_threads = 8L
) |> collect()Block lookups
fuzzy_join() is a streaming operation: the probe side
flows through the pipeline and the build side is materialized into a
hash table internally. But sometimes we need to look up the same
reference repeatedly with different query sets, and rebuilding the hash
table each time is wasteful.
materialize() converts a vectra node into a persistent
in-memory block. Once materialized, we can probe it with
block_lookup() (exact matching) or
block_fuzzy_lookup() (fuzzy matching) as many times as
needed without re-reading the source data.
Exact lookups
block_lookup(block, column, keys) performs a hash-based
exact match on a string column. The hash index is built lazily on the
first call and cached for subsequent calls.
blk <- tbl(f_ref) |>
select(canonical, family) |>
materialize()
block_lookup(blk, "canonical",
c("Quercus robur", "Betula pendula"))
#> query_idx canonical family
#> 1 1 Quercus robur Fagaceae
#> 2 2 Betula pendula BetulaceaeThe result includes a query_idx column that maps each
result row back to the input key vector (1-based). Queries that match
multiple rows in the block return all matches.
Case-insensitive lookups
Field data often has inconsistent capitalization. The
ci = TRUE parameter performs case-insensitive matching
without requiring a separate lowercased column in the reference.
block_lookup(blk, "canonical",
c("quercus robur", "BETULA PENDULA"),
ci = TRUE)
#> query_idx canonical family
#> 1 1 Quercus robur Fagaceae
#> 2 2 Betula pendula BetulaceaeBoth queries match despite the case mismatch. Internally, the engine builds a separate case-folded hash index on first use and caches it alongside the case-sensitive one.
Fuzzy lookups
block_fuzzy_lookup() computes string distances between
query keys and every row in the block (or, with blocking, only rows
sharing a blocking key). The interface mirrors fuzzy_join()
but operates on a materialized block.
blk2 <- tbl(f_ref2) |>
select(genus, canonical, family) |>
materialize()
block_fuzzy_lookup(
blk2, "canonical",
keys = c("Qurecus robur", "Fagus silvatica"),
method = "dl",
max_dist = 0.2,
block_col = "genus",
block_keys = c("Quercus", "Fagus"),
n_threads = 2L
)
#> query_idx fuzzy_dist genus canonical family
#> 1 1 0.07692308 Quercus Quercus robur Fagaceae
#> 2 2 0.06666667 Fagus Fagus sylvatica FagaceaeThe block_col and block_keys arguments work
together: the first query (“Qurecus robur”) is only compared against
rows where genus == "Quercus", and the second query against
rows where genus == "Fagus". This is the same blocking
logic as fuzzy_join() but applied to a materialized block
that can be reused across multiple lookup calls without re-reading
disk.
Performance
String operations in vectra fall into three performance tiers. Understanding which tier an operation belongs to helps predict where the bottleneck will be in a pipeline.
Tier 1: byte-level operations (essentially free)
toupper, tolower, trimws,
nchar, substr, startsWith,
endsWith, and paste/paste0 all
operate on raw bytes with O(n) single-pass algorithms. Their cost is
negligible compared to disk I/O. On a modern SSD, reading a 100 MB
string column takes roughly 50 ms; applying tolower to it
adds less than 1 ms, so in practice these operations contribute
negligible overhead compared to the I/O cost of reading the data from
disk.
Tier 2: pattern matching (depends on pattern complexity)
grepl, gsub, sub, and
str_extract compile a POSIX regex (or set up a fixed-string
scanner) and apply it to every row. Fixed-string matching
(fixed = TRUE) is in tier 1 territory. Regex matching cost
depends on the pattern: a simple alternation like ^(A|B|C)$
is nearly as fast as fixed, while a pattern with nested quantifiers like
(.+?)* can be exponentially slow on pathological inputs.
For batches larger than 1000 rows, the engine distributes the matching
loop across OpenMP threads with per-thread regex compilation, so
doubling the available cores roughly halves the wall time.
The practical advice: use fixed = TRUE (or omit the
parameter, since it defaults to TRUE) whenever the pattern
is a literal string. Reserve fixed = FALSE for cases where
you genuinely need regex features like character classes, alternations,
or capture groups.
Tier 3: fuzzy distance computation (CPU-bound)
levenshtein, dl_dist, and
jaro_winkler compute an O(mn) dynamic programming
alignment for each pair of strings, where m and n are the string
lengths. For column-vs-column comparisons in mutate(), this
is O(rows m * n). For fuzzy joins without blocking, it is
O(probe_rows * build_rows m n). Blocking reduces this to
O(probe_rows * avg_partition_size * m * n), which is why it matters so
much. A loose max_dist does not change the number of
distance computations, but it increases the number of rows that survive
the threshold, and in a many-to-many fuzzy join this can cause a
combinatorial explosion of output rows. If both tables contain 10,000
rows and a generous threshold matches each probe row to 50 build rows,
the result has 500,000 rows. Tightening max_dist or adding
a block_by column are the two most effective ways to keep
the output manageable.
All three distance functions use OpenMP parallelization when the
input exceeds 1000 rows. The work is distributed with
schedule(dynamic, 64), meaning threads grab 64-row chunks
as they finish, which balances load well when string lengths vary.
Two-pass string building
String-producing expressions (everything that outputs a new string
column) use a two-pass strategy. The first pass computes the output
length for each row without writing any bytes. The second pass allocates
a single contiguous buffer and fills it. This avoids the
realloc-and-copy pattern that R uses internally, which can fragment
memory and cause O(n^2) behaviour on long strings. For
paste() with many arguments or gsub with
complex replacements, the two-pass approach keeps memory allocation
linear.
Practical guidance
Which distance metric for which use case
For taxonomic name matching, dl_dist_norm with a
threshold of 0.15-0.20 is the workhorse. It catches transpositions (the
most common typo type in Latin binomials), single-letter substitutions,
and dropped characters. If your reference list is large and you can
block on genus, the false-positive rate at 0.20 is low enough for
automated matching with a manual review pass for distances above
0.10.
For matching personal names (observers, collectors, authors),
jaro_winkler with a threshold of 0.10-0.15 (remember, this
is dissimilarity, so 0.10 means a similarity score of 0.90 or higher) is
the better choice. Names are short, and the prefix weighting in
Jaro-Winkler exploits the fact that people rarely mistype the first
letter of their own name.
For free-text fields where errors include missing words, reordered
words, or abbreviations, none of these character-level metrics work
well. Consider preprocessing with tolower,
trimws, and gsub to normalize the strings
before computing distances.
Choosing max_dist thresholds
Start conservative (low max_dist) and increase
gradually. A max_dist of 0.10 with DL distance catches most
single-character typos in strings of 10+ characters. At 0.20, you start
matching strings that differ by two edits, which increases recall but
also increases false positives. Above 0.30, the match set usually
contains too much noise to be useful without manual filtering.
For Jaro-Winkler, the scale is inverted (higher score = better
match). A max_dist of 0.10 (score >= 0.90) is a
reasonable starting point. Below 0.05 (score >= 0.95), you catch only
very minor differences.
A useful workflow is to run the fuzzy join once with a generous
threshold, sort the results by fuzzy_dist, and examine the
matches near the boundary. This gives an empirical sense of where true
matches end and false positives begin for your specific data.
Handling encoding issues
vectra operates on raw bytes and assumes UTF-8 encoding. If your
input data is Latin-1 or another encoding, convert it to UTF-8 before
writing to .vtr format. In R,
iconv(x, from = "latin1", to = "UTF-8") handles this. Once
the data is in UTF-8, all string operations work correctly on ASCII
characters and pass through multi-byte sequences without corruption,
though the distance metrics count bytes rather than Unicode code points
for multi-byte characters.
This byte-level behaviour has a practical implication for distance
computation: if your strings contain accented characters encoded as
multi-byte UTF-8 sequences (e.g., “u” + combining accent), the edit
distance between “u” and “ü” will be larger than 1 because the engine
sees multiple bytes changing, not a single character. For Latin-alphabet
ecological data (species names, place names), this is rarely a problem
because accented characters are uncommon in standardized taxonomic
databases. If your data does include accented characters, consider
normalizing them to ASCII before computing distances (e.g., replacing
“ü” with “u” via gsub).
Common cleaning patterns
A typical cleaning pipeline chains several string operations before matching. Here is a pattern we use repeatedly for species name standardization.
raw_names <- data.frame(
species = c(" Quercus ROBUR ", "fagus sylvatica.",
"Acer platanoides (L.)", "BETULA pendula"),
stringsAsFactors = FALSE
)
f9 <- tempfile(fileext = ".vtr")
write_vtr(raw_names, f9)
tbl(f9) |>
mutate(clean = tolower(trimws(species))) |>
mutate(clean = gsub(".", "", clean, fixed = TRUE)) |>
mutate(clean = gsub(" +", " ", clean, fixed = FALSE)) |>
select(species, clean) |>
collect()
#> species clean
#> 1 Quercus ROBUR quercus robur
#> 2 fagus sylvatica. fagus sylvatica
#> 3 Acer platanoides (L.) acer platanoides (l)
#> 4 BETULA pendula betula pendulaThe pipeline strips whitespace, lowercases everything, removes
periods, and collapses multiple spaces into one. Each
mutate() call creates a streaming node, so chaining them
costs no extra memory. After this cleaning, the strings are ready for
exact or fuzzy matching against a standardized reference.
Combining approaches
The most effective matching strategy layers exact and fuzzy methods. Start with an exact join on cleaned names (catching the easy cases at near-zero cost), then run a fuzzy join on the unmatched remainder. This avoids spending CPU cycles on distance computations for records that match perfectly after basic cleaning.
ref_clean <- data.frame(
canonical = c("quercus robur", "fagus sylvatica",
"acer platanoides", "betula pendula"),
status = c("accepted", "accepted", "accepted", "accepted"),
stringsAsFactors = FALSE
)
f_refc <- tempfile(fileext = ".vtr")
write_vtr(ref_clean, f_refc)
cleaned <- tbl(f9) |>
mutate(clean = tolower(trimws(species))) |>
mutate(clean = gsub(".", "", clean, fixed = TRUE)) |>
mutate(clean = gsub(" +", " ", clean, fixed = FALSE))
# Step 1: exact join on cleaned names
exact <- left_join(
cleaned, tbl(f_refc),
by = c("clean" = "canonical")
) |> collect()
exact
#> species clean status
#> 1 Quercus ROBUR quercus robur accepted
#> 2 fagus sylvatica. fagus sylvatica accepted
#> 3 BETULA pendula betula pendula accepted
#> 4 Acer platanoides (L.) acer platanoides (l) <NA>Records where status is NA did not match
exactly and would go to a second pass with fuzzy_join(). In
this example all four match after cleaning, which is the ideal outcome:
the fuzzy join handles only the genuinely ambiguous cases.
This layered strategy reflects how taxonomic name resolution works in practice. The Global Biodiversity Information Facility (GBIF) backbone taxonomy, for instance, contains over 7 million names. Running a fuzzy join against the full list for every input record would be prohibitively expensive. Cleaning first, exact-matching second, and fuzzy-matching only the remainder keeps the computational cost proportional to the number of genuinely problematic records rather than the total dataset size. In a typical citizen-science dataset, 80-90% of records match exactly after basic normalization, leaving only 10-20% for the distance computation stage.