pkgdown/mathjax-config.html

Skip to contents

taxify is designed for lists of a few hundred names and lists of a few hundred thousand names alike. The underlying engine (vectra) stores backbone databases in a columnar binary format (.vtr) that supports memory-mapped access, hash-indexed lookups, and OpenMP-parallel fuzzy joins. None of this requires special configuration from the user. But knowing how the pieces fit together helps when tuning a workflow for speed, memory, or disk usage at scale.

This vignette covers the performance-relevant internals, gives concrete timing guidance for different list sizes, and walks through four worked examples: exact vs. fuzzy matching, multi-backend fallback ordering, batch processing of very large lists, and pre-downloading resources before a batch run.

How taxify scales

The .vtr columnar format

Every backbone ships as a .vtr file: a binary columnar format written by the vectra C11 engine. Unlike CSV or TSV, the .vtr format stores each column contiguously on disk with lightweight compression. taxify never parses text at query time. There is no read.csv() step, no string splitting, no quote escaping. The backbone is already in a query-ready binary layout.

Backbones are distributed as pre-built .vtr files (hosted on Zenodo and GitHub Releases), so users download a single binary file that is ready to query immediately — no CSV parsing, no conversion step. The .vtr files are typically 30-50% smaller than the original Darwin Core CSV because the columnar layout compresses string columns more efficiently than row-oriented text.

Exact matching: hash-indexed lookups

When a backbone is first used in a session, vectra materializes it into an in-memory columnar block with hash indexes on the name and genus columns. Subsequent lookups against that block run in essentially constant time per name.

Exact matching uses block_lookup(), which resolves each input name via a hash index. This is an O(1) operation per name and the reason exact matching scales linearly with list size. A list of 100,000 clean plant names matches against WFO in seconds, not minutes.

The exact pipeline is more thorough than a simple string comparison. It runs five passes in sequence, each catching a different class of name variation:

  1. Case-sensitive exact match against the canonical name column.

  2. Case-insensitive match against a precomputed lowercased key.

  3. Latin orthographic normalization that maps common epithet variants (e.g., -ii to -i, -anum to -ana) to a canonical form.

  4. Infraspecific-to-species fallback that strips variety/subspecies qualifiers and matches against the binomial.

  5. Hybrid name normalization that resolves nothospecies formatting differences (e.g., Salix x rubens vs. Salix xrubens).

All five passes use hash lookups. A name that matches in pass 1 is never tested in passes 2-5. In practice, pass 1 resolves 85-95% of names from clean input, and passes 2-4 pick up another 2-5%. The total cost of exact matching is dominated by the hash lookups, which are O(1) per name regardless of backbone size.

Fuzzy matching: genus-blocked string distance

Fuzzy matching is fundamentally more expensive. For each unmatched name, vectra computes string distances (Damerau-Levenshtein by default) against all backbone entries that share the same genus. This genus-blocking strategy reduces the search space from millions of backbone entries to a few hundred or thousand (the typical number of species per genus). The computation is parallelized across cores via OpenMP, using 4 threads by default.

On a 4-core machine, fuzzy matching 5,000 names against WFO takes roughly 10-30 seconds depending on genus sizes. Large genera like Carex (~2,000 entries in WFO) or Astragalus (~3,000 entries) are more expensive per name than small genera. The cost grows with the number of names that fail exact matching and with the size of the backbone. Against GBIF’s 7 million rows, the same 5,000 names might take 30-90 seconds because each genus block is proportionally larger.

A secondary fuzzy pass handles misspelled genera. When the genus itself is wrong (e.g., Qurecus instead of Quercus), the genus-blocked join misses the name entirely. taxify runs a fallback pass that blocks on the first two characters of the name instead of the full genus. This catches most single- character genus typos while keeping the search space much smaller than a full cross-join.

The practical consequence: exact-only matching is fast at any scale, and fuzzy matching is the knob that controls how long a run takes.

Backbone loading and the session cache

The first time taxify() is called for a given backend, several things happen behind the scenes. The function resolves the backbone path through a four-step fallback: session cache, versioned directory on disk, legacy flat directory, and finally auto-download from Zenodo if no local copy exists. Once the path is known, vectra materializes the .vtr into an in-memory columnar block and builds hash indexes on the name and genus columns. This initialization step takes 1-3 seconds for WFO (~400,000 rows) and 5-10 seconds for GBIF (~7 million rows). Every subsequent taxify() call in the same R session reuses the materialized block. There is no repeated file I/O.

Two caches operate in parallel. The path cache (.taxify_cache) maps backend names to .vtr file paths on disk. Once a path is resolved, it stays cached so that ensure_backbone() does not re-scan the file system. The data cache (.taxify_env) holds the materialized columnar block itself, keyed by file path. It also stores the session manifest, version-check flags, enrichment paths, and coverage data for the genus register. Both are package-level environments that persist for the duration of the R session and are shared across all taxify() calls.

The first taxify() call in a session also triggers a version check. taxify fetches a manifest from GitHub (a small JSON file listing the latest version of each backbone) and compares it against the locally installed version. If a newer backbone is available, it is downloaded automatically. This check runs once per backend per session. Subsequent calls skip it entirely. If the network is unavailable, the check is skipped and the local copy is used as-is.

To see where backbones live on disk:

taxify_data_dir()
#> [1] "C:/Users/jane/AppData/Local/R/taxify"

On Linux this is typically ~/.local/share/R/taxify, on macOS ~/Library/Application Support/R/taxify. The path is determined by tools::R_user_dir("taxify", "data") and is shared across all R projects on the same machine. Two R sessions running on the same machine can read the same backbone files concurrently without conflict because the .vtr files are read-only at query time.

Worked example: exact vs. fuzzy matching

The simplest performance lever is the fuzzy argument. When input names are clean (e.g., names from a curated database, an existing taxonomic checklist, or the output of a previous taxify run), disabling fuzzy matching skips the string-distance computation entirely.

Consider a list of 10,000 plant names extracted from an herbarium database where names are already in standard binomial form. We time both modes:

# Assume `species_list` is a character vector of 10,000 plant names

# Exact + fuzzy (default)
t_fuzzy <- system.time(
  result_fuzzy <- taxify(species_list, backend = "wfo", fuzzy = TRUE)
)

# Exact only
t_exact <- system.time(
  result_exact <- taxify(species_list, backend = "wfo", fuzzy = FALSE)
)

t_fuzzy["elapsed"]
#> elapsed
#>   18.4

t_exact["elapsed"]
#> elapsed
#>    2.1

The exact-only run is about an order of magnitude faster. The exact pass matches most names on the first try through its five-pass pipeline (case-sensitive, case-insensitive, Latin orthographic normalization, infraspecific fallback, and hybrid normalization). Fuzzy matching picks up the remaining names with minor misspellings, but at a cost that scales with the number of unmatched names times the average genus size in the backbone.

The ratio between the two modes depends on input quality. For a list of names extracted from a curated database (GBIF occurrence records, a published checklist, or a previous taxify run), the exact pass resolves 95-99% of names and the fuzzy pass adds very little. For OCR-transcribed herbarium labels or citizen science data with frequent misspellings, the exact pass might resolve only 70-80% and the fuzzy pass becomes essential.

A practical two-pass pattern for large lists: run exact-only first, inspect the unmatched names, and decide whether the fuzzy pass is worth the time.

# Pass 1: exact only
result <- taxify(species_list, backend = "wfo", fuzzy = FALSE)

# How many names remain unmatched?
n_unmatched <- sum(result$match_type == "none")
message(n_unmatched, " names unmatched after exact pass")

# Pass 2: fuzzy only on the unmatched subset
if (n_unmatched > 0) {
  unmatched_names <- result$input_name[result$match_type == "none"]
  fuzzy_result <- taxify(unmatched_names, backend = "wfo", fuzzy = TRUE)

  # Merge back
  matched_rows <- fuzzy_result$match_type != "none"
  idx <- match(fuzzy_result$input_name[matched_rows],
               result$input_name)
  result[idx, ] <- fuzzy_result[matched_rows, ]
}

This pattern is especially useful when only 1-5% of names need fuzzy matching. The exact pass finishes in seconds even for 100,000 names, and the fuzzy pass operates on a much smaller subset. The total wall time is often less than half of what a single fuzzy = TRUE call would take, because the fuzzy engine does not need to allocate working memory or build query tables for the names that already matched exactly.

One subtlety: the second call to taxify() does not re-materialize the backbone. The session cache from the first call is still active, so the fuzzy-only pass starts immediately with the string-distance computation. There is no penalty for splitting the work into two calls.

Worked example: multi-backend fallback ordering

When taxify() receives multiple backends, it processes them as a sequential fallback chain. Names matched by an earlier backend are excluded from later ones. The order matters for performance: the first backend sees all names, the second sees only those that failed, and so on.

Suppose we have a mixed-kingdom species list from a freshwater ecology survey: mostly aquatic plants, some fish, a handful of invertebrates. WFO covers the plants, COL covers everything but is larger and slower to search. Putting WFO first means the plant names (the majority) are resolved quickly, and only the animal names fall through to COL.

# 8,000 names: ~6,000 plants, ~1,500 fish, ~500 invertebrates
t_wfo_first <- system.time(
  result_a <- taxify(survey_names,
                     backend = c("wfo", "col"),
                     fuzzy = TRUE)
)
t_wfo_first["elapsed"]
#> elapsed
#>   25.3

# Reversed order: COL first, WFO second
t_col_first <- system.time(
  result_b <- taxify(survey_names,
                     backend = c("col", "wfo"),
                     fuzzy = TRUE)
)
t_col_first["elapsed"]
#> elapsed
#>   41.7

The results are identical in terms of name resolution (both backends ultimately resolve the same names to accepted names), but the WFO-first ordering is faster because WFO’s smaller backbone (~400,000 rows) resolves 75% of the list before COL’s larger backbone (~4.5 million rows) is ever touched. The saving comes from two sources: the exact pass against WFO is faster (smaller hash table), and the fuzzy pass against COL runs on 2,000 names instead of 8,000. Since fuzzy matching cost scales linearly with the number of unmatched names, resolving 6,000 names via WFO’s fast exact pass eliminates the need for 6,000 fuzzy comparisons against COL’s much larger genus blocks.

Note that the backend column in the output records which backend resolved each name. This is useful for quality control: if a name was resolved by the second backend in the chain, it means the first backend either did not contain it or matched it differently. Inspecting the backend column after a multi-backend run can reveal patterns in taxonomic coverage gaps.

General guidelines for backend ordering:

  • Plant-only lists: "wfo" alone is sufficient. WFO has the most complete plant synonym coverage and a compact backbone.
  • Marine lists: "worms" first, then "col" or "gbif" for anything WoRMS misses.
  • Mixed-kingdom lists: put the backbone that covers the dominant kingdom first. For a list that is 80% plants and 20% animals, c("wfo", "col") is faster than c("col", "wfo").
  • Maximizing coverage: c("wfo", "col", "gbif") casts the widest net but involves three backbone loads. For lists under 10,000 names the extra loading time is negligible. For 100,000+ names, the extra fuzzy passes add up.

Backbone sizes on disk

Each backbone’s .vtr file is a one-time download stored in taxify_data_dir(). The sizes below are approximate and depend on the backbone version.

Backend Rows (approx.) .vtr size on disk Scope
WFO 400,000 50-70 MB Plants (vascular + bryophytes)
COL 4,500,000 250-350 MB All kingdoms
GBIF 7,000,000 500-700 MB All kingdoms (largest)
ITIS 800,000 80-120 MB North American focus
NCBI 2,500,000 200-300 MB Molecular/genomic taxa
OTT 3,500,000 300-400 MB Synthetic tree (multi-source)
WoRMS 600,000 60-80 MB Marine taxa

A full installation of all seven backbones occupies roughly 1.5-2 GB. Most workflows need only one or two. The WFO backbone alone covers the vast majority of plant taxonomy use cases at under 70 MB.

The download sizes are comparable to the on-disk sizes since the .vtr format is already compressed. No additional decompression step runs after download. The file that arrives on disk is the file that vectra reads at query time.

Enrichment files are much smaller. The largest enrichment is WCVP (native range data, ~2 million rows) at roughly 30-40 MB. Most enrichments are under 5 MB. A full set of 12 enrichments adds about 80-100 MB to disk usage.

Memory footprint

When a backbone is loaded for the first time in a session, vectra materializes it as an in-memory columnar block. The memory footprint is roughly 1.5-2x the .vtr file size because the columnar block includes hash indexes and decompressed string data. WFO occupies about 80-100 MB in memory, COL about 400-500 MB, and GBIF about 800 MB-1 GB. The block persists for the session and is reused by every taxify() call. Loading a second backbone (e.g., during a multi-backend fallback) adds its own block to memory. The two blocks coexist independently.

Fuzzy matching adds a transient memory cost on top of the backbone block. For each fuzzy pass, taxify writes a temporary .vtr containing the unmatched names and their genera, then passes it to vectra’s fuzzy_join() function. The join allocates a working buffer proportional to the number of unmatched names times the average genus block size. For 5,000 unmatched names against WFO, this working set is roughly 10-20 MB. For 50,000 unmatched names against GBIF, it can reach 100-200 MB. The temporary files and working buffers are freed after each fuzzy pass.

Enrichment .vtr files are loaded on demand. Calling add_conservation_status() loads the conservation_status enrichment (~60,000 rows, a few MB). Calling add_elton_traits() loads EltonTraits (~15,000 rows). Enrichment joins use a different mechanism than backbone matching: they build a temporary .vtr of unique accepted names, run an inner_join() against the enrichment .vtr, and fill the result via match(). The enrichment .vtr itself is not fully materialized into memory; only the joined subset is collected. The memory cost per enrichment join is proportional to the number of unique accepted names in the result, which is typically much smaller than the input list (synonyms collapse to shared accepted names).

For a typical session matching 50,000 plant names against WFO with two enrichments, expect about 150 MB of total memory usage from taxify’s caches. Matching the same names against GBIF with five enrichments brings that closer to 1.2 GB. On a machine with 8 GB of RAM, this leaves ample room for downstream analysis. On a shared server with 2-4 GB per process, the GBIF backbone might be tight.

If memory is tight, three strategies help:

  1. Use a smaller backbone. WFO at ~100 MB in memory is 8x lighter than GBIF. For plant-only lists there is no coverage penalty.
  2. Clear the cache between analysis phases. After matching is done and the result is saved, release the backbone from memory before running downstream models.
  3. Process enrichments in the same loop as matching (the chunk-and-write pattern shown below), rather than accumulating the full result in memory and enriching after.
# Match names
result <- taxify(species_list, backend = "gbif")

# Save result
saveRDS(result, "matched_names.rds")

# Free the backbone from memory
taxify_clear_cache()

# Now ~800 MB of RAM is available for downstream work
gc()

Worked example: batch processing a very large list

Lists above 100,000 names are common in biodiversity informatics. A national herbarium digitization project might produce 500,000 label transcriptions. A metabarcoding pipeline might output 200,000 OTU labels. Processing these in a single taxify() call works, but splitting into chunks gives two practical benefits: progress monitoring and memory stability.

taxify’s matching engine handles the full vector internally, and a single taxify() call on 500,000 names will produce correct results. But two practical issues arise at this scale. First, the fuzzy-join working set grows with input size: 500,000 names with 10% unmatched means 50,000 fuzzy comparisons, each scanning a genus block. The temporary .vtr files and distance matrices for this many comparisons can spike memory by several hundred MB. Second, if the R process is interrupted mid-run (Ctrl+C, OOM kill, session timeout), the entire result is lost. Chunking at 50,000-100,000 names keeps peak memory predictable and provides natural restart points.

# 300,000 names from a herbarium digitization project
all_names <- readLines("herbarium_names.txt")
chunk_size <- 50000

# Split into chunks
chunks <- split(all_names,
                ceiling(seq_along(all_names) / chunk_size))

# Process each chunk
results <- lapply(seq_along(chunks), function(i) {
  message(sprintf("Chunk %d/%d (%d names)...",
                  i, length(chunks), length(chunks[[i]])))
  taxify(chunks[[i]], backend = "wfo", fuzzy = TRUE, verbose = FALSE)
})

# Combine
result <- do.call(rbind, results)
nrow(result)
#> [1] 300000

The backbone stays in memory across chunks (the session cache is not cleared between calls), so each chunk after the first skips the initialization overhead. Only the fuzzy-join working set is allocated and freed per chunk.

For lists in the millions (e.g., processing all occurrence records from a GBIF download), consider writing results to disk after each chunk rather than accumulating in memory:

output_dir <- "results"
dir.create(output_dir, showWarnings = FALSE)

for (i in seq_along(chunks)) {
  message(sprintf("Chunk %d/%d", i, length(chunks)))
  res <- taxify(chunks[[i]], backend = "wfo",
                fuzzy = TRUE, verbose = FALSE)
  saveRDS(res, file.path(output_dir,
                         sprintf("chunk_%04d.rds", i)))
}

# Combine when needed
all_files <- list.files(output_dir, pattern = "\\.rds$",
                        full.names = TRUE)
result <- do.call(rbind, lapply(all_files, readRDS))

This pattern keeps R’s memory usage bounded by a single chunk regardless of total list size. It also makes the workflow resumable: if the process dies at chunk 47 of 60, we can check which .rds files exist and restart from chunk 48. Adding a simple skip condition handles this:

for (i in seq_along(chunks)) {
  out_file <- file.path(output_dir, sprintf("chunk_%04d.rds", i))
  if (file.exists(out_file)) next
  message(sprintf("Chunk %d/%d", i, length(chunks)))
  res <- taxify(chunks[[i]], backend = "wfo",
                fuzzy = TRUE, verbose = FALSE)
  saveRDS(res, out_file)
}

One detail worth noting: the chunk boundaries are arbitrary and do not affect matching quality. Each chunk is matched independently against the backbone. A name that appears in chunk 3 gets the same result as if it appeared in chunk 7 because the backbone is deterministic and the matching logic is stateless across calls. The only shared state between chunks is the session cache (the materialized backbone block), which improves performance by avoiding repeated initialization.

Cache management

taxify uses two internal environments for session-level caching. The path cache stores the disk location of each loaded backbone. The data cache stores materialized columnar blocks, the session manifest, version-check flags, and enrichment paths. Both persist until the R session ends or the user explicitly clears them.

taxify_clear_cache() removes all loaded backbone paths from memory. The next taxify() call will re-read from disk and re-materialize. This is useful after a large matching run when the backbone is no longer needed and the memory can be reclaimed.

# After finishing all matching work
taxify_clear_cache()
gc()

Clearing the cache does not delete any files from disk. The .vtr files remain in taxify_data_dir() and will be reloaded on the next use. The cost of reloading is the same 1-10 second initialization time that the first call in a session incurs. For a workflow where matching is done in one phase and downstream modelling in another, this is a worthwhile trade: spend 3 seconds reloading WFO later if needed, but free 100 MB of RAM for a memory-intensive ordination or species distribution model.

taxify_refresh_manifest() is a narrower operation: it invalidates the cached copy of the remote manifest (the JSON file listing the latest version of each backbone and enrichment). This forces the next taxify() call to re-check for updates. Normally the manifest is fetched once per session and cached. In a long-running R session (e.g., an RStudio session that stays open for days), calling taxify_refresh_manifest() before a batch run ensures you are working against the latest backbone version. If a new backbone release was published since the session started, the version check will detect it and trigger an automatic download.

Disk storage and sharing across projects

All taxify data lives under taxify_data_dir(), which resolves to the platform-specific user data directory via tools::R_user_dir("taxify", "data"). The layout is:

taxify_data_dir()/
  wfo/
    latest/
      wfo.vtr          # the backbone
      wfo.meta         # download provenance
      meta.json        # version metadata
  col/
    latest/
      col.vtr
      ...
  enrichment/
    conservation_status/
      latest/
        conservation_status.vtr
        meta.json
    woodiness/
      latest/
        woodiness.vtr
        meta.json
    ...

This directory is per-user, not per-project. A backbone downloaded once is available to every R project on the machine without duplication. There is no need to copy .vtr files into a project directory or version-control them.

If multiple users on a shared server need the same backbones, one user can download them and the others can set the R_USER_DATA_DIR environment variable (or symlink taxify_data_dir()) to a shared location. The .vtr files are read-only at query time, so concurrent access from multiple R sessions is safe. No file locking is needed.

To check how much disk space taxify is currently using:

# Total size of all backbones and enrichments
data_dir <- taxify_data_dir()
files <- list.files(data_dir, recursive = TRUE, full.names = TRUE)
total_mb <- sum(file.size(files), na.rm = TRUE) / 1048576
message(sprintf("taxify data: %.0f MB across %d files",
                total_mb, length(files)))

To remove a specific backbone (e.g., GBIF after finishing a project that needed it), delete its directory:

# Remove GBIF backbone (frees ~500-700 MB)
unlink(file.path(taxify_data_dir(), "gbif"), recursive = TRUE)

# Clear the session cache so taxify() doesn't try to use the old path
taxify_clear_cache()

Deleting a backbone directory is safe. The next taxify() call for that backend will re-download it from Zenodo if needed.

Worked example: pre-downloading resources

For a reproducible batch pipeline (e.g., a Makefile or targets plan), it is cleaner to separate the download step from the analysis step. Downloads can fail due to network issues, and you want to know about that before a 2-hour matching run starts.

taxify_download_vtr() downloads one or more backbone .vtr files. taxify_download_enrichment() does the same for enrichment layers. Both are idempotent: if the file already exists and the version is current, they return immediately.

# Pre-download everything needed for a multi-kingdom analysis
# with conservation status and trait enrichments

# Backbones
taxify_download_vtr(c("wfo", "col"))

# Enrichments
taxify_download_enrichment(c(
  "conservation_status",
  "woodiness",
  "eive",
  "elton_traits"
))

# Now the analysis can run fully offline
result <- taxify(species_list, backend = c("wfo", "col"))
result <- add_conservation_status(result)
result <- add_woodiness(result)

In a CI/CD or cluster environment, the download step can run in a setup script or container build phase. The matching step then operates entirely from local disk, with no network dependency and no risk of mid-run download failures. This separation also makes the pipeline reproducible: the download step pins a specific backbone version (recorded in the meta.json sidecar file), and the matching step uses whatever version is on disk.

For a targets or drake plan, the download calls fit naturally as upstream targets that the matching targets depend on. The return value (the .vtr path) can be passed through the dependency graph, though in practice the path is resolved internally by ensure_backbone() and does not need to be passed explicitly.

To see which enrichments are available and their current versions:

list_enrichments()
#>                name version    nrow static
#> 1 conservation_status 2026.04   59583  FALSE
#> 2               griis 2026.04   98131  FALSE
#> 3                wcvp 2026.04 1973234  FALSE
#> 4                eive     1.0   14835   TRUE
#> 5        elton_traits     1.0   15394   TRUE
#> 6              avonet     1.0   11009   TRUE
#> ...

Static enrichments (those based on published, version-locked datasets like EltonTraits 1.0 or PanTHERIA 1.0) are never re-downloaded after the initial fetch. Non-static enrichments (conservation_status, griis, wcvp, common_names) are checked once per session and updated if a newer build is available.

Practical scaling guidance

The table below summarizes recommended settings by list size. These are guidelines, not hard thresholds. The actual performance depends on input cleanliness (how many names need fuzzy matching), backbone size (WFO vs. GBIF), and hardware (number of cores, available RAM, disk speed).

Under 1,000 names. The defaults work well. taxify(names) with fuzzy = TRUE and a single backend completes in a few seconds. No tuning is needed. This is the regime for most interactive analysis: a field survey, a thesis species list, a table extracted from a paper. Memory usage is negligible.

1,000 to 50,000 names. If the input is clean (names from a curated database, a previous taxify run, or a standard checklist), consider fuzzy = FALSE. The exact pipeline handles case differences, Latin orthographic variants (e.g., -ii vs. -i endings), and infraspecific-to- species fallback without string-distance computation. Enabling fuzzy on a clean list of 50,000 names might add 30-60 seconds for no practical gain. If the input has known quality issues (OCR transcriptions, citizen science data), leave fuzzy on and expect 1-3 minutes against WFO.

50,000 to 500,000 names. Backend ordering starts to matter. Put the backbone that covers the dominant taxon group first. For a plant list, "wfo" alone suffices. For mixed-kingdom lists, c("wfo", "col") resolves most names on the faster WFO pass. Consider the two-pass pattern (exact first, fuzzy on unmatched) if only a small fraction of names have quality issues. The GBIF backbone at ~7 million rows is the most expensive for fuzzy matching; avoid it as the first backend unless the list is primarily non-plant, non-marine taxa not covered by COL.

Over 500,000 names. Batch in chunks of 50,000-100,000 names. The backbone stays cached across chunks, so there is no repeated initialization cost. Write results to disk per chunk if total memory is a concern. Clear the cache between analysis phases (matching, enrichment, downstream modelling) to keep memory usage bounded. If enriching with multiple layers, apply all enrichments to each chunk before writing rather than accumulating the full result in memory and enriching after.

# Pattern for 500,000+ names with enrichments
for (i in seq_along(chunks)) {
  res <- taxify(chunks[[i]], backend = "wfo",
                fuzzy = TRUE, verbose = FALSE)
  res <- add_conservation_status(res, verbose = FALSE)
  res <- add_woodiness(res, verbose = FALSE)
  saveRDS(res, sprintf("results/chunk_%04d.rds", i))
}

Backend selection cheat sheet:

List composition Recommended backend(s)
Plants only "wfo"
Plants + animals c("wfo", "col")
Marine taxa c("worms", "col")
Fungi c("fungorum", "col")
Algae c("algaebase", "col")
All kingdoms, maximum coverage c("wfo", "col", "gbif")
Molecular/genomic taxa c("ncbi", "col")
North American biodiversity c("itis", "col")

For any single-kingdom list, starting with the specialist backbone (WFO for plants, WoRMS for marine, NCBI for molecular) and falling back to COL or GBIF gives the best balance of speed and coverage. The specialist backbone resolves most names quickly (smaller backbone, faster exact pass), and the generalist backbone catches the remainder.

The fuzzy_threshold parameter

The default fuzzy threshold is 0.2 (normalized Damerau-Levenshtein distance: edits divided by the maximum of the two name lengths). A threshold of 0.2 allows roughly one edit per five characters, which catches single-character typos in binomials of typical length (15-25 characters). This is a good default for most use cases.

For large lists with noisy input (OCR, handwriting transcription), a slightly higher threshold like 0.25 catches more misspellings but also increases false positives. For clean input where fuzzy matching serves only as a safety net, a lower threshold like 0.1 or 0.15 reduces the risk of incorrect matches without sacrificing much recall.

The threshold also affects performance. A higher threshold means more backbone entries pass the distance filter, which means more candidate matches to evaluate and rank. The difference is modest for most inputs but can be noticeable for very large genera: at threshold 0.2, a query against Astragalus (~3,000 WFO entries) might return 5 candidates; at 0.3, it might return 20.

An alternative mode uses integer thresholds. Setting fuzzy_threshold = 2L allows at most 2 raw edit operations regardless of name length. This is useful for long infraspecific names where a normalized threshold of 0.2 might allow too many edits. Integer thresholds are not supported with the Jaro-Winkler method (fuzzy_method = "jw").

# Tight threshold for clean input
result <- taxify(clean_names, backend = "wfo",
                 fuzzy = TRUE, fuzzy_threshold = 0.1)

# Integer threshold: at most 2 edits, period
result <- taxify(noisy_names, backend = "wfo",
                 fuzzy = TRUE, fuzzy_threshold = 2L)

Summary of performance-relevant functions

Function Purpose
taxify(..., fuzzy = FALSE) Skip fuzzy matching for clean input
taxify(..., backend = c("wfo", "col")) Multi-backend fallback chain
taxify_data_dir() Find where backbones are stored
taxify_download_vtr() Pre-download backbone .vtr files
taxify_download_enrichment() Pre-download enrichment .vtr files
taxify_clear_cache() Free backbone memory after matching
taxify_refresh_manifest() Force re-check for backbone updates
list_enrichments() See available enrichments and versions