pkgdown/mathjax-config.html

Skip to contents

The taxonomic-resolution landscape in R

The R ecosystem has a rich set of taxonomic name-resolution tools. Each takes a different design choice along three axes: where the data lives (local files or remote APIs), how many backbones are bundled, and what the package returns. The table below summarizes the options most likely to overlap with a taxify workflow.

Package Source data Coverage Access Closest taxify analogue
taxize ~20 web services (NCBI, ITIS, GBIF, EOL, IUCN, WoRMS, Tropicos, …) All kingdoms Live API taxify(backend = c(...)) with the relevant local backbone(s)
WorldFlora World Flora Online classification (WFO.match) Land plants (vascular + bryophytes) Local file taxify(backend = "wfo")
lcvplants Leipzig Catalogue of Vascular Plants Vascular plants Bundled in package taxify(backend = "lcvp")
rWCVP World Checklist of Vascular Plants (Kew) Vascular plants Local snapshot taxify(backend = "wcvp")
taxadb GBIF, ITIS, COL, NCBI, OTT, WFO snapshots All kingdoms Local DuckDB / MonetDB taxify(backend = c(...))
Taxonstand The Plant List (retired by Kew in 2013, superseded by WCVP and WFO) Vascular plants Bundled in package taxify(backend = c("wcvp", "wfo"))
U.Taxonstand User-supplied or bundled checklists Configurable Local taxify(backend = ...) plus add_data()
bdc taxadb + GNR for the taxonomic step inside a larger biodiversity-cleaning workflow All kingdoms Local + API taxify() for the matching step
TNRS TNRS web service (BIEN / iDigBio) Plants Live API taxify(backend = "wfo") or similar
rgbif, worrms, ritis GBIF / WoRMS / ITIS web APIs One backbone each Live API taxify(backend = "gbif" / "worms" / "itis")

If your workflow already uses one of these and you are happy with it, there is no urgent reason to switch.

That said, there are situations where taxify offers a better fit:

  • Multiple backbones. taxify matches against seven backbones offline and can chain them in a single call: taxify(names, backend = c("wfo", "col", "gbif")).
  • Speed at scale. The matching engine is written in C with genus-blocked fuzzy joins. Ten thousand names resolve in seconds.
  • Enrichments. Results pipe directly into twelve published trait and status datasets (IUCN, GRIIS, WCVP, EIVE, EltonTraits, etc.) with a single |> chain.
  • Reproducibility. Backbones are versioned files on disk. The backbone_version column records exactly which snapshot was used.

This vignette maps the old APIs to their taxify equivalents, walks through three side-by-side examples, and is honest about what taxify does not cover.

Function mapping: taxize to taxify

The table below maps the taxize name-resolution functions to their closest taxify equivalent.

taxize function taxify equivalent Notes
gnr_resolve() taxify() Any backend; returns best match per name
classification() taxify() family, genus, rank columns in the output; add_col_info() for full hierarchy
synonyms() taxify() is_synonym + accepted_name columns in the output
tax_name() taxify() family, genus, rank columns
sci2comm() add_common_names() Pipe enrichment; GBIF vernacular names by language

taxize also has functions that serve a different purpose (fetching database IDs, enumerating child taxa, retrieving occurrence or sequence data). These are not name-resolution functions, so taxify does not cover them. The “What taxify does not do” section below points to the right packages for those tasks.

The key structural difference: taxize returned results in varied formats depending on the function (classification() gave a nested list of data.frames, synonyms() another nested list, get_tsn() a character vector with attributes). taxify returns the same 16-column data.frame from every call. Synonym status, classification, and match quality are columns, not separate API calls.

Function mapping: WorldFlora to taxify

WorldFlora function taxify equivalent Notes
WFO.match() taxify(backend = "wfo") Exact + fuzzy in one call
WFO.one() taxify() Best-match selection is automatic
WFO.match.fuzzyjoin() taxify(fuzzy = TRUE) Enabled by default; genus-blocked Damerau-Levenshtein
WFO.synonyms() taxify() is_synonym, accepted_name, accepted_id in output

WorldFlora returns a wide data.frame with WFO-specific column names (scientificName, taxonID, taxonomicStatus, acceptedNameUsageID, plus authorship and bibliographic fields). taxify normalizes these into a backend-agnostic schema: matched_name, taxon_id, accepted_name, accepted_id, and so on. The WFO-specific columns are still accessible via add_wfo_info() when needed, but the default output is the same 16 columns whether the backend is WFO, COL, or GBIF.

taxify also handles backbone management automatically: the first taxify() call downloads the backbone, subsequent calls reuse the local copy, and a once-per-session version check keeps it current.

Function mapping: lcvplants to taxify

lcvplants wraps the Leipzig Catalogue of Vascular Plants and ships the LCVP table as bundled data. The package centres on LCVP() and lcvp_search().

lcvplants function taxify equivalent Notes
LCVP(splist) taxify(splist, backend = "lcvp") Returns the standardized 16-column data.frame
lcvp_search() taxify() Search by name; same output schema
lcvp_fuzzy_search() taxify(fuzzy = TRUE) Genus-blocked Damerau-Levenshtein; on by default
tab_lcvp (data object) taxify_data_dir() / lcvp / latest / lcvp.vtr The LCVP snapshot is shipped as a .vtr file rather than an in-package data object

The LCVP and WCVP backbones can be combined in a single fallback chain to arbitrate between the Leipzig and Kew vascular-plant authorities:

result <- taxify(plant_names, backend = c("wcvp", "lcvp", "wfo"))
result[, c("input_name", "accepted_name", "backend")]

Function mapping: rWCVP to taxify

rWCVP is the Kew package for the World Checklist of Vascular Plants. Its name-resolution side centres on wcvp_match_names() and wcvp_check_gbif(); it also has a strong distribution-query side that taxify does not replace.

rWCVP function taxify equivalent Notes
wcvp_match_names() taxify(backend = "wcvp") Exact + fuzzy in one call
wcvp_check_gbif() taxify(backend = c("wcvp", "gbif")) Cascade WCVP first, GBIF as fallback
wcvp_distribution() add_wcvp() Native range by TDWG region (the add_wcvp() enrichment)
wcvp_synonyms() taxify() is_synonym and accepted_name columns in the output
get_wcvp() automatic The backbone downloads on first taxify(backend = "wcvp") call

rWCVP’s distribution-query functions (wcvp_occ_mat(), generate_checklist()) operate on TDWG geography and are outside taxify’s scope. For native-range data joined to a name-resolved result, add_wcvp() covers the most common case; for full geographic queries, rWCVP remains the right tool.

Function mapping: taxadb to taxify

taxadb is the closest functional analogue to taxify. Both store backbone snapshots locally and avoid network calls at query time. The two packages differ in matching strategy and integration: taxadb returns a long-format table for exact-key joins, while taxify returns a flat one-row-per-input result with fuzzy matching, synonym resolution, and trait enrichment built in.

taxadb function taxify equivalent Notes
td_create("itis") automatic First taxify(backend = "itis") call downloads the .vtr snapshot
filter_name(names, "itis") taxify(names, backend = "itis") Exact match against the local snapshot
filter_id(ids, "itis") not exposed Use vectra::tbl() directly on the .vtr if needed
synonyms(names, "itis") taxify() is_synonym, accepted_name, accepted_id in the output
clean_names() automatic taxify() runs the cleaning pipeline (authorship, qualifiers, hybrid markers, orthography) before matching
(no fuzzy match) taxify(fuzzy = TRUE) Genus-blocked Damerau-Levenshtein, on by default

The two largest practical differences:

  • Matching scope. taxadb is built around exact lookups against pre-cleaned input. taxify cleans the input automatically and runs fuzzy matching on names that do not match exactly, which catches typos, orthographic variants, and authorship strings without a separate preprocessing step.
  • Output shape. taxadb returns multiple rows per input when a name has multiple matches (you pick the row you want with dplyr::filter). taxify returns one row per input with a best-match selection rule (ACCEPTED over SYNONYM, species rank over higher ranks, lowest ID as tiebreaker), and reports the match type and fuzzy distance as columns.

For workflows that already use taxadb’s column-oriented querying for custom analyses, taxadb’s approach is a clean fit. For workflows that need a single resolved name per input plus enrichment joins, taxify’s flat output is closer to the goal.

Function mapping: Taxonstand to taxify

Taxonstand was built around The Plant List, which Kew retired in 2013 in favour of WCVP and WFO. The package still works, but the underlying taxonomy has not been updated since the retirement.

Taxonstand function taxify equivalent Notes
TPL(splist) taxify(splist, backend = c("wcvp", "wfo")) Replace TPL with its successors
TPLck() taxify() Single-name check; same output schema

The simplest migration is to replace backend = "tpl" with backend = c("wcvp", "wfo") (or backend = c("lcvp", "wcvp", "wfo") for triple-arbitration across the three large vascular-plant authorities).

Example 1: Basic name resolution

With taxize, name resolution typically meant several separate calls: gnr_resolve() for matching, get_gbifid() for IDs, classification() for hierarchy, synonyms() for synonym status.

# --- taxize ---
library(taxize)

names <- c("Quercus robur", "Pinus sylvestris", "Betula pendula",
           "Panthera leo", "Salmo trutta")

resolved  <- gnr_resolve(names, best_match_only = TRUE)
gbif_ids  <- get_gbifid(names)
class_list <- classification(gbif_ids, db = "gbif")
syn_list   <- synonyms(gbif_ids, db = "gbif")

With taxify, all of that is one call:

# --- taxify ---
library(taxify)

names <- c("Quercus robur", "Pinus sylvestris", "Betula pendula",
           "Panthera leo", "Salmo trutta")

result <- taxify(names, backend = "gbif")

result$accepted_name
result$family
result$genus
result$is_synonym
result$taxon_id        # GBIF usage key

The output is a data.frame with 16 columns and one row per input name.

Example 2: WFO matching with fuzzy + synonyms

With WorldFlora, the typical workflow is: load the backbone, run exact matching, apply fuzzy matching separately, then pick the best match.

# --- WorldFlora ---
library(WorldFlora)

wfo_data <- read.delim("classification.txt")

names <- c("Quercus robur", "Quercus pedonculata",
           "Pinus silvestris", "Rosa canina")
exact <- WFO.match(names, WFO.data = wfo_data)
fuzzy <- WFO.match.fuzzyjoin(names, WFO.data = wfo_data)
best  <- WFO.one(fuzzy)

With taxify, exact matching, fuzzy matching, and synonym resolution happen in a single call:

# --- taxify ---
library(taxify)

names <- c("Quercus robur", "Quercus pedonculata",
           "Pinus silvestris", "Rosa canina")

result <- taxify(names, backend = "wfo")

# Misspellings are caught by fuzzy matching:
result[, c("input_name", "matched_name", "match_type", "fuzzy_dist")]
#   input_name           matched_name        match_type fuzzy_dist
# 1 Quercus robur        Quercus robur       exact              NA
# 2 Quercus pedonculata  Quercus pedunculata fuzzy           0.053
# 3 Pinus silvestris     Pinus sylvestris    fuzzy           0.063
# 4 Rosa canina          Rosa canina         exact              NA

# Synonyms resolved automatically:
result[, c("input_name", "is_synonym", "accepted_name")]

Quercus pedonculata is both a misspelling and a synonym. taxify handles both: the fuzzy matcher corrects the spelling to Quercus pedunculata, and the synonym resolver maps it to Quercus robur.

Example 3: Multi-backend fallback with enrichments

taxify can chain multiple backbones in a single call. Unmatched names cascade to the next backbone automatically.

library(taxify)

# Mixed kingdom input: plants, animals, fungi
names <- c(
  "Quercus robur",         # plant (WFO primary)
  "Panthera leo",          # animal (not in WFO, picked up by GBIF)
  "Amanita muscaria",      # fungus (not in WFO, picked up by GBIF)
  "Salmo trutta",          # fish (not in WFO, picked up by GBIF)
  "Arabidopsis thaliana"   # plant (in both WFO and GBIF)
)

# WFO first (best for plants), GBIF as fallback (all kingdoms)
result <- taxify(names, backend = c("wfo", "gbif"))

# The backend column shows which database matched each name:
result[, c("input_name", "backend", "family")]
#   input_name            backend family
# 1 Quercus robur         wfo     Fagaceae
# 2 Panthera leo          gbif    Felidae
# 3 Amanita muscaria      gbif    Amanitaceae
# 4 Salmo trutta          gbif    Salmonidae
# 5 Arabidopsis thaliana  wfo     Brassicaceae

# Enrich with traits:
result |>
  add_conservation_status() |>
  add_woodiness()

# Or join custom data:
my_traits <- data.frame(
  species = c("Quercus robur", "Panthera leo"),
  max_height_m = c(35, NA),
  body_mass_kg = c(NA, 190)
)
result |> add_data(my_traits, species_col = "species")

Key differences at a glance

Offline matching. taxify downloads backbone files once and matches locally. After the initial download (typically 50–300 MB depending on the backbone), no internet connection is needed.

Multi-backend. taxify supports seven backbones through a single function, with optional fallback chains that cascade unmatched names automatically.

Output format. taxify always returns a data.frame with 16 standardized columns, regardless of the backend:

Column Type Content
input_name character Original name as submitted
matched_name character Closest match in the backbone
accepted_name character Accepted name after synonym resolution
taxon_id character Backend-specific ID of the matched name
accepted_id character ID of the accepted name
rank character Taxonomic rank (species, genus, family, etc.)
family character Family name
genus character Genus name
epithet character Specific epithet
authorship character Taxonomic authority
is_synonym logical Was the matched name a synonym?
is_hybrid logical Hybrid marker detected in the input?
match_type character "exact", "exact_ci", "fuzzy", or "none"
fuzzy_dist numeric Normalized edit distance (NA if exact)
backend character Which backend matched this name
backbone_version character Backend name, version, and download date

Speed. taxify uses vectra’s C-level join engine with hash indexes and genus-blocked fuzzy joins, processing thousands of names per second.

Reproducibility. taxify pins backbone versions locally and records the version string in the backbone_version column of every result. The same backbone file produces the same output indefinitely. Version pinning is also available: taxify_download_vtr("wfo", version = "2024.06") downloads a specific release.

What taxify does not do

taxify is a name matcher. It resolves scientific names to accepted names, returns classification metadata, and joins enrichment layers. Several things that taxize or other packages handle are outside its scope.

Common-to-scientific name lookup. taxize had comm2sci() to go from “European robin” to Erithacus rubecula. taxify matches scientific names, not vernacular input. For that direction, the GBIF API (rgbif::name_suggest()) accepts common names and returns candidates.

Downstream taxa. taxize’s downstream() returned all children of a higher taxon (e.g., all species in a genus). taxify does not enumerate children. For tree-based queries, the rotl package provides access to the Open Tree of Life synthetic tree, and rgbif’s name_usage() can list children of a GBIF usage key.

Phylogenetic trees. For phylogenetic data, use rotl (Open Tree of Life) or phylomatic.

Occurrence data. For occurrence data, rgbif and spocc are the standard tools.

Sequence data. For sequence retrieval, the rentrez package handles GenBank/NCBI queries directly.

Real-time API lookups. By design, taxify queries local files. If a name was added to a backbone yesterday and taxify’s local copy is from last month, taxify will not find it until the backbone is updated. For workflows where freshness matters more than reproducibility, a direct API client (rgbif, worrms, ritis) may be the better fit.

When the other packages are the better choice

taxify is one tool among several. A few situations where the related packages remain the right answer:

  • Distribution and range queries. rWCVP exposes WCVP’s TDWG-region geography directly through wcvp_distribution(), wcvp_occ_mat(), and generate_checklist(). taxify covers name-resolution and the most common native-range join through add_wcvp(), but full geographic queries belong in rWCVP.

  • Live API access to upstream databases. taxize, rgbif, worrms, ritis, and TNRS query their backends in real time. If you need a name added to a backbone yesterday, or you want the latest annotation for a single taxon, these packages return that immediately. taxify works against the snapshot on disk and only sees changes when the backbone is updated.

  • Common-to-scientific lookups. taxize had comm2sci() to go from “European robin” to Erithacus rubecula. taxify matches scientific names, not vernacular input. For that direction, rgbif::name_suggest() accepts common names and returns candidates.

  • Downstream taxa enumeration. If the goal is to list all species in a family or all subspecies of a species, taxify does not provide that query. Use rgbif::name_usage(key, data = "children") or rotl::tol_subtree().

  • Wider biodiversity-data cleaning. bdc wraps the entire data-cleaning workflow (coordinate cleaning, dataset merging, taxonomic harmonization, occurrence flagging). taxify can replace its taxonomic step alone if you prefer offline backbones over taxadb + GNR, but the rest of bdc’s pipeline is outside taxify’s scope.

  • Interactive, per-name resolution with manual disambiguation. taxize had interactive modes where the user could pick among multiple candidates. taxify picks the best match automatically (accepted name over synonym, species rank over higher ranks, lowest ID as tiebreaker). If manual control over ambiguous matches is needed, direct API calls may be preferable.

  • Column-oriented querying of a backbone. taxadb stores backbones in DuckDB / MonetDB and exposes them through dplyr verbs, which is a natural fit if your analysis is itself a SQL-style transformation of the backbone. taxify exposes the underlying .vtr files through vectra for this kind of work, but taxadb’s dplyr surface is more ergonomic for custom queries.

Discovering available enrichments

taxify bundles 12 enrichment datasets that cover conservation status, invasive species, functional traits, morphological measurements, and vernacular names. These are joined to the taxify result by piping through add_*() functions.

# See all available enrichments and their metadata
list_enrichments()

Each enrichment downloads automatically on first use and is cached locally, following the same pattern as backbones. The full list: add_conservation_status(), add_invasive_status(), add_wcvp(), add_eive(), add_elton_traits(), add_avonet(), add_pantheria(), add_amphibio(), add_common_names(), add_woodiness(), add_diaz_traits(), and add_leda().

Summary

Migrating from taxize, WorldFlora, lcvplants, rWCVP, taxadb, or Taxonstand to taxify means replacing the package’s resolution call with taxify(backend = ...) and optional add_*() enrichment pipes. The output is a flat 16-column data.frame, not nested lists or long-format join tables, and matching runs offline against versioned backbone files so results do not change between sessions unless the user explicitly updates the backbone.

For things taxify does not handle (distribution queries, downstream taxa, occurrence data, phylogenetic trees, sequence retrieval, live API freshness), the specialized packages (rWCVP, rgbif, rotl, spocc, rentrez, worrms, ritis) remain the right tools. taxify covers the name-matching step that comes before most of those.