The taxonomic-resolution landscape in R
The R ecosystem has a rich set of taxonomic name-resolution tools. Each takes a different design choice along three axes: where the data lives (local files or remote APIs), how many backbones are bundled, and what the package returns. The table below summarizes the options most likely to overlap with a taxify workflow.
| Package | Source data | Coverage | Access | Closest taxify analogue |
|---|---|---|---|---|
| taxize | ~20 web services (NCBI, ITIS, GBIF, EOL, IUCN, WoRMS, Tropicos, …) | All kingdoms | Live API |
taxify(backend = c(...)) with the relevant local
backbone(s) |
| WorldFlora | World Flora Online classification (WFO.match) |
Land plants (vascular + bryophytes) | Local file | taxify(backend = "wfo") |
| lcvplants | Leipzig Catalogue of Vascular Plants | Vascular plants | Bundled in package | taxify(backend = "lcvp") |
| rWCVP | World Checklist of Vascular Plants (Kew) | Vascular plants | Local snapshot | taxify(backend = "wcvp") |
| taxadb | GBIF, ITIS, COL, NCBI, OTT, WFO snapshots | All kingdoms | Local DuckDB / MonetDB | taxify(backend = c(...)) |
| Taxonstand | The Plant List (retired by Kew in 2013, superseded by WCVP and WFO) | Vascular plants | Bundled in package | taxify(backend = c("wcvp", "wfo")) |
| U.Taxonstand | User-supplied or bundled checklists | Configurable | Local |
taxify(backend = ...) plus add_data()
|
| bdc | taxadb + GNR for the taxonomic step inside a larger biodiversity-cleaning workflow | All kingdoms | Local + API |
taxify() for the matching step |
| TNRS | TNRS web service (BIEN / iDigBio) | Plants | Live API |
taxify(backend = "wfo") or similar |
| rgbif, worrms, ritis | GBIF / WoRMS / ITIS web APIs | One backbone each | Live API | taxify(backend = "gbif" / "worms" / "itis") |
If your workflow already uses one of these and you are happy with it, there is no urgent reason to switch.
That said, there are situations where taxify offers a better fit:
-
Multiple backbones. taxify matches against seven
backbones offline and can chain them in a single call:
taxify(names, backend = c("wfo", "col", "gbif")). - Speed at scale. The matching engine is written in C with genus-blocked fuzzy joins. Ten thousand names resolve in seconds.
-
Enrichments. Results pipe directly into twelve
published trait and status datasets (IUCN, GRIIS, WCVP, EIVE,
EltonTraits, etc.) with a single
|>chain. -
Reproducibility. Backbones are versioned files on
disk. The
backbone_versioncolumn records exactly which snapshot was used.
This vignette maps the old APIs to their taxify equivalents, walks through three side-by-side examples, and is honest about what taxify does not cover.
Function mapping: taxize to taxify
The table below maps the taxize name-resolution functions to their closest taxify equivalent.
| taxize function | taxify equivalent | Notes |
|---|---|---|
gnr_resolve() |
taxify() |
Any backend; returns best match per name |
classification() |
taxify() |
family, genus, rank columns
in the output; add_col_info() for full hierarchy |
synonyms() |
taxify() |
is_synonym + accepted_name columns in the
output |
tax_name() |
taxify() |
family, genus, rank
columns |
sci2comm() |
add_common_names() |
Pipe enrichment; GBIF vernacular names by language |
taxize also has functions that serve a different purpose (fetching database IDs, enumerating child taxa, retrieving occurrence or sequence data). These are not name-resolution functions, so taxify does not cover them. The “What taxify does not do” section below points to the right packages for those tasks.
The key structural difference: taxize returned results in varied
formats depending on the function (classification() gave a
nested list of data.frames, synonyms() another nested list,
get_tsn() a character vector with attributes). taxify
returns the same 16-column data.frame from every call. Synonym status,
classification, and match quality are columns, not separate API
calls.
Function mapping: WorldFlora to taxify
| WorldFlora function | taxify equivalent | Notes |
|---|---|---|
WFO.match() |
taxify(backend = "wfo") |
Exact + fuzzy in one call |
WFO.one() |
taxify() |
Best-match selection is automatic |
WFO.match.fuzzyjoin() |
taxify(fuzzy = TRUE) |
Enabled by default; genus-blocked Damerau-Levenshtein |
WFO.synonyms() |
taxify() |
is_synonym, accepted_name,
accepted_id in output |
WorldFlora returns a wide data.frame with WFO-specific column names
(scientificName, taxonID,
taxonomicStatus, acceptedNameUsageID, plus
authorship and bibliographic fields). taxify normalizes these into a
backend-agnostic schema: matched_name,
taxon_id, accepted_name,
accepted_id, and so on. The WFO-specific columns are still
accessible via add_wfo_info() when needed, but the default
output is the same 16 columns whether the backend is WFO, COL, or
GBIF.
taxify also handles backbone management automatically: the first
taxify() call downloads the backbone, subsequent calls
reuse the local copy, and a once-per-session version check keeps it
current.
Function mapping: lcvplants to taxify
lcvplants
wraps the Leipzig Catalogue of Vascular Plants and ships the LCVP table
as bundled data. The package centres on LCVP() and
lcvp_search().
| lcvplants function | taxify equivalent | Notes |
|---|---|---|
LCVP(splist) |
taxify(splist, backend = "lcvp") |
Returns the standardized 16-column data.frame |
lcvp_search() |
taxify() |
Search by name; same output schema |
lcvp_fuzzy_search() |
taxify(fuzzy = TRUE) |
Genus-blocked Damerau-Levenshtein; on by default |
tab_lcvp (data object) |
taxify_data_dir() / lcvp / latest / lcvp.vtr |
The LCVP snapshot is shipped as a .vtr file rather than
an in-package data object |
The LCVP and WCVP backbones can be combined in a single fallback chain to arbitrate between the Leipzig and Kew vascular-plant authorities:
Function mapping: rWCVP to taxify
rWCVP is the Kew
package for the World Checklist of Vascular Plants. Its name-resolution
side centres on wcvp_match_names() and
wcvp_check_gbif(); it also has a strong distribution-query
side that taxify does not replace.
| rWCVP function | taxify equivalent | Notes |
|---|---|---|
wcvp_match_names() |
taxify(backend = "wcvp") |
Exact + fuzzy in one call |
wcvp_check_gbif() |
taxify(backend = c("wcvp", "gbif")) |
Cascade WCVP first, GBIF as fallback |
wcvp_distribution() |
add_wcvp() |
Native range by TDWG region (the add_wcvp()
enrichment) |
wcvp_synonyms() |
taxify() |
is_synonym and accepted_name columns in
the output |
get_wcvp() |
automatic | The backbone downloads on first
taxify(backend = "wcvp") call |
rWCVP’s distribution-query functions (wcvp_occ_mat(),
generate_checklist()) operate on TDWG geography and are
outside taxify’s scope. For native-range data joined to a name-resolved
result, add_wcvp() covers the most common case; for full
geographic queries, rWCVP remains the right tool.
Function mapping: taxadb to taxify
taxadb is the closest functional analogue to taxify. Both store backbone snapshots locally and avoid network calls at query time. The two packages differ in matching strategy and integration: taxadb returns a long-format table for exact-key joins, while taxify returns a flat one-row-per-input result with fuzzy matching, synonym resolution, and trait enrichment built in.
| taxadb function | taxify equivalent | Notes |
|---|---|---|
td_create("itis") |
automatic | First taxify(backend = "itis") call downloads the
.vtr snapshot |
filter_name(names, "itis") |
taxify(names, backend = "itis") |
Exact match against the local snapshot |
filter_id(ids, "itis") |
not exposed | Use vectra::tbl() directly on the .vtr if
needed |
synonyms(names, "itis") |
taxify() |
is_synonym, accepted_name,
accepted_id in the output |
clean_names() |
automatic |
taxify() runs the cleaning pipeline (authorship,
qualifiers, hybrid markers, orthography) before matching |
| (no fuzzy match) | taxify(fuzzy = TRUE) |
Genus-blocked Damerau-Levenshtein, on by default |
The two largest practical differences:
- Matching scope. taxadb is built around exact lookups against pre-cleaned input. taxify cleans the input automatically and runs fuzzy matching on names that do not match exactly, which catches typos, orthographic variants, and authorship strings without a separate preprocessing step.
-
Output shape. taxadb returns multiple rows per
input when a name has multiple matches (you pick the row you want with
dplyr::filter). taxify returns one row per input with a best-match selection rule (ACCEPTED over SYNONYM, species rank over higher ranks, lowest ID as tiebreaker), and reports the match type and fuzzy distance as columns.
For workflows that already use taxadb’s column-oriented querying for custom analyses, taxadb’s approach is a clean fit. For workflows that need a single resolved name per input plus enrichment joins, taxify’s flat output is closer to the goal.
Function mapping: Taxonstand to taxify
Taxonstand was built around The Plant List, which Kew retired in 2013 in favour of WCVP and WFO. The package still works, but the underlying taxonomy has not been updated since the retirement.
| Taxonstand function | taxify equivalent | Notes |
|---|---|---|
TPL(splist) |
taxify(splist, backend = c("wcvp", "wfo")) |
Replace TPL with its successors |
TPLck() |
taxify() |
Single-name check; same output schema |
The simplest migration is to replace backend = "tpl"
with backend = c("wcvp", "wfo") (or
backend = c("lcvp", "wcvp", "wfo") for triple-arbitration
across the three large vascular-plant authorities).
Example 1: Basic name resolution
With taxize, name resolution typically meant several separate calls:
gnr_resolve() for matching, get_gbifid() for
IDs, classification() for hierarchy,
synonyms() for synonym status.
# --- taxize ---
library(taxize)
names <- c("Quercus robur", "Pinus sylvestris", "Betula pendula",
"Panthera leo", "Salmo trutta")
resolved <- gnr_resolve(names, best_match_only = TRUE)
gbif_ids <- get_gbifid(names)
class_list <- classification(gbif_ids, db = "gbif")
syn_list <- synonyms(gbif_ids, db = "gbif")With taxify, all of that is one call:
# --- taxify ---
library(taxify)
names <- c("Quercus robur", "Pinus sylvestris", "Betula pendula",
"Panthera leo", "Salmo trutta")
result <- taxify(names, backend = "gbif")
result$accepted_name
result$family
result$genus
result$is_synonym
result$taxon_id # GBIF usage keyThe output is a data.frame with 16 columns and one row per input name.
Example 2: WFO matching with fuzzy + synonyms
With WorldFlora, the typical workflow is: load the backbone, run exact matching, apply fuzzy matching separately, then pick the best match.
# --- WorldFlora ---
library(WorldFlora)
wfo_data <- read.delim("classification.txt")
names <- c("Quercus robur", "Quercus pedonculata",
"Pinus silvestris", "Rosa canina")
exact <- WFO.match(names, WFO.data = wfo_data)
fuzzy <- WFO.match.fuzzyjoin(names, WFO.data = wfo_data)
best <- WFO.one(fuzzy)With taxify, exact matching, fuzzy matching, and synonym resolution happen in a single call:
# --- taxify ---
library(taxify)
names <- c("Quercus robur", "Quercus pedonculata",
"Pinus silvestris", "Rosa canina")
result <- taxify(names, backend = "wfo")
# Misspellings are caught by fuzzy matching:
result[, c("input_name", "matched_name", "match_type", "fuzzy_dist")]
# input_name matched_name match_type fuzzy_dist
# 1 Quercus robur Quercus robur exact NA
# 2 Quercus pedonculata Quercus pedunculata fuzzy 0.053
# 3 Pinus silvestris Pinus sylvestris fuzzy 0.063
# 4 Rosa canina Rosa canina exact NA
# Synonyms resolved automatically:
result[, c("input_name", "is_synonym", "accepted_name")]Quercus pedonculata is both a misspelling and a synonym.
taxify handles both: the fuzzy matcher corrects the spelling to
Quercus pedunculata, and the synonym resolver maps it to
Quercus robur.
Example 3: Multi-backend fallback with enrichments
taxify can chain multiple backbones in a single call. Unmatched names cascade to the next backbone automatically.
library(taxify)
# Mixed kingdom input: plants, animals, fungi
names <- c(
"Quercus robur", # plant (WFO primary)
"Panthera leo", # animal (not in WFO, picked up by GBIF)
"Amanita muscaria", # fungus (not in WFO, picked up by GBIF)
"Salmo trutta", # fish (not in WFO, picked up by GBIF)
"Arabidopsis thaliana" # plant (in both WFO and GBIF)
)
# WFO first (best for plants), GBIF as fallback (all kingdoms)
result <- taxify(names, backend = c("wfo", "gbif"))
# The backend column shows which database matched each name:
result[, c("input_name", "backend", "family")]
# input_name backend family
# 1 Quercus robur wfo Fagaceae
# 2 Panthera leo gbif Felidae
# 3 Amanita muscaria gbif Amanitaceae
# 4 Salmo trutta gbif Salmonidae
# 5 Arabidopsis thaliana wfo Brassicaceae
# Enrich with traits:
result |>
add_conservation_status() |>
add_woodiness()
# Or join custom data:
my_traits <- data.frame(
species = c("Quercus robur", "Panthera leo"),
max_height_m = c(35, NA),
body_mass_kg = c(NA, 190)
)
result |> add_data(my_traits, species_col = "species")Key differences at a glance
Offline matching. taxify downloads backbone files once and matches locally. After the initial download (typically 50–300 MB depending on the backbone), no internet connection is needed.
Multi-backend. taxify supports seven backbones through a single function, with optional fallback chains that cascade unmatched names automatically.
Output format. taxify always returns a data.frame with 16 standardized columns, regardless of the backend:
| Column | Type | Content |
|---|---|---|
input_name |
character | Original name as submitted |
matched_name |
character | Closest match in the backbone |
accepted_name |
character | Accepted name after synonym resolution |
taxon_id |
character | Backend-specific ID of the matched name |
accepted_id |
character | ID of the accepted name |
rank |
character | Taxonomic rank (species, genus, family, etc.) |
family |
character | Family name |
genus |
character | Genus name |
epithet |
character | Specific epithet |
authorship |
character | Taxonomic authority |
is_synonym |
logical | Was the matched name a synonym? |
is_hybrid |
logical | Hybrid marker detected in the input? |
match_type |
character |
"exact", "exact_ci", "fuzzy",
or "none"
|
fuzzy_dist |
numeric | Normalized edit distance (NA if exact) |
backend |
character | Which backend matched this name |
backbone_version |
character | Backend name, version, and download date |
Speed. taxify uses vectra’s C-level join engine with hash indexes and genus-blocked fuzzy joins, processing thousands of names per second.
Reproducibility. taxify pins backbone versions
locally and records the version string in the
backbone_version column of every result. The same backbone
file produces the same output indefinitely. Version pinning is also
available: taxify_download_vtr("wfo", version = "2024.06")
downloads a specific release.
What taxify does not do
taxify is a name matcher. It resolves scientific names to accepted names, returns classification metadata, and joins enrichment layers. Several things that taxize or other packages handle are outside its scope.
Common-to-scientific name lookup. taxize had
comm2sci() to go from “European robin” to Erithacus
rubecula. taxify matches scientific names, not vernacular input.
For that direction, the GBIF API (rgbif::name_suggest())
accepts common names and returns candidates.
Downstream taxa. taxize’s downstream()
returned all children of a higher taxon (e.g., all species in a genus).
taxify does not enumerate children. For tree-based queries, the rotl
package provides access to the Open Tree of Life synthetic tree, and
rgbif’s name_usage() can list children of a GBIF usage
key.
Phylogenetic trees. For phylogenetic data, use rotl (Open Tree of Life) or phylomatic.
Occurrence data. For occurrence data, rgbif and spocc are the standard tools.
Sequence data. For sequence retrieval, the rentrez package handles GenBank/NCBI queries directly.
Real-time API lookups. By design, taxify queries local files. If a name was added to a backbone yesterday and taxify’s local copy is from last month, taxify will not find it until the backbone is updated. For workflows where freshness matters more than reproducibility, a direct API client (rgbif, worrms, ritis) may be the better fit.
When the other packages are the better choice
taxify is one tool among several. A few situations where the related packages remain the right answer:
Distribution and range queries. rWCVP exposes WCVP’s TDWG-region geography directly through
wcvp_distribution(),wcvp_occ_mat(), andgenerate_checklist(). taxify covers name-resolution and the most common native-range join throughadd_wcvp(), but full geographic queries belong in rWCVP.Live API access to upstream databases. taxize, rgbif, worrms, ritis, and TNRS query their backends in real time. If you need a name added to a backbone yesterday, or you want the latest annotation for a single taxon, these packages return that immediately. taxify works against the snapshot on disk and only sees changes when the backbone is updated.
Common-to-scientific lookups. taxize had
comm2sci()to go from “European robin” to Erithacus rubecula. taxify matches scientific names, not vernacular input. For that direction,rgbif::name_suggest()accepts common names and returns candidates.Downstream taxa enumeration. If the goal is to list all species in a family or all subspecies of a species, taxify does not provide that query. Use
rgbif::name_usage(key, data = "children")orrotl::tol_subtree().Wider biodiversity-data cleaning. bdc wraps the entire data-cleaning workflow (coordinate cleaning, dataset merging, taxonomic harmonization, occurrence flagging). taxify can replace its taxonomic step alone if you prefer offline backbones over taxadb + GNR, but the rest of bdc’s pipeline is outside taxify’s scope.
Interactive, per-name resolution with manual disambiguation. taxize had interactive modes where the user could pick among multiple candidates. taxify picks the best match automatically (accepted name over synonym, species rank over higher ranks, lowest ID as tiebreaker). If manual control over ambiguous matches is needed, direct API calls may be preferable.
Column-oriented querying of a backbone. taxadb stores backbones in DuckDB / MonetDB and exposes them through dplyr verbs, which is a natural fit if your analysis is itself a SQL-style transformation of the backbone. taxify exposes the underlying
.vtrfiles through vectra for this kind of work, but taxadb’s dplyr surface is more ergonomic for custom queries.
Discovering available enrichments
taxify bundles 12 enrichment datasets that cover conservation status,
invasive species, functional traits, morphological measurements, and
vernacular names. These are joined to the taxify result by piping
through add_*() functions.
# See all available enrichments and their metadata
list_enrichments()Each enrichment downloads automatically on first use and is cached
locally, following the same pattern as backbones. The full list:
add_conservation_status(),
add_invasive_status(), add_wcvp(),
add_eive(), add_elton_traits(),
add_avonet(), add_pantheria(),
add_amphibio(), add_common_names(),
add_woodiness(), add_diaz_traits(), and
add_leda().
Summary
Migrating from taxize, WorldFlora, lcvplants, rWCVP, taxadb, or
Taxonstand to taxify means replacing the package’s resolution call with
taxify(backend = ...) and optional add_*()
enrichment pipes. The output is a flat 16-column data.frame, not nested
lists or long-format join tables, and matching runs offline against
versioned backbone files so results do not change between sessions
unless the user explicitly updates the backbone.
For things taxify does not handle (distribution queries, downstream taxa, occurrence data, phylogenetic trees, sequence retrieval, live API freshness), the specialized packages (rWCVP, rgbif, rotl, spocc, rentrez, worrms, ritis) remain the right tools. taxify covers the name-matching step that comes before most of those.