taxify matches taxonomic names against locally stored Darwin Core backbone databases. Ten backends are available, each compiled from a different authoritative source. The backend we choose determines which names can be matched, what taxonomic opinion governs synonym resolution, and which extra metadata columns are available downstream. This vignette walks through the backends, explains how to combine them in fallback chains, and offers practical guidance on which combination to pick for a given project.
Backend overview
The table below summarizes the ten backends. “Approx. names” is the total number of name strings in the compiled backbone (accepted names plus synonyms); the actual species count is lower because each accepted species may have several synonym entries pointing to it.
| Backend | Full name | Scope | Approx. names | Source format |
|---|---|---|---|---|
wfo |
World Flora Online | Vascular plants, bryophytes | ~400k | Zenodo ZIP (classification.txt) |
col |
Catalogue of Life | All kingdoms | ~4.5M | ChecklistBank DwC-A (Taxon.tsv) |
gbif |
GBIF Backbone Taxonomy | All kingdoms | ~10M | GBIF simple.txt.gz (30 positional cols) |
itis |
Integrated Taxonomic Information System | All kingdoms, US focus | ~900k | SQLite dump from itis.gov |
ncbi |
NCBI Taxonomy | All life incl. viruses | ~2.5M | Pipe-delimited .dmp files (taxdump) |
ott |
Open Tree of Life | All life (synthetic) | ~4M | Pipe-delimited taxonomy.tsv + synonyms.tsv |
worms |
World Register of Marine Species | Marine and brackish | ~600k | ChecklistBank DwC-A |
euromed |
Euro+Med PlantBase | European/Mediterranean plants | ~132k | Semicolon-delimited CSV |
fungorum |
Species Fungorum Plus | Fungi | ~500k | ChecklistBank DwC-A |
algaebase |
AlgaeBase | Algae and cyanobacteria | ~170k | ChecklistBank DwC-A (CC BY-NC) |
A few things stand out. WFO is the standard reference for plant taxonomy and the default backend in taxify. It is maintained by the World Flora Online consortium and receives regular updates. The backbone includes all taxonomic ranks from kingdom down to form, with full synonym resolution and authorship.
COL and GBIF both cover all kingdoms, but they differ in curation strategy. COL is an expert-curated checklist assembled from over 160 sector databases, each maintained by a taxonomic authority for its group. GBIF’s backbone is assembled algorithmically from COL, ITIS, and dozens of other sources, which gives it broader raw coverage (~10M names vs COL’s ~4.5M) at the cost of occasional inconsistencies where source databases disagree. In practice, COL tends to give cleaner synonym resolution; GBIF tends to match more names.
ITIS was originally developed for North American fauna and remains
particularly strong on freshwater invertebrates, insects, and US-listed
species. Its coverage of non-American taxa is uneven. The ITIS backbone
is distributed as a SQLite dump; building it from source requires the
RSQLite package. Pre-built .vtr backbones avoid this
dependency.
NCBI Taxonomy is the gold standard for sequence-linked work. Every
GenBank, RefSeq, and BOLD sequence is linked to an NCBI tax_id, making
this backend essential for molecular ecology and metagenomics. It is
also the only backend that covers bacteria, archaea, and viruses in
meaningful depth. However, NCBI Taxonomy does not store authorship data,
so the authorship column is always NA for
NCBI-matched rows.
OTT (Open Tree of Life) is a synthetic taxonomy that merges NCBI,
GBIF, WoRMS, IRMNG, and several other sources into a single tree. It has
the broadest coverage of any single source and provides cross-references
to all of its constituent databases via the sourceinfo
field. The trade-off is that synthetic taxonomies can carry conflicts
and inconsistencies at the edges, where source databases disagree about
the placement of a taxon.
WoRMS is the authoritative source for marine species. It is curated by a network of over 300 taxonomic editors and covers marine, brackish, and some freshwater species. Beyond basic taxonomy, the WoRMS backbone stores habitat flags (marine, brackish, freshwater, terrestrial) and extinction status, some of which are accessible via the COL SpeciesProfile.
Euro+Med PlantBase is the taxonomic reference for the flora of Europe, the Mediterranean, and the Caucasus. It covers all native and introduced vascular plants in its geographic scope (~49k accepted names, ~83k synonyms). The backbone is built from the 2020 bulk download, updated via a PESI API delta refresh (April 2026) that resolved 1,014 reclassifications and synonym changes cross-referenced against WFO and POWO. Euro+Med is particularly useful for European vegetation surveys and datasets aligned with the European Vegetation Archive (EVA). Its data is licensed CC BY-SA 3.0.
Species Fungorum Plus is the specialist reference for fungal taxonomy, with ~500k names curated by the Royal Botanic Gardens, Kew. It covers Ascomycota, Basidiomycota, and other fungal phyla, including anamorphs and teleomorphs. For purely mycological datasets, it gives better synonym resolution than generalist databases.
AlgaeBase covers micro- and macroalgae, cyanobacteria, and some protists. It is the only backend licensed CC BY-NC (non-commercial use only). All other backends are open-access. taxify prints a license notice during AlgaeBase download to make this visible.
Downloading backbones
taxify auto-downloads backbones on first use. When we call
taxify(names, backend = "wfo") and no local WFO backbone
exists, taxify fetches the pre-built .vtr file from Zenodo,
writes it to taxify_data_dir(), and caches the path for the
remainder of the R session. Subsequent calls, whether in the same
session or future sessions, reuse the local copy without any network
access.
To download a backbone ahead of time (useful on a shared server or in
a Docker image), use taxify_download_vtr():
library(taxify)
# Download one backbone
taxify_download_vtr("wfo")
# Download several at once
taxify_download_vtr(c("wfo", "col", "worms"))Pre-built .vtr files are hosted on Zenodo and typically
range from 50 MB (AlgaeBase) to 400 MB (GBIF), depending on the backend.
The files are compiled from the raw Darwin Core sources with precomputed
matching keys, embedded synonym resolution, and genus-level indexes, so
they are ready for querying the moment the download completes.
taxify checks for backbone updates once per R session. The first
taxify() call in a session fetches the manifest from
GitHub, compares each requested backend’s local version against the
latest available release, and downloads a new version only if one
exists. If the network is unavailable, taxify falls back to the bundled
manifest and uses whatever local copy is on disk. The version check is
logged to the console so there are no silent updates.
For backends with large source files, the build-from-source path also
exists. taxify_download("gbif") downloads the raw 1.5 GB
simple.txt.gz from GBIF, parses all 30 positional columns,
denormalizes the family hierarchy via self-joins, and compiles the
result into .vtr format. This is slower than downloading
the pre-built file but produces the same output. The build-from-source
path is mainly useful for CI pipelines or users who want to customize
the compilation step.
We can also pin a specific backbone version:
taxify_download_vtr("wfo", version = "2024.01")Pinned versions are stored in their own directory
(taxify_data_dir()/wfo/2024.01/) and are never overwritten
by future updates. The “latest” slot
(taxify_data_dir()/wfo/latest/) is always overwritten when
a newer version becomes available. Pinning is useful for
reproducibility: a project can lock a specific backbone version and
produce identical results regardless of when the analysis is re-run.
Single-backend matching
The simplest use case: match plant names against WFO.
library(taxify)
plants <- c(
"Quercus robur",
"Quercus petraea",
"Pinus sylvestris",
"Acer pseudoplatanus",
"Betula pendula",
"Fagus sylvatica",
"Picea abies"
)
result <- taxify(plants, backend = "wfo")
result[, c("input_name", "accepted_name", "family", "match_type", "backend")]Every row in the output has 16 columns regardless of which backend
produced it: input_name, matched_name,
accepted_name, taxon_id,
accepted_id, rank, family,
genus, epithet, authorship,
is_synonym, is_hybrid,
match_type, fuzzy_dist, backend,
and backbone_version. The backend column
records "wfo" for matched rows and NA for
unmatched ones. The backbone_version column records the
backend name, version, and download date (e.g.,
"wfo:2024-12 (2026-04-01)") so we can cite the exact data
snapshot used.
When a name matches a synonym, taxify automatically resolves it to
the accepted name. The matched_name column shows the name
string that actually matched in the backbone (which may be a synonym),
while accepted_name shows the current accepted name after
resolution. The is_synonym column is TRUE for
resolved synonyms, FALSE for direct matches.
Fuzzy matching is on by default with a normalized Damerau-Levenshtein threshold of 0.2, roughly one edit per five characters. This catches common typos like transposed letters or missing diacritics. We can tighten the threshold for stricter matching:
result <- taxify(plants, backend = "wfo", fuzzy_threshold = 0.1)Or disable fuzzy matching entirely to only accept exact and case-insensitive matches:
result <- taxify(plants, backend = "wfo", fuzzy = FALSE)Two alternative distance metrics are available via the
fuzzy_method argument: "levenshtein" (standard
Levenshtein, no transposition handling) and "jw"
(Jaro-Winkler, better for names that differ mainly in their beginnings).
The default "dl" (Damerau-Levenshtein) is a good
all-rounder.
Backend-specific output differences
All ten backends produce the same 16-column output schema. This is a deliberate design choice: downstream code does not need to know which backend produced a match. That said, the content of those columns varies in ways worth knowing about.
Authorship. WFO’s scientificName is
already canonical (no authorship appended), so the
authorship column comes from a separate
scientificNameAuthorship field. COL and WoRMS store the
full scientificName with authorship included; taxify strips
it at build time to produce the canonical name used for matching, and
the stripped authorship is stored separately. NCBI and OTT have no
authorship data at all, so the authorship column is always
NA for those backends. GBIF and ITIS provide authorship.
Euro+Med provides authorship from its AuthorString field.
Species Fungorum and AlgaeBase provide authorship from their DwC-A
archives.
Taxon IDs. Each backend uses a different identifier
system. WFO IDs look like "wfo-0000000123". COL IDs are
opaque alphanumeric strings like "4LHBG". GBIF uses integer
keys ("2878688"). ITIS uses TSN integers
("183671"). NCBI uses NCBI Taxonomy IDs
("9606"). OTT uses OTT IDs ("770315"). WoRMS
uses AphiaIDs extracted from LSIDs; during build time, taxify strips the
urn:lsid:marinespecies.org:taxname: prefix and stores just
the numeric ID. Euro+Med uses TaxonUsageID integers from
the PlantBase export. Species Fungorum and AlgaeBase use ChecklistBank
dataset-specific IDs. All IDs are stored as character strings in the
taxon_id and accepted_id columns for
consistency, but their format is backend-specific and meaningful only
within that backend’s ecosystem. A taxon_id from WFO cannot
be looked up in the COL database, and vice versa.
Classification depth. The base output always
includes family and genus. WFO provides these
directly from its classification file. COL stores the full Linnaean
hierarchy (kingdom through order) in the Taxon.tsv, though these extra
columns require add_col_info() to access. GBIF provides
family through a denormalized family_key self-join at build
time. ITIS, NCBI, and OTT resolve family and genus via parent-hierarchy
walks during backbone compilation; the walk traverses up to 25 levels of
the taxonomic tree. WoRMS has denormalized classification columns
directly in its DwC-A. Euro+Med resolves family and genus via a
hierarchy walk on IsChildTaxonOfID. The genus register
(covered later in this vignette) fills in higher classification fields
(kingdom_group, taxon_group,
life_form) for all backends.
Synonym handling. The backends represent synonymy in
very different ways internally. WFO and COL use the Darwin Core field
acceptedNameUsageID to point from a synonym row to its
accepted name. GBIF encodes synonyms via parent_key
pointing to the accepted taxon. NCBI represents synonyms as alternative
name strings for the same tax_id; during build, taxify
emits these as separate rows with synthetic IDs of the form
"123456_syn_1", "123456_syn_2", etc. OTT uses
a separate synonyms.tsv file with explicit
synonym-to-accepted mappings. All of these representations are
normalized at build time into the same is_synonym +
accepted_name + accepted_id schema, so the
output looks the same regardless of source.
Synonym chains. Some backbones contain chained
synonyms, where synonym A points to synonym B, which points to accepted
name C. taxify resolves these chains at build time (up to 10 hops) so
the accepted_name always points to the terminal accepted
name. This happens transparently during backbone compilation and does
not affect query-time performance.
Multi-backend fallback chains
When a species list spans multiple kingdoms, a single backend may not cover everything. A wetland monitoring dataset might contain vascular plants, invertebrates, amphibians, algae, and fungi. No single backend covers all of these equally well. taxify’s fallback chain handles this: we pass a vector of backend names, and names are matched against each backend in order. A name matched by an earlier backend is never re-matched by a later one; it is removed from the pool.
mixed <- c(
"Quercus robur", # plant
"Panthera leo", # animal
"Amanita muscaria", # fungus
"Salmo trutta", # fish
"Escherichia coli" # bacterium
)
result <- taxify(mixed, backend = c("wfo", "col", "gbif"))
result[, c("input_name", "accepted_name", "match_type", "backend")]The console output during matching shows the chain in action:
Matching 5 names against 3 backends: wfo -> col -> gbif
[wfo] Matching 5 names...
[col] Matching 4 remaining names...
[gbif] Matching 1 remaining names...
"Quercus robur" matches in WFO and is removed from the
pool. The remaining four names go to COL. If any are still unmatched
after COL (perhaps an obscure bacterial name), they go to GBIF. The
process continues until all names have been tried against all backends
or all names have matched.
The order of backends in the vector matters. It determines which
taxonomic opinion wins for each name. If "Quercus robur"
exists in both WFO and COL, putting WFO first means WFO’s taxonomic
opinion is used (its accepted name, family assignment, synonym
resolution). Putting COL first would give COL’s opinion. For names that
exist in multiple backends, the first backend in the chain always
wins.
This has practical consequences. If we put GBIF first, everything
would match there (GBIF has ~10M names, the largest of any backend) and
the curated opinions from WFO, COL, or WoRMS would never be consulted.
For a plant-heavy list with some non-plant taxa mixed in,
c("wfo", "col") or c("wfo", "col", "gbif") is
a sensible ordering: we get WFO’s curated plant taxonomy for plants, and
COL or GBIF picks up the rest.
If all names have been matched by earlier backends, later backends are skipped entirely with a message:
[gbif] Skipped (all names matched)
Fuzzy matching runs independently within each backend in the chain. A name that fails exact matching in WFO gets fuzzy-matched against WFO. If it still fails, it moves to the next backend and gets exact-matched, then fuzzy-matched there. This means a misspelled plant name has the best chance of matching in WFO (the plant-specialist backend) before falling through to COL or GBIF.
Worked example: plants-only with WFO vs WFO + COL
WFO focuses on accepted vascular plant and bryophyte names. Its coverage is excellent for current taxonomy, but names that appear only in older literature, belong to genera not yet integrated into WFO, or are nomenclaturally orphaned (no clear accepted name) may be absent. COL inherits WFO’s plant taxonomy as one of its sector databases but supplements it with names from other sources, including historical synonyms and cultivar names.
plants <- c(
"Quercus robur",
"Quercus petraea",
"Pinus sylvestris",
"Acer pseudoplatanus",
"Coffea arabica",
"Welwitschia mirabilis",
"Lepidodendron aculeatum", # extinct lycopsid
"Nothofagus cunninghamii",
"Dracaena draco"
)
# WFO alone
wfo_result <- taxify(plants, backend = "wfo")
table(wfo_result$match_type)If any names come back as "none", we can add COL as a
fallback:
# WFO first, COL as fallback
both_result <- taxify(plants, backend = c("wfo", "col"))
table(both_result$match_type)
both_result[, c("input_name", "accepted_name", "backend")]The backend column now shows "wfo" for
names matched by WFO and "col" for names that only COL
could resolve. This tells us exactly where each match came from, which
matters for reproducibility. In a paper’s methods section, we can state
“plant names were resolved against WFO 2024-12, with unmatched names
resolved against COL 2025.”
The two-backend chain is especially valuable for large vegetation plot datasets. Most names resolve in WFO with its plant-optimized taxonomy, but the handful of edge cases (cultivars, historical names, genera recently moved between families) that fall through to COL would otherwise require manual resolution.
Worked example: mixed-kingdom list with COL + GBIF + WoRMS
An ecological monitoring dataset from a coastal estuary might contain vascular plants, invertebrates, fish, and marine algae. No single backend covers all of these well. COL has broad expert-curated coverage across kingdoms. GBIF fills gaps with its larger name pool. WoRMS provides an authoritative backstop for marine invertebrate synonymy, which can be slow to propagate to generalist databases.
estuary_species <- c(
"Zostera marina", # seagrass (plant)
"Salicornia europaea", # glasswort (plant)
"Carcinus maenas", # shore crab
"Mytilus edulis", # blue mussel
"Platichthys flesus", # European flounder
"Nereis diversicolor", # ragworm
"Fucus vesiculosus", # bladderwrack (brown alga)
"Littorina littorea", # common periwinkle
"Arenicola marina", # lugworm
"Cerastoderma edule" # common cockle
)
result <- taxify(estuary_species, backend = c("col", "gbif", "worms"))
result[, c("input_name", "accepted_name", "family", "backend")]COL is a good first choice here because it covers all kingdoms with expert curation. Most of these names will resolve there. GBIF catches anything COL might miss, including names from national checklists that have not yet been incorporated into COL. WoRMS serves as a final backstop specifically for marine taxa, covering invertebrate synonyms that may lag in generalist databases. The chain is ordered from highest curation to broadest coverage.
If the list were predominantly marine with only a few terrestrial taxa, we might lead with WoRMS instead to ensure its authoritative marine taxonomy takes precedence:
Worked example: fungi with Species Fungorum + COL fallback
Mycological datasets benefit from using Species Fungorum Plus as the primary backend. It is curated specifically for fungi, with ~500k names including anamorphs, teleomorphs, and the pleomorphic naming changes introduced by the 2011 Melbourne Code. Synonym coverage for fungal genera is better than in generalist databases, where fungal taxonomy is often a secondary concern.
fungi <- c(
"Amanita muscaria",
"Boletus edulis",
"Cantharellus cibarius",
"Tuber melanosporum",
"Saccharomyces cerevisiae",
"Aspergillus niger",
"Penicillium chrysogenum",
"Agaricus bisporus",
"Trametes versicolor",
"Cordyceps militaris"
)
result <- taxify(fungi, backend = c("fungorum", "col"))
result[, c("input_name", "accepted_name", "is_synonym", "backend")]Species Fungorum resolves the standard names. If any obscure, recently described, or historically orphaned species fall through, COL picks them up. For mixed datasets that include both fungi and plants, a three-backend chain works well:
mixed <- c(
"Quercus robur", # plant
"Amanita muscaria", # fungus
"Lactarius deliciosus", # fungus
"Pinus sylvestris", # plant
"Russula emetica" # fungus
)
result <- taxify(mixed, backend = c("wfo", "fungorum", "col"))WFO handles plants, Species Fungorum handles fungi, and COL serves as
a catch-all for anything that falls through both specialist backends.
The genus register (see below) helps taxify skip backends that cannot
possibly match a given name. When taxify encounters
"Amanita muscaria" and the genus Amanita is
not in WFO’s coverage table, the name is marked out-of-scope for WFO
immediately and passed to the next backend without wasting time on fuzzy
matching.
Worked example: algae
For algal taxonomy, AlgaeBase is the specialist source. It covers micro- and macroalgae, cyanobacteria, and some protists. Its curation is particularly strong for freshwater and marine microalgae where generalist databases often have thin coverage and outdated synonymy.
algae <- c(
"Chlamydomonas reinhardtii",
"Chlorella vulgaris",
"Ulva lactuca",
"Fucus vesiculosus",
"Sargassum muticum"
)
result <- taxify(algae, backend = c("algaebase", "col"))
result[, c("input_name", "accepted_name", "backend")]AlgaeBase is licensed CC BY-NC. taxify prints a license notice during download so users are aware of the restriction before incorporating the data into a workflow. For commercial applications, COL or WoRMS can serve as alternatives, though with less specialized algal coverage.
Worked example: molecular ecology with NCBI
When reconciling species lists from metabarcoding or eDNA studies, names are often linked to NCBI accessions. Using the NCBI backend ensures that taxify’s accepted names align with the same taxonomy used in GenBank and BOLD.
edna_hits <- c(
"Salmo trutta",
"Phoxinus phoxinus",
"Anguilla anguilla",
"Cottus gobio",
"Lampetra planeri",
"Chironomus riparius", # midge (insect)
"Potamopyrgus antipodarum" # New Zealand mud snail
)
result <- taxify(edna_hits, backend = c("ncbi", "col"))
result[, c("input_name", "accepted_name", "taxon_id", "backend")]The taxon_id values for NCBI-matched rows are NCBI
tax_ids, which can be used directly to link back to GenBank records or
NCBI taxonomy pages. For names not found in NCBI (e.g., taxa without
sequenced representatives), COL provides a fallback.
The backend column
The backend column in taxify’s output is a plain
character column. For single-backend calls, every matched row shows the
same backend name. For multi-backend chains, the column records which
backend produced each match. Unmatched rows have
backend = NA.
We can use this column to count how many names each backend resolved:
result <- taxify(species_list, backend = c("wfo", "col", "gbif"))
table(result$backend, useNA = "ifany")Or filter to rows matched by a specific backend:
wfo_matches <- result[result$backend == "wfo" & !is.na(result$backend), ]
col_matches <- result[result$backend == "col" & !is.na(result$backend), ]This is useful for quality control. If we expected a purely plant-based list but see many names matched by COL instead of WFO, that tells us the list contains names outside WFO’s scope (perhaps algae classified as plants in older literature, or animal-associated organisms like plant parasites).
The backbone_version column gives the full provenance
string for each row, combining the backend name, version, and download
date. For a paper’s methods section, we can extract the unique versions
used:
unique(result$backbone_version[!is.na(result$backbone_version)])
# e.g., c("wfo:2024-12 (2026-04-01)", "col:2025 (2026-04-01)")These provenance strings identify both the taxonomic source and the exact snapshot used, making results fully reproducible even if the backend releases a new version between when we run the analysis and when a reviewer checks it.
Backend-specific extras
Three backends have dedicated enrichment functions that join
additional backend-specific columns to a taxify result. These functions
only enrich rows that were matched by the corresponding backend; rows
from other backends get NA in the new columns.
WFO extras: add_wfo_info()
Adds WFO-specific columns: scientificNameID,
parentNameUsageID, namePublishedIn,
higherClassification, taxonRemarks, and
infraspecificEpithet. The namePublishedIn
field is particularly useful for citing original descriptions, and
higherClassification provides the full taxonomic hierarchy
as a semicolon-separated string.
result <- taxify(plants, backend = "wfo") |>
add_wfo_info()
result[, c("input_name", "accepted_name", "namePublishedIn")]COL extras: add_col_info()
Adds COL classification columns (kingdom,
phylum, col_class, order),
nomenclatural metadata (notho,
nomenclaturalCode, nomenclaturalStatus,
namePublishedIn), infraspecificEpithet, and
SpeciesProfile flags (is_extinct, is_marine,
is_freshwater, is_terrestrial). The
class column is renamed to col_class to avoid
conflict with R’s class() function.
The SpeciesProfile flags come from a separate file in the COL DwC-A archive. They are especially useful for filtering: we can, for instance, exclude extinct species from a contemporary biodiversity analysis, or separate marine from terrestrial taxa in an estuarine dataset.
result <- taxify(species_list, backend = "col") |>
add_col_info()
# Check which species are marine
result[result$is_marine == TRUE & !is.na(result$is_marine),
c("input_name", "accepted_name", "kingdom", "is_marine")]GBIF extras: add_gbif_info()
Adds GBIF-specific columns: notho_type (hybrid type),
nom_status (nomenclatural status),
bracket_authorship (basionym author),
bracket_year, gbif_year,
name_published_in, origin (how the name
entered the GBIF backbone), and infra_specific_epithet. The
origin field is useful for understanding provenance: values
like "SOURCE", "DENORMED_CLASSIFICATION", or
"VERBATIM_ACCEPTED" indicate how GBIF ingested the
name.
result <- taxify(species_list, backend = "gbif") |>
add_gbif_info()
result[, c("input_name", "accepted_name", "origin", "nom_status")]Combining extras in a multi-backend result
In a multi-backend result, we can pipe through multiple enrichment functions. Each one only touches rows from its own backend; the others are left alone.
result <- taxify(species_list, backend = c("wfo", "col", "gbif")) |>
add_wfo_info() |>
add_col_info() |>
add_gbif_info()This produces a wide data.frame with the union of all extra columns.
For a given row, only the columns from its backend are populated; the
rest are NA. Whether this is useful depends on the
analysis. For most workflows, the base 16 columns are sufficient, and
backend-specific extras are only needed when we require nomenclatural
details, habitat flags, or publication references that the standard
output does not include.
The genus register
taxify ships a unified genus register built from the union of genera
across all eight pre-built backends (WFO, COL, GBIF, ITIS, NCBI, OTT,
WoRMS, and Euro+Med). The register contains ~100k genera, each with its
family, higher classification (kingdom through order, where available),
and a life_form label (e.g., "vascular plant",
"animal", "fungus"). The classification is
resolved by priority: WoRMS > COL > GBIF > Euro+Med > ITIS
> NCBI > OTT > WFO. If COL and WFO disagree about which family
a genus belongs to, COL’s assignment wins.
The register serves two purposes in taxify’s matching pipeline.
First, it provides life_form, kingdom_group,
and taxon_group columns in taxify output for every matched
name, regardless of which backend matched it. These columns make it
possible to stratify results by broad taxonomic group without needing to
look up each family manually.
Second, the register enables out-of-scope detection. Before fuzzy
matching begins, taxify checks whether an unmatched name’s genus is in
the register. If the genus is known (it appears in the register) but not
covered by any of the requested backends (it does not appear in the
backend coverage table), taxify marks the name as
"out_of_scope" immediately. This avoids wasting time on
fuzzy matching against a backend that could never produce a match, and
gives the user a more informative signal than a plain
"none".
Looking up a genus
lookup_genus() returns the register row for a single
genus. The register is loaded into memory on first call and cached for
the session.
lookup_genus("Quercus")
# genus kingdom phylum class order family life_form
# 1 Quercus Plantae ... ... Fagales Fagaceae vascular plant
lookup_genus("Panthera")
# genus kingdom phylum class order family life_form
# 1 Panthera Animalia Chordata Mammalia Carnivora Felidae animalChecking backend coverage
taxify_register_coverage() shows which backends contain
a given genus and at what version. This is useful when diagnosing match
failures: if the genus does not appear in the coverage table for the
requested backend, the name is genuinely out of scope for that
backend.
taxify_register_coverage("Quercus")
# genus backend version date_added
# 1 Quercus col 2025 2026-04-01
# 2 Quercus gbif current 2026-04-01
# 3 Quercus wfo 2024-12 2026-04-01A genus covered by all three backends can be matched by any of them. A genus covered only by GBIF (perhaps a recently described bacterial genus) will not match against WFO or COL.
Out-of-scope detection in practice
When taxify encounters an unmatched name whose genus appears in the
register but is not covered by any of the requested backends, it sets
match_type = "out_of_scope" instead of "none".
This distinction carries information: an out-of-scope result means the
name likely exists in a different backend rather than being a
misspelling or invalid name.
# Trying to match a marine invertebrate against WFO (plants only)
result <- taxify("Carcinus maenas", backend = "wfo")
result$match_type
# [1] "out_of_scope"
result$life_form
# [1] "animal"The life_form column tells us the genus belongs to
animals, confirming that WFO is the wrong backend for this name. In a
pipeline, we can filter for "out_of_scope" rows and re-run
them against a broader backend. Or, more practically, we include the
right backends in the fallback chain from the start, and the
out-of-scope mechanism saves time by skipping the fuzzy matching step
for names that cannot possibly match.
The taxify result’s print() method includes a tally of
out-of-scope names broken down by taxon_group, making it
easy to see at a glance what kinds of organisms were outside the
requested backend’s scope.
Practical guidance: choosing backends
The right backend depends on the taxonomic scope of the data. Here is a decision tree organized by the most common use cases.
Pure vascular plant lists. Use
backend = "wfo". WFO is the standard reference,
well-curated, and updated regularly. If some names fall through (e.g.,
horticultural cultivars, nomenclaturally complex genera, or names from
older floras that use outdated synonymy), add COL:
backend = c("wfo", "col").
European vegetation data. Use
backend = c("euromed", "wfo"). Euro+Med PlantBase is the
taxonomic reference used by EVA and covers all native and introduced
vascular plants of Europe, the Mediterranean, and the Caucasus. Leading
with Euro+Med ensures European synonym resolution follows Euro+Med’s
taxonomic opinion, with WFO as fallback for non-European taxa or names
outside Euro+Med’s scope. Note that Euro+Med data is CC BY-SA 3.0.
Pure marine/aquatic lists. Use
backend = "worms". WoRMS is the authoritative source for
marine taxonomy, curated by domain experts, and includes habitat and
extinction flags. For estuarine or transitional lists that include some
terrestrial taxa, add COL: backend = c("worms", "col").
Pure fungal lists. Use
backend = "fungorum". Species Fungorum Plus is the
specialist reference for fungi. Add COL as fallback for obscure or
recently described species:
backend = c("fungorum", "col").
Pure algal lists. Use
backend = "algaebase". Add COL or WoRMS as fallback:
backend = c("algaebase", "col"). Remember AlgaeBase is CC
BY-NC.
Mixed-kingdom ecological datasets. Use
backend = c("col", "gbif"), or lead with a specialist
backend for the dominant taxon group. For a plant-dominated dataset with
some animals and fungi: backend = c("wfo", "col"). For a
marine biodiversity survey:
backend = c("worms", "col", "gbif"). For a forest inventory
that includes trees, fungi, insects, and epiphytes:
backend = c("wfo", "fungorum", "col").
Molecular/sequence-linked work. Use
backend = "ncbi". NCBI Taxonomy is the reference for
GenBank, BOLD, and other sequence databases. It covers bacteria,
archaea, and viruses that other backends lack. For mixed molecular and
ecological work: backend = c("ncbi", "col").
Phylogenetic studies. Use
backend = "ott". OTT is the backbone of the Open Tree of
Life and merges multiple source taxonomies. Its cross-references to
NCBI, GBIF, WoRMS, and IRMNG make it a good bridge between different
identifier systems.
Maximum coverage / catch-all. Use
backend = c("col", "gbif"). COL provides expert-curated
taxonomy for ~4.5M names. GBIF’s backbone adds ~10M names from
additional sources. Together they cover virtually all described species
with a nomenclatural record. This combination is a reasonable default
when the taxonomic composition of the dataset is unknown.
General rule. Specialist backends first, generalist
backends second. Lead with the backend whose taxonomic opinion we trust
most for the dominant taxon group, and add broader backends as fallbacks
for the remainder. The backend column in the output lets us
audit exactly which taxonomic opinion was applied to each name.
Performance considerations
Backend size affects download time and, to a lesser extent, matching speed. WFO (~400k names) matches faster than GBIF (~10M names) for the same query. However, taxify uses index-accelerated genus-blocked joins at the C level (via vectra), so even GBIF matching is fast for typical species lists. A list of 5,000 names resolves against GBIF in under a second on modern hardware. The performance difference only becomes noticeable at scale (100k+ names) or with heavy fuzzy matching against a large backbone.
For large lists with a multi-backend chain, putting the most likely
backend first saves time. Names matched by the first backend skip all
later backends entirely. If 90% of a list is plants,
c("wfo", "col") is faster than c("col", "wfo")
because WFO is smaller and resolves most names on the first pass. The
remaining 10% of names go to COL, which is larger but only processes the
small residual.
Fuzzy matching is the most expensive step. It runs a genus-blocked fuzzy join with multi-threaded string distance computation. For names with misspelled genera (where genus blocking cannot help), taxify falls back to a 2-character prefix block that catches most genus-level typos while keeping the search space manageable.
Reproducibility
The backbone_version column encodes the exact data
snapshot used. For a published analysis, we recommend recording these
strings in the methods section or supplementary material. taxify pins
the backbone version at download time and does not update mid-session.
Version checks happen once per R session, and any update is logged to
the console with the old and new version numbers.
To lock a specific backbone version for a project:
taxify_download_vtr("wfo", version = "2024.01")Pinned versions live in their own directories and are never
overwritten. The “latest” slot continues to track new releases
independently. A project that needs exact reproducibility can pin all
backends at specific versions and never use the “latest” slot. A project
that prefers to stay current can rely on the default “latest” behavior
and cite the backbone_version strings from the output.
The manifest, which maps backend names to their download URLs and
latest versions, is cached per session and can be refreshed with
taxify_refresh_manifest(). For offline use, taxify falls
back to the bundled manifest shipped with the package.