the species names never quite match
Offline taxonomic name matching against local Darwin Core backbones, with matching done in C.
Hand it a column of messy species names. taxify cleans them, matches them against a backbone you already have on disk, resolves synonyms to accepted names, and returns one standardized data.frame. Every step runs locally against a versioned snapshot, so there are no API calls, no rate limits, and the same input gives the same output on any machine. The matching engine is written in C through the vectra columnar engine.
library(taxify)
# match against WFO (downloads the backbone on first use, ~120 MB)
taxify(c(
"Quercus robur",
"Pinus abies", # synonym, resolved to Picea abies
"Quercus robus", # typo, fuzzy-corrected to Q. robur
"Taraxacum officinale"
))Local, not over the wire
The usual route for name resolution, taxize, calls out to around twenty web services (NCBI, ITIS, GBIF, EOL, IUCN, WoRMS, Tropicos, …). That covers everything, but it ties each run to network latency, service uptime, and rate limits, and the answer can change between runs as upstream services update. taxify ships the backbones as pre-built local snapshots and matches against them in C, so a list of thousands resolves in seconds and a result is reproducible from the recorded backbone version.
The closest local analogue is taxadb, which also stores backbone snapshots on disk; the migration vignette walks through the differences in matching strategy, output schema, and enrichment.
Ten backbones, one call
taxify ships ten backbones as compressed .vtr files, downloaded once and matched locally. Pass several and they form a fallback chain: a name unmatched by the first backbone cascades to the next.
# WFO first (plants), then GBIF for whatever WFO doesn't cover
taxify(
c("Quercus robur", "Panthera leo", "Amanita muscaria"),
backend = c("wfo", "gbif")
)| Backend | Scope | Approx. names |
|---|---|---|
| WFO | Vascular plants | ~400k |
| COL | All kingdoms | ~4.5M |
| GBIF | All kingdoms | ~10M |
| ITIS | US focus, freshwater/marine | ~900k |
| NCBI Taxonomy | All life | ~2.5M |
| Open Tree of Life | All life (synthetic) | ~4M |
| WoRMS | Marine/aquatic | ~600k |
| Euro+Med | European/Mediterranean plants | ~132k |
| Species Fungorum | Fungi | ~329k |
| AlgaeBase | Algae | ~172k |
Names are cleaned before matching
Input names are normalized first, so the fuzzy pass only runs on names that genuinely differ from the backbone rather than on names that just carry extra authorship or qualifiers:
"Quercus robur L." -> "Quercus robur" # authorship stripped
"Pinus cf. sylvestris" -> "Pinus sylvestris" # qualifier removed
"Nothofagus x alpina" -> "Nothofagus alpina" # hybrid marker normalized
"Betula pendula (Roth) Doll" -> "Betula pendula" # parenthesized author strippedFuzzy matching is configurable (Damerau-Levenshtein, Levenshtein, or Jaro-Winkler, with a distance threshold), and runs genus-blocked so a typo only competes against names in the same genus.
On the same WFO backbone and the same 5,000 plant names (Windows, R 4.5.2), matching against the local snapshot in C avoids the per-name cost of the CSV-into-RAM approach:
| taxify | WorldFlora | |
|---|---|---|
| Exact match (1,000 names) | 0.1 s | 1.3 s |
| Fuzzy match (1,000 names) | 1.0 s | 1,862 s (31 min) |
| Fuzzy match (5,000 names) | 1.1 s | ~83 min (extrapolated) |
| Backbone load | ~3 s (first call) | 33 s (CSV into RAM) |
What you get back
taxify() returns one row per input name with a fixed 16-column schema: the matched and accepted names, IDs, rank, family, genus, epithet, authorship, synonym and hybrid flags, the match type (exact, exact_ci, fuzzy, or none), the fuzzy distance, the backend, and the backbone version used. summary() prints a compact digest of how the batch resolved.
Trait and status enrichment
Twenty-seven enrichment layers join published trait and status data to your results through the backbone-resolved accepted name, so synonyms in either dataset land on the same key:
# plants
taxify(plant_names) |>
add_conservation_status() |> # IUCN Red List
add_invasive_status("AT") |> # GRIIS
add_woodiness() |> # Zanne et al.
add_eive() # EIVE indicator values
# fish
taxify(fish_names, backend = "col") |>
add_fishbase() |> # FishBase morphology & ecology
add_fish_traits() # FISHMORPH functional traitsSources span all kingdoms: IUCN, GRIIS, GBIF common names, WCVP, EIVE, Diaz et al., LEDA, FungalTraits, FUNGuild, AlgaeTraits, EltonTraits, AVONET, PanTHERIA, AmphiBIO, FISHMORPH, FishBase, AnAge, GloNAF, LepTraits, AnimalTraits, and regional plant-trait sets for France (Baseflor), Britain (Ecoflora), and Germany (FloraWeb), and more. The enrichments vignette lists the full set with references and licenses.
To join your own table, add_data() auto-detects the species column, matches it through the same backbone(s) used in the original call, and left-joins. It accepts data.frames, CSV, CSV.GZ, XLSX, SQLite, and .vtr.
Installation
install.packages("pak")
pak::pak("gcol33/taxify") # vectra is installed automaticallySupport
“Software is like sex: it’s better when it’s free.” — Linus Torvalds
I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.
If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.