taxify: messy name lists to accepted names, offline

What this solves

Every dataset here starts as a name column. Resolving it to accepted names in R, at scale and offline, is the job.
A plain merge() on raw strings silently drops every disagreeing row, so the matching has to come first.
taxify returns one standardized table from a character vector: cleaned, matched against a backbone you keep on disk, synonyms resolved. Matching runs in C, so no web services, no rate limits, and the same input gives the same output on any machine.

Below: each job done the familiar way and with taxify, side by side. Everything runs offline against the staged backbone.

Setup

install.packages("pak")
pak::pak("gcol33/taxify")     # the vectra engine installs with it

The first taxify() call downloads the WFO backbone once (about 150 MB, roughly 1 GB once unpacked). After that the package never touches the network. To stage it deliberately:

library(taxify)
taxify_download_vtr("wfo")
taxify_data_dir()             # where the backbone lives on this machine

One call

Hand taxify() a character vector. It cleans each name, matches it, and resolves synonyms.

field_names <- c(
  "Quercus robur L.",        # authorship to strip
  "Quercus robus",           # typo
  "cf. Betula pendula",      # field qualifier
  "FAGUS SYLVATICA",         # caps
  "Quercus pedunculata",     # historical synonym of Q. robur
  "Q. petraea",              # abbreviated genus
  "Pinus abies",             # synonym of Picea abies (a different genus)
  "Festuca rubrra",          # typo
  "Fallopia japonica",       # synonym of Reynoutria japonica (invasive)
  "Taraxacum officinale"
)

res <- taxify(field_names)
res[, c("input_name", "accepted_name", "family",
        "is_synonym", "match_type", "fuzzy_dist")]

#>              input_name        accepted_name       family is_synonym match_type
#> 1      Quercus robur L.        Quercus robur     Fagaceae      FALSE      exact
#> 2         Quercus robus        Quercus robur     Fagaceae      FALSE      fuzzy
#> 3    cf. Betula pendula       Betula pendula   Betulaceae      FALSE      exact
#> 4       FAGUS SYLVATICA      Fagus sylvatica     Fagaceae      FALSE   exact_ci
#> 5   Quercus pedunculata        Quercus robur     Fagaceae       TRUE      exact
#> 6            Q. petraea      Quercus petraea     Fagaceae      FALSE     abbrev
#> 7           Pinus abies          Picea abies     Pinaceae       TRUE      exact
#> 8        Festuca rubrra        Festuca rubra      Poaceae      FALSE      fuzzy
#> 9     Fallopia japonica  Reynoutria japonica Polygonaceae       TRUE      exact
#> 10 Taraxacum officinale Taraxacum officinale   Asteraceae      FALSE      exact
#>    fuzzy_dist
#> 1          NA
#> 2  0.07692308
#> 3          NA
#> 4          NA
#> 5          NA
#> 6          NA
#> 7          NA
#> 8  0.07142857
#> 9          NA
#> 10         NA

Ten names, ten rows. The list is small on purpose: each name takes a different route, so the whole table fits on screen and you can read every match. The hundred-name run further down shows how the same call behaves at field size.

Each row carries:

the original string and the accepted name
family, genus, and authorship
a synonym flag and the match type (exact, exact_ci, fuzzy, abbrev, none)
the fuzzy distance

The full table is 22 columns wide and holds the same shape for any input.

Each name takes a different route to its accepted name. Some lose authorship or a cf., one matches only after case folding, one is a typo caught by fuzzy matching, one is a synonym resolved to the current name. The pipeline is the same for all of them; the route through it is what differs.

What happened to each name:

Quercus robur L. lost its authorship before matching.
Quercus robus and Festuca rubrra were typos; the fuzzy pass corrected both, comparing only against other Quercus and Festuca (genus-blocked), so a one-letter slip costs little.
cf. Betula pendula lost the field qualifier.
FAGUS SYLVATICA matched after case folding (exact_ci).
Quercus pedunculata, Pinus abies, and Fallopia japonica resolved to their accepted names (Quercus robur, Picea abies, Reynoutria japonica) with is_synonym TRUE. The second crosses a genus boundary, the third is the current name for a well-known invader.
Q. petraea resolved to Quercus petraea on the genus initial plus epithet (match_type abbrev).

Why a typo barely costs anything

The fuzzy pass never compares a name against the whole backbone. It blocks on genus first, so Quercus robus is scored only against the other Quercus names, and a one-letter slip is found in a handful of comparisons instead of across every name on disk.

The default threshold allows about one edit per five characters, so common typos resolve while genuinely different names do not. A large genus is split further: the block key is the genus plus the first two letters of the epithet, which cuts a 31,000-name genus like Hieracium into sub-blocks of roughly a thousand, with a genus-only pass behind it so quality is unchanged.

When the typo lands on the genus itself, the genus block never pulls the name, so a second pass re-blocks on the first two letters of the full string. A genus slip that keeps those two letters still resolves to the right name; one that changes them falls through to no match. This second pass runs on the WFO, Catalogue of Life, and GBIF backbones.

Check the batch at a glance

summary() prints a digest, which is the fastest way to see whether a run went cleanly or something upstream needs attention.

summary(res)

#> ── taxify results ────────────────────────────────────────────────────────────
#>   backend: WFO v2024-12  |  10 names submitted
#> 
#>   matched        10  (exact: 6, case-insensitive: 1, fuzzy: 2, abbrev: 1)
#>   ────────────────────────────────────────────────────────────
#>   taxon groups: angiosperm: 8  gymnosperm: 1  unknown: 1

Offline, and how much faster

Every match runs against the local snapshot:

the same input gives the same output on any machine, so a run reproduces exactly
the backbone_version column records the exact WFO release and download date, which drops into a methods section

With the backbone already loaded (after the first call), a hundred-name field list resolves in a fraction of a second on this machine:

field <- c(
  "Quercus petraea", "Pinus sylvestris", "Picea abies", "Betula pendula",
  "Acer pseudoplatanus", "Acer platanoides", "Acer campestre", "Corylus avellana",
  "Fraxinus excelsior", "Carpinus betulus", "Sorbus aucuparia", "Tilia cordata",
  "Ulmus glabra", "Alnus glutinosa", "Salix caprea", "Populus tremula",
  "Prunus avium", "Prunus spinosa", "Crataegus monogyna", "Sambucus nigra",
  "Cornus sanguinea", "Viburnum opulus", "Euonymus europaeus", "Ligustrum vulgare",
  "Frangula alnus", "Juniperus communis", "Taxus baccata", "Larix decidua",
  "Abies alba", "Rosa canina", "Rubus idaeus", "Hedera helix",
  "Clematis vitalba", "Berberis vulgaris", "Betula pubescens", "Prunus padus",
  "Rhamnus cathartica", "Lonicera xylosteum", "Trifolium repens", "Trifolium pratense",
  "Festuca ovina", "Dactylis glomerata", "Plantago lanceolata", "Plantago major",
  "Plantago media", "Achillea millefolium", "Ranunculus acris", "Ranunculus repens",
  "Urtica dioica", "Poa pratensis", "Poa annua", "Galium mollugo",
  "Galium aparine", "Bellis perennis", "Cardamine pratensis", "Cirsium arvense",
  "Cirsium vulgare", "Daucus carota", "Heracleum sphondylium", "Anthriscus sylvestris",
  "Lotus corniculatus", "Medicago lupulina", "Vicia cracca", "Lathyrus pratensis",
  "Stellaria media", "Silene dioica", "Silene vulgaris", "Geranium pratense",
  "Geranium robertianum", "Glechoma hederacea", "Lamium album", "Prunella vulgaris",
  "Ajuga reptans", "Veronica chamaedrys", "Rumex acetosa", "Rumex obtusifolius",
  "Chenopodium album", "Capsella bursa-pastoris", "Senecio vulgaris", "Leucanthemum vulgare",
  "Centaurea jacea", "Knautia arvensis", "Campanula rotundifolia", "Primula veris",
  "Anemone nemorosa", "Filipendula ulmaria", "Lythrum salicaria",
  # a few naturalized aliens this group works with
  "Robinia pseudoacacia", "Solidago canadensis", "Solidago gigantea",
  "Impatiens glandulifera", "Heracleum mantegazzianum", "Prunus serotina",
  "Quercus rubra",
  # and the same kinds of mess as the ten-name list, at scale
  "Quercus robur L.", "FAGUS SYLVATICA", "Quercus robus",
  "cf. Taraxacum officinale", "Quercus pedunculata", "Festuca rubrra"
)

t <- system.time(field_res <- taxify(field, verbose = FALSE))
cat(sprintf("%d names resolved in %.3f s on this machine\n",
            length(field), t[["elapsed"]]))

#> 100 names resolved in 0.110 s on this machine

At this size you stop reading the table row by row. The fastest check is the digest from summary() (matched, and by which route); then the only rows that need a human’s eye are the ones that did not match exactly. Filtering to those turns a hundred rows into a handful:

field_res[field_res$match_type != "exact" | field_res$is_synonym %in% TRUE,
          c("input_name", "accepted_name", "match_type",
            "is_synonym", "fuzzy_dist")]

#>              input_name        accepted_name match_type is_synonym fuzzy_dist
#> 85     Anemone nemorosa Anemonoides nemorosa      exact       TRUE         NA
#> 96      FAGUS SYLVATICA      Fagus sylvatica   exact_ci      FALSE         NA
#> 97        Quercus robus        Quercus robur      fuzzy      FALSE 0.07692308
#> 99  Quercus pedunculata        Quercus robur      exact       TRUE         NA
#> 100      Festuca rubrra        Festuca rubra      fuzzy      FALSE 0.07142857

For a quick look at the resolved table itself, head(field_res) shows the same 22 columns as the ten-name run; the full object goes straight into the joins and statistics below without ever being printed in full.

The package benchmark sizes that up. On the same task many in this room reach for, WorldFlora’s WFO.match, both run against a local copy and return the same matches; the difference is where the matching happens. taxify scores names in C against the compiled backbone, WorldFlora in R. On 1,000 plant names with fuzzy matching on (Windows, R 4.5.2), the published numbers are:

Exact matching is close (0.1 s against 1.3 s); the gap opens on fuzzy matching, where the genus blocking above keeps taxify near a second while the in-R scan grows with the list. The full table and method are in the package README.

Add your own attributes

Once names resolve to an accepted name, any table keyed on species joins cleanly. add_data():

takes a data frame, or a CSV, XLSX, or SQLite file
runs its species column through the same backbone
joins on the accepted name, so a synonym on either side still lines up

my_traits <- data.frame(
  species      = c("Quercus pedunculata",   # synonym of Q. robur
                   "Pinus sylvestris",
                   "Betula pendula"),
  seed_mass_mg = c(3200, 7.5, 0.2)
)

taxify(c("Quercus robur", "Pinus sylvestris", "Betula pendula")) |>
  add_data(my_traits, species_col = "species")

#>         input_name     matched_name    accepted_name       taxon_id
#> 1    Quercus robur    Quercus robur    Quercus robur wfo-0000292858
#> 2 Pinus sylvestris Pinus sylvestris Pinus sylvestris wfo-0000481648
#> 3   Betula pendula   Betula pendula   Betula pendula wfo-0000335449
#>      accepted_id    rank     family   genus    epithet authorship
#> 1 wfo-0000292858 species   Fagaceae Quercus      robur         L.
#> 2 wfo-0000481648 species   Pinaceae   Pinus sylvestris         L.
#> 3 wfo-0000335449 species Betulaceae  Betula    pendula       Roth
#>   accepted_authorship is_synonym is_hybrid match_type fuzzy_dist is_ambiguous
#> 1                <NA>      FALSE     FALSE      exact         NA        FALSE
#> 2                <NA>      FALSE     FALSE      exact         NA        FALSE
#> 3                <NA>      FALSE     FALSE      exact         NA        FALSE
#>   ambiguous_targets backend         backbone_version kingdom_group taxon_group
#> 1              <NA>     wfo wfo:2024-12 (2026-06-24)       plantae  angiosperm
#> 2              <NA>     wfo wfo:2024-12 (2026-06-24)       plantae  gymnosperm
#> 3              <NA>     wfo wfo:2024-12 (2026-06-24)       plantae  angiosperm
#>    life_form qualifier qualifier_position seed_mass_mg
#> 1 angiosperm      <NA>               <NA>       3200.0
#> 2 gymnosperm      <NA>               <NA>          7.5
#> 3 angiosperm      <NA>               <NA>          0.2
#> Sources: WFO 2024-12, my_traits | cite() for full citations

The trait table used Quercus pedunculata and the result used Quercus robur; a plain merge() would have missed that row. add_data() joins on the accepted name, so it lines up.

taxify also ships:

published trait and status layers (add_woodiness(), add_eive(), add_diaz_traits(), add_conservation_status(), and more)
a fallback chain across backbones for animals, fungi, and marine taxa (backend = c("wfo", "col", "gbif"))

The next section lays out the full menu of these layers; the one after stacks three and runs a quick analysis. Full detail is in the enrichments and backends vignettes.

The enrichment layers

Those trait and status layers are just the start. The package ships the same join for many more datasets, across the tree of life and for the conservation and invasion records this group works with. Each add_*() matches its own source against the backbone and attaches on the accepted name, so any of them stacks into a pipeline the same way.

The current layers, grouped by what they cover:

covers	layers
Plants	`add_woodiness()`, `add_eive()`, `add_diaz_traits()`, `add_leda()`, `add_wcvp()`, `add_glonaf()`
Birds & mammals	`add_avonet()`, `add_elton_traits()`, `add_pantheria()`
Amphibians & reptiles	`add_amphibio()`, `add_lizard_traits()`
Fish	`add_fish_traits()`, `add_fishbase()`
Arthropods	`add_arthropod_traits()`, `add_leptraits()`
Fungi	`add_fungal_traits()`, `add_funguild()`
Algae	`add_algae_traits()`
Cross-taxon animals	`add_animaltraits()`, `add_anage()`
Status, alien & names	`add_conservation_status()`, `add_invasive_status()`, `add_alien_first_records()`, `add_common_names()`

For invasion work the GloNAF, GRIIS, and alien-first-record layers attach naturalized status, invasive status, and first-record years on the same accepted name, so a resolved species list carries its invasion history without a second join.

Add published attributes, then test an idea

Pick any three of these and stack them. Woodiness (woody or herbaceous), the five EIVE ecological indicator values (Dengler et al. 2023), and plant height from the Diaz global trait dataset (Diaz et al. 2022) all attach to the same hundred-name field list from above; once attached, the table is ready for ordinary R statistics straight away:

dat <- field_res |>
  add_woodiness() |>
  add_eive() |>
  add_diaz_traits()

head(dat[, c("accepted_name", "woodiness", "eive_light", "eive_reaction",
             "eive_nutrients", "plant_height_m")], 10)

#>          accepted_name woodiness eive_light eive_reaction eive_nutrients
#> 1      Quercus petraea     woody   5.826045      4.792116       3.548459
#> 2     Pinus sylvestris     woody   7.101249      5.133071       2.691805
#> 3          Picea abies     woody   4.362537      4.243331       4.311589
#> 4       Betula pendula     woody   6.840881      4.384956       3.641275
#> 5  Acer pseudoplatanus     woody   3.798876      5.919569       6.891850
#> 6     Acer platanoides     woody   3.871987      6.337649       5.656169
#> 7       Acer campestre     woody   5.046531      6.804980       5.790993
#> 8     Corylus avellana     woody   5.411088      5.743586       5.677502
#> 9   Fraxinus excelsior     woody   4.523195      7.190141       6.980345
#> 10    Carpinus betulus     woody   3.652234      5.623408       5.340294
#>    plant_height_m
#> 1       31.438500
#> 2       19.027893
#> 3       40.688265
#> 4       12.024730
#> 5       24.462656
#> 6       21.925662
#> 7       12.217433
#> 8        3.929159
#> 9       23.115589
#> 10      16.718244

Three layers attached in one pipeline (only the first ten rows shown). A few species lack an EIVE or a height value, so those cells are NA; R’s statistics functions drop incomplete rows on their own, and the columns are ready to use.

A correlation

Do species on more base-rich soils also sit higher on the nutrient axis? Two EIVE columns and one cor.test():

cor.test(dat$eive_reaction, dat$eive_nutrients)

#> 
#>  Pearson's product-moment correlation
#> 
#> data:  dat$eive_reaction and dat$eive_nutrients
#> t = 3.2378, df = 96, p-value = 0.001654
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.1230070 0.4821711
#> sample estimates:
#>       cor 
#> 0.3137695

The trend is positive and, across a hundred species, significant, though modest in strength: base-rich soils tend to carry higher nutrient values.

A linear model

The same question as a regression returns the slope and the same p-value:

summary(lm(eive_nutrients ~ eive_reaction, data = dat))

#> 
#> Call:
#> lm(formula = eive_nutrients ~ eive_reaction, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.9207 -1.1559 -0.0833  1.2259  3.5102 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)     2.4460     0.9035   2.707  0.00803 **
#> eive_reaction   0.4816     0.1487   3.238  0.00165 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.709 on 96 degrees of freedom
#>   (2 observations deleted due to missingness)
#> Multiple R-squared:  0.09845,    Adjusted R-squared:  0.08906 
#> F-statistic: 10.48 on 1 and 96 DF,  p-value: 0.001654

Each unit of soil-reaction value adds about half a point on the nutrient axis and explains roughly a tenth of the variance. For a single predictor the model’s p-value matches the correlation’s.

An ANOVA

Woodiness is a grouping factor, so an analysis of variance asks whether woody and herbaceous species differ in height:

height <- dat[!is.na(dat$plant_height_m), ]
summary(aov(plant_height_m ~ woodiness, data = height))

#>             Df Sum Sq Mean Sq F value   Pr(>F)    
#> woodiness    1   4699    4699   81.74 1.85e-14 ***
#> Residuals   95   5461      57                     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tapply(height$plant_height_m, height$woodiness, mean)

#> herbaceous      woody 
#>  0.5024754 14.4593877

Woody species average about fourteen metres against half a metre for the herbs, a difference that is clearly significant here. The result is unsurprising for height; the same three lines work for any attribute the package can attach. A boxplot shows the split:

boxplot(plant_height_m ~ woodiness, data = height,
        ylab = "height (m)", xlab = "")