merge() on raw strings silently drops every
disagreeing row, so the matching has to come first.taxify returns one standardized table from a character
vector: cleaned, matched against a backbone you keep on disk, synonyms
resolved. Matching runs in C, so no web services, no rate limits, and
the same input gives the same output on any machine.Below: each job done the familiar way and with taxify, side by side. Everything runs offline against the staged backbone.
The first taxify() call downloads the WFO backbone once
(about 150 MB, roughly 1 GB once unpacked). After that the package never
touches the network. To stage it deliberately:
Hand taxify() a character vector. It cleans each name,
matches it, and resolves synonyms.
field_names <- c(
"Quercus robur L.", # authorship to strip
"Quercus robus", # typo
"cf. Betula pendula", # field qualifier
"FAGUS SYLVATICA", # caps
"Quercus pedunculata", # historical synonym of Q. robur
"Q. petraea", # abbreviated genus
"Pinus abies", # synonym of Picea abies (a different genus)
"Festuca rubrra", # typo
"Fallopia japonica", # synonym of Reynoutria japonica (invasive)
"Taraxacum officinale"
)
res <- taxify(field_names)
res[, c("input_name", "accepted_name", "family",
"is_synonym", "match_type", "fuzzy_dist")]#> input_name accepted_name family is_synonym match_type
#> 1 Quercus robur L. Quercus robur Fagaceae FALSE exact
#> 2 Quercus robus Quercus robur Fagaceae FALSE fuzzy
#> 3 cf. Betula pendula Betula pendula Betulaceae FALSE exact
#> 4 FAGUS SYLVATICA Fagus sylvatica Fagaceae FALSE exact_ci
#> 5 Quercus pedunculata Quercus robur Fagaceae TRUE exact
#> 6 Q. petraea Quercus petraea Fagaceae FALSE abbrev
#> 7 Pinus abies Picea abies Pinaceae TRUE exact
#> 8 Festuca rubrra Festuca rubra Poaceae FALSE fuzzy
#> 9 Fallopia japonica Reynoutria japonica Polygonaceae TRUE exact
#> 10 Taraxacum officinale Taraxacum officinale Asteraceae FALSE exact
#> fuzzy_dist
#> 1 NA
#> 2 0.07692308
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 NA
#> 7 NA
#> 8 0.07142857
#> 9 NA
#> 10 NA
Ten names, ten rows. The list is small on purpose: each name takes a different route, so the whole table fits on screen and you can read every match. The hundred-name run further down shows how the same call behaves at field size.
Each row carries:
exact,
exact_ci, fuzzy, abbrev,
none)The full table is 22 columns wide and holds the same shape for any input.
Each name takes a different route to its accepted name. Some lose
authorship or a cf., one matches only after case folding,
one is a typo caught by fuzzy matching, one is a synonym resolved to the
current name. The pipeline is the same for all of them; the route
through it is what differs.
What happened to each name:
Quercus robur L. lost its authorship before
matching.Quercus robus and Festuca rubrra were
typos; the fuzzy pass corrected both, comparing only against other
Quercus and Festuca (genus-blocked), so a one-letter
slip costs little.cf. Betula pendula lost the field qualifier.FAGUS SYLVATICA matched after case folding
(exact_ci).Quercus pedunculata, Pinus abies, and
Fallopia japonica resolved to their accepted names
(Quercus robur, Picea abies, Reynoutria
japonica) with is_synonym TRUE. The second crosses a
genus boundary, the third is the current name for a well-known
invader.Q. petraea resolved to Quercus petraea on the
genus initial plus epithet (match_type
abbrev).The fuzzy pass never compares a name against the whole backbone. It
blocks on genus first, so Quercus robus is scored only
against the other Quercus names, and a one-letter slip is found
in a handful of comparisons instead of across every name on disk.
The default threshold allows about one edit per five characters, so common typos resolve while genuinely different names do not. A large genus is split further: the block key is the genus plus the first two letters of the epithet, which cuts a 31,000-name genus like Hieracium into sub-blocks of roughly a thousand, with a genus-only pass behind it so quality is unchanged.
When the typo lands on the genus itself, the genus block never pulls the name, so a second pass re-blocks on the first two letters of the full string. A genus slip that keeps those two letters still resolves to the right name; one that changes them falls through to no match. This second pass runs on the WFO, Catalogue of Life, and GBIF backbones.
summary() prints a digest, which is the fastest way to
see whether a run went cleanly or something upstream needs
attention.
#> ── taxify results ────────────────────────────────────────────────────────────
#> backend: WFO v2024-12 | 10 names submitted
#>
#> matched 10 (exact: 6, case-insensitive: 1, fuzzy: 2, abbrev: 1)
#> ────────────────────────────────────────────────────────────
#> taxon groups: angiosperm: 8 gymnosperm: 1 unknown: 1
Every match runs against the local snapshot:
backbone_version column records the exact WFO
release and download date, which drops into a methods sectionWith the backbone already loaded (after the first call), a hundred-name field list resolves in a fraction of a second on this machine:
field <- c(
"Quercus petraea", "Pinus sylvestris", "Picea abies", "Betula pendula",
"Acer pseudoplatanus", "Acer platanoides", "Acer campestre", "Corylus avellana",
"Fraxinus excelsior", "Carpinus betulus", "Sorbus aucuparia", "Tilia cordata",
"Ulmus glabra", "Alnus glutinosa", "Salix caprea", "Populus tremula",
"Prunus avium", "Prunus spinosa", "Crataegus monogyna", "Sambucus nigra",
"Cornus sanguinea", "Viburnum opulus", "Euonymus europaeus", "Ligustrum vulgare",
"Frangula alnus", "Juniperus communis", "Taxus baccata", "Larix decidua",
"Abies alba", "Rosa canina", "Rubus idaeus", "Hedera helix",
"Clematis vitalba", "Berberis vulgaris", "Betula pubescens", "Prunus padus",
"Rhamnus cathartica", "Lonicera xylosteum", "Trifolium repens", "Trifolium pratense",
"Festuca ovina", "Dactylis glomerata", "Plantago lanceolata", "Plantago major",
"Plantago media", "Achillea millefolium", "Ranunculus acris", "Ranunculus repens",
"Urtica dioica", "Poa pratensis", "Poa annua", "Galium mollugo",
"Galium aparine", "Bellis perennis", "Cardamine pratensis", "Cirsium arvense",
"Cirsium vulgare", "Daucus carota", "Heracleum sphondylium", "Anthriscus sylvestris",
"Lotus corniculatus", "Medicago lupulina", "Vicia cracca", "Lathyrus pratensis",
"Stellaria media", "Silene dioica", "Silene vulgaris", "Geranium pratense",
"Geranium robertianum", "Glechoma hederacea", "Lamium album", "Prunella vulgaris",
"Ajuga reptans", "Veronica chamaedrys", "Rumex acetosa", "Rumex obtusifolius",
"Chenopodium album", "Capsella bursa-pastoris", "Senecio vulgaris", "Leucanthemum vulgare",
"Centaurea jacea", "Knautia arvensis", "Campanula rotundifolia", "Primula veris",
"Anemone nemorosa", "Filipendula ulmaria", "Lythrum salicaria",
# a few naturalized aliens this group works with
"Robinia pseudoacacia", "Solidago canadensis", "Solidago gigantea",
"Impatiens glandulifera", "Heracleum mantegazzianum", "Prunus serotina",
"Quercus rubra",
# and the same kinds of mess as the ten-name list, at scale
"Quercus robur L.", "FAGUS SYLVATICA", "Quercus robus",
"cf. Taraxacum officinale", "Quercus pedunculata", "Festuca rubrra"
)
t <- system.time(field_res <- taxify(field, verbose = FALSE))
cat(sprintf("%d names resolved in %.3f s on this machine\n",
length(field), t[["elapsed"]]))#> 100 names resolved in 0.110 s on this machine
At this size you stop reading the table row by row. The fastest check
is the digest from summary() (matched, and by which route);
then the only rows that need a human’s eye are the ones that did not
match exactly. Filtering to those turns a hundred rows into a
handful:
field_res[field_res$match_type != "exact" | field_res$is_synonym %in% TRUE,
c("input_name", "accepted_name", "match_type",
"is_synonym", "fuzzy_dist")]#> input_name accepted_name match_type is_synonym fuzzy_dist
#> 85 Anemone nemorosa Anemonoides nemorosa exact TRUE NA
#> 96 FAGUS SYLVATICA Fagus sylvatica exact_ci FALSE NA
#> 97 Quercus robus Quercus robur fuzzy FALSE 0.07692308
#> 99 Quercus pedunculata Quercus robur exact TRUE NA
#> 100 Festuca rubrra Festuca rubra fuzzy FALSE 0.07142857
For a quick look at the resolved table itself,
head(field_res) shows the same 22 columns as the ten-name
run; the full object goes straight into the joins and statistics below
without ever being printed in full.
The package benchmark sizes that up. On the same task many in this
room reach for, WorldFlora’s WFO.match, both run against a
local copy and return the same matches; the difference is where the
matching happens. taxify scores names in C against the compiled
backbone, WorldFlora in R. On 1,000 plant names with fuzzy matching on
(Windows, R 4.5.2), the published numbers are:
Exact matching is close (0.1 s against 1.3 s); the gap opens on fuzzy matching, where the genus blocking above keeps taxify near a second while the in-R scan grows with the list. The full table and method are in the package README.
Once names resolve to an accepted name, any table keyed on species
joins cleanly. add_data():
my_traits <- data.frame(
species = c("Quercus pedunculata", # synonym of Q. robur
"Pinus sylvestris",
"Betula pendula"),
seed_mass_mg = c(3200, 7.5, 0.2)
)
taxify(c("Quercus robur", "Pinus sylvestris", "Betula pendula")) |>
add_data(my_traits, species_col = "species")#> input_name matched_name accepted_name taxon_id
#> 1 Quercus robur Quercus robur Quercus robur wfo-0000292858
#> 2 Pinus sylvestris Pinus sylvestris Pinus sylvestris wfo-0000481648
#> 3 Betula pendula Betula pendula Betula pendula wfo-0000335449
#> accepted_id rank family genus epithet authorship
#> 1 wfo-0000292858 species Fagaceae Quercus robur L.
#> 2 wfo-0000481648 species Pinaceae Pinus sylvestris L.
#> 3 wfo-0000335449 species Betulaceae Betula pendula Roth
#> accepted_authorship is_synonym is_hybrid match_type fuzzy_dist is_ambiguous
#> 1 <NA> FALSE FALSE exact NA FALSE
#> 2 <NA> FALSE FALSE exact NA FALSE
#> 3 <NA> FALSE FALSE exact NA FALSE
#> ambiguous_targets backend backbone_version kingdom_group taxon_group
#> 1 <NA> wfo wfo:2024-12 (2026-06-24) plantae angiosperm
#> 2 <NA> wfo wfo:2024-12 (2026-06-24) plantae gymnosperm
#> 3 <NA> wfo wfo:2024-12 (2026-06-24) plantae angiosperm
#> life_form qualifier qualifier_position seed_mass_mg
#> 1 angiosperm <NA> <NA> 3200.0
#> 2 gymnosperm <NA> <NA> 7.5
#> 3 angiosperm <NA> <NA> 0.2
#> Sources: WFO 2024-12, my_traits | cite() for full citations
The trait table used Quercus pedunculata and the result used
Quercus robur; a plain merge() would have missed
that row. add_data() joins on the accepted name, so it
lines up.
taxify also ships:
add_woodiness(),
add_eive(), add_diaz_traits(),
add_conservation_status(), and more)backend = c("wfo", "col", "gbif"))The next section lays out the full menu of these layers; the one after stacks three and runs a quick analysis. Full detail is in the enrichments and backends vignettes.
Those trait and status layers are just the start. The package ships
the same join for many more datasets, across the tree of life and for
the conservation and invasion records this group works with. Each
add_*() matches its own source against the backbone and
attaches on the accepted name, so any of them stacks into a pipeline the
same way.
The current layers, grouped by what they cover:
| covers | layers |
|---|---|
| Plants | add_woodiness(), add_eive(),
add_diaz_traits(), add_leda(),
add_wcvp(), add_glonaf() |
| Birds & mammals | add_avonet(), add_elton_traits(),
add_pantheria() |
| Amphibians & reptiles | add_amphibio(), add_lizard_traits() |
| Fish | add_fish_traits(), add_fishbase() |
| Arthropods | add_arthropod_traits(),
add_leptraits() |
| Fungi | add_fungal_traits(), add_funguild() |
| Algae | add_algae_traits() |
| Cross-taxon animals | add_animaltraits(), add_anage() |
| Status, alien & names | add_conservation_status(),
add_invasive_status(),
add_alien_first_records(),
add_common_names() |
For invasion work the GloNAF, GRIIS, and alien-first-record layers attach naturalized status, invasive status, and first-record years on the same accepted name, so a resolved species list carries its invasion history without a second join.
Pick any three of these and stack them. Woodiness (woody or herbaceous), the five EIVE ecological indicator values (Dengler et al. 2023), and plant height from the Diaz global trait dataset (Diaz et al. 2022) all attach to the same hundred-name field list from above; once attached, the table is ready for ordinary R statistics straight away:
dat <- field_res |>
add_woodiness() |>
add_eive() |>
add_diaz_traits()
head(dat[, c("accepted_name", "woodiness", "eive_light", "eive_reaction",
"eive_nutrients", "plant_height_m")], 10)#> accepted_name woodiness eive_light eive_reaction eive_nutrients
#> 1 Quercus petraea woody 5.826045 4.792116 3.548459
#> 2 Pinus sylvestris woody 7.101249 5.133071 2.691805
#> 3 Picea abies woody 4.362537 4.243331 4.311589
#> 4 Betula pendula woody 6.840881 4.384956 3.641275
#> 5 Acer pseudoplatanus woody 3.798876 5.919569 6.891850
#> 6 Acer platanoides woody 3.871987 6.337649 5.656169
#> 7 Acer campestre woody 5.046531 6.804980 5.790993
#> 8 Corylus avellana woody 5.411088 5.743586 5.677502
#> 9 Fraxinus excelsior woody 4.523195 7.190141 6.980345
#> 10 Carpinus betulus woody 3.652234 5.623408 5.340294
#> plant_height_m
#> 1 31.438500
#> 2 19.027893
#> 3 40.688265
#> 4 12.024730
#> 5 24.462656
#> 6 21.925662
#> 7 12.217433
#> 8 3.929159
#> 9 23.115589
#> 10 16.718244
Three layers attached in one pipeline (only the first ten rows
shown). A few species lack an EIVE or a height value, so those cells are
NA; R’s statistics functions drop incomplete rows on their
own, and the columns are ready to use.
Do species on more base-rich soils also sit higher on the nutrient
axis? Two EIVE columns and one cor.test():
#>
#> Pearson's product-moment correlation
#>
#> data: dat$eive_reaction and dat$eive_nutrients
#> t = 3.2378, df = 96, p-value = 0.001654
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.1230070 0.4821711
#> sample estimates:
#> cor
#> 0.3137695
The trend is positive and, across a hundred species, significant, though modest in strength: base-rich soils tend to carry higher nutrient values.
The same question as a regression returns the slope and the same p-value:
#>
#> Call:
#> lm(formula = eive_nutrients ~ eive_reaction, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4.9207 -1.1559 -0.0833 1.2259 3.5102
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.4460 0.9035 2.707 0.00803 **
#> eive_reaction 0.4816 0.1487 3.238 0.00165 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.709 on 96 degrees of freedom
#> (2 observations deleted due to missingness)
#> Multiple R-squared: 0.09845, Adjusted R-squared: 0.08906
#> F-statistic: 10.48 on 1 and 96 DF, p-value: 0.001654
Each unit of soil-reaction value adds about half a point on the nutrient axis and explains roughly a tenth of the variance. For a single predictor the model’s p-value matches the correlation’s.
Woodiness is a grouping factor, so an analysis of variance asks whether woody and herbaceous species differ in height:
#> Df Sum Sq Mean Sq F value Pr(>F)
#> woodiness 1 4699 4699 81.74 1.85e-14 ***
#> Residuals 95 5461 57
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> herbaceous woody
#> 0.5024754 14.4593877
Woody species average about fourteen metres against half a metre for the herbs, a difference that is clearly significant here. The result is unsurprising for height; the same three lines work for any attribute the package can attach. A boxplot shows the split: