Taxonomic name matching is rarely the end goal. Once
taxify() has resolved a list of species names against a
backbone, the next step is usually joining ecological trait data,
conservation assessments, or geographic range information to the matched
names. This step is where the real analytical value emerges, and it is
also where most workflows hit friction: trait databases use different
taxonomic authorities, store names with or without authorship strings,
treat synonyms inconsistently, and distribute data in incompatible
formats. Manually aligning names between a taxonomic backbone and a
trait database can consume hours even for moderately sized species
lists.
taxify ships with 22 enrichment layers that attach published trait
and status datasets to a taxify() result in a single pipe
call. Each enrichment is backed by a pre-built .vtr file
that downloads automatically on first use and caches locally for all
subsequent sessions. The enrichment system handles the name alignment
problem at build time, so the join at analysis time is a simple, fast,
exact-match operation.
This vignette covers the mechanics of how enrichments work, walks
through each of the 18 layers with worked examples, and discusses
practical strategies for combining enrichments, interpreting coverage
gaps, and choosing the right layers for a given taxon group. We also
discuss the add_data() function for joining custom datasets
that go beyond the built-in enrichments.
How enrichments work
Every add_*() function performs the same underlying
operation: a left join between the accepted_name column in
a taxify() result and the canonical_name
column in an enrichment .vtr file. Because the join key is
the accepted (resolved) name rather than the original input, synonyms
that were resolved during matching contribute automatically. If the
input contained “Pinus abies” and the backbone resolved it to “Picea
abies”, the enrichment join looks up “Picea abies” in the trait
database. This means we never have to worry about whether our species
list uses currently accepted names or outdated synonyms: the backbone
resolution step has already normalized everything.
This design has a deliberate consequence: enrichments only produce
values for rows that were successfully matched by taxify().
Rows where accepted_name is NA (unmatched
names) always receive NA in all enrichment columns. If a
species could not be resolved against the backbone, it cannot be looked
up in a trait database either. This is usually the correct behavior, but
it means that improving match rates upstream (by cleaning names, trying
a different backbone, or enabling fuzzy matching) directly improves
enrichment coverage downstream.
The join in detail
When we call an enrichment function, taxify executes the following steps:
Ensures the enrichment
.vtrfile is present on disk (downloading it if needed).Extracts the unique
accepted_namevalues from the result.Writes those unique names into a temporary
.vtrfile.Performs a vectra
inner_join()between the temporary names and the enrichment.vtron the name column.Uses a vectorized
match()to fill the new trait columns back into the original result. Rows without a match receiveNA.
The operation is fast because it reduces to a single hash-based
lookup per unique accepted name, not per row. A result with 50,000 rows
but 8,000 unique accepted names only does 8,000 lookups. The vectra join
exploits hash indexes on the canonical_name column in the
enrichment .vtr, making even enrichments with hundreds of
thousands of rows resolve in under a second.
Cross-backbone name resolution
A subtle but important design decision underlies the enrichment
.vtr files: they are built to work with any of taxify’s
seven backends (WFO, COL, GBIF, ITIS, NCBI, OTT, WoRMS). Different
backbones sometimes accept different names for the same taxon. WFO might
accept “Senecio jacobaea” while COL accepts “Jacobaea vulgaris” for the
same species. If the enrichment .vtr only contained one of
these names, it would fail to match results from the other backbone.
The taxifydb build pipeline solves this by resolving every source
species name against each of the seven backends separately (not as a
fallback chain, which would only return the first match). The union of
all unique accepted_name values is collected per source
species. Each source row is then expanded: one enrichment row per
distinct accepted name, with the trait data duplicated. The final
.vtr is then deduplicated by canonical_name
(plus any group column for grouped enrichments).
In practice, backends agree on more than 90% of names, so this
expansion is modest (typically 1.1–1.5x the original row count). The
result is that add_conservation_status() works identically
whether the upstream taxify() call used WFO, COL, or GBIF.
We do not have to pick enrichments based on which backbone we used, and
we do not have to worry about backbone-specific name variants falling
through the cracks.
Automatic download and caching
The first time an enrichment is requested in a session, taxify checks
whether a local copy exists on disk. If not, it downloads the pre-built
.vtr from GitHub Releases using the URL recorded in the
package manifest (inst/manifest.json). A
meta.json sidecar file is written alongside the
.vtr, recording the version string, whether the dataset is
static, and the download date. On subsequent calls within the same R
session, the file path is served from an in-memory cache (a
package-level environment), so the disk is not even touched. Across
sessions, the on-disk copy is reused without any network request.
For enrichments marked as “static” in the manifest (version-locked
datasets like Zanne et al. 2014 or PanTHERIA), version checks are
skipped entirely. These datasets have fixed, published versions that
will never change. For non-static enrichments (IUCN Red List, GRIIS,
WCVP, common names), taxify performs a lightweight version check once
per session by comparing the local meta.json version
against the manifest’s latest field. If a newer version is
available, it is downloaded automatically with a console message. This
check adds negligible latency because the manifest itself is cached.
Fallback chain
If the pre-built .vtr download fails (network issues,
mirror outage, transient server errors), taxify does not stop
immediately. Instead, it attempts to build the enrichment from the
original source data. Each of the 18 enrichments has a build recipe in
an internal registry (.enrichment_build_registry) that
knows how to download the raw CSV, ZIP, or API response from the
upstream source, parse it into a data.frame with a
canonical_name column, and produce the .vtr
file locally. This build-from-source path is slower (it has to download
and parse raw data rather than a pre-built binary), but it means that
enrichments remain available even if the GitHub Releases mirror is
temporarily down.
If the build-from-source also fails (e.g., the upstream source is
unreachable), taxify falls back one more level to an “emergency
fallback”: it downloads and parses the source data in memory without
writing to disk, performs the join in-memory using a data.frame rather
than a .vtr, and issues a warning explaining the situation.
This emergency result is ephemeral and not cached. If all three paths
fail, an error is raised with a link to the GitHub issue tracker so the
failure can be reported.
### The enrichment data directory
All enrichment .vtr files live under a single root
directory, organized by enrichment name and version:
taxify_data_dir()/
enrichment/
conservation_status/
latest/
conservation_status.vtr
meta.json
griis/
latest/
griis.vtr
meta.json
woodiness/
latest/
woodiness.vtr
meta.json
...
The taxify_data_dir() function returns the
platform-appropriate data directory (typically
~/.local/share/taxify on Linux/macOS or
%LOCALAPPDATA%/taxify on Windows). This directory is also
where backbone .vtr files are stored, so a single
taxify_data_dir() call reveals where all taxify data lives
on the system. Enrichment files are modest in size: most are between 1
and 20 MB. The full set of 18 enrichments totals roughly 150-200 MB.
Discovering enrichments
Before applying enrichments, we may want to see what is available.
list_enrichments() queries the taxify manifest and returns
a data.frame summarizing every available enrichment layer. The returned
columns are: name, version, nrow
(approximate row count), static (whether the dataset is
version-locked), trait_cols (comma-separated list of trait
column names), and source_url (the upstream data
source).
library(taxify)
list_enrichments()
#> name version nrow static trait_cols ...
#> 1 conservation_status ... 166000 TRUE conservation_status ...
#> 2 griis ... 23000 FALSE invasive_status ...
#> 3 wcvp ... 340000 FALSE native_status ...
#> ...The static column is worth paying attention to. Static
enrichments (woodiness, PanTHERIA, AmphiBIO, EltonTraits, LEDA, Diaz
traits, FungalTraits, FUNGuild, AlgaeTraits, FISHMORPH, Meiri lizard
traits, LepTraits, AnimalTraits, NW European Arthropods) are based on
published, version-locked datasets that have a single definitive
release. These never trigger version checks, so they add zero network
overhead to a session. Non-static enrichments (conservation_status,
GRIIS, WCVP, common_names) are periodically updated as the upstream
source publishes new releases. For these, taxify checks once per session
whether a newer version is available and updates transparently if
so.
The nrow column gives a rough sense of enrichment size.
Conservation status has ~166,000 rows (one per assessed species), WCVP
has ~340,000 (one per species-region combination), and the smaller
enrichments like LEDA have ~8,000. These numbers include the
cross-backbone name expansion discussed earlier, so they are slightly
larger than the original source row counts.
The trait_cols column lists the columns that the
enrichment adds to a result. This is useful for planning which
enrichments to apply: if we need specific leaf area data, scanning the
trait_cols column reveals that LEDA provides
sla_mm2_mg. If we need diet composition data, the
trait_cols for elton_traits lists all 18 diet,
foraging, mass, and nocturnality columns. The source_url
column points to the original data source (Zenodo, Figshare, Dryad,
GBIF, etc.) for reference and citation.
Pre-downloading enrichments
For workflows that run on computing clusters, in Docker containers,
or in any environment without reliable internet access, we can
pre-download enrichments before the analysis begins. The
taxify_download_enrichment() function accepts a character
vector of enrichment names and downloads each one to the local data
directory.
# Download a single enrichment
taxify_download_enrichment("conservation_status")
# Download several at once
taxify_download_enrichment(c("woodiness", "eive", "leda"))
# Download all of them
taxify_download_enrichment(c(
"conservation_status", "griis", "wcvp", "eive",
"elton_traits", "avonet", "pantheria", "amphibio",
"common_names", "woodiness", "diaz_traits", "leda",
"fungal_traits", "funguild", "algae_traits",
"fish_traits", "fishbase", "lizard_traits", "anage", "glonaf",
"leptraits", "animaltraits", "arthropod_traits", "alien_first_records",
"baseflor", "ecoflora", "floraweb"
))After this step, all subsequent add_*() calls for these
enrichments will use the local copies without any network access. This
is particularly useful for reproducible pipelines: pre-downloading
enrichments at setup time guarantees that the analysis always uses the
same version of each dataset, regardless of whether newer versions are
published in the meantime.
The download function prints a confirmation message with the version
and file size for each enrichment. If an enrichment is already present
at the requested version, it is not re-downloaded. The .vtr
files live in taxify_data_dir()/enrichment/{name}/latest/
alongside their meta.json sidecar.
Pre-downloading is also useful for teaching and workshop settings
where many participants share a slow network connection. One person can
download the enrichments, copy the taxify_data_dir()
contents to a shared drive or USB stick, and distribute it to all
participants. Since the enrichment lookup path starts with the on-disk
check, the copied files will be found immediately without any network
access.
Simple enrichments
Simple enrichments add one or more columns via a flat join on
accepted_name. Eighteen of the twenty-two enrichment layers
use this pattern. They differ only in which columns they add and which
taxonomic groups they cover. We group them below by taxon focus,
starting with plants (which have the most enrichment layers), then
conservation status (cross-taxon), birds, mammals, amphibians,
vertebrates, fungi, algae, fish, reptiles, butterflies, and
arthropods.
Plant enrichments
Plants are the best-served taxon group in the enrichment system, with dedicated layers covering growth form, ecological niches, seed and height traits, a broad suite of functional traits, and regional trait compilations for the British, French, and German floras. This reflects the state of published plant trait databases: decades of investment in standardized trait measurement protocols have produced several large, open-access datasets that are straightforward to integrate.
Woodiness (Zanne et al. 2014)
The woodiness enrichment classifies ~50,000 plant species as woody, herbaceous, or variable. The dataset comes from Zanne et al. (2014), a landmark study on the radiation of angiosperms into freezing environments, published in Nature. The underlying classification draws on the world’s major herbarium and botanical databases.
plants <- taxify(c(
"Quercus robur", "Betula pendula", "Arrhenatherum elatius",
"Festuca rubra", "Salix caprea", "Cornus sanguinea"
))
plants |> add_woodiness()
#> input_name accepted_name woodiness
#> 1 Quercus robur Quercus robur woody
#> 2 Betula pendula Betula pendula woody
#> 3 Arrhenatherum elatius Arrhenatherum elatius herbaceous
#> 4 Festuca rubra Festuca rubra herbaceous
#> 5 Salix caprea Salix caprea woody
#> 6 Cornus sanguinea Cornus sanguinea woodyThe three possible values are "woody",
"herbaceous", and "variable". The
"variable" category applies to species that exhibit both
growth forms depending on environmental conditions or ecotype. Coverage
is strongest for angiosperms (both monocots and dicots) and weaker for
ferns, lycophytes, and bryophytes. The dataset is static (CC0 license,
published 2014), so it never triggers version checks.
Woodiness is a coarse trait, but it is one of the most widely used in community ecology and macroecology. It separates plant strategies along a fundamental axis (persistent woody stems vs. annual or perennial herbaceous growth), making it valuable for community-weighted mean analyses, functional diversity indices, and biome classification.
EIVE ecological indicator values (Dengler et al. 2023)
EIVE 1.0 provides continuous ecological indicator values for ~14,500 European vascular plant species. It supersedes the classic ordinal Ellenberg indicator values, which were expert-assigned integers on a 1-9 scale, with statistically derived continuous scores based on species co-occurrence patterns across thousands of vegetation plots. Five niche axes are covered: light, temperature, moisture, soil reaction (pH), and nutrients.
grasses <- taxify(c(
"Arrhenatherum elatius", "Bromus erectus", "Festuca rubra",
"Dactylis glomerata", "Lolium perenne", "Poa pratensis"
))
grasses |> add_eive()
#> input_name eive_light eive_temperature eive_moisture eive_reaction eive_nutrients
#> 1 Arrhenatherum ... 7.2 5.8 4.3 7.1 6.5
#> 2 Bromus erectus 7.6 5.5 3.1 7.8 3.2
#> ...Because EIVE is restricted to the European flora, species from other
continents will receive NA in all five columns. The
continuous values are on a scale comparable to the original Ellenberg
system (roughly 1-9) but allow fractional positions. This matters for
community-weighted mean (CWM) calculations: averaging ordinal Ellenberg
values treats the intervals between categories as equal (the difference
between 3 and 4 is the same as between 7 and 8), which is not
guaranteed. EIVE’s continuous scale makes CWM calculations statistically
cleaner. The five output columns are prefixed with eive_ to
avoid collision with columns from other sources:
eive_light, eive_temperature,
eive_moisture, eive_reaction, and
eive_nutrients.
The EIVE dataset is licensed under CC BY 4.0 and published on Zenodo. It is classified as static in the taxify manifest because its version (1.0) is a fixed publication. The reference is Dengler et al. (2023), Vegetation Classification and Survey 4:7-29.
Diaz traits (Diaz et al. 2022)
The Diaz enrichment provides two key functional traits from the TRY database consortium: seed mass in milligrams and plant height in metres. These are species-level means compiled from thousands of individual measurements across multiple primary sources. Coverage spans ~46,000 plant species globally, making it one of the broader trait datasets available for plants.
trees <- taxify(c(
"Quercus robur", "Fagus sylvatica", "Picea abies",
"Pinus sylvestris", "Acer pseudoplatanus"
))
trees |> add_diaz_traits()
#> input_name seed_mass_mg plant_height_m
#> 1 Quercus robur 3200.0 25.0
#> 2 Fagus sylvatica 2200.0 30.0
#> 3 Picea abies 7.9 40.0
#> 4 Pinus sylvestris 6.5 25.0
#> 5 Acer pseudoplatanus 120.0 25.0Seed mass and plant height sit on the two most important axes of the global spectrum of plant form and function described by Diaz et al. (2016, Nature 529:167-171). Seed mass captures the offspring size / offspring number trade-off (small-seeded species produce many propagules, large-seeded species invest in fewer, better-provisioned offspring). Plant height captures the competitive strategy axis (tall species intercept more light but invest more in structural tissue). Combining these two traits with EIVE or LEDA columns produces a reasonably complete functional characterization for European temperate species.
The output columns are seed_mass_mg and
plant_height_m. Both are numeric (NA_real_ for missing
values). The dataset is licensed under CC BY 3.0 and distributed via the
TRY File Archive.
LEDA Traitbase (Kleyer et al. 2008)
LEDA covers ~8,000 NW European plant species with 10 trait columns spanning life form, dispersal, seed, leaf, and clonality dimensions. It is the most column-rich of the simple enrichments, providing a broad functional profile in a single call.
meadow_spp <- taxify(c(
"Arrhenatherum elatius", "Trifolium pratense",
"Leucanthemum vulgare", "Plantago lanceolata",
"Achillea millefolium", "Centaurea jacea"
))
meadow_spp |> add_leda()
#> input_name raunkiaer_life_form dispersal_type sla_mm2_mg canopy_height_m ...
#> 1 Arrhenatherum ... hemicryptophyte anemochory 25.1 0.90 ...
#> 2 Trifolium pratense hemicryptophyte zoochory 22.3 0.30 ...
#> ...The full column set includes:
raunkiaer_life_form: the primary Raunkiaer life form (phanerophyte, chamaephyte, hemicryptophyte, geophyte, therophyte, helophyte, hydrophyte).raunkiaer_variable: 1 if the species is assigned to multiple life forms, 0 otherwise.dispersal_type: primary dispersal vector (anemochory, zoochory, hydrochory, autochory, barochory, dysochory).terminal_velocity_ms: seed terminal velocity in m/s (species median).seed_mass_mg: seed mass in mg (species median).canopy_height_m: canopy height in metres (species median).leaf_mass_mg: leaf dry mass in mg (species median).sla_mm2_mg: specific leaf area in mm^2/mg (species median).clonal_growth: capable of clonal growth (1 = yes, 0 = no).buoyancy: seed buoyancy classification.
The Raunkiaer life form column classifies plants by where their perennating buds sit during the unfavorable season. Phanerophytes (trees, tall shrubs) hold buds more than 25 cm above the soil; chamaephytes (low shrubs, cushion plants) keep them near the surface. Hemicryptophytes, the dominant group in temperate grasslands, position buds right at soil level. Below ground, geophytes store buds as bulbs or rhizomes, while therophytes skip the problem entirely by surviving as seeds. LEDA provides this classification at species level for the NW European flora, making it one of the few trait databases that includes Raunkiaer assignments for several thousand species.
One column-name collision to be aware of: LEDA’s
seed_mass_mg and the Diaz enrichment’s
seed_mass_mg share the same output column name. If both
enrichments are stacked in a pipe chain, the second one to run will
overwrite the first. The values may differ slightly because LEDA reports
the species median from its own measurements while Diaz reports the TRY
consortium mean. To keep both, apply one enrichment, rename the column,
then apply the second. Alternatively, choose whichever source is more
appropriate for the study: LEDA for NW European analyses (regional
measurements), Diaz for global analyses (worldwide compilation).
Regional plant-trait compilations (Baseflor, Ecoflora, FloraWeb)
Three regional databases add trait detail for the European floras
they cover. Each carries a region suffix on every column
(_uk for Britain, _de for Germany, and
Baseflor’s unsuffixed French set), so they can be chained without
clobbering one another or the pan-European layers above.
-
Baseflor (Julve, Programme Catminat), via
add_baseflor(): about 8,500 taxa of the French and neighbouring flora, with flowering months, pollination vector, dispersal mode, breeding system, flower colour, fruit type, woody growth form, and the continentality and salinity axes absent from EIVE. -
Ecoflora (Fitter & Peat 1994), via
add_ecoflora(): the British Isles flora, with canopy height, leaf traits, life form, flowering phenology, pollination, seed weight, and British-calibrated Ellenberg values (18_ukcolumns). -
FloraWeb (BfN; the BiolFlor data of Klotz, Kuehn
& Durka 2002), via
add_floraweb(): the German flora, with morphology, reproductive biology, the nine Ellenberg indicator values, ploidy and chromosome number, Grime CSR strategy type, and chorological distribution (59_decolumns).
taxify(c("Bellis perennis", "Achillea millefolium", "Calluna vulgaris")) |>
add_ecoflora() |>
add_floraweb()Because every column is region-suffixed, one chain can attach
British, French, and German trait sets side by side for the same
species. FloraWeb and Ecoflora are bundled as pre-built datasets and
work offline; their German and British trait values are reported as
published, and the access date is the dataset version (neither portal
offers a versioned bulk export). Italian Ellenberg-type indicator values
are also available through add_pignatti(), which reads the
copy bundled in the TR8 package on demand; those values come from a
copyrighted publication and are not redistributed by taxify.
Mycorrhizal type (FungalRoot, Soudzilovskaia et al. 2020)
Most vascular plants form a symbiosis with root fungi, and the
type of that symbiosis is one of the most informative
functional traits a plant carries: it governs how the plant acquires
nutrients and which soil fungi it depends on.
add_fungalroot() attaches the mycorrhizal type from
FungalRoot, a global compilation of more than 36,000 plant-by-site
observations published on GBIF.
Unlike the enrichments above, FungalRoot joins on genus,
not accepted_name. Mycorrhizal type is conserved at the
genus level (the resolution FungalRoot itself recommends for inference),
so the value is a per-genus majority consensus and every species in a
covered genus inherits it, whether or not that exact binomial was
observed.
taxify(c("Quercus robur", "Pinus sylvestris", "Trifolium pratense",
"Vaccinium myrtillus", "Brassica oleracea")) |>
add_fungalroot()
#> input_name genus mycorrhizal_type mycorrhizal_status mycorrhizal_records
#> 1 Quercus robur Quercus EcM mycorrhizal 163
#> 2 Pinus sylvestris Pinus EcM mycorrhizal 500
#> 3 Trifolium pratense Trifolium AM mycorrhizal 193
#> 4 Vaccinium myrtillus Vaccinium ErM mycorrhizal 227
#> 5 Brassica oleracea Brassica NM non-mycorrhizal 59Three columns are added:
-
mycorrhizal_type: the genus-level consensus type.AM(arbuscular, by far the most common, formed by most herbs and many trees),EcM(ecto, typical of oaks, pines, birches, and other temperate forest trees),ErM(ericoid, confined to the Ericaceae),OM(orchid),NM(non-mycorrhizal, e.g. the Brassicaceae and many Cyperaceae), the dual typesEcM-AM/ErM-EcM/ErM-AM, plusOtheranduncertain. -
mycorrhizal_status: a coarse roll-up of the type, one of"mycorrhizal","non-mycorrhizal", or"uncertain". -
mycorrhizal_records: how many FungalRoot observations support the genus-level consensus, so a one-record genus can be told apart from a well-sampled one.
Because the join is on genus, a plant whose genus is not in
FungalRoot returns NA, and a genus circumscribed
differently across backbones may not line up. Coverage is plant genera
only (about 4,000 genera). The dataset is distributed under CC BY-NC
4.0; the per-genus consensus is computed by taxify from the
per-observation labels, not FungalRoot’s own published genus
assignment.
Conservation status (IUCN Red List)
The conservation status enrichment is the only enrichment that spans
all taxonomic groups equally. Coverage includes ~166,000 species
assessed by the IUCN Red List, with representation across plants,
vertebrates, invertebrates, and fungi. A single column is added:
conservation_status, containing the standard IUCN category
code.
species <- taxify(c(
"Panthera tigris", "Ailuropoda melanoleuca",
"Gorilla gorilla", "Vulpes vulpes",
"Passer domesticus", "Quercus robur"
))
species |> add_conservation_status()
#> input_name conservation_status
#> 1 Panthera tigris EN
#> 2 Ailuropoda melanoleuca VU
#> 3 Gorilla gorilla CR
#> 4 Vulpes vulpes LC
#> 5 Passer domesticus LC
#> 6 Quercus robur LCThe seven categories in order of increasing threat are:
LC (Least Concern): population stable, no significant threats.
NT (Near Threatened): close to qualifying for a threatened category.
VU (Vulnerable): facing a high risk of extinction in the wild.
EN (Endangered): facing a very high risk of extinction.
CR (Critically Endangered): facing an extremely high risk.
EW (Extinct in the Wild): survives only in captivity or cultivation.
EX (Extinct): no known living individuals.
Species not yet assessed by the IUCN receive NA. The
IUCN has assessed nearly all mammals, birds, amphibians, and reptiles,
but only a fraction of invertebrates, fungi, and plants. For a
plant-focused study, coverage rates are likely to be lower (perhaps
10-30% of the species list) than for a vertebrate study (where 90-100%
coverage is typical). The enrichment also includes species assessed as
DD (Data Deficient), which indicates that the IUCN has examined the
species but lacks sufficient data to assign a threat category.
The conservation status enrichment is non-static: the IUCN publishes
updated assessments several times per year, and the taxify enrichment is
rebuilt when new assessments become available. The
summary() output will show the version string (e.g.,
“2025.1”) so that the exact assessment vintage can be cited.
Bird enrichments
Birds are served by two complementary enrichments that together provide a detailed functional and ecological profile. AVONET covers morphology and migration strategy; EltonTraits covers diet composition and foraging behavior. There is intentional overlap in body mass (both provide it), but the remaining columns are distinct.
AVONET (Tobias et al. 2022)
AVONET provides species-level morphological measurements for ~11,000 bird species worldwide, based on direct measurements of museum specimens and live birds. The enrichment adds 11 columns covering beak dimensions (length, depth), wing length, tail length, tarsus length, body mass, hand-wing index, primary habitat, trophic level, trophic niche, and migration strategy.
birds <- taxify(c(
"Parus major", "Cyanistes caeruleus", "Erithacus rubecula",
"Turdus merula", "Falco peregrinus", "Aquila chrysaetos"
))
birds |> add_avonet()
#> input_name beak_length wing_length avonet_body_mass_g migration trophic_niche ...
#> 1 Parus major 11.2 75.1 18.5 sedentary Invertivore ...
#> 2 Cyanistes caeruleus 9.8 67.2 11.0 sedentary Invertivore ...
#> 3 Erithacus rubecula 11.5 72.3 17.1 partial Invertivore ...
#> 4 Turdus merula 20.8 130.5 95.0 partial Omnivore ...
#> 5 Falco peregrinus 15.2 312.0 750.0 full Vertivore ...
#> 6 Aquila chrysaetos 37.5 607.0 4000.0 partial Vertivore ...The morphological measurements (beak, wing, tail, tarsus) are all in
millimetres, representing species means across measured specimens. The
hand-wing index (hand_wing_index) quantifies wing
pointedness and correlates strongly with long-distance flight ability:
swifts, falcons, and shearwaters score high, while wrens, rails, and
pheasants sit at the low end of the spectrum. Dispersal ecology,
macroecology, and studies of range expansion all make heavy use of this
index.
The migration column has three possible values:
"sedentary" (non- migratory), "partial" (some
populations migrate), and "full" (obligate long-distance
migrant). The trophic_niche column uses categories like
"Invertivore", "Omnivore",
"Vertivore", "Frugivore",
"Granivore", "Nectarivore",
"Herbivore aquatic", and others.
AVONET is licensed under CC BY 4.0 and published on Figshare. The reference is Tobias et al. (2022), Ecology Letters 25:581-597. The dataset is classified as static in the taxify manifest.
EltonTraits (Wilman et al. 2014)
EltonTraits 1.0 covers both birds and mammals (~15,400 species total), making it the only enrichment that spans two vertebrate classes. It adds 18 columns organized into three groups: 10 diet composition percentages, 6 foraging stratum percentages, body mass, and nocturnality.
The diet columns express the percentage contribution of each food
category to the species’ diet: invertebrates (diet_inv),
endothermic vertebrates (diet_vend), ectothermic
vertebrates (diet_vect), fish (diet_vfish),
unknown vertebrates (diet_vunk), scavenging
(diet_scav), fruit (diet_fruit), nectar
(diet_nect), seeds and nuts (diet_seed), and
other plant material (diet_plantother). The 10 diet
percentages sum to 100 for each species.
The foraging columns express where in the vertical habitat structure
the species forages: below water surface (foraging_water),
on ground (foraging_ground), in understory
(foraging_understory), in mid to high vegetation
(foraging_midhigh), in canopy
(foraging_canopy), and aerial
(foraging_aerial). These 6 percentages also sum to 100.
birds <- taxify(c(
"Parus major", "Dendrocopos major", "Alcedo atthis",
"Tyto alba", "Apus apus"
))
birds |> add_elton_traits()
#> input_name diet_inv diet_fruit diet_seed foraging_canopy foraging_aerial nocturnal ...
#> 1 Parus major 60 10 20 50 0 0 ...
#> 2 Dendrocopos major 75 5 10 80 0 0 ...
#> 3 Alcedo atthis 0 0 0 0 0 0 ...
#> 4 Tyto alba 10 0 0 0 0 1 ...
#> 5 Apus apus 100 0 0 0 100 0 ...The Common Swift (Apus apus) is a textbook example of a
species at the extreme of the foraging and diet axes: 100% invertebrate
diet, 100% aerial foraging, reflecting its life spent almost entirely on
the wing. The Barn Owl (Tyto alba) illustrates the nocturnal
flag: it is one of the few species in this example set with
nocturnal = 1. The elton_body_mass_g column
provides body mass in grams from literature compilation.
EltonTraits is particularly valuable for functional diversity analyses (computing Rao’s quadratic entropy or functional richness using diet and foraging traits as axes), food web construction (using diet percentages to parameterize trophic links), and macroecological studies of niche breadth. It is licensed under CC0 and published on Figshare. The reference is Wilman et al. (2014), Ecology 95:2027.
Mammal enrichments
PanTHERIA (Jones et al. 2009)
PanTHERIA covers ~5,400 mammal species with eight life-history and ecological traits: adult body mass, maximum longevity, mean litter size, gestation length, weaning age, home range size, diet breadth, and habitat breadth. It remains the most-cited source of mammalian life-history data in the ecological literature, despite being published in 2009.
mammals <- taxify(c(
"Vulpes vulpes", "Canis lupus", "Ursus arctos",
"Mustela nivalis", "Lutra lutra", "Lynx lynx"
))
mammals |> add_pantheria()
#> input_name pantheria_body_mass_g longevity_mo litter_size home_range_km2 ...
#> 1 Vulpes vulpes 5480.0 144 5.0 8.55 ...
#> 2 Canis lupus 31757.0 192 5.4 242.00 ...
#> 3 Ursus arctos 139000.0 396 2.0 488.00 ...
#> 4 Mustela nivalis 67.0 72 5.5 0.03 ...
#> 5 Lutra lutra 8000.0 180 2.3 15.00 ...
#> 6 Lynx lynx 20500.0 252 2.6 168.00 ...The body mass column is named pantheria_body_mass_g to
distinguish it from AVONET’s avonet_body_mass_g and
EltonTraits’ elton_body_mass_g. This prefixing convention
prevents column-name collisions when stacking multiple enrichments on
the same result.
The Least Weasel (Mustela nivalis) in the example above illustrates the dynamic range of mammalian traits: at 67 g body mass and a home range of 0.03 km^2, it sits at the opposite end of the spectrum from the Brown Bear (Ursus arctos) at 139,000 g and 488 km^2. These allometric scaling relationships (body mass predicting home range, longevity, gestation, etc.) are a major reason PanTHERIA is so widely cited.
Because PanTHERIA was published in 2009, species described or
taxonomically split after that date will appear as NA. The
dataset is static (CC0 license), so it never triggers version checks.
The reference is Jones et al. (2009), Ecology 90:2648.
Amphibian enrichments
AmphiBIO (Oliveira et al. 2017)
AmphiBIO covers ~6,800 amphibian species with 13 trait columns. The continuous traits are body size (snout-vent length in mm), age at maturity (days), longevity (days), clutch/litter size, reproductive output per year, and offspring size (mm). The binary traits encode habitat and activity patterns: direct development (0/1), larval stage (0/1), aquatic habitat (0/1), fossorial habitat (0/1), arboreal habitat (0/1), diurnal activity (0/1), and nocturnal activity (0/1).
amphibians <- taxify(c(
"Bufo bufo", "Rana temporaria", "Salamandra salamandra",
"Triturus cristatus", "Hyla arborea", "Bombina variegata"
))
amphibians |> add_amphibio()
#> input_name body_size_mm arboreal aquatic direct_development nocturnal_amphibio ...
#> 1 Bufo bufo 150.0 0 0 0 1 ...
#> 2 Rana temporaria 110.0 0 1 0 0 ...
#> 3 Salamandra salamandra 200.0 0 0 0 1 ...
#> 4 Triturus cristatus 160.0 0 1 0 1 ...
#> 5 Hyla arborea 50.0 1 0 0 1 ...
#> 6 Bombina variegata 50.0 0 1 0 0 ...The nocturnality column is named nocturnal_amphibio
rather than nocturnal to avoid colliding with EltonTraits’
nocturnal column. While it is unusual to stack both
AmphiBIO and EltonTraits on the same result (EltonTraits covers birds
and mammals, not amphibians), the precaution prevents surprises in
mixed-taxon workflows where both enrichments are applied to a single
data.frame.
The binary trait columns use integer values (0/1) rather than logical
(TRUE/FALSE), following the original dataset’s
encoding. This means filtering syntax uses == 1L rather
than bare column names:
result[result$arboreal == 1L, ].
AmphiBIO is one of the few large-scale trait databases for amphibians, a taxon group that is relatively data-poor compared to birds and mammals. Coverage spans anurans (frogs and toads), urodeles (salamanders and newts), and caecilians. It is licensed under CC BY 4.0 and published on Scientific Data. The reference is Oliveira et al. (2017), Scientific Data 4:170123.
Fungal enrichments
Fungi have historically been underrepresented in trait databases compared to plants and animals, but two complementary datasets now provide detailed ecological and functional information. FungalTraits classifies genera by lifestyle, growth form, and interaction capabilities, while FUNGuild provides a guild-based trophic classification. Together they offer a reasonably complete functional profile for macrofungi and many microfungi.
FungalTraits (Polme et al. 2020)
FungalTraits is a genus-level database covering ~10,200 fungal genera
with nine trait columns. Unlike the species-level enrichments discussed
above, FungalTraits joins on genus rather than
accepted_name. This reflects the reality of fungal trait
data: most functional traits (lifestyle, growth form, decay strategy)
are conserved at the genus level, and the enormous diversity of
described fungal species (~150,000) makes species-level trait
compilation impractical for many attributes. The genus-level join means
that all species within a genus receive the same trait values, which is
ecologically reasonable for the traits covered.
fungi <- taxify(c(
"Amanita muscaria", "Boletus edulis", "Trametes versicolor",
"Agaricus bisporus", "Cantharellus cibarius"
))
fungi |> add_fungal_traits()
#> input_name primary_lifestyle growth_form fruitbody_type decay_substrate ...
#> 1 Amanita muscaria ectomycorrhizal agaricoid agaricoid <NA> ...
#> 2 Boletus edulis ectomycorrhizal boletoid boletoid <NA> ...
#> 3 Trametes versicolor saprotroph polyporoid polyporoid wood ...
#> 4 Agaricus bisporus saprotroph agaricoid agaricoid litter ...
#> 5 Cantharellus cibarius ectomycorrhizal cantharelloid cantharelloid <NA> ...The nine trait columns capture complementary facets of fungal ecology:
primary_lifestyle: the dominant trophic strategy (ectomycorrhizal, saprotroph, plant pathogen, animal parasite, lichen, endophyte, etc.).secondary_lifestyle: an additional lifestyle where applicable (many genera have a single lifestyle, so this is frequentlyNA).growth_form: the vegetative morphology (agaricoid, boletoid, polyporoid, corticioid, clavarioid, gasteroid, etc.).fruitbody_type: the reproductive structure type.decay_substrate: the primary substrate for saprotrophic genera (wood, litter, dung, soil).plant_pathogenic_capacity: a coarse classification of pathogenic potential for plant-associated genera.animal_biotrophic_capacity: analogous classification for animal associations.endophytic_interaction_capability: whether the genus includes endophytic species.ectomycorrhiza_exploration_type: for ectomycorrhizal genera, the exploration type of the mycelium (contact, short-distance, medium- distance, long-distance). This is ecologically important because exploration type governs nutrient acquisition strategy and competitive dynamics among ectomycorrhizal fungi.
The primary_lifestyle column is the single most
informative trait for broad ecological analyses. It separates
ectomycorrhizal fungi (mutualists that form nutrient-exchange networks
with plant roots) from saprotrophs (decomposers that drive nutrient
cycling) and pathogens (agents of disease and mortality). These three
groups have fundamentally different roles in ecosystem functioning, and
knowing which lifestyle a genus belongs to determines how it should be
interpreted in community analyses, food web models, and carbon cycling
studies.
FungalTraits is licensed under CC BY 4.0 and published in Fungal Diversity. The reference is Polme et al. (2020), Fungal Diversity 105:1-16. The dataset is classified as static in the taxify manifest.
FUNGuild (Nguyen et al. 2016)
FUNGuild provides trophic and guild classifications for ~13,000 fungal taxa at both genus and species levels. Where FungalTraits describes what a genus does ecologically (lifestyle, growth form, substrate preference), FUNGuild classifies taxa into guild categories that describe their functional role in the ecosystem. The two datasets are complementary: a genus like Trametes might be classified as “saprotroph” with “polyporoid” growth form in FungalTraits, and as “Wood Saprotroph” guild with “Saprotroph” trophic mode in FUNGuild. The FUNGuild classification is coarser but more directly interpretable for guild-based community analyses.
fungi <- taxify(c(
"Amanita muscaria", "Boletus edulis", "Trametes versicolor",
"Agaricus bisporus", "Cantharellus cibarius"
))
fungi |> add_funguild()
#> input_name trophic_mode guild funguild_growth_form confidence_ranking
#> 1 Amanita muscaria Symbiotroph Ectomycorrhizal Agaricoid Highly Probable
#> 2 Boletus edulis Symbiotroph Ectomycorrhizal Boletoid Highly Probable
#> 3 Trametes versicolor Saprotroph Wood Saprotroph Polyporoid Highly Probable
#> 4 Agaricus bisporus Saprotroph Litter Saprotroph Agaricoid Highly Probable
#> 5 Cantharellus cibarius Symbiotroph Ectomycorrhizal Cantharelloid Highly ProbableThe four output columns provide a hierarchical classification:
-
trophic_mode: the broadest category (Saprotroph, Symbiotroph, Pathotroph, or combinations like Saprotroph-Symbiotroph for genera with multiple trophic strategies). -
guild: a finer classification within each trophic mode (Wood Saprotroph, Litter Saprotroph, Ectomycorrhizal, Arbuscular Mycorrhizal, Plant Pathogen, Animal Pathogen, Lichenized, etc.). -
funguild_growth_form: the morphological category, named with thefunguild_prefix to avoid collision with FungalTraits’growth_formcolumn. -
confidence_ranking: how confident the guild assignment is (Highly Probable, Probable, or Possible). This column deserves attention: assignments at the “Possible” level are based on limited evidence and should be treated with caution in quantitative analyses. Filtering to “Highly Probable” and “Probable” assignments reduces coverage but improves reliability.
FUNGuild is particularly valuable for soil mycobiome studies, where operational taxonomic units (OTUs) from metabarcoding are classified into ecological guilds. The trophic mode and guild columns map directly onto the functional group categories used in fungal community ecology: the ratio of saprotrophs to symbiotrophs, or the proportion of pathotrophs in a community, are common response variables in studies of land use change, nutrient cycling, and plant-soil feedbacks.
FUNGuild is published in Fungal Ecology. The reference is Nguyen et al. (2016), Fungal Ecology 20:241-248. The dataset is classified as static in the taxify manifest.
Algae enrichments
AlgaeTraits (Vranken et al. 2023)
AlgaeTraits provides morphological and ecological traits for ~1,745 European macroalgae species (seaweeds). Macroalgae are the dominant primary producers in coastal rocky ecosystems, yet they are conspicuously absent from the major plant trait databases (TRY, LEDA, EIVE) that focus exclusively on vascular plants. AlgaeTraits fills this gap for the European coastline, covering green algae (Chlorophyta), brown algae (Phaeophyceae), and red algae (Rhodophyta) with eight trait columns spanning morphology, habitat, and environmental tolerances.
seaweeds <- taxify(c(
"Fucus vesiculosus", "Ulva lactuca", "Laminaria digitata",
"Chondrus crispus", "Sargassum muticum"
))
seaweeds |> add_algae_traits()
#> input_name algae_body_size_cm algae_growth_form algae_calcification algae_tidal_zone ...
#> 1 Fucus vesiculosus 60.0 foliose none intertidal ...
#> 2 Ulva lactuca 30.0 foliose none intertidal ...
#> 3 Laminaria digitata 200.0 foliose none subtidal ...
#> 4 Chondrus crispus 15.0 foliose none intertidal ...
#> 5 Sargassum muticum 300.0 foliose none subtidal ...All eight columns are prefixed with algae_ to clearly
distinguish them from terrestrial plant traits:
algae_body_size_cm: maximum thallus length in centimetres.algae_growth_form: the morphological category (filamentous, foliose, corticated, leathery, calcareous, crustose, etc.).algae_calcification: whether the species produces calcium carbonate structures (none, articulated, crustose). Calcifying algae like coralline species play a critical role in reef construction and are particularly sensitive to ocean acidification.algae_life_span: the typical life span category (annual, perennial, pseudoperennial).algae_tidal_zone: the primary tidal zone (supralittoral, intertidal, subtidal).algae_wave_exposure: the preferred wave exposure regime (sheltered, moderately exposed, exposed).algae_environment: the salinity regime (marine, brackish, freshwater).algae_substrate: the preferred substrate type (rock, sand, epiphytic, free-living).
The body size column spans three orders of magnitude, from millimetre-scale filamentous algae to kelps exceeding 3 metres. This variation underpins the structural complexity of rocky shore communities: large canopy-forming species like Laminaria digitata create habitat for hundreds of associated species, while small turf-forming species dominate in disturbed or nutrient-enriched conditions. The growth form and tidal zone columns together define the ecological niche of each species along the shore gradient, making AlgaeTraits directly useful for intertidal community analyses, climate change impact assessments, and marine protected area planning.
The geographic scope is European, so species from other coastlines
will receive NA in all columns. The dataset is licensed
under CC BY 4.0 and published in Scientific Data. The reference
is Vranken et al. (2023), Scientific Data 10:826. The dataset
is classified as static in the taxify manifest.
Fish enrichments
Fish ecology has produced two large, complementary trait databases. FISHMORPH focuses on morphological measurements of freshwater species, while FishBase provides broader ecological and life-history data for both freshwater and marine fish. Together they provide detailed functional profiles for ichthyological studies.
FISHMORPH (Brosse et al. 2021)
FISHMORPH provides morphological trait data for ~8,300 freshwater fish species worldwide, based on standardized measurements from photographs and museum specimens. The 10 morphological traits capture the key axes of fish body shape variation that determine swimming performance, feeding mode, and habitat use.
freshwater_fish <- taxify(c(
"Salmo trutta", "Esox lucius", "Cyprinus carpio",
"Perca fluviatilis", "Silurus glanis"
))
freshwater_fish |> add_fish_traits()
#> input_name fish_body_elongation fish_eye_size fish_oral_gape_position fish_body_lateral_shape ...
#> 1 Salmo trutta 0.22 0.08 0.42 0.18 ...
#> 2 Esox lucius 0.18 0.06 0.50 0.15 ...
#> 3 Cyprinus carpio 0.35 0.05 0.38 0.25 ...
#> 4 Perca fluviatilis 0.30 0.07 0.40 0.22 ...
#> 5 Silurus glanis 0.15 0.03 0.48 0.12 ...All columns are prefixed with fish_ and express
dimensionless morphological ratios normalized by body length. The 10
traits are:
fish_body_elongation: body depth relative to standard length. High values indicate deep-bodied species (cyprinids), low values indicate elongated species (eels, pike).fish_eye_size: eye diameter relative to head length. Large-eyed species tend to be visual predators in clear water.fish_oral_gape_position: the vertical position of the mouth, from ventral (benthic feeders) to dorsal (surface feeders).fish_body_lateral_shape: the lateral compression of the body.fish_pectoral_fin_size: pectoral fin area, associated with manoeuvrability and braking ability.fish_pectoral_fin_position: the vertical insertion of the pectoral fin on the body.fish_caudal_peduncle_throttling: the narrowing of the caudal peduncle, associated with sustained swimming efficiency.fish_caudal_fin_shape: the aspect ratio of the caudal fin. High values (forked tails) indicate cruising swimmers; low values (rounded tails) indicate ambush predators or benthic species.fish_fin_surface_ratio: total fin area relative to body area.fish_max_body_length_cm: the maximum recorded standard length in centimetres.
These morphological ratios are ecomorphological indicators: they predict how a species interacts with its physical environment. Body elongation and caudal fin shape together separate benthic, slow-moving species from pelagic, fast-cruising species. Oral gape position separates surface feeders from bottom feeders. Eye size and pectoral fin size relate to sensory ecology and manoeuvrability, respectively. The combination of these traits places each species in morphological space, making FISHMORPH directly useful for functional diversity calculations (Rao’s Q, functional richness, functional divergence) in freshwater fish community ecology.
The dataset covers freshwater fish only; marine species receive
NA. It is licensed under CC BY 4.0 and published in
Global Ecology and Biogeography. The reference is Brosse et
al. (2021), Global Ecology and Biogeography 30:2330-2345. The
dataset is classified as static in the taxify manifest.
FishBase (Froese & Pauly 2024)
FishBase is the most comprehensive fish database in the world, covering ~35,000 species across both freshwater and marine environments. The taxify enrichment extracts eight key ecological and life-history traits from the FishBase dataset, providing a broad functional profile that complements FISHMORPH’s morphological focus.
fish <- taxify(c(
"Gadus morhua", "Thunnus thynnus", "Hippocampus hippocampus",
"Squalus acanthias", "Salmo trutta"
))
fish |> add_fishbase()
#> input_name fb_body_length_cm fb_body_mass_g fb_trophic_level fb_depth_min_m fb_depth_max_m ...
#> 1 Gadus morhua 132.0 55500.0 4.4 0.0 600.0 ...
#> 2 Thunnus thynnus 458.0 684000.0 4.2 0.0 1000.0 ...
#> 3 Hippocampus hippocampus 15.0 NA 3.1 1.0 60.0 ...
#> 4 Squalus acanthias 160.0 11000.0 4.3 16.0 900.0 ...
#> 5 Salmo trutta 140.0 50000.0 3.4 0.0 332.0 ...The eight columns are all prefixed with fb_:
fb_body_length_cm: maximum total length in centimetres.fb_body_mass_g: maximum recorded body mass in grams.fb_trophic_level: the trophic level (continuous, typically 2.0-5.0). Herbivorous fish sit near 2.0, planktivores around 3.0, piscivores around 4.0-4.5, and apex predators above 4.5.fb_depth_min_mandfb_depth_max_m: the minimum and maximum depth range in metres. Together these define the vertical habitat envelope.fb_vulnerability: the intrinsic vulnerability index (0-100), a composite score based on maximum size, age, fecundity, and other life-history parameters. High values indicate species that are inherently more susceptible to overexploitation.fb_habitat: the primary habitat category (pelagic, demersal, bathydemersal, bathypelagic, reef-associated, etc.).fb_importance: the economic importance category (commercial, subsistence, minor commercial, gamefish, etc.).
The trophic level and vulnerability columns are particularly valuable for fisheries ecology and marine conservation. Trophic level quantifies the position of each species in the food web, and the well-documented pattern of “fishing down the food web” (declining mean trophic level of catches over time) is diagnosed using exactly this variable. Vulnerability provides a quick assessment of which species in a community are most at risk from fishing pressure, complementing the IUCN conservation status with a mechanistic, trait-based risk metric.
Note that FishBase is licensed under CC BY-NC 3.0 (non-commercial use). This is more restrictive than the CC BY 4.0 license used by most other enrichments. Users intending to use FishBase data in commercial applications should consult the FishBase terms of use. The reference is Froese, R. and D. Pauly (2024), FishBase, www.fishbase.org. The dataset is classified as non-static in the taxify manifest because FishBase is updated periodically.
Reptile enrichments
Meiri lizard traits (Meiri 2018)
The Meiri lizard trait database covers ~6,600 lizard species (Squamata, excluding snakes and amphisbaenians) with 10 life-history, morphological, and ecological traits. Lizards are the most species-rich group of non-avian reptiles, and this dataset provides the most comprehensive species-level trait compilation available for the group.
lizards <- taxify(c(
"Pogona vitticeps", "Lacerta agilis", "Iguana iguana",
"Varanus komodoensis", "Gekko gecko"
))
lizards |> add_lizard_traits()
#> input_name lizard_body_mass_g lizard_svl_mm lizard_tail_length_mm lizard_clutch_size ...
#> 1 Pogona vitticeps 350.0 230.0 250.0 18.0 ...
#> 2 Lacerta agilis 10.0 70.0 100.0 8.0 ...
#> 3 Iguana iguana 4000.0 450.0 700.0 35.0 ...
#> 4 Varanus komodoensis 70000.0 1500.0 1400.0 18.0 ...
#> 5 Gekko gecko 60.0 140.0 130.0 2.0 ...All columns are prefixed with lizard_:
lizard_body_mass_g: adult body mass in grams.lizard_svl_mm: snout-vent length in millimetres, the standard body size measurement for reptiles (excluding the tail, which is frequently lost and regenerated).lizard_tail_length_mm: tail length in millimetres.lizard_clutch_size: the mean number of eggs per clutch (or neonates per litter for viviparous species).lizard_clutch_frequency: the number of clutches produced per year.lizard_longevity_yr: maximum recorded longevity in years.lizard_diet: the primary diet category (insectivore, herbivore, omnivore, carnivore).lizard_habitat: the primary habitat type (terrestrial, arboreal, fossorial, saxicolous, semi-aquatic).lizard_activity_time: the primary activity period (diurnal, nocturnal, crepuscular, cathemeral).lizard_foraging_mode: the foraging strategy (sit-and-wait, active foraging, mixed). This trait is tightly linked to metabolic rate and energy budgets: active foragers have higher metabolic rates and larger home ranges, while sit-and-wait predators invest less energy in locomotion but rely on crypsis and ambush efficiency.
The body mass range spans four orders of magnitude, from sub-gram geckos to the 70 kg Komodo Dragon (Varanus komodoensis). This allometric range drives strong scaling relationships: metabolic rate, home range size, and prey size all scale predictably with body mass in lizards. The SVL measurement is preferred over total length because tail autotomy (voluntary tail shedding) makes total length unreliable; SVL provides a stable, comparable measure of body size across species and individuals.
The combination of clutch size, clutch frequency, and longevity captures the fast-slow life-history continuum. Small geckos with clutches of 1-2 eggs but multiple clutches per year represent a different strategy from large iguanas with single large clutches per season. This life-history variation is directly relevant to population viability analysis and conservation planning for reptile species.
The dataset covers lizards globally. Snakes and turtles are not
included; they receive NA in all columns. It is licensed
under CC BY 4.0 and published in Global Ecology and
Biogeography. The reference is Meiri (2018), Global Ecology and
Biogeography 27:1168-1172. The dataset is classified as static in
the taxify manifest.
Vertebrate enrichments (cross-class)
AnAge longevity and life-history (Tacutu et al. 2018)
AnAge is a curated database of aging and longevity records for ~4,700 vertebrate species spanning mammals, birds, reptiles, amphibians, and fish. It provides maximum longevity, body mass, metabolic rate, maturity age, gestation/incubation time, litter/clutch size, birth mass, growth rate, and body temperature. The unique value of AnAge over taxon-specific databases like PanTHERIA is its cross-class coverage: longevity and metabolic data can be compared directly across vertebrate classes.
vertebrates <- taxify(c(
"Vulpes vulpes", "Aquila chrysaetos", "Crocodylus niloticus",
"Bufo bufo", "Salmo salar"
), backend = c("col", "gbif"))
vertebrates |> add_anage()
#> input_name max_longevity_yr anage_body_mass_g metabolic_rate_w ...
#> 1 Vulpes vulpes 15.2 5480.0 10.41 ...
#> 2 Aquila chrysaetos 46.0 4210.0 8.94 ...
#> 3 Crocodylus niloticus 44.0 242500.0 NA ...
#> 4 Bufo bufo 36.0 48.0 NA ...
#> 5 Salmo salar 13.0 3400.0 NA ...All columns use the anage_ prefix for body mass and
litter size to distinguish them from PanTHERIA equivalents. The
max_longevity_yr column is the maximum recorded lifespan in
years — the most widely used parameter for cross-species aging
comparisons.
The dataset is compiled from the Human Ageing Genomic Resources (HAGR) and is freely available under CC BY. The reference is Tacutu et al. (2018), Nucleic Acids Research 46:D1083-D1090.
AnimalTraits body mass and metabolic rate (Hebert et al. 2022)
AnimalTraits is a curated database of body mass and metabolic rate measurements covering ~2,000 species across arthropods (~1,700 species), vertebrates, molluscs, and annelids. Unlike taxon-specific databases, it provides a unified framework for cross-taxon allometric comparisons — particularly valuable for arthropods, which are underrepresented in other trait databases.
arthropods <- taxify(c(
"Drosophila melanogaster", "Apis mellifera",
"Tenebrio molitor", "Gryllus campestris"
), backend = c("col", "gbif"))
arthropods |> add_animaltraits()
#> input_name animaltraits_body_mass_kg animaltraits_metabolic_rate_w
#> 1 Drosophila melanogaster 0.000001030 0.000000218
#> 2 Apis mellifera 0.000100000 0.000012600
#> 3 Tenebrio molitor 0.000140000 0.000004850
#> 4 Gryllus campestris 0.000800000 NAThe data is stored as individual-level observations in the source
CSV; taxify’s parse function aggregates these to species-level medians.
Body mass is in kilograms and metabolic rate in watts (both in SI units,
as published). The animaltraits_ prefix avoids collision
with body mass columns from other enrichments.
The dataset is licensed under CC0 (public domain) and published on Zenodo. The reference is Hebert et al. (2022), Scientific Data 9:265.
Butterfly enrichments
LepTraits butterfly traits (Shirey et al. 2022)
LepTraits 1.0 is the most comprehensive open butterfly trait database, covering ~12,400 species of Papilionoidea globally. It provides wingspan, voltinism, diapause stage, four habitat affinity dimensions, host plant data, and adult flight phenology.
butterflies <- taxify(c(
"Vanessa cardui", "Pieris rapae", "Papilio machaon",
"Lycaena phlaeas", "Colias crocea"
), backend = c("col", "gbif"))
butterflies |> add_leptraits()
#> input_name wingspan_mm voltinism diapause_stage canopy_affinity ...
#> 1 Vanessa cardui 62.5 3.0 NA Open canopy ...
#> 2 Pieris rapae 47.5 4.0 pupa Open canopy ...
#> 3 Papilio machaon 75.0 2.0 pupa Open canopy ...
#> 4 Lycaena phlaeas 30.0 3.0 larva Open canopy ...
#> 5 Colias crocea 48.5 3.0 adult Open canopy ...The wingspan is computed as the midpoint of the lower and upper
bounds reported in the dataset. Voltinism indicates the number of
generations per year. The four habitat affinities (canopy, edge,
moisture, disturbance) are categorical variables describing the species’
preferred environmental context. The flight_months column
counts the number of months with recorded adult flight activity.
The dataset is licensed under CC0 and published on Figshare. The reference is Shirey et al. (2022), Scientific Data 9:398.
Arthropod enrichments
NW European Arthropod life-history traits (Logghe et al. 2025)
This dataset provides 28 life-history and ecological traits for ~4,900 arthropod species from Northwestern Europe, covering 10 orders including Coleoptera, Hemiptera, Orthoptera, Araneae, Diptera, Hymenoptera, and Lepidoptera. It is the most comprehensive open arthropod trait compilation for this region.
insects <- taxify(c(
"Abax parallelepipedus", "Pterostichus melanarius",
"Chorthippus parallelus", "Araneus diadematus"
), backend = c("col", "gbif"))
insects |> add_arthropod_traits()
#> input_name arthropod_body_size_mm arthropod_dispersal arthropod_voltinism arthropod_feeding_guild ...
#> 1 Abax parallelepipedus 18.5 0.01 1.0 carnivore ...
#> 2 Pterostichus melanarius 15.0 0.32 1.0 carnivore ...
#> 3 Chorthippus parallelus 17.0 0.10 1.0 herbivore ...
#> 4 Araneus diadematus 13.0 0.45 1.0 carnivore ...All columns are prefixed with arthropod_. The
quantitative traits include body size (mm), dispersal ability (0-1 ratio
within order), mean voltinism, fecundity, development time (days),
lifespan (days), and thermal niche mean (°C). The categorical traits
include diurnality, feeding guild, and trophic range.
Because this dataset is geographically scoped to NW Europe (Belgium,
Luxembourg, Netherlands, northern France, UK, western Germany), species
from other regions will have NA values. The dataset is
particularly strong for Coleoptera, Hemiptera, and Orthoptera, with
near-complete coverage of the regional fauna in those orders.
The dataset is licensed under CC BY-NC and published in Biodiversity Data Journal. The reference is Logghe et al. (2025), Biodiversity Data Journal 13:e146785.
Group-based enrichments
Five enrichments filter by a grouping variable (country code, TDWG
botanical region code, GloNAF region code, or language code) and pivot
the result to wide format. The mechanics are the same across all of
them: when a single group value is requested, the output column uses the
base name (e.g., invasive_status). When multiple group
values are requested, each output column gets a suffix derived from the
group value (e.g., invasive_status_AT,
invasive_status_DE). Passing "all" as the
group value expands to every group present in the enrichment
dataset.
This design keeps the output tidy for the common case (one country,
one region, one language) while still supporting comparative analyses
across multiple groups without reshaping the data manually. Internally,
the group-based join performs one match() call per group
value, so requesting 10 countries costs roughly 10 times the computation
of a single country. This is still fast for typical use cases
(sub-second for results with tens of thousands of rows), but requesting
"all" on a large result may take a few seconds because it
iterates over all group values in the enrichment (196 countries for
GRIIS, dozens of TDWG regions for WCVP).
Invasive species status (GRIIS)
The Global Register of Introduced and Invasive Species (GRIIS)
classifies species as native, introduced, or invasive on a per-country
basis. The dataset covers 196 countries with ~23,000 species-country
combinations. The country argument takes ISO 3166-1 alpha-2
codes (e.g., "AT" for Austria, "DE" for
Germany, "GB" for Great Britain).
Single country
plants <- taxify(c(
"Robinia pseudoacacia", "Ailanthus altissima",
"Impatiens glandulifera", "Quercus robur",
"Reynoutria japonica", "Solidago canadensis"
))
plants |> add_invasive_status(country = "AT")
#> input_name invasive_status
#> 1 Robinia pseudoacacia invasive
#> 2 Ailanthus altissima invasive
#> 3 Impatiens glandulifera invasive
#> 4 Quercus robur native
#> 5 Reynoutria japonica invasive
#> 6 Solidago canadensis invasiveWith a single country code, the output column is simply
invasive_status without any suffix. The three possible
values are "native", "introduced", and
"invasive". Species not recorded in the GRIIS dataset for
the requested country receive NA. Note that NA
does not mean “native”; it means “no record” in the GRIIS database. Many
native species are simply not listed because GRIIS focuses on introduced
and invasive taxa.
Multiple countries
When comparing invasive status across countries, pass a vector of codes. Each output column is suffixed with the corresponding country code.
plants |> add_invasive_status(country = c("AT", "DE", "GB"))
#> input_name invasive_status_AT invasive_status_DE invasive_status_GB
#> 1 Robinia pseudoacacia invasive invasive invasive
#> 2 Ailanthus altissima invasive invasive introduced
#> 3 Impatiens glandulifera invasive invasive invasive
#> 4 Quercus robur native native native
#> 5 Reynoutria japonica invasive invasive invasive
#> 6 Solidago canadensis invasive invasive introducedThis layout makes cross-country comparisons straightforward. Filtering for species that differ in status between countries is a matter of subsetting columns. The example below finds species classified as invasive in Austria but not (yet) classified as invasive in Germany:
result <- plants |> add_invasive_status(country = c("AT", "DE"))
# Species invasive in Austria but not in Germany
result[result$invasive_status_AT == "invasive" &
result$invasive_status_DE != "invasive", ]This pattern is useful for identifying species that may be expanding their invasive range, or for comparing the regulatory status of non-native species across neighboring countries.
All countries
Passing country = "all" expands the result with one
column per country in the GRIIS dataset (196 countries). This produces a
wide data.frame with 196 additional columns, so it is best reserved for
full-scale screening exercises where the complete geographic profile of
each species matters.
plants |> add_invasive_status(country = "all")
# Adds invasive_status_AD, invasive_status_AE, ..., invasive_status_ZWThe resolution of "all" to the full list of country
codes is done efficiently: if the manifest contains an
available_groups field for the GRIIS enrichment (which it
normally does), the codes are read from there in O(1) time without
scanning the .vtr file. This makes even the
"all" case fast to set up, though the subsequent join
across 196 groups naturally takes longer than a single-country join.
Alien species first records (Seebens et al.)
The Global Alien Species First Record Database (Seebens et al. 2017)
records the year each alien species was first documented in a given
country or territory. Unlike GRIIS (which records current status), this
enrichment provides a historical timeline of alien species arrivals. The
dataset covers all taxa (plants, animals, fungi) with ~77,000
species-country combinations across 241 countries. The
country argument takes ISO 3166-1 alpha-2 codes, same as
add_invasive_status().
Single country
aliens <- taxify(c(
"Robinia pseudoacacia", "Ailanthus altissima",
"Impatiens glandulifera", "Quercus robur",
"Ambrosia artemisiifolia", "Solidago canadensis"
))
aliens |> add_alien_first_records(country = "AT")
#> input_name alien_first_record alien_first_record_source alien_first_record_reference
#> 1 Robinia pseudoacacia 1850 NOBANIS NOBANIS
#> 2 Ailanthus altissima 1870 NOBANIS NOBANIS
#> 3 Impatiens glandulifera 1900 NOBANIS NOBANIS
#> 4 Quercus robur NA <NA> <NA>
#> 5 Ambrosia artemisiifolia 1863 NOBANIS NOBANIS
#> 6 Solidago canadensis 1850 NOBANIS NOBANISEach row gets three columns: alien_first_record (the
year as an integer), alien_first_record_source (the
database that contributed this record, e.g., “NOBANIS”, “GAVIA”,
“FishBase”), and alien_first_record_reference (the original
citation). Native species like Quercus robur receive
NA because they are not in the alien first records
database.
The source and reference columns provide row-level provenance. This
matters because a second first-records source (GBIF occurrence-based
records) will be added in a future version, and the source
column will distinguish which database contributed each record.
Multiple countries
aliens |> add_alien_first_records(country = c("AT", "DE", "GB"))
#> input_name alien_first_record_AT alien_first_record_DE alien_first_record_GB ...
#> 1 Robinia pseudoacacia 1850 1630 1640 ...
#> 2 Ailanthus altissima 1870 1780 1751 ...
#> 3 Impatiens glandulifera 1900 1839 1855 ...With multiple countries, each of the three value columns gets a
country suffix: alien_first_record_AT,
alien_first_record_source_AT,
alien_first_record_reference_AT, etc. This makes
cross-country comparisons of invasion history straightforward.
Reshaping to long format
When working with multiple countries, the wide format can be unwieldy
for modelling, mapping, or timeline analyses. The
taxify_long() helper reshapes any group-based enrichment
columns back to long format:
aliens |>
add_alien_first_records(country = c("AT", "DE", "GB")) |>
taxify_long()
#> input_name country_code alien_first_record alien_first_record_source ...
#> 1 Robinia pseudoacacia AT 1850 NOBANIS ...
#> 2 Ailanthus altissima AT 1870 NOBANIS ...
#> ...
#> 7 Robinia pseudoacacia DE 1630 Long (2003) ...
#> ...When cols and group_col are omitted,
taxify_long() auto-detects them from metadata stamped by
the add_*() functions. The result has one row per species
per country, with the base column names (no suffix) and a new
country_code column. The drop_na = TRUE
argument removes rows where all value columns are NA (e.g.,
native species with no alien first record in any queried country).
taxify_long() works with any group-based enrichment, not
just alien first records. It can reshape invasive_status,
native_status, or common_name columns just as
easily:
aliens |>
add_invasive_status(country = c("AT", "DE")) |>
taxify_long()When multiple grouped enrichments share the same group column, they
are reshaped together. If an enrichment covers different groups than
another (e.g., GRIIS for AT/DE but first records for AT/DE/CH), the
missing combinations are padded with NA:
aliens |>
add_invasive_status(country = c("AT", "DE")) |>
add_alien_first_records(country = c("AT", "DE", "CH")) |>
taxify_long()You can still provide cols and group_col
explicitly to override auto-detection or to rename the group column.
Native range by botanical region (WCVP)
The World Checklist of Vascular Plants (WCVP) from the Royal Botanic Gardens, Kew, classifies ~340,000 plant species as native, introduced, or extinct in TDWG Level 2 botanical regions. TDWG (Taxonomic Databases Working Group, now TDWG Biodiversity Information Standards) defined a hierarchical system of geographic regions for recording plant distributions. Level 2 regions are continent-scale units.
The region argument takes TDWG Level 2 codes. Common
codes include:
EUR(Europe)NAM(Northern America)SAM(Southern America)AFR(Africa)AUS(Australasia)ASI(Asia-Temperate)AST(Asia-Tropical)PAC(Pacific)ANT(Antarctica)
trees <- taxify(c(
"Quercus robur", "Quercus suber", "Eucalyptus globulus",
"Nothofagus pumilio", "Sequoiadendron giganteum"
))
trees |> add_wcvp(region = "EUR")
#> input_name native_status
#> 1 Quercus robur native
#> 2 Quercus suber native
#> 3 Eucalyptus globulus NA
#> 4 Nothofagus pumilio NA
#> 5 Sequoiadendron giganteum NAEucalyptus globulus returns NA for Europe
because it is native to Australia, not because it is absent from the
WCVP dataset. The dataset records where a species is natively
distributed, not where it has been planted or naturalized. This is an
important distinction: many cultivated species will show NA
in regions where they are widespread in gardens and plantations.
Querying multiple regions reveals each species’ native continental range:
trees |> add_wcvp(region = c("EUR", "AUS", "SAM"))
#> input_name native_status_EUR native_status_AUS native_status_SAM
#> 1 Quercus robur native NA NA
#> 2 Quercus suber native NA NA
#> 3 Eucalyptus globulus NA native NA
#> 4 Nothofagus pumilio NA NA native
#> 5 Sequoiadendron giganteum NA NA NASequoiadendron giganteum (Giant Sequoia) returns
NA for all three regions because it is native to western
North America (NAM), which was not included in the query.
This illustrates that the absence of a region code from the query does
not mean the species lacks native range data; it means we did not ask
about the right region.
The full list of available TDWG codes can be retrieved
programmatically from the manifest (the available_groups
field for the wcvp enrichment), or from the TDWG geographic
standard documentation. WCVP is non-static: Kew updates the checklist
periodically, and the taxify enrichment is rebuilt when new versions are
published.
Naturalized alien flora by region (GloNAF)
The Global Naturalized Alien Flora (GloNAF) records which plant
species are naturalized in ~1,300 regions worldwide. Unlike GRIIS (which
classifies species as native/introduced/invasive per country), GloNAF
provides a binary naturalization flag per region with finer geographic
resolution, using TDWG-compatible codes extended with dot notation for
sub-national units (e.g., "USA.CA" for California).
plants <- taxify(c(
"Robinia pseudoacacia", "Ailanthus altissima",
"Impatiens glandulifera", "Quercus robur"
))
plants |> add_glonaf(region = "EUR")
#> input_name naturalized
#> 1 Robinia pseudoacacia 1
#> 2 Ailanthus altissima 1
#> 3 Impatiens glandulifera 1
#> 4 Quercus robur NAThe output column naturalized is 1 if the
species is recorded as naturalized in the queried region, and
NA otherwise. Multiple regions produce suffixed columns
(naturalized_EUR, naturalized_NAM). The
region = "all" option expands to all ~1,300 regions.
GloNAF complements GRIIS: GRIIS provides the invasion status dimension (native/introduced/invasive), while GloNAF provides the geographic coverage dimension (where has this species established self-sustaining populations?). Combining both gives a fuller picture of alien plant distributions.
The dataset is licensed under CC BY 4.0. The reference is van Kleunen et al. (2019), Ecology 100:e02542 (v1.0) and Davis et al. (2025), Ecology e70245 (v2.0). GloNAF is classified as static in the taxify manifest.
Common (vernacular) names (GBIF)
The common names enrichment draws on GBIF’s vernacular name database,
which aggregates names from many national and regional nomenclature
sources. It is the most multilingual of the enrichments, covering dozens
of languages. The lang argument takes ISO 639-1 two-letter
language codes.
species <- taxify(c(
"Quercus robur", "Parus major", "Vulpes vulpes",
"Bufo bufo", "Picea abies"
))
species |> add_common_names()
#> input_name common_name
#> 1 Quercus robur Pedunculate Oak
#> 2 Parus major Great Tit
#> 3 Vulpes vulpes Red Fox
#> 4 Bufo bufo Common Toad
#> 5 Picea abies Norway SpruceThe default language is English (lang = "en"). Switching
to another language is a matter of changing the lang
argument:
species |> add_common_names(lang = "de")
#> input_name common_name
#> 1 Quercus robur Stieleiche
#> 2 Parus major Kohlmeise
#> 3 Vulpes vulpes Rotfuchs
#> 4 Bufo bufo Erdkroete
#> 5 Picea abies Gemeine FichteWhen multiple common names exist for a species in the requested
language, the first (most commonly used) entry is returned. Coverage
varies substantially by language: English and German have the broadest
coverage (most European and widespread species have entries). French,
Spanish, Portuguese, and Dutch also have good coverage. Less widely
spoken languages or languages with limited digital biodiversity
infrastructure may have gaps, resulting in NA for species
that do have common names in those languages but that have not been
digitized in GBIF’s aggregation.
The common names enrichment is non-static (GBIF updates its backbone
periodically) and licensed under CC0. When a single language is
requested, the output column is common_name. If multiple
languages were supported in a single call, they would follow the
group-based suffix pattern, but in practice the common usage pattern is
one language per call.
Stacking enrichments
The add_*() functions return the same data.frame class
(taxify_result) they receive, preserving all attributes
including the metadata used by summary(). This means they
compose naturally with the pipe operator. A typical workflow chains
taxify() with several enrichment calls, building up the
desired set of columns incrementally.
library(taxify)
plant_result <- taxify(c(
"Quercus robur", "Fagus sylvatica", "Picea abies",
"Arrhenatherum elatius", "Festuca rubra", "Plantago lanceolata"
)) |>
add_conservation_status() |>
add_woodiness() |>
add_eive() |>
add_diaz_traits()Each enrichment appends its columns to the right of the data.frame.
The result of this chain has the original 16 taxify columns plus
conservation_status, woodiness, the five EIVE
columns (eive_light, eive_temperature,
eive_moisture, eive_reaction,
eive_nutrients), and the two Diaz columns
(seed_mass_mg, plant_height_m). That is 25
columns total. Order within the chain does not affect the output because
each enrichment operates independently on the accepted_name
column. The only case where order matters is the column-name collision
between LEDA and Diaz (seed_mass_mg), discussed
earlier.
Here is a similar chain for birds, combining morphological measurements with diet data and vernacular names:
bird_result <- taxify(c(
"Parus major", "Cyanistes caeruleus", "Erithacus rubecula",
"Turdus merula", "Falco peregrinus"
)) |>
add_conservation_status() |>
add_avonet() |>
add_elton_traits() |>
add_common_names()This produces 16 (base) + 1 (conservation) + 11 (AVONET) + 18
(EltonTraits) + 1 (common name) = 47 columns. Both AVONET and
EltonTraits contribute body mass, but in distinct columns
(avonet_body_mass_g and elton_body_mass_g), so
there is no overwriting.
And for mammals, combining life-history traits from PanTHERIA with diet data from EltonTraits and German common names:
mammal_result <- taxify(c(
"Vulpes vulpes", "Canis lupus", "Ursus arctos",
"Lutra lutra", "Lynx lynx"
)) |>
add_conservation_status() |>
add_pantheria() |>
add_elton_traits() |>
add_common_names(lang = "de")Both EltonTraits and PanTHERIA cover mammals, so both contribute data to the mammal chain. EltonTraits provides diet composition percentages and foraging strata; PanTHERIA provides life-history traits like longevity, litter size, and home range. The combination gives a multidimensional view of each species’ ecology without any manual data assembly.
The same pattern works for fungi, combining lifestyle traits from FungalTraits with guild classifications from FUNGuild:
fungal_result <- taxify(c(
"Amanita muscaria", "Boletus edulis", "Trametes versicolor"
)) |>
add_conservation_status() |>
add_fungal_traits() |>
add_funguild()FungalTraits provides the detailed ecological traits (lifestyle,
growth form, substrate, mycorrhizal exploration type), while FUNGuild
adds the trophic mode and guild classification. The two enrichments have
complementary column sets, so there is no overwriting except for growth
form, which is distinguished by the funguild_growth_form
prefix in FUNGuild. The confidence_ranking column from
FUNGuild is a useful quality filter: restricting to “Highly Probable”
assignments before downstream analysis reduces noise.
Fish analyses can similarly combine morphological and ecological enrichments:
fish_result <- taxify(c(
"Salmo trutta", "Esox lucius", "Gadus morhua"
)) |>
add_conservation_status() |>
add_fish_traits() |>
add_fishbase()FISHMORPH provides the morphological ratios (body elongation, fin
shape, eye size) that define the ecomorphological profile of each
species, while FishBase adds the ecological and life-history context
(trophic level, depth range, vulnerability). Note that Gadus
morhua (Atlantic Cod) will have NA values in all
FISHMORPH columns because FISHMORPH covers freshwater species only, but
it will be fully populated by FishBase. This kind of partial coverage
across complementary enrichments is expected and easy to diagnose from
the summary() output.
Enrichment chains can be as long as needed. Performance is linear in
the number of enrichments: each add_*() call performs one
join, regardless of how many enrichments have already been applied. A
chain of 10 enrichments on a 50,000-row result completes in seconds. The
enrichment files themselves are loaded via vectra’s memory-mapped I/O,
so even enrichments with hundreds of thousands of rows (like WCVP at
~340,000) do not consume large amounts of RAM.
The pipe chain pattern also plays well with reproducibility
workflows. The entire analysis, from raw species list to fully enriched
table, is captured in a single, readable expression. Saving this code
alongside the session info (including taxify version and enrichment
versions from summary()) gives a complete record of which
data was used to produce the results.
Coverage patterns
Not all species appear in all enrichments. Each dataset has a
taxonomic scope (plants, birds, mammals, amphibians, vertebrates,
butterflies, arthropods, fungi, algae, fish, lizards, or cross-taxon)
and a geographic scope (global, European, NW European). When an
enrichment has no data for a species, the corresponding columns contain
NA. Understanding coverage patterns is essential for
interpreting enriched results correctly and for choosing which
enrichments to apply.
mixed <- taxify(c(
"Quercus robur", # plant
"Parus major", # bird
"Vulpes vulpes", # mammal
"Bufo bufo", # amphibian
"Amanita muscaria", # fungus
"Salmo trutta" # fish
))
mixed |>
add_woodiness() |>
add_avonet() |>
add_pantheria() |>
add_amphibio() |>
add_fungal_traits() |>
add_fishbase()
#> input_name woodiness beak_length pantheria_body_mass_g body_size_mm primary_lifestyle fb_trophic_level ...
#> 1 Quercus robur woody NA NA NA <NA> NA ...
#> 2 Parus major NA 11.2 NA NA <NA> NA ...
#> 3 Vulpes vulpes NA NA 5480.0 NA <NA> NA ...
#> 4 Bufo bufo NA NA NA 150.0 <NA> NA ...
#> 5 Amanita muscaria NA NA NA NA ectomycorrhizal NA ...
#> 6 Salmo trutta NA NA NA NA <NA> 3.4 ...Each species populates only the columns from enrichments that cover
its taxon group. The NA values are not errors or data
quality problems; they reflect the scope of the underlying datasets.
Quercus robur has a woodiness value but no beak length, body
mass, body size, fungal traits, or fish data. Amanita muscaria
has a primary lifestyle but no plant, bird, mammal, amphibian, or fish
traits. Salmo trutta has FishBase data but nothing from the
other taxon-specific enrichments. This is expected behavior.
Approximate coverage rates by enrichment
The following table summarizes the approximate species coverage of each enrichment, its taxonomic scope, and its geographic scope. Numbers are approximate because enrichments are updated periodically and because coverage depends somewhat on the backbone used (different backbones accept slightly different sets of names).
| Enrichment | Taxon scope | Geographic scope | ~Species |
|---|---|---|---|
| conservation_status | all groups | global | 166,000 |
| woodiness | plants | global | 50,000 |
| eive | plants | European | 14,500 |
| diaz_traits | plants | global | 46,000 |
| leda | plants | NW European | 8,000 |
| elton_traits | birds + mammals | global | 15,400 |
| avonet | birds | global | 11,000 |
| pantheria | mammals | global | 5,400 |
| amphibio | amphibians | global | 6,800 |
| fungal_traits | fungi | global | 10,200 genera |
| funguild | fungi | global | 13,000 |
| algae_traits | macroalgae | European | 1,745 |
| fish_traits | freshwater fish | global | 8,300 |
| fishbase | all fish | global | 35,000 |
| lizard_traits | lizards | global | 6,600 |
| anage | vertebrates | global | 4,700 |
| animaltraits | cross-taxon (arthropods+) | global | 2,000 |
| leptraits | butterflies | global | 12,400 |
| arthropod_traits | arthropods | NW European | 4,900 |
| griis | all groups | per country | 23,000 combos |
| glonaf | plants | global by region | 16,000 × 1,300 |
| wcvp | plants | global by region | 340,000 |
| common_names | all groups | multi-language | varies |
For a European plant survey, the enrichment with the highest absolute coverage is WCVP (~340,000 species), followed by conservation status (~166,000), woodiness (~50,000), Diaz traits (~46,000), EIVE (~14,500), and LEDA (~8,000). However, for a specifically NW European dataset, LEDA’s ~8,000 species may actually cover a larger fraction of the species list than the Diaz dataset, because LEDA is geographically focused on the same region.
Interpreting NA columns
When an entire column is NA for all rows in a result,
the most likely explanation is a taxon-scope mismatch. Woodiness covers
vascular plants, so a bird dataset will have NA in every
row of that column. The reverse holds for AVONET against a plant list,
or PanTHERIA against amphibians. This is expected behavior: the
enrichment data simply does not include species from that taxon
group.
A partially populated column (some rows NA, others
filled) means the enrichment covers the taxon group but the specific
species is not in the source dataset. Common reasons for per-species
NA include:
-
Source dataset incomplete. No trait database covers
100% of described species. PanTHERIA covers ~5,400 of the ~6,500
described mammal species; roughly 1,100 mammals will have
NAvalues. - Recently described species. Species described or split after the dataset’s publication date will be absent. PanTHERIA (2009) misses all species described since 2009.
- Name alignment failure. Rare, but possible for taxa with ongoing taxonomic revisions where the backbone and the enrichment source use different name variants that the cross-backbone resolution did not capture. If a species consistently fails to match, filing a GitHub issue helps us improve the name alignment pipeline.
- Infraspecific taxa. Most enrichments operate at the species level. If the taxify result contains subspecies or varieties (e.g., “Quercus robur subsp. robur”), the enrichment may not have a matching entry at that rank.
It is worth noting that coverage is not the same as data quality. An
enrichment might cover 95% of the species in a result, but the trait
values for some of those species could be based on few measurements,
extrapolated from congeners, or derived from captive rather than wild
populations. The enrichment system does not expose confidence intervals
or sample sizes for individual trait values; that level of detail lives
in the original source databases. For analyses that require
measurement-level metadata (sample size, measurement uncertainty,
geographic origin of measurements), consult the original source cited on
the add_*() help page.
To check the overall enrichment rate for a result, the
summary() output reports the number of matched rows per
enrichment. We can also compute it directly:
result <- taxify(species_list) |> add_woodiness()
# Fraction of matched species with woodiness data
mean(!is.na(result$woodiness[!is.na(result$accepted_name)]))The enrichment register in summary() output
Every add_*() call records metadata about the enrichment
in an attribute (taxify_meta) on the result data.frame.
This metadata includes the enrichment name, source label, version
string, and the count of rows that received non-NA trait
values. Calling summary() on a taxify result displays this
information alongside the standard match statistics, providing a compact
overview of the entire analysis pipeline.
result <- taxify(c("Quercus robur", "Fagus sylvatica", "Pinus sylvestris")) |>
add_conservation_status() |>
add_woodiness() |>
add_eive()
summary(result)
#> -- taxify results --------------------------------------------------------
#> backend: WFO v2024.12 | 3 names submitted
#>
#> matched 3 (exact: 3, case-insensitive: 0, fuzzy: 0)
#> --------------------------------------------------------
#> taxon groups: plant: 3
#>
#> enrichments:
#> conservation_status (IUCN Red List 2025.1) -- 3 of 3 matched
#> woodiness (Zanne et al. 2014 1.0) -- 3 of 3 matched
#> eive (EIVE 1.0 2023.1) -- 3 of 3 matchedThe enrichment register lists each applied enrichment with its source name, version, and the fraction of successfully matched names. In this example, all three enrichments achieved 100% coverage (3 of 3 matched), which is expected for well-known European tree species. On a larger dataset with a broader taxonomic scope, we would typically see lower fractions, especially for enrichments with narrow geographic or taxonomic coverage.
The register is cumulative: applying more enrichments adds more
lines. This makes summary() a useful diagnostic at the end
of a pipe chain. If one enrichment shows unexpectedly low coverage
(e.g., “2 of 500 matched” for EIVE on a dataset that we expected to be
European plants), it signals a problem worth investigating. Common
causes include the species list containing non-plant taxa, non-European
species, or names at ranks other than species.
The version strings in the register provide exact provenance
information for the methods section of a paper. Rather than writing “we
used the IUCN Red List” (which version? downloaded when?), we can report
the version string directly from summary() (e.g., “IUCN Red
List v2025.1 as distributed by taxify enrichment conservation_status
v2025.04”).
Practical guidance: which enrichments for which taxa
The choice of enrichments depends on the taxonomic scope and geographic focus of the analysis. Below are recommended enrichment stacks for common use cases, with brief notes on what each enrichment contributes.
Vascular plants (European)
European plant ecology benefits from the richest set of enrichments. A full stack combines conservation status, growth form, ecological niche position, global functional traits, regional functional traits, native range, and vernacular names.
result <- taxify(species_list, backend = "wfo") |>
add_conservation_status() |>
add_woodiness() |>
add_fungalroot() |>
add_eive() |>
add_diaz_traits() |>
add_leda() |>
add_wcvp(region = "EUR") |>
add_common_names()EIVE and LEDA both cover European plants, but their trait columns are complementary. EIVE provides niche position along five environmental gradients (where does this species grow?). LEDA provides morphological and dispersal traits (what does this species look like, how does it disperse?). The combination produces a detailed functional profile suitable for community-weighted mean analyses, trait-based ordinations, and functional diversity calculations. The Diaz traits add a global perspective on seed mass and plant height that complements LEDA’s regional measurements.
The seed_mass_mg collision between LEDA and Diaz was
discussed earlier. In this stack, LEDA runs before Diaz, so the Diaz
seed_mass_mg will be the value in the final result. If the
LEDA value is preferred, reverse the order or omit
add_diaz_traits().
Vascular plants (global)
Outside Europe, EIVE and LEDA coverage drops to near zero. The global plant stack relies on the wider-coverage datasets and omits the regional enrichments.
result <- taxify(species_list, backend = "wfo") |>
add_conservation_status() |>
add_woodiness() |>
add_fungalroot() |>
add_diaz_traits() |>
add_wcvp(region = c("NAM", "SAM", "AFR")) |>
add_common_names()Woodiness, Diaz traits, and FungalRoot mycorrhizal type all have global coverage, so they contribute useful data regardless of the geographic origin of the species list. WCVP can be queried for any TDWG region, providing native range information for the continents relevant to the study.
Birds
Birds are covered by two complementary enrichments: AVONET for morphology and migration, EltonTraits for diet and foraging. Together they provide a detailed functional profile spanning body plan, habitat use, dietary niche, and movement ecology.
result <- taxify(species_list, backend = "col") |>
add_conservation_status() |>
add_avonet() |>
add_elton_traits() |>
add_common_names()Both AVONET and EltonTraits include body mass, stored in separate
columns (avonet_body_mass_g from specimen measurements and
elton_body_mass_g from literature compilation). Small
discrepancies between the two are expected and can be informative: large
discrepancies for a species may indicate measurement error in one source
or sexually dimorphic species where the two sources sampled different
sexes.
Mammals
Mammals are covered by PanTHERIA (life-history traits) and
EltonTraits (diet and foraging behavior). The combination provides body
mass from two independent sources (pantheria_body_mass_g
and elton_body_mass_g), which can serve as a
cross-validation of the mass data.
result <- taxify(species_list, backend = "col") |>
add_conservation_status() |>
add_pantheria() |>
add_elton_traits() |>
add_common_names()PanTHERIA contributes life-history variables that EltonTraits does not cover (longevity, litter size, gestation, weaning age, home range, diet breadth, habitat breadth). EltonTraits contributes the detailed diet composition percentages and foraging stratum data that PanTHERIA does not provide. There is no redundancy except body mass.
Amphibians
AmphiBIO is the sole dedicated amphibian enrichment. It can be combined with conservation status and common names.
result <- taxify(species_list, backend = "col") |>
add_conservation_status() |>
add_amphibio() |>
add_common_names()Amphibians are the most threatened vertebrate class, with roughly 40% of assessed species listed in threatened categories (VU, EN, or CR) according to the IUCN. Conservation status is therefore particularly informative for amphibian analyses. The combination of AmphiBIO habitat traits (aquatic, fossorial, arboreal) with IUCN status can reveal associations between habitat specialization and extinction risk.
Fish
Fish are covered by two complementary enrichments: FISHMORPH for morphological traits of freshwater species, and FishBase for ecological and life-history traits across all fish (freshwater + marine). The WoRMS backend provides authoritative taxonomy for marine fish; COL and GBIF cover both freshwater and marine species.
result <- taxify(species_list, backend = "worms") |>
add_conservation_status() |>
add_fish_traits() |>
add_fishbase() |>
add_common_names()For freshwater fish community studies, both enrichments contribute
data. FISHMORPH provides the ecomorphological ratios used in functional
diversity calculations, while FishBase adds trophic level, depth range,
and vulnerability. For marine fish studies, only FishBase will
contribute data (FISHMORPH covers freshwater species only). The
fb_vulnerability column from FishBase is particularly
useful alongside IUCN conservation status for prioritizing species in
fisheries management and marine spatial planning.
Reptiles (lizards)
Lizards are covered by the Meiri lizard traits enrichment, which provides life-history and ecological traits for ~6,600 species. Combined with conservation status, it gives a functional profile suitable for reptile community analyses and conservation assessments.
result <- taxify(species_list, backend = "col") |>
add_conservation_status() |>
add_lizard_traits() |>
add_common_names()Snakes and turtles are not covered by the lizard enrichment. For
those groups, the add_data() function can join custom trait
datasets. For cross-class longevity and metabolic comparisons,
add_anage() covers reptiles alongside mammals, birds,
amphibians, and fish.
Butterflies
LepTraits is the dedicated butterfly enrichment, providing wingspan, voltinism, habitat affinities, and host plant data for ~12,400 species globally. For European butterfly ecology, it can be combined with the NW European Arthropod traits for additional life-history variables.
result <- taxify(species_list, backend = "col") |>
add_conservation_status() |>
add_leptraits() |>
add_common_names()Arthropods (NW European)
For arthropod community studies in NW Europe, the arthropod traits enrichment provides the most comprehensive trait coverage. It can be combined with AnimalTraits for cross-taxon body mass comparisons and with LepTraits for additional butterfly-specific traits.
result <- taxify(species_list, backend = c("col", "gbif")) |>
add_conservation_status() |>
add_arthropod_traits() |>
add_animaltraits() |>
add_common_names()For arthropod studies outside NW Europe, AnimalTraits provides body mass for ~1,700 arthropod species globally, though with fewer trait dimensions than the Logghe et al. dataset.
Fungi
Fungi are covered by two enrichments: FungalTraits for genus-level ecological traits and FUNGuild for trophic guild classifications. The COL and GBIF backends provide fungal taxonomy.
result <- taxify(species_list, backend = "col") |>
add_conservation_status() |>
add_fungal_traits() |>
add_funguild() |>
add_common_names()FungalTraits provides the lifestyle, growth form, and interaction
capability traits that describe what each genus does ecologically.
FUNGuild adds the trophic mode and guild classification used in fungal
community ecology. The confidence_ranking column from
FUNGuild allows filtering to high-confidence assignments, which is
important for quantitative analyses where guild misclassification would
introduce systematic bias.
Macroalgae (European)
European macroalgae are covered by AlgaeTraits, which provides morphological and ecological traits for ~1,745 species. The WoRMS backend is recommended for marine algae taxonomy.
result <- taxify(species_list, backend = "worms") |>
add_conservation_status() |>
add_algae_traits() |>
add_common_names()AlgaeTraits is geographically scoped to European coastlines. For
non-European macroalgae studies, the add_data() function
can join custom datasets. The key advantage of AlgaeTraits over general
plant trait databases is that it provides marine-specific traits (tidal
zone, wave exposure, calcification) that are not captured by terrestrial
plant databases like LEDA or EIVE.
Mixed-taxon datasets
When a dataset spans multiple kingdoms (e.g., a biodiversity survey with plants, birds, and mammals), there are two strategies.
The first is to apply all relevant enrichments to the full result and
accept NA values where taxonomic scope does not
overlap:
result <- taxify(species_list) |>
add_conservation_status() |>
add_woodiness() |>
add_avonet() |>
add_pantheria() |>
add_amphibio() |>
add_elton_traits()This produces a wide data.frame where most cells in any given row are
NA (a plant row has woodiness data but NA for
beak length, body mass, etc.). The advantage is simplicity: one
data.frame, one summary() call, no manual splitting.
The second strategy is to split the result by taxon group, enrich each subset with the appropriate stack, and recombine:
result <- taxify(species_list)
plants <- result[result$kingdom == "Plantae", ]
birds <- result[result$family %in% bird_families, ]
mammals <- result[result$family %in% mammal_families, ]
plants <- plants |> add_woodiness() |> add_eive()
birds <- birds |> add_avonet() |> add_elton_traits()
mammals <- mammals |> add_pantheria() |> add_elton_traits()The second approach avoids wide data.frames with many NA
columns and produces cleaner trait matrices for downstream analyses that
treat columns as features (ordination, clustering, machine learning).
The disadvantage is more code and the need to maintain separate
data.frames for each group.
The choice depends on the analysis goal. For a summary table in a paper (e.g., “species, conservation status, key traits”), the first approach works well. For functional diversity calculations or trait-based models that expect a complete trait matrix, the second approach typically produces better inputs because it avoids rows with structurally missing values (values that are missing by design, not by data limitation).
A middle ground is to apply add_conservation_status()
and add_common_names() to the full dataset (since both
cover all taxon groups), then split by group for the taxon-specific
enrichments. This gives us conservation data and vernacular names for
every species in a single data.frame, while keeping the taxon-specific
trait matrices clean.
Joining custom data
Beyond the built-in enrichments, add_data() joins any
external dataset to a taxify result. It accepts a file path (CSV,
CSV.GZ, XLSX, SQLite/DB, or VTR) or an in-memory data.frame. The
function identifies the species name column (automatically by running
the first 10 rows of each character column through taxify()
and selecting the column with the highest match rate, or via an explicit
species_col argument), resolves those names through the
same backbone(s) used in the original taxify() call, and
joins on accepted_id.
result <- taxify(c("Quercus robur", "Pinus sylvestris", "Fagus sylvatica"))
# From a CSV file (auto-detect species column)
result |> add_data("my_traits.csv")
# From a data.frame with explicit species column
my_traits <- data.frame(
species = c("Quercus robur", "Pinus sylvestris", "Fagus sylvatica"),
bark_thickness_mm = c(25, 15, 8),
shade_tolerance = c(0.6, 0.3, 0.8)
)
result |> add_data(my_traits, species_col = "species")Because add_data() resolves names through the backbone
before joining, it handles synonyms correctly. If the external data uses
“Pinus abies” and the backbone resolves it to “Picea abies”, the join
still works. This is the recommended way to integrate local field data,
unpublished trait measurements, or datasets from sources not covered by
the built-in enrichments.
The cols argument can restrict which columns are joined
from the external data. If the external data has 50 columns but we only
need two, passing
cols = c("bark_thickness_mm", "shade_tolerance") avoids
cluttering the result with unwanted columns. The fuzzy
argument (default TRUE) enables fuzzy matching for names in
the external data that do not exact-match the backbone;
fuzzy_threshold controls the maximum allowed string
distance.
Column names from the external data that collide with existing
columns in the taxify result are automatically prefixed with
"data_" to prevent overwriting. If multiple rows in the
external data resolve to the same accepted_id with
identical trait values, they are deduplicated. If they resolve to the
same accepted_id with conflicting values (e.g., two
different height measurements for the same species),
add_data() raises an error asking the user to resolve the
ambiguity before joining. This strict handling of duplicates prevents
the row duplication that a plain merge() would produce.
The add_data() function also supports SQLite databases
via the table argument, and .vtr files
directly (useful for sharing pre-built enrichments between
collaborators). For XLSX files, the openxlsx2 package is required
(listed in Suggests).
Data provenance and citation
Each enrichment draws on published, peer-reviewed datasets with their
own licenses and citation requirements. Citing the correct source and
version is a professional obligation when using these data in
publications. The summary() output includes the source and
version for each applied enrichment, providing a starting point for the
methods section. The original references are listed in each
add_*() function’s help page (accessible via
?add_avonet, ?add_leda,
?add_pantheria, etc.).
For reproducibility, the version recorded in meta.json
pins the exact build of each enrichment .vtr file that was
used. Static enrichments (Zanne 2014, PanTHERIA 2009, EltonTraits 2014,
AmphiBIO 2017, LEDA 2008, Diaz 2022, Seebens 2017, FungalTraits 2020,
FUNGuild 2016, AlgaeTraits 2023, FISHMORPH 2021, Meiri 2018, LepTraits
2022, AnimalTraits 2022, NW European Arthropods 2025, GloNAF 2019) have
fixed versions that never change. Non-static enrichments (IUCN, GRIIS,
WCVP, common names) are updated when the upstream source publishes a new
release, and the version in meta.json reflects which
release was used. Reporting the enrichment version in a publication
ensures that results can be reproduced even if the upstream data is
later revised or corrected.
The licenses of the source datasets range from CC0 (EltonTraits,
PanTHERIA, woodiness, common names, LepTraits, AnimalTraits) to CC BY
4.0 (EIVE, AmphiBIO, AVONET, GRIIS, GloNAF, FungalTraits, AlgaeTraits,
FISHMORPH, Meiri lizard traits), CC BY (AnAge), CC BY-NC (NW European
Arthropods), CC BY 3.0 (Diaz traits), and CC BY-NC 3.0 (FishBase). LEDA
and WCVP have their own terms published on their respective websites.
The taxify package itself does not redistribute these datasets in their
original form; the .vtr files are built from publicly
available sources and distributed via GitHub Releases. When using
enrichment data in a publication, cite the original source (the
reference on the ?add_* help page) and optionally note the
taxify enrichment version for reproducibility.
A minimal methods paragraph citing enrichments might read:
Taxonomic names were resolved against the WFO backbone (v2024.12) using taxify (v0.x.x). Conservation status was obtained from the IUCN Red List (v2025.1) via
add_conservation_status(). Woodiness classification followed Zanne et al. (2014). Ecological indicator values were sourced from EIVE 1.0 (Dengler et al. 2023). All enrichment versions are recorded in the taxify result metadata and available viasummary().
Summary
taxify’s enrichment system turns taxonomic name matching into a gateway to ecological trait data. The 22 built-in enrichments cover conservation status, growth form, ecological niches, functional traits, diet, morphology, life-history, geographic ranges, invasive status, and vernacular names across plants, birds, mammals, amphibians, vertebrates, butterflies, arthropods, fungi, algae, fish, and reptiles. All enrichments share the same underlying join mechanics, download automatically on first use, cache locally for subsequent sessions, and compose freely with the pipe operator.
The cross-backbone name resolution built into the .vtr
files means we do not have to worry about which backbone we used:
enrichments work identically with WFO, COL, GBIF, ITIS, NCBI, OTT, or
WoRMS results. The summary() method tracks which
enrichments have been applied, their source versions, and their coverage
rates, supporting both exploratory analysis and reproducible
reporting.
For taxa or traits not covered by the built-in layers,
add_data() integrates any external dataset using the same
backbone-resolved name matching. Between the built-in enrichments and
the add_data() escape hatch, most common ecological
analyses can go from raw species lists to trait-enriched analytical
tables in a single pipe chain.
The key properties of the enrichment system, to recap, are: automatic
download and caching (no manual data management), cross-backbone
compatibility (enrichments work regardless of which backend produced the
result), version tracking (the summary() method documents
exactly which data versions were used), and compositional design
(enrichments stack freely via the pipe operator without side effects or
ordering constraints). These properties together aim to make the path
from species names to trait-enriched analyses as short and reproducible
as possible.