pkgdown/mathjax-config.html

Skip to contents

Taxonomic name matching is rarely the end goal. Once taxify() has resolved a list of species names against a backbone, the next step is usually joining ecological trait data, conservation assessments, or geographic range information to the matched names. This step is where the real analytical value emerges, and it is also where most workflows hit friction: trait databases use different taxonomic authorities, store names with or without authorship strings, treat synonyms inconsistently, and distribute data in incompatible formats. Manually aligning names between a taxonomic backbone and a trait database can consume hours even for moderately sized species lists.

taxify ships with 22 enrichment layers that attach published trait and status datasets to a taxify() result in a single pipe call. Each enrichment is backed by a pre-built .vtr file that downloads automatically on first use and caches locally for all subsequent sessions. The enrichment system handles the name alignment problem at build time, so the join at analysis time is a simple, fast, exact-match operation.

This vignette covers the mechanics of how enrichments work, walks through each of the 18 layers with worked examples, and discusses practical strategies for combining enrichments, interpreting coverage gaps, and choosing the right layers for a given taxon group. We also discuss the add_data() function for joining custom datasets that go beyond the built-in enrichments.

How enrichments work

Every add_*() function performs the same underlying operation: a left join between the accepted_name column in a taxify() result and the canonical_name column in an enrichment .vtr file. Because the join key is the accepted (resolved) name rather than the original input, synonyms that were resolved during matching contribute automatically. If the input contained “Pinus abies” and the backbone resolved it to “Picea abies”, the enrichment join looks up “Picea abies” in the trait database. This means we never have to worry about whether our species list uses currently accepted names or outdated synonyms: the backbone resolution step has already normalized everything.

This design has a deliberate consequence: enrichments only produce values for rows that were successfully matched by taxify(). Rows where accepted_name is NA (unmatched names) always receive NA in all enrichment columns. If a species could not be resolved against the backbone, it cannot be looked up in a trait database either. This is usually the correct behavior, but it means that improving match rates upstream (by cleaning names, trying a different backbone, or enabling fuzzy matching) directly improves enrichment coverage downstream.

The join in detail

When we call an enrichment function, taxify executes the following steps:

  1. Ensures the enrichment .vtr file is present on disk (downloading it if needed).

  2. Extracts the unique accepted_name values from the result.

  3. Writes those unique names into a temporary .vtr file.

  4. Performs a vectra inner_join() between the temporary names and the enrichment .vtr on the name column.

  5. Uses a vectorized match() to fill the new trait columns back into the original result. Rows without a match receive NA.

The operation is fast because it reduces to a single hash-based lookup per unique accepted name, not per row. A result with 50,000 rows but 8,000 unique accepted names only does 8,000 lookups. The vectra join exploits hash indexes on the canonical_name column in the enrichment .vtr, making even enrichments with hundreds of thousands of rows resolve in under a second.

Cross-backbone name resolution

A subtle but important design decision underlies the enrichment .vtr files: they are built to work with any of taxify’s seven backends (WFO, COL, GBIF, ITIS, NCBI, OTT, WoRMS). Different backbones sometimes accept different names for the same taxon. WFO might accept “Senecio jacobaea” while COL accepts “Jacobaea vulgaris” for the same species. If the enrichment .vtr only contained one of these names, it would fail to match results from the other backbone.

The taxifydb build pipeline solves this by resolving every source species name against each of the seven backends separately (not as a fallback chain, which would only return the first match). The union of all unique accepted_name values is collected per source species. Each source row is then expanded: one enrichment row per distinct accepted name, with the trait data duplicated. The final .vtr is then deduplicated by canonical_name (plus any group column for grouped enrichments).

In practice, backends agree on more than 90% of names, so this expansion is modest (typically 1.1–1.5x the original row count). The result is that add_conservation_status() works identically whether the upstream taxify() call used WFO, COL, or GBIF. We do not have to pick enrichments based on which backbone we used, and we do not have to worry about backbone-specific name variants falling through the cracks.

Automatic download and caching

The first time an enrichment is requested in a session, taxify checks whether a local copy exists on disk. If not, it downloads the pre-built .vtr from GitHub Releases using the URL recorded in the package manifest (inst/manifest.json). A meta.json sidecar file is written alongside the .vtr, recording the version string, whether the dataset is static, and the download date. On subsequent calls within the same R session, the file path is served from an in-memory cache (a package-level environment), so the disk is not even touched. Across sessions, the on-disk copy is reused without any network request.

For enrichments marked as “static” in the manifest (version-locked datasets like Zanne et al. 2014 or PanTHERIA), version checks are skipped entirely. These datasets have fixed, published versions that will never change. For non-static enrichments (IUCN Red List, GRIIS, WCVP, common names), taxify performs a lightweight version check once per session by comparing the local meta.json version against the manifest’s latest field. If a newer version is available, it is downloaded automatically with a console message. This check adds negligible latency because the manifest itself is cached.

Fallback chain

If the pre-built .vtr download fails (network issues, mirror outage, transient server errors), taxify does not stop immediately. Instead, it attempts to build the enrichment from the original source data. Each of the 18 enrichments has a build recipe in an internal registry (.enrichment_build_registry) that knows how to download the raw CSV, ZIP, or API response from the upstream source, parse it into a data.frame with a canonical_name column, and produce the .vtr file locally. This build-from-source path is slower (it has to download and parse raw data rather than a pre-built binary), but it means that enrichments remain available even if the GitHub Releases mirror is temporarily down.

If the build-from-source also fails (e.g., the upstream source is unreachable), taxify falls back one more level to an “emergency fallback”: it downloads and parses the source data in memory without writing to disk, performs the join in-memory using a data.frame rather than a .vtr, and issues a warning explaining the situation. This emergency result is ephemeral and not cached. If all three paths fail, an error is raised with a link to the GitHub issue tracker so the failure can be reported.

### The enrichment data directory

All enrichment .vtr files live under a single root directory, organized by enrichment name and version:

taxify_data_dir()/
  enrichment/
    conservation_status/
      latest/
        conservation_status.vtr
        meta.json
    griis/
      latest/
        griis.vtr
        meta.json
    woodiness/
      latest/
        woodiness.vtr
        meta.json
    ...

The taxify_data_dir() function returns the platform-appropriate data directory (typically ~/.local/share/taxify on Linux/macOS or %LOCALAPPDATA%/taxify on Windows). This directory is also where backbone .vtr files are stored, so a single taxify_data_dir() call reveals where all taxify data lives on the system. Enrichment files are modest in size: most are between 1 and 20 MB. The full set of 18 enrichments totals roughly 150-200 MB.

Discovering enrichments

Before applying enrichments, we may want to see what is available. list_enrichments() queries the taxify manifest and returns a data.frame summarizing every available enrichment layer. The returned columns are: name, version, nrow (approximate row count), static (whether the dataset is version-locked), trait_cols (comma-separated list of trait column names), and source_url (the upstream data source).

library(taxify)

list_enrichments()
#>              name version   nrow static                              trait_cols ...
#> 1 conservation_status  ...  166000   TRUE                   conservation_status ...
#> 2               griis  ...   23000  FALSE                      invasive_status ...
#> 3                wcvp  ...  340000  FALSE                        native_status ...
#> ...

The static column is worth paying attention to. Static enrichments (woodiness, PanTHERIA, AmphiBIO, EltonTraits, LEDA, Diaz traits, FungalTraits, FUNGuild, AlgaeTraits, FISHMORPH, Meiri lizard traits, LepTraits, AnimalTraits, NW European Arthropods) are based on published, version-locked datasets that have a single definitive release. These never trigger version checks, so they add zero network overhead to a session. Non-static enrichments (conservation_status, GRIIS, WCVP, common_names) are periodically updated as the upstream source publishes new releases. For these, taxify checks once per session whether a newer version is available and updates transparently if so.

The nrow column gives a rough sense of enrichment size. Conservation status has ~166,000 rows (one per assessed species), WCVP has ~340,000 (one per species-region combination), and the smaller enrichments like LEDA have ~8,000. These numbers include the cross-backbone name expansion discussed earlier, so they are slightly larger than the original source row counts.

The trait_cols column lists the columns that the enrichment adds to a result. This is useful for planning which enrichments to apply: if we need specific leaf area data, scanning the trait_cols column reveals that LEDA provides sla_mm2_mg. If we need diet composition data, the trait_cols for elton_traits lists all 18 diet, foraging, mass, and nocturnality columns. The source_url column points to the original data source (Zenodo, Figshare, Dryad, GBIF, etc.) for reference and citation.

Pre-downloading enrichments

For workflows that run on computing clusters, in Docker containers, or in any environment without reliable internet access, we can pre-download enrichments before the analysis begins. The taxify_download_enrichment() function accepts a character vector of enrichment names and downloads each one to the local data directory.

# Download a single enrichment
taxify_download_enrichment("conservation_status")

# Download several at once
taxify_download_enrichment(c("woodiness", "eive", "leda"))

# Download all of them
taxify_download_enrichment(c(
  "conservation_status", "griis", "wcvp", "eive",
  "elton_traits", "avonet", "pantheria", "amphibio",
  "common_names", "woodiness", "diaz_traits", "leda",
  "fungal_traits", "funguild", "algae_traits",
  "fish_traits", "fishbase", "lizard_traits", "anage", "glonaf",
  "leptraits", "animaltraits", "arthropod_traits", "alien_first_records",
  "baseflor", "ecoflora", "floraweb"
))

After this step, all subsequent add_*() calls for these enrichments will use the local copies without any network access. This is particularly useful for reproducible pipelines: pre-downloading enrichments at setup time guarantees that the analysis always uses the same version of each dataset, regardless of whether newer versions are published in the meantime.

The download function prints a confirmation message with the version and file size for each enrichment. If an enrichment is already present at the requested version, it is not re-downloaded. The .vtr files live in taxify_data_dir()/enrichment/{name}/latest/ alongside their meta.json sidecar.

Pre-downloading is also useful for teaching and workshop settings where many participants share a slow network connection. One person can download the enrichments, copy the taxify_data_dir() contents to a shared drive or USB stick, and distribute it to all participants. Since the enrichment lookup path starts with the on-disk check, the copied files will be found immediately without any network access.

Simple enrichments

Simple enrichments add one or more columns via a flat join on accepted_name. Eighteen of the twenty-two enrichment layers use this pattern. They differ only in which columns they add and which taxonomic groups they cover. We group them below by taxon focus, starting with plants (which have the most enrichment layers), then conservation status (cross-taxon), birds, mammals, amphibians, vertebrates, fungi, algae, fish, reptiles, butterflies, and arthropods.

Plant enrichments

Plants are the best-served taxon group in the enrichment system, with dedicated layers covering growth form, ecological niches, seed and height traits, a broad suite of functional traits, and regional trait compilations for the British, French, and German floras. This reflects the state of published plant trait databases: decades of investment in standardized trait measurement protocols have produced several large, open-access datasets that are straightforward to integrate.

Woodiness (Zanne et al. 2014)

The woodiness enrichment classifies ~50,000 plant species as woody, herbaceous, or variable. The dataset comes from Zanne et al. (2014), a landmark study on the radiation of angiosperms into freezing environments, published in Nature. The underlying classification draws on the world’s major herbarium and botanical databases.

plants <- taxify(c(
  "Quercus robur", "Betula pendula", "Arrhenatherum elatius",
  "Festuca rubra", "Salix caprea", "Cornus sanguinea"
))

plants |> add_woodiness()
#>               input_name       accepted_name woodiness
#> 1          Quercus robur       Quercus robur     woody
#> 2        Betula pendula      Betula pendula     woody
#> 3 Arrhenatherum elatius Arrhenatherum elatius herbaceous
#> 4         Festuca rubra        Festuca rubra herbaceous
#> 5         Salix caprea        Salix caprea     woody
#> 6    Cornus sanguinea     Cornus sanguinea     woody

The three possible values are "woody", "herbaceous", and "variable". The "variable" category applies to species that exhibit both growth forms depending on environmental conditions or ecotype. Coverage is strongest for angiosperms (both monocots and dicots) and weaker for ferns, lycophytes, and bryophytes. The dataset is static (CC0 license, published 2014), so it never triggers version checks.

Woodiness is a coarse trait, but it is one of the most widely used in community ecology and macroecology. It separates plant strategies along a fundamental axis (persistent woody stems vs. annual or perennial herbaceous growth), making it valuable for community-weighted mean analyses, functional diversity indices, and biome classification.

EIVE ecological indicator values (Dengler et al. 2023)

EIVE 1.0 provides continuous ecological indicator values for ~14,500 European vascular plant species. It supersedes the classic ordinal Ellenberg indicator values, which were expert-assigned integers on a 1-9 scale, with statistically derived continuous scores based on species co-occurrence patterns across thousands of vegetation plots. Five niche axes are covered: light, temperature, moisture, soil reaction (pH), and nutrients.

grasses <- taxify(c(
  "Arrhenatherum elatius", "Bromus erectus", "Festuca rubra",
  "Dactylis glomerata", "Lolium perenne", "Poa pratensis"
))

grasses |> add_eive()
#>          input_name eive_light eive_temperature eive_moisture eive_reaction eive_nutrients
#> 1 Arrhenatherum ...       7.2              5.8           4.3           7.1            6.5
#> 2    Bromus erectus       7.6              5.5           3.1           7.8            3.2
#> ...

Because EIVE is restricted to the European flora, species from other continents will receive NA in all five columns. The continuous values are on a scale comparable to the original Ellenberg system (roughly 1-9) but allow fractional positions. This matters for community-weighted mean (CWM) calculations: averaging ordinal Ellenberg values treats the intervals between categories as equal (the difference between 3 and 4 is the same as between 7 and 8), which is not guaranteed. EIVE’s continuous scale makes CWM calculations statistically cleaner. The five output columns are prefixed with eive_ to avoid collision with columns from other sources: eive_light, eive_temperature, eive_moisture, eive_reaction, and eive_nutrients.

The EIVE dataset is licensed under CC BY 4.0 and published on Zenodo. It is classified as static in the taxify manifest because its version (1.0) is a fixed publication. The reference is Dengler et al. (2023), Vegetation Classification and Survey 4:7-29.

Diaz traits (Diaz et al. 2022)

The Diaz enrichment provides two key functional traits from the TRY database consortium: seed mass in milligrams and plant height in metres. These are species-level means compiled from thousands of individual measurements across multiple primary sources. Coverage spans ~46,000 plant species globally, making it one of the broader trait datasets available for plants.

trees <- taxify(c(
  "Quercus robur", "Fagus sylvatica", "Picea abies",
  "Pinus sylvestris", "Acer pseudoplatanus"
))

trees |> add_diaz_traits()
#>          input_name seed_mass_mg plant_height_m
#> 1     Quercus robur      3200.0           25.0
#> 2  Fagus sylvatica      2200.0           30.0
#> 3      Picea abies        7.9           40.0
#> 4 Pinus sylvestris        6.5           25.0
#> 5 Acer pseudoplatanus   120.0           25.0

Seed mass and plant height sit on the two most important axes of the global spectrum of plant form and function described by Diaz et al. (2016, Nature 529:167-171). Seed mass captures the offspring size / offspring number trade-off (small-seeded species produce many propagules, large-seeded species invest in fewer, better-provisioned offspring). Plant height captures the competitive strategy axis (tall species intercept more light but invest more in structural tissue). Combining these two traits with EIVE or LEDA columns produces a reasonably complete functional characterization for European temperate species.

The output columns are seed_mass_mg and plant_height_m. Both are numeric (NA_real_ for missing values). The dataset is licensed under CC BY 3.0 and distributed via the TRY File Archive.

LEDA Traitbase (Kleyer et al. 2008)

LEDA covers ~8,000 NW European plant species with 10 trait columns spanning life form, dispersal, seed, leaf, and clonality dimensions. It is the most column-rich of the simple enrichments, providing a broad functional profile in a single call.

meadow_spp <- taxify(c(
  "Arrhenatherum elatius", "Trifolium pratense",
  "Leucanthemum vulgare", "Plantago lanceolata",
  "Achillea millefolium", "Centaurea jacea"
))

meadow_spp |> add_leda()
#>          input_name raunkiaer_life_form dispersal_type sla_mm2_mg canopy_height_m ...
#> 1 Arrhenatherum ...  hemicryptophyte     anemochory       25.1            0.90  ...
#> 2 Trifolium pratense hemicryptophyte     zoochory         22.3            0.30  ...
#> ...

The full column set includes:

  • raunkiaer_life_form: the primary Raunkiaer life form (phanerophyte, chamaephyte, hemicryptophyte, geophyte, therophyte, helophyte, hydrophyte).

  • raunkiaer_variable: 1 if the species is assigned to multiple life forms, 0 otherwise.

  • dispersal_type: primary dispersal vector (anemochory, zoochory, hydrochory, autochory, barochory, dysochory).

  • terminal_velocity_ms: seed terminal velocity in m/s (species median).

  • seed_mass_mg: seed mass in mg (species median).

  • canopy_height_m: canopy height in metres (species median).

  • leaf_mass_mg: leaf dry mass in mg (species median).

  • sla_mm2_mg: specific leaf area in mm^2/mg (species median).

  • clonal_growth: capable of clonal growth (1 = yes, 0 = no).

  • buoyancy: seed buoyancy classification.

The Raunkiaer life form column classifies plants by where their perennating buds sit during the unfavorable season. Phanerophytes (trees, tall shrubs) hold buds more than 25 cm above the soil; chamaephytes (low shrubs, cushion plants) keep them near the surface. Hemicryptophytes, the dominant group in temperate grasslands, position buds right at soil level. Below ground, geophytes store buds as bulbs or rhizomes, while therophytes skip the problem entirely by surviving as seeds. LEDA provides this classification at species level for the NW European flora, making it one of the few trait databases that includes Raunkiaer assignments for several thousand species.

One column-name collision to be aware of: LEDA’s seed_mass_mg and the Diaz enrichment’s seed_mass_mg share the same output column name. If both enrichments are stacked in a pipe chain, the second one to run will overwrite the first. The values may differ slightly because LEDA reports the species median from its own measurements while Diaz reports the TRY consortium mean. To keep both, apply one enrichment, rename the column, then apply the second. Alternatively, choose whichever source is more appropriate for the study: LEDA for NW European analyses (regional measurements), Diaz for global analyses (worldwide compilation).

Regional plant-trait compilations (Baseflor, Ecoflora, FloraWeb)

Three regional databases add trait detail for the European floras they cover. Each carries a region suffix on every column (_uk for Britain, _de for Germany, and Baseflor’s unsuffixed French set), so they can be chained without clobbering one another or the pan-European layers above.

  • Baseflor (Julve, Programme Catminat), via add_baseflor(): about 8,500 taxa of the French and neighbouring flora, with flowering months, pollination vector, dispersal mode, breeding system, flower colour, fruit type, woody growth form, and the continentality and salinity axes absent from EIVE.
  • Ecoflora (Fitter & Peat 1994), via add_ecoflora(): the British Isles flora, with canopy height, leaf traits, life form, flowering phenology, pollination, seed weight, and British-calibrated Ellenberg values (18 _uk columns).
  • FloraWeb (BfN; the BiolFlor data of Klotz, Kuehn & Durka 2002), via add_floraweb(): the German flora, with morphology, reproductive biology, the nine Ellenberg indicator values, ploidy and chromosome number, Grime CSR strategy type, and chorological distribution (59 _de columns).
taxify(c("Bellis perennis", "Achillea millefolium", "Calluna vulgaris")) |>
  add_ecoflora() |>
  add_floraweb()

Because every column is region-suffixed, one chain can attach British, French, and German trait sets side by side for the same species. FloraWeb and Ecoflora are bundled as pre-built datasets and work offline; their German and British trait values are reported as published, and the access date is the dataset version (neither portal offers a versioned bulk export). Italian Ellenberg-type indicator values are also available through add_pignatti(), which reads the copy bundled in the TR8 package on demand; those values come from a copyrighted publication and are not redistributed by taxify.

Mycorrhizal type (FungalRoot, Soudzilovskaia et al. 2020)

Most vascular plants form a symbiosis with root fungi, and the type of that symbiosis is one of the most informative functional traits a plant carries: it governs how the plant acquires nutrients and which soil fungi it depends on. add_fungalroot() attaches the mycorrhizal type from FungalRoot, a global compilation of more than 36,000 plant-by-site observations published on GBIF.

Unlike the enrichments above, FungalRoot joins on genus, not accepted_name. Mycorrhizal type is conserved at the genus level (the resolution FungalRoot itself recommends for inference), so the value is a per-genus majority consensus and every species in a covered genus inherits it, whether or not that exact binomial was observed.

taxify(c("Quercus robur", "Pinus sylvestris", "Trifolium pratense",
         "Vaccinium myrtillus", "Brassica oleracea")) |>
  add_fungalroot()
#>          input_name     genus mycorrhizal_type mycorrhizal_status mycorrhizal_records
#> 1       Quercus robur   Quercus              EcM        mycorrhizal                 163
#> 2    Pinus sylvestris     Pinus              EcM        mycorrhizal                 500
#> 3  Trifolium pratense Trifolium               AM        mycorrhizal                 193
#> 4 Vaccinium myrtillus Vaccinium              ErM        mycorrhizal                 227
#> 5   Brassica oleracea  Brassica               NM    non-mycorrhizal                  59

Three columns are added:

  • mycorrhizal_type: the genus-level consensus type. AM (arbuscular, by far the most common, formed by most herbs and many trees), EcM (ecto, typical of oaks, pines, birches, and other temperate forest trees), ErM (ericoid, confined to the Ericaceae), OM (orchid), NM (non-mycorrhizal, e.g. the Brassicaceae and many Cyperaceae), the dual types EcM-AM / ErM-EcM / ErM-AM, plus Other and uncertain.
  • mycorrhizal_status: a coarse roll-up of the type, one of "mycorrhizal", "non-mycorrhizal", or "uncertain".
  • mycorrhizal_records: how many FungalRoot observations support the genus-level consensus, so a one-record genus can be told apart from a well-sampled one.

Because the join is on genus, a plant whose genus is not in FungalRoot returns NA, and a genus circumscribed differently across backbones may not line up. Coverage is plant genera only (about 4,000 genera). The dataset is distributed under CC BY-NC 4.0; the per-genus consensus is computed by taxify from the per-observation labels, not FungalRoot’s own published genus assignment.

Conservation status (IUCN Red List)

The conservation status enrichment is the only enrichment that spans all taxonomic groups equally. Coverage includes ~166,000 species assessed by the IUCN Red List, with representation across plants, vertebrates, invertebrates, and fungi. A single column is added: conservation_status, containing the standard IUCN category code.

species <- taxify(c(
  "Panthera tigris", "Ailuropoda melanoleuca",
  "Gorilla gorilla", "Vulpes vulpes",
  "Passer domesticus", "Quercus robur"
))

species |> add_conservation_status()
#>            input_name conservation_status
#> 1     Panthera tigris                  EN
#> 2 Ailuropoda melanoleuca               VU
#> 3      Gorilla gorilla                  CR
#> 4        Vulpes vulpes                  LC
#> 5   Passer domesticus                  LC
#> 6       Quercus robur                  LC

The seven categories in order of increasing threat are:

  • LC (Least Concern): population stable, no significant threats.

  • NT (Near Threatened): close to qualifying for a threatened category.

  • VU (Vulnerable): facing a high risk of extinction in the wild.

  • EN (Endangered): facing a very high risk of extinction.

  • CR (Critically Endangered): facing an extremely high risk.

  • EW (Extinct in the Wild): survives only in captivity or cultivation.

  • EX (Extinct): no known living individuals.

Species not yet assessed by the IUCN receive NA. The IUCN has assessed nearly all mammals, birds, amphibians, and reptiles, but only a fraction of invertebrates, fungi, and plants. For a plant-focused study, coverage rates are likely to be lower (perhaps 10-30% of the species list) than for a vertebrate study (where 90-100% coverage is typical). The enrichment also includes species assessed as DD (Data Deficient), which indicates that the IUCN has examined the species but lacks sufficient data to assign a threat category.

The conservation status enrichment is non-static: the IUCN publishes updated assessments several times per year, and the taxify enrichment is rebuilt when new assessments become available. The summary() output will show the version string (e.g., “2025.1”) so that the exact assessment vintage can be cited.

Bird enrichments

Birds are served by two complementary enrichments that together provide a detailed functional and ecological profile. AVONET covers morphology and migration strategy; EltonTraits covers diet composition and foraging behavior. There is intentional overlap in body mass (both provide it), but the remaining columns are distinct.

AVONET (Tobias et al. 2022)

AVONET provides species-level morphological measurements for ~11,000 bird species worldwide, based on direct measurements of museum specimens and live birds. The enrichment adds 11 columns covering beak dimensions (length, depth), wing length, tail length, tarsus length, body mass, hand-wing index, primary habitat, trophic level, trophic niche, and migration strategy.

birds <- taxify(c(
  "Parus major", "Cyanistes caeruleus", "Erithacus rubecula",
  "Turdus merula", "Falco peregrinus", "Aquila chrysaetos"
))

birds |> add_avonet()
#>        input_name beak_length wing_length avonet_body_mass_g migration trophic_niche ...
#> 1     Parus major        11.2        75.1               18.5 sedentary  Invertivore  ...
#> 2 Cyanistes caeruleus    9.8        67.2               11.0 sedentary  Invertivore  ...
#> 3 Erithacus rubecula    11.5        72.3               17.1   partial  Invertivore  ...
#> 4    Turdus merula      20.8       130.5               95.0   partial    Omnivore   ...
#> 5 Falco peregrinus      15.2       312.0              750.0      full  Vertivore    ...
#> 6 Aquila chrysaetos     37.5       607.0             4000.0   partial  Vertivore    ...

The morphological measurements (beak, wing, tail, tarsus) are all in millimetres, representing species means across measured specimens. The hand-wing index (hand_wing_index) quantifies wing pointedness and correlates strongly with long-distance flight ability: swifts, falcons, and shearwaters score high, while wrens, rails, and pheasants sit at the low end of the spectrum. Dispersal ecology, macroecology, and studies of range expansion all make heavy use of this index.

The migration column has three possible values: "sedentary" (non- migratory), "partial" (some populations migrate), and "full" (obligate long-distance migrant). The trophic_niche column uses categories like "Invertivore", "Omnivore", "Vertivore", "Frugivore", "Granivore", "Nectarivore", "Herbivore aquatic", and others.

AVONET is licensed under CC BY 4.0 and published on Figshare. The reference is Tobias et al. (2022), Ecology Letters 25:581-597. The dataset is classified as static in the taxify manifest.

EltonTraits (Wilman et al. 2014)

EltonTraits 1.0 covers both birds and mammals (~15,400 species total), making it the only enrichment that spans two vertebrate classes. It adds 18 columns organized into three groups: 10 diet composition percentages, 6 foraging stratum percentages, body mass, and nocturnality.

The diet columns express the percentage contribution of each food category to the species’ diet: invertebrates (diet_inv), endothermic vertebrates (diet_vend), ectothermic vertebrates (diet_vect), fish (diet_vfish), unknown vertebrates (diet_vunk), scavenging (diet_scav), fruit (diet_fruit), nectar (diet_nect), seeds and nuts (diet_seed), and other plant material (diet_plantother). The 10 diet percentages sum to 100 for each species.

The foraging columns express where in the vertical habitat structure the species forages: below water surface (foraging_water), on ground (foraging_ground), in understory (foraging_understory), in mid to high vegetation (foraging_midhigh), in canopy (foraging_canopy), and aerial (foraging_aerial). These 6 percentages also sum to 100.

birds <- taxify(c(
  "Parus major", "Dendrocopos major", "Alcedo atthis",
  "Tyto alba", "Apus apus"
))

birds |> add_elton_traits()
#>       input_name diet_inv diet_fruit diet_seed foraging_canopy foraging_aerial nocturnal ...
#> 1    Parus major       60         10        20              50               0         0 ...
#> 2 Dendrocopos major    75          5        10              80               0         0 ...
#> 3   Alcedo atthis       0          0         0               0               0         0 ...
#> 4       Tyto alba       10          0         0               0               0         1 ...
#> 5       Apus apus      100          0         0               0             100         0 ...

The Common Swift (Apus apus) is a textbook example of a species at the extreme of the foraging and diet axes: 100% invertebrate diet, 100% aerial foraging, reflecting its life spent almost entirely on the wing. The Barn Owl (Tyto alba) illustrates the nocturnal flag: it is one of the few species in this example set with nocturnal = 1. The elton_body_mass_g column provides body mass in grams from literature compilation.

EltonTraits is particularly valuable for functional diversity analyses (computing Rao’s quadratic entropy or functional richness using diet and foraging traits as axes), food web construction (using diet percentages to parameterize trophic links), and macroecological studies of niche breadth. It is licensed under CC0 and published on Figshare. The reference is Wilman et al. (2014), Ecology 95:2027.

Mammal enrichments

PanTHERIA (Jones et al. 2009)

PanTHERIA covers ~5,400 mammal species with eight life-history and ecological traits: adult body mass, maximum longevity, mean litter size, gestation length, weaning age, home range size, diet breadth, and habitat breadth. It remains the most-cited source of mammalian life-history data in the ecological literature, despite being published in 2009.

mammals <- taxify(c(
  "Vulpes vulpes", "Canis lupus", "Ursus arctos",
  "Mustela nivalis", "Lutra lutra", "Lynx lynx"
))

mammals |> add_pantheria()
#>      input_name pantheria_body_mass_g longevity_mo litter_size home_range_km2 ...
#> 1  Vulpes vulpes               5480.0          144         5.0           8.55 ...
#> 2   Canis lupus              31757.0          192         5.4         242.00 ...
#> 3  Ursus arctos             139000.0          396         2.0         488.00 ...
#> 4 Mustela nivalis               67.0           72         5.5           0.03 ...
#> 5   Lutra lutra               8000.0          180         2.3          15.00 ...
#> 6     Lynx lynx              20500.0          252         2.6         168.00 ...

The body mass column is named pantheria_body_mass_g to distinguish it from AVONET’s avonet_body_mass_g and EltonTraits’ elton_body_mass_g. This prefixing convention prevents column-name collisions when stacking multiple enrichments on the same result.

The Least Weasel (Mustela nivalis) in the example above illustrates the dynamic range of mammalian traits: at 67 g body mass and a home range of 0.03 km^2, it sits at the opposite end of the spectrum from the Brown Bear (Ursus arctos) at 139,000 g and 488 km^2. These allometric scaling relationships (body mass predicting home range, longevity, gestation, etc.) are a major reason PanTHERIA is so widely cited.

Because PanTHERIA was published in 2009, species described or taxonomically split after that date will appear as NA. The dataset is static (CC0 license), so it never triggers version checks. The reference is Jones et al. (2009), Ecology 90:2648.

Amphibian enrichments

AmphiBIO (Oliveira et al. 2017)

AmphiBIO covers ~6,800 amphibian species with 13 trait columns. The continuous traits are body size (snout-vent length in mm), age at maturity (days), longevity (days), clutch/litter size, reproductive output per year, and offspring size (mm). The binary traits encode habitat and activity patterns: direct development (0/1), larval stage (0/1), aquatic habitat (0/1), fossorial habitat (0/1), arboreal habitat (0/1), diurnal activity (0/1), and nocturnal activity (0/1).

amphibians <- taxify(c(
  "Bufo bufo", "Rana temporaria", "Salamandra salamandra",
  "Triturus cristatus", "Hyla arborea", "Bombina variegata"
))

amphibians |> add_amphibio()
#>           input_name body_size_mm arboreal aquatic direct_development nocturnal_amphibio ...
#> 1          Bufo bufo        150.0        0       0                  0                  1 ...
#> 2    Rana temporaria        110.0        0       1                  0                  0 ...
#> 3 Salamandra salamandra    200.0        0       0                  0                  1 ...
#> 4  Triturus cristatus      160.0        0       1                  0                  1 ...
#> 5       Hyla arborea         50.0        1       0                  0                  1 ...
#> 6  Bombina variegata         50.0        0       1                  0                  0 ...

The nocturnality column is named nocturnal_amphibio rather than nocturnal to avoid colliding with EltonTraits’ nocturnal column. While it is unusual to stack both AmphiBIO and EltonTraits on the same result (EltonTraits covers birds and mammals, not amphibians), the precaution prevents surprises in mixed-taxon workflows where both enrichments are applied to a single data.frame.

The binary trait columns use integer values (0/1) rather than logical (TRUE/FALSE), following the original dataset’s encoding. This means filtering syntax uses == 1L rather than bare column names: result[result$arboreal == 1L, ].

AmphiBIO is one of the few large-scale trait databases for amphibians, a taxon group that is relatively data-poor compared to birds and mammals. Coverage spans anurans (frogs and toads), urodeles (salamanders and newts), and caecilians. It is licensed under CC BY 4.0 and published on Scientific Data. The reference is Oliveira et al. (2017), Scientific Data 4:170123.

Fungal enrichments

Fungi have historically been underrepresented in trait databases compared to plants and animals, but two complementary datasets now provide detailed ecological and functional information. FungalTraits classifies genera by lifestyle, growth form, and interaction capabilities, while FUNGuild provides a guild-based trophic classification. Together they offer a reasonably complete functional profile for macrofungi and many microfungi.

FungalTraits (Polme et al. 2020)

FungalTraits is a genus-level database covering ~10,200 fungal genera with nine trait columns. Unlike the species-level enrichments discussed above, FungalTraits joins on genus rather than accepted_name. This reflects the reality of fungal trait data: most functional traits (lifestyle, growth form, decay strategy) are conserved at the genus level, and the enormous diversity of described fungal species (~150,000) makes species-level trait compilation impractical for many attributes. The genus-level join means that all species within a genus receive the same trait values, which is ecologically reasonable for the traits covered.

fungi <- taxify(c(
  "Amanita muscaria", "Boletus edulis", "Trametes versicolor",
  "Agaricus bisporus", "Cantharellus cibarius"
))

fungi |> add_fungal_traits()
#>          input_name primary_lifestyle  growth_form fruitbody_type decay_substrate ...
#> 1  Amanita muscaria   ectomycorrhizal     agaricoid      agaricoid            <NA> ...
#> 2    Boletus edulis   ectomycorrhizal       boletoid       boletoid            <NA> ...
#> 3 Trametes versicolor      saprotroph  polyporoid      polyporoid           wood  ...
#> 4 Agaricus bisporus       saprotroph     agaricoid      agaricoid          litter ...
#> 5 Cantharellus cibarius ectomycorrhizal cantharelloid cantharelloid          <NA> ...

The nine trait columns capture complementary facets of fungal ecology:

  • primary_lifestyle: the dominant trophic strategy (ectomycorrhizal, saprotroph, plant pathogen, animal parasite, lichen, endophyte, etc.).

  • secondary_lifestyle: an additional lifestyle where applicable (many genera have a single lifestyle, so this is frequently NA).

  • growth_form: the vegetative morphology (agaricoid, boletoid, polyporoid, corticioid, clavarioid, gasteroid, etc.).

  • fruitbody_type: the reproductive structure type.

  • decay_substrate: the primary substrate for saprotrophic genera (wood, litter, dung, soil).

  • plant_pathogenic_capacity: a coarse classification of pathogenic potential for plant-associated genera.

  • animal_biotrophic_capacity: analogous classification for animal associations.

  • endophytic_interaction_capability: whether the genus includes endophytic species.

  • ectomycorrhiza_exploration_type: for ectomycorrhizal genera, the exploration type of the mycelium (contact, short-distance, medium- distance, long-distance). This is ecologically important because exploration type governs nutrient acquisition strategy and competitive dynamics among ectomycorrhizal fungi.

The primary_lifestyle column is the single most informative trait for broad ecological analyses. It separates ectomycorrhizal fungi (mutualists that form nutrient-exchange networks with plant roots) from saprotrophs (decomposers that drive nutrient cycling) and pathogens (agents of disease and mortality). These three groups have fundamentally different roles in ecosystem functioning, and knowing which lifestyle a genus belongs to determines how it should be interpreted in community analyses, food web models, and carbon cycling studies.

FungalTraits is licensed under CC BY 4.0 and published in Fungal Diversity. The reference is Polme et al. (2020), Fungal Diversity 105:1-16. The dataset is classified as static in the taxify manifest.

FUNGuild (Nguyen et al. 2016)

FUNGuild provides trophic and guild classifications for ~13,000 fungal taxa at both genus and species levels. Where FungalTraits describes what a genus does ecologically (lifestyle, growth form, substrate preference), FUNGuild classifies taxa into guild categories that describe their functional role in the ecosystem. The two datasets are complementary: a genus like Trametes might be classified as “saprotroph” with “polyporoid” growth form in FungalTraits, and as “Wood Saprotroph” guild with “Saprotroph” trophic mode in FUNGuild. The FUNGuild classification is coarser but more directly interpretable for guild-based community analyses.

fungi <- taxify(c(
  "Amanita muscaria", "Boletus edulis", "Trametes versicolor",
  "Agaricus bisporus", "Cantharellus cibarius"
))

fungi |> add_funguild()
#>          input_name     trophic_mode                   guild funguild_growth_form confidence_ranking
#> 1  Amanita muscaria       Symbiotroph       Ectomycorrhizal            Agaricoid            Highly Probable
#> 2    Boletus edulis       Symbiotroph       Ectomycorrhizal             Boletoid            Highly Probable
#> 3 Trametes versicolor      Saprotroph        Wood Saprotroph           Polyporoid            Highly Probable
#> 4 Agaricus bisporus       Saprotroph      Litter Saprotroph            Agaricoid            Highly Probable
#> 5 Cantharellus cibarius   Symbiotroph       Ectomycorrhizal        Cantharelloid            Highly Probable

The four output columns provide a hierarchical classification:

  • trophic_mode: the broadest category (Saprotroph, Symbiotroph, Pathotroph, or combinations like Saprotroph-Symbiotroph for genera with multiple trophic strategies).
  • guild: a finer classification within each trophic mode (Wood Saprotroph, Litter Saprotroph, Ectomycorrhizal, Arbuscular Mycorrhizal, Plant Pathogen, Animal Pathogen, Lichenized, etc.).
  • funguild_growth_form: the morphological category, named with the funguild_ prefix to avoid collision with FungalTraits’ growth_form column.
  • confidence_ranking: how confident the guild assignment is (Highly Probable, Probable, or Possible). This column deserves attention: assignments at the “Possible” level are based on limited evidence and should be treated with caution in quantitative analyses. Filtering to “Highly Probable” and “Probable” assignments reduces coverage but improves reliability.

FUNGuild is particularly valuable for soil mycobiome studies, where operational taxonomic units (OTUs) from metabarcoding are classified into ecological guilds. The trophic mode and guild columns map directly onto the functional group categories used in fungal community ecology: the ratio of saprotrophs to symbiotrophs, or the proportion of pathotrophs in a community, are common response variables in studies of land use change, nutrient cycling, and plant-soil feedbacks.

FUNGuild is published in Fungal Ecology. The reference is Nguyen et al. (2016), Fungal Ecology 20:241-248. The dataset is classified as static in the taxify manifest.

Algae enrichments

AlgaeTraits (Vranken et al. 2023)

AlgaeTraits provides morphological and ecological traits for ~1,745 European macroalgae species (seaweeds). Macroalgae are the dominant primary producers in coastal rocky ecosystems, yet they are conspicuously absent from the major plant trait databases (TRY, LEDA, EIVE) that focus exclusively on vascular plants. AlgaeTraits fills this gap for the European coastline, covering green algae (Chlorophyta), brown algae (Phaeophyceae), and red algae (Rhodophyta) with eight trait columns spanning morphology, habitat, and environmental tolerances.

seaweeds <- taxify(c(
  "Fucus vesiculosus", "Ulva lactuca", "Laminaria digitata",
  "Chondrus crispus", "Sargassum muticum"
))

seaweeds |> add_algae_traits()
#>          input_name algae_body_size_cm algae_growth_form algae_calcification algae_tidal_zone ...
#> 1 Fucus vesiculosus               60.0          foliose               none         intertidal ...
#> 2      Ulva lactuca               30.0          foliose               none         intertidal ...
#> 3 Laminaria digitata             200.0          foliose               none         subtidal   ...
#> 4    Chondrus crispus              15.0          foliose               none         intertidal ...
#> 5  Sargassum muticum             300.0          foliose               none         subtidal   ...

All eight columns are prefixed with algae_ to clearly distinguish them from terrestrial plant traits:

  • algae_body_size_cm: maximum thallus length in centimetres.

  • algae_growth_form: the morphological category (filamentous, foliose, corticated, leathery, calcareous, crustose, etc.).

  • algae_calcification: whether the species produces calcium carbonate structures (none, articulated, crustose). Calcifying algae like coralline species play a critical role in reef construction and are particularly sensitive to ocean acidification.

  • algae_life_span: the typical life span category (annual, perennial, pseudoperennial).

  • algae_tidal_zone: the primary tidal zone (supralittoral, intertidal, subtidal).

  • algae_wave_exposure: the preferred wave exposure regime (sheltered, moderately exposed, exposed).

  • algae_environment: the salinity regime (marine, brackish, freshwater).

  • algae_substrate: the preferred substrate type (rock, sand, epiphytic, free-living).

The body size column spans three orders of magnitude, from millimetre-scale filamentous algae to kelps exceeding 3 metres. This variation underpins the structural complexity of rocky shore communities: large canopy-forming species like Laminaria digitata create habitat for hundreds of associated species, while small turf-forming species dominate in disturbed or nutrient-enriched conditions. The growth form and tidal zone columns together define the ecological niche of each species along the shore gradient, making AlgaeTraits directly useful for intertidal community analyses, climate change impact assessments, and marine protected area planning.

The geographic scope is European, so species from other coastlines will receive NA in all columns. The dataset is licensed under CC BY 4.0 and published in Scientific Data. The reference is Vranken et al. (2023), Scientific Data 10:826. The dataset is classified as static in the taxify manifest.

Fish enrichments

Fish ecology has produced two large, complementary trait databases. FISHMORPH focuses on morphological measurements of freshwater species, while FishBase provides broader ecological and life-history data for both freshwater and marine fish. Together they provide detailed functional profiles for ichthyological studies.

FISHMORPH (Brosse et al. 2021)

FISHMORPH provides morphological trait data for ~8,300 freshwater fish species worldwide, based on standardized measurements from photographs and museum specimens. The 10 morphological traits capture the key axes of fish body shape variation that determine swimming performance, feeding mode, and habitat use.

freshwater_fish <- taxify(c(
  "Salmo trutta", "Esox lucius", "Cyprinus carpio",
  "Perca fluviatilis", "Silurus glanis"
))

freshwater_fish |> add_fish_traits()
#>        input_name fish_body_elongation fish_eye_size fish_oral_gape_position fish_body_lateral_shape ...
#> 1    Salmo trutta                 0.22          0.08                    0.42                    0.18 ...
#> 2    Esox lucius                  0.18          0.06                    0.50                    0.15 ...
#> 3  Cyprinus carpio                0.35          0.05                    0.38                    0.25 ...
#> 4 Perca fluviatilis               0.30          0.07                    0.40                    0.22 ...
#> 5   Silurus glanis                0.15          0.03                    0.48                    0.12 ...

All columns are prefixed with fish_ and express dimensionless morphological ratios normalized by body length. The 10 traits are:

  • fish_body_elongation: body depth relative to standard length. High values indicate deep-bodied species (cyprinids), low values indicate elongated species (eels, pike).

  • fish_eye_size: eye diameter relative to head length. Large-eyed species tend to be visual predators in clear water.

  • fish_oral_gape_position: the vertical position of the mouth, from ventral (benthic feeders) to dorsal (surface feeders).

  • fish_body_lateral_shape: the lateral compression of the body.

  • fish_pectoral_fin_size: pectoral fin area, associated with manoeuvrability and braking ability.

  • fish_pectoral_fin_position: the vertical insertion of the pectoral fin on the body.

  • fish_caudal_peduncle_throttling: the narrowing of the caudal peduncle, associated with sustained swimming efficiency.

  • fish_caudal_fin_shape: the aspect ratio of the caudal fin. High values (forked tails) indicate cruising swimmers; low values (rounded tails) indicate ambush predators or benthic species.

  • fish_fin_surface_ratio: total fin area relative to body area.

  • fish_max_body_length_cm: the maximum recorded standard length in centimetres.

These morphological ratios are ecomorphological indicators: they predict how a species interacts with its physical environment. Body elongation and caudal fin shape together separate benthic, slow-moving species from pelagic, fast-cruising species. Oral gape position separates surface feeders from bottom feeders. Eye size and pectoral fin size relate to sensory ecology and manoeuvrability, respectively. The combination of these traits places each species in morphological space, making FISHMORPH directly useful for functional diversity calculations (Rao’s Q, functional richness, functional divergence) in freshwater fish community ecology.

The dataset covers freshwater fish only; marine species receive NA. It is licensed under CC BY 4.0 and published in Global Ecology and Biogeography. The reference is Brosse et al. (2021), Global Ecology and Biogeography 30:2330-2345. The dataset is classified as static in the taxify manifest.

FishBase (Froese & Pauly 2024)

FishBase is the most comprehensive fish database in the world, covering ~35,000 species across both freshwater and marine environments. The taxify enrichment extracts eight key ecological and life-history traits from the FishBase dataset, providing a broad functional profile that complements FISHMORPH’s morphological focus.

fish <- taxify(c(
  "Gadus morhua", "Thunnus thynnus", "Hippocampus hippocampus",
  "Squalus acanthias", "Salmo trutta"
))

fish |> add_fishbase()
#>              input_name fb_body_length_cm fb_body_mass_g fb_trophic_level fb_depth_min_m fb_depth_max_m ...
#> 1          Gadus morhua            132.0        55500.0              4.4            0.0          600.0 ...
#> 2      Thunnus thynnus            458.0       684000.0              4.2            0.0         1000.0 ...
#> 3 Hippocampus hippocampus          15.0             NA              3.1            1.0           60.0 ...
#> 4    Squalus acanthias            160.0         11000.0              4.3           16.0          900.0 ...
#> 5         Salmo trutta            140.0        50000.0              3.4            0.0          332.0 ...

The eight columns are all prefixed with fb_:

  • fb_body_length_cm: maximum total length in centimetres.

  • fb_body_mass_g: maximum recorded body mass in grams.

  • fb_trophic_level: the trophic level (continuous, typically 2.0-5.0). Herbivorous fish sit near 2.0, planktivores around 3.0, piscivores around 4.0-4.5, and apex predators above 4.5.

  • fb_depth_min_m and fb_depth_max_m: the minimum and maximum depth range in metres. Together these define the vertical habitat envelope.

  • fb_vulnerability: the intrinsic vulnerability index (0-100), a composite score based on maximum size, age, fecundity, and other life-history parameters. High values indicate species that are inherently more susceptible to overexploitation.

  • fb_habitat: the primary habitat category (pelagic, demersal, bathydemersal, bathypelagic, reef-associated, etc.).

  • fb_importance: the economic importance category (commercial, subsistence, minor commercial, gamefish, etc.).

The trophic level and vulnerability columns are particularly valuable for fisheries ecology and marine conservation. Trophic level quantifies the position of each species in the food web, and the well-documented pattern of “fishing down the food web” (declining mean trophic level of catches over time) is diagnosed using exactly this variable. Vulnerability provides a quick assessment of which species in a community are most at risk from fishing pressure, complementing the IUCN conservation status with a mechanistic, trait-based risk metric.

Note that FishBase is licensed under CC BY-NC 3.0 (non-commercial use). This is more restrictive than the CC BY 4.0 license used by most other enrichments. Users intending to use FishBase data in commercial applications should consult the FishBase terms of use. The reference is Froese, R. and D. Pauly (2024), FishBase, www.fishbase.org. The dataset is classified as non-static in the taxify manifest because FishBase is updated periodically.

Reptile enrichments

Meiri lizard traits (Meiri 2018)

The Meiri lizard trait database covers ~6,600 lizard species (Squamata, excluding snakes and amphisbaenians) with 10 life-history, morphological, and ecological traits. Lizards are the most species-rich group of non-avian reptiles, and this dataset provides the most comprehensive species-level trait compilation available for the group.

lizards <- taxify(c(
  "Pogona vitticeps", "Lacerta agilis", "Iguana iguana",
  "Varanus komodoensis", "Gekko gecko"
))

lizards |> add_lizard_traits()
#>            input_name lizard_body_mass_g lizard_svl_mm lizard_tail_length_mm lizard_clutch_size ...
#> 1    Pogona vitticeps             350.0         230.0                 250.0               18.0 ...
#> 2      Lacerta agilis              10.0          70.0                 100.0                8.0 ...
#> 3       Iguana iguana            4000.0         450.0                 700.0               35.0 ...
#> 4 Varanus komodoensis           70000.0        1500.0                1400.0               18.0 ...
#> 5         Gekko gecko              60.0         140.0                 130.0                2.0 ...

All columns are prefixed with lizard_:

  • lizard_body_mass_g: adult body mass in grams.

  • lizard_svl_mm: snout-vent length in millimetres, the standard body size measurement for reptiles (excluding the tail, which is frequently lost and regenerated).

  • lizard_tail_length_mm: tail length in millimetres.

  • lizard_clutch_size: the mean number of eggs per clutch (or neonates per litter for viviparous species).

  • lizard_clutch_frequency: the number of clutches produced per year.

  • lizard_longevity_yr: maximum recorded longevity in years.

  • lizard_diet: the primary diet category (insectivore, herbivore, omnivore, carnivore).

  • lizard_habitat: the primary habitat type (terrestrial, arboreal, fossorial, saxicolous, semi-aquatic).

  • lizard_activity_time: the primary activity period (diurnal, nocturnal, crepuscular, cathemeral).

  • lizard_foraging_mode: the foraging strategy (sit-and-wait, active foraging, mixed). This trait is tightly linked to metabolic rate and energy budgets: active foragers have higher metabolic rates and larger home ranges, while sit-and-wait predators invest less energy in locomotion but rely on crypsis and ambush efficiency.

The body mass range spans four orders of magnitude, from sub-gram geckos to the 70 kg Komodo Dragon (Varanus komodoensis). This allometric range drives strong scaling relationships: metabolic rate, home range size, and prey size all scale predictably with body mass in lizards. The SVL measurement is preferred over total length because tail autotomy (voluntary tail shedding) makes total length unreliable; SVL provides a stable, comparable measure of body size across species and individuals.

The combination of clutch size, clutch frequency, and longevity captures the fast-slow life-history continuum. Small geckos with clutches of 1-2 eggs but multiple clutches per year represent a different strategy from large iguanas with single large clutches per season. This life-history variation is directly relevant to population viability analysis and conservation planning for reptile species.

The dataset covers lizards globally. Snakes and turtles are not included; they receive NA in all columns. It is licensed under CC BY 4.0 and published in Global Ecology and Biogeography. The reference is Meiri (2018), Global Ecology and Biogeography 27:1168-1172. The dataset is classified as static in the taxify manifest.

Vertebrate enrichments (cross-class)

AnAge longevity and life-history (Tacutu et al. 2018)

AnAge is a curated database of aging and longevity records for ~4,700 vertebrate species spanning mammals, birds, reptiles, amphibians, and fish. It provides maximum longevity, body mass, metabolic rate, maturity age, gestation/incubation time, litter/clutch size, birth mass, growth rate, and body temperature. The unique value of AnAge over taxon-specific databases like PanTHERIA is its cross-class coverage: longevity and metabolic data can be compared directly across vertebrate classes.

vertebrates <- taxify(c(
  "Vulpes vulpes", "Aquila chrysaetos", "Crocodylus niloticus",
  "Bufo bufo", "Salmo salar"
), backend = c("col", "gbif"))

vertebrates |> add_anage()
#>              input_name max_longevity_yr anage_body_mass_g metabolic_rate_w ...
#> 1         Vulpes vulpes             15.2            5480.0            10.41 ...
#> 2   Aquila chrysaetos              46.0            4210.0             8.94 ...
#> 3 Crocodylus niloticus             44.0          242500.0               NA ...
#> 4            Bufo bufo             36.0              48.0               NA ...
#> 5         Salmo salar              13.0            3400.0               NA ...

All columns use the anage_ prefix for body mass and litter size to distinguish them from PanTHERIA equivalents. The max_longevity_yr column is the maximum recorded lifespan in years — the most widely used parameter for cross-species aging comparisons.

The dataset is compiled from the Human Ageing Genomic Resources (HAGR) and is freely available under CC BY. The reference is Tacutu et al. (2018), Nucleic Acids Research 46:D1083-D1090.

AnimalTraits body mass and metabolic rate (Hebert et al. 2022)

AnimalTraits is a curated database of body mass and metabolic rate measurements covering ~2,000 species across arthropods (~1,700 species), vertebrates, molluscs, and annelids. Unlike taxon-specific databases, it provides a unified framework for cross-taxon allometric comparisons — particularly valuable for arthropods, which are underrepresented in other trait databases.

arthropods <- taxify(c(
  "Drosophila melanogaster", "Apis mellifera",
  "Tenebrio molitor", "Gryllus campestris"
), backend = c("col", "gbif"))

arthropods |> add_animaltraits()
#>                input_name animaltraits_body_mass_kg animaltraits_metabolic_rate_w
#> 1 Drosophila melanogaster              0.000001030                    0.000000218
#> 2         Apis mellifera              0.000100000                    0.000012600
#> 3       Tenebrio molitor              0.000140000                    0.000004850
#> 4    Gryllus campestris              0.000800000                           NA

The data is stored as individual-level observations in the source CSV; taxify’s parse function aggregates these to species-level medians. Body mass is in kilograms and metabolic rate in watts (both in SI units, as published). The animaltraits_ prefix avoids collision with body mass columns from other enrichments.

The dataset is licensed under CC0 (public domain) and published on Zenodo. The reference is Hebert et al. (2022), Scientific Data 9:265.

Butterfly enrichments

LepTraits butterfly traits (Shirey et al. 2022)

LepTraits 1.0 is the most comprehensive open butterfly trait database, covering ~12,400 species of Papilionoidea globally. It provides wingspan, voltinism, diapause stage, four habitat affinity dimensions, host plant data, and adult flight phenology.

butterflies <- taxify(c(
  "Vanessa cardui", "Pieris rapae", "Papilio machaon",
  "Lycaena phlaeas", "Colias crocea"
), backend = c("col", "gbif"))

butterflies |> add_leptraits()
#>        input_name wingspan_mm voltinism diapause_stage canopy_affinity ...
#> 1   Vanessa cardui        62.5       3.0             NA    Open canopy ...
#> 2     Pieris rapae        47.5       4.0           pupa    Open canopy ...
#> 3  Papilio machaon        75.0       2.0           pupa    Open canopy ...
#> 4  Lycaena phlaeas        30.0       3.0          larva    Open canopy ...
#> 5    Colias crocea        48.5       3.0          adult    Open canopy ...

The wingspan is computed as the midpoint of the lower and upper bounds reported in the dataset. Voltinism indicates the number of generations per year. The four habitat affinities (canopy, edge, moisture, disturbance) are categorical variables describing the species’ preferred environmental context. The flight_months column counts the number of months with recorded adult flight activity.

The dataset is licensed under CC0 and published on Figshare. The reference is Shirey et al. (2022), Scientific Data 9:398.

Arthropod enrichments

NW European Arthropod life-history traits (Logghe et al. 2025)

This dataset provides 28 life-history and ecological traits for ~4,900 arthropod species from Northwestern Europe, covering 10 orders including Coleoptera, Hemiptera, Orthoptera, Araneae, Diptera, Hymenoptera, and Lepidoptera. It is the most comprehensive open arthropod trait compilation for this region.

insects <- taxify(c(
  "Abax parallelepipedus", "Pterostichus melanarius",
  "Chorthippus parallelus", "Araneus diadematus"
), backend = c("col", "gbif"))

insects |> add_arthropod_traits()
#>               input_name arthropod_body_size_mm arthropod_dispersal arthropod_voltinism arthropod_feeding_guild ...
#> 1 Abax parallelepipedus                   18.5                0.01                 1.0               carnivore ...
#> 2 Pterostichus melanarius                 15.0                0.32                 1.0               carnivore ...
#> 3 Chorthippus parallelus                  17.0                0.10                 1.0               herbivore ...
#> 4   Araneus diadematus                    13.0                0.45                 1.0               carnivore ...

All columns are prefixed with arthropod_. The quantitative traits include body size (mm), dispersal ability (0-1 ratio within order), mean voltinism, fecundity, development time (days), lifespan (days), and thermal niche mean (°C). The categorical traits include diurnality, feeding guild, and trophic range.

Because this dataset is geographically scoped to NW Europe (Belgium, Luxembourg, Netherlands, northern France, UK, western Germany), species from other regions will have NA values. The dataset is particularly strong for Coleoptera, Hemiptera, and Orthoptera, with near-complete coverage of the regional fauna in those orders.

The dataset is licensed under CC BY-NC and published in Biodiversity Data Journal. The reference is Logghe et al. (2025), Biodiversity Data Journal 13:e146785.

Group-based enrichments

Five enrichments filter by a grouping variable (country code, TDWG botanical region code, GloNAF region code, or language code) and pivot the result to wide format. The mechanics are the same across all of them: when a single group value is requested, the output column uses the base name (e.g., invasive_status). When multiple group values are requested, each output column gets a suffix derived from the group value (e.g., invasive_status_AT, invasive_status_DE). Passing "all" as the group value expands to every group present in the enrichment dataset.

This design keeps the output tidy for the common case (one country, one region, one language) while still supporting comparative analyses across multiple groups without reshaping the data manually. Internally, the group-based join performs one match() call per group value, so requesting 10 countries costs roughly 10 times the computation of a single country. This is still fast for typical use cases (sub-second for results with tens of thousands of rows), but requesting "all" on a large result may take a few seconds because it iterates over all group values in the enrichment (196 countries for GRIIS, dozens of TDWG regions for WCVP).

Invasive species status (GRIIS)

The Global Register of Introduced and Invasive Species (GRIIS) classifies species as native, introduced, or invasive on a per-country basis. The dataset covers 196 countries with ~23,000 species-country combinations. The country argument takes ISO 3166-1 alpha-2 codes (e.g., "AT" for Austria, "DE" for Germany, "GB" for Great Britain).

Single country

plants <- taxify(c(
  "Robinia pseudoacacia", "Ailanthus altissima",
  "Impatiens glandulifera", "Quercus robur",
  "Reynoutria japonica", "Solidago canadensis"
))

plants |> add_invasive_status(country = "AT")
#>            input_name invasive_status
#> 1 Robinia pseudoacacia        invasive
#> 2  Ailanthus altissima        invasive
#> 3 Impatiens glandulifera      invasive
#> 4        Quercus robur          native
#> 5  Reynoutria japonica        invasive
#> 6 Solidago canadensis        invasive

With a single country code, the output column is simply invasive_status without any suffix. The three possible values are "native", "introduced", and "invasive". Species not recorded in the GRIIS dataset for the requested country receive NA. Note that NA does not mean “native”; it means “no record” in the GRIIS database. Many native species are simply not listed because GRIIS focuses on introduced and invasive taxa.

Multiple countries

When comparing invasive status across countries, pass a vector of codes. Each output column is suffixed with the corresponding country code.

plants |> add_invasive_status(country = c("AT", "DE", "GB"))
#>            input_name invasive_status_AT invasive_status_DE invasive_status_GB
#> 1 Robinia pseudoacacia         invasive           invasive           invasive
#> 2  Ailanthus altissima         invasive           invasive         introduced
#> 3 Impatiens glandulifera       invasive           invasive           invasive
#> 4        Quercus robur           native             native             native
#> 5  Reynoutria japonica         invasive           invasive           invasive
#> 6 Solidago canadensis         invasive           invasive         introduced

This layout makes cross-country comparisons straightforward. Filtering for species that differ in status between countries is a matter of subsetting columns. The example below finds species classified as invasive in Austria but not (yet) classified as invasive in Germany:

result <- plants |> add_invasive_status(country = c("AT", "DE"))
# Species invasive in Austria but not in Germany
result[result$invasive_status_AT == "invasive" &
       result$invasive_status_DE != "invasive", ]

This pattern is useful for identifying species that may be expanding their invasive range, or for comparing the regulatory status of non-native species across neighboring countries.

All countries

Passing country = "all" expands the result with one column per country in the GRIIS dataset (196 countries). This produces a wide data.frame with 196 additional columns, so it is best reserved for full-scale screening exercises where the complete geographic profile of each species matters.

plants |> add_invasive_status(country = "all")
# Adds invasive_status_AD, invasive_status_AE, ..., invasive_status_ZW

The resolution of "all" to the full list of country codes is done efficiently: if the manifest contains an available_groups field for the GRIIS enrichment (which it normally does), the codes are read from there in O(1) time without scanning the .vtr file. This makes even the "all" case fast to set up, though the subsequent join across 196 groups naturally takes longer than a single-country join.

Alien species first records (Seebens et al.)

The Global Alien Species First Record Database (Seebens et al. 2017) records the year each alien species was first documented in a given country or territory. Unlike GRIIS (which records current status), this enrichment provides a historical timeline of alien species arrivals. The dataset covers all taxa (plants, animals, fungi) with ~77,000 species-country combinations across 241 countries. The country argument takes ISO 3166-1 alpha-2 codes, same as add_invasive_status().

Single country

aliens <- taxify(c(
  "Robinia pseudoacacia", "Ailanthus altissima",
  "Impatiens glandulifera", "Quercus robur",
  "Ambrosia artemisiifolia", "Solidago canadensis"
))

aliens |> add_alien_first_records(country = "AT")
#>              input_name alien_first_record alien_first_record_source alien_first_record_reference
#> 1   Robinia pseudoacacia               1850                   NOBANIS                      NOBANIS
#> 2    Ailanthus altissima               1870                   NOBANIS                      NOBANIS
#> 3 Impatiens glandulifera               1900                   NOBANIS                      NOBANIS
#> 4          Quercus robur                 NA                      <NA>                         <NA>
#> 5 Ambrosia artemisiifolia              1863                   NOBANIS                      NOBANIS
#> 6  Solidago canadensis                 1850                   NOBANIS                      NOBANIS

Each row gets three columns: alien_first_record (the year as an integer), alien_first_record_source (the database that contributed this record, e.g., “NOBANIS”, “GAVIA”, “FishBase”), and alien_first_record_reference (the original citation). Native species like Quercus robur receive NA because they are not in the alien first records database.

The source and reference columns provide row-level provenance. This matters because a second first-records source (GBIF occurrence-based records) will be added in a future version, and the source column will distinguish which database contributed each record.

Multiple countries

aliens |> add_alien_first_records(country = c("AT", "DE", "GB"))
#>              input_name alien_first_record_AT alien_first_record_DE alien_first_record_GB ...
#> 1   Robinia pseudoacacia                  1850                  1630                  1640 ...
#> 2    Ailanthus altissima                  1870                  1780                  1751 ...
#> 3 Impatiens glandulifera                  1900                  1839                  1855 ...

With multiple countries, each of the three value columns gets a country suffix: alien_first_record_AT, alien_first_record_source_AT, alien_first_record_reference_AT, etc. This makes cross-country comparisons of invasion history straightforward.

Reshaping to long format

When working with multiple countries, the wide format can be unwieldy for modelling, mapping, or timeline analyses. The taxify_long() helper reshapes any group-based enrichment columns back to long format:

aliens |>
  add_alien_first_records(country = c("AT", "DE", "GB")) |>
  taxify_long()
#>              input_name country_code alien_first_record alien_first_record_source ...
#> 1   Robinia pseudoacacia           AT               1850                   NOBANIS ...
#> 2    Ailanthus altissima           AT               1870                   NOBANIS ...
#> ...
#> 7   Robinia pseudoacacia           DE               1630              Long (2003) ...
#> ...

When cols and group_col are omitted, taxify_long() auto-detects them from metadata stamped by the add_*() functions. The result has one row per species per country, with the base column names (no suffix) and a new country_code column. The drop_na = TRUE argument removes rows where all value columns are NA (e.g., native species with no alien first record in any queried country).

taxify_long() works with any group-based enrichment, not just alien first records. It can reshape invasive_status, native_status, or common_name columns just as easily:

aliens |>
  add_invasive_status(country = c("AT", "DE")) |>
  taxify_long()

When multiple grouped enrichments share the same group column, they are reshaped together. If an enrichment covers different groups than another (e.g., GRIIS for AT/DE but first records for AT/DE/CH), the missing combinations are padded with NA:

aliens |>
  add_invasive_status(country = c("AT", "DE")) |>
  add_alien_first_records(country = c("AT", "DE", "CH")) |>
  taxify_long()

You can still provide cols and group_col explicitly to override auto-detection or to rename the group column.

Native range by botanical region (WCVP)

The World Checklist of Vascular Plants (WCVP) from the Royal Botanic Gardens, Kew, classifies ~340,000 plant species as native, introduced, or extinct in TDWG Level 2 botanical regions. TDWG (Taxonomic Databases Working Group, now TDWG Biodiversity Information Standards) defined a hierarchical system of geographic regions for recording plant distributions. Level 2 regions are continent-scale units.

The region argument takes TDWG Level 2 codes. Common codes include:

  • EUR (Europe)

  • NAM (Northern America)

  • SAM (Southern America)

  • AFR (Africa)

  • AUS (Australasia)

  • ASI (Asia-Temperate)

  • AST (Asia-Tropical)

  • PAC (Pacific)

  • ANT (Antarctica)

trees <- taxify(c(
  "Quercus robur", "Quercus suber", "Eucalyptus globulus",
  "Nothofagus pumilio", "Sequoiadendron giganteum"
))

trees |> add_wcvp(region = "EUR")
#>              input_name native_status
#> 1          Quercus robur        native
#> 2          Quercus suber        native
#> 3    Eucalyptus globulus            NA
#> 4    Nothofagus pumilio            NA
#> 5 Sequoiadendron giganteum          NA

Eucalyptus globulus returns NA for Europe because it is native to Australia, not because it is absent from the WCVP dataset. The dataset records where a species is natively distributed, not where it has been planted or naturalized. This is an important distinction: many cultivated species will show NA in regions where they are widespread in gardens and plantations.

Querying multiple regions reveals each species’ native continental range:

trees |> add_wcvp(region = c("EUR", "AUS", "SAM"))
#>              input_name native_status_EUR native_status_AUS native_status_SAM
#> 1          Quercus robur           native                NA                NA
#> 2          Quercus suber           native                NA                NA
#> 3    Eucalyptus globulus               NA            native                NA
#> 4    Nothofagus pumilio               NA                NA            native
#> 5 Sequoiadendron giganteum              NA                NA                NA

Sequoiadendron giganteum (Giant Sequoia) returns NA for all three regions because it is native to western North America (NAM), which was not included in the query. This illustrates that the absence of a region code from the query does not mean the species lacks native range data; it means we did not ask about the right region.

The full list of available TDWG codes can be retrieved programmatically from the manifest (the available_groups field for the wcvp enrichment), or from the TDWG geographic standard documentation. WCVP is non-static: Kew updates the checklist periodically, and the taxify enrichment is rebuilt when new versions are published.

Naturalized alien flora by region (GloNAF)

The Global Naturalized Alien Flora (GloNAF) records which plant species are naturalized in ~1,300 regions worldwide. Unlike GRIIS (which classifies species as native/introduced/invasive per country), GloNAF provides a binary naturalization flag per region with finer geographic resolution, using TDWG-compatible codes extended with dot notation for sub-national units (e.g., "USA.CA" for California).

plants <- taxify(c(
  "Robinia pseudoacacia", "Ailanthus altissima",
  "Impatiens glandulifera", "Quercus robur"
))

plants |> add_glonaf(region = "EUR")
#>              input_name naturalized
#> 1   Robinia pseudoacacia           1
#> 2    Ailanthus altissima           1
#> 3 Impatiens glandulifera           1
#> 4          Quercus robur          NA

The output column naturalized is 1 if the species is recorded as naturalized in the queried region, and NA otherwise. Multiple regions produce suffixed columns (naturalized_EUR, naturalized_NAM). The region = "all" option expands to all ~1,300 regions.

GloNAF complements GRIIS: GRIIS provides the invasion status dimension (native/introduced/invasive), while GloNAF provides the geographic coverage dimension (where has this species established self-sustaining populations?). Combining both gives a fuller picture of alien plant distributions.

The dataset is licensed under CC BY 4.0. The reference is van Kleunen et al. (2019), Ecology 100:e02542 (v1.0) and Davis et al. (2025), Ecology e70245 (v2.0). GloNAF is classified as static in the taxify manifest.

Common (vernacular) names (GBIF)

The common names enrichment draws on GBIF’s vernacular name database, which aggregates names from many national and regional nomenclature sources. It is the most multilingual of the enrichments, covering dozens of languages. The lang argument takes ISO 639-1 two-letter language codes.

species <- taxify(c(
  "Quercus robur", "Parus major", "Vulpes vulpes",
  "Bufo bufo", "Picea abies"
))

species |> add_common_names()
#>     input_name common_name
#> 1 Quercus robur   Pedunculate Oak
#> 2   Parus major    Great Tit
#> 3  Vulpes vulpes    Red Fox
#> 4     Bufo bufo    Common Toad
#> 5   Picea abies    Norway Spruce

The default language is English (lang = "en"). Switching to another language is a matter of changing the lang argument:

species |> add_common_names(lang = "de")
#>     input_name common_name
#> 1 Quercus robur   Stieleiche
#> 2   Parus major    Kohlmeise
#> 3  Vulpes vulpes    Rotfuchs
#> 4     Bufo bufo    Erdkroete
#> 5   Picea abies    Gemeine Fichte

When multiple common names exist for a species in the requested language, the first (most commonly used) entry is returned. Coverage varies substantially by language: English and German have the broadest coverage (most European and widespread species have entries). French, Spanish, Portuguese, and Dutch also have good coverage. Less widely spoken languages or languages with limited digital biodiversity infrastructure may have gaps, resulting in NA for species that do have common names in those languages but that have not been digitized in GBIF’s aggregation.

The common names enrichment is non-static (GBIF updates its backbone periodically) and licensed under CC0. When a single language is requested, the output column is common_name. If multiple languages were supported in a single call, they would follow the group-based suffix pattern, but in practice the common usage pattern is one language per call.

Stacking enrichments

The add_*() functions return the same data.frame class (taxify_result) they receive, preserving all attributes including the metadata used by summary(). This means they compose naturally with the pipe operator. A typical workflow chains taxify() with several enrichment calls, building up the desired set of columns incrementally.

library(taxify)

plant_result <- taxify(c(
  "Quercus robur", "Fagus sylvatica", "Picea abies",
  "Arrhenatherum elatius", "Festuca rubra", "Plantago lanceolata"
)) |>
  add_conservation_status() |>
  add_woodiness() |>
  add_eive() |>
  add_diaz_traits()

Each enrichment appends its columns to the right of the data.frame. The result of this chain has the original 16 taxify columns plus conservation_status, woodiness, the five EIVE columns (eive_light, eive_temperature, eive_moisture, eive_reaction, eive_nutrients), and the two Diaz columns (seed_mass_mg, plant_height_m). That is 25 columns total. Order within the chain does not affect the output because each enrichment operates independently on the accepted_name column. The only case where order matters is the column-name collision between LEDA and Diaz (seed_mass_mg), discussed earlier.

Here is a similar chain for birds, combining morphological measurements with diet data and vernacular names:

bird_result <- taxify(c(
  "Parus major", "Cyanistes caeruleus", "Erithacus rubecula",
  "Turdus merula", "Falco peregrinus"
)) |>
  add_conservation_status() |>
  add_avonet() |>
  add_elton_traits() |>
  add_common_names()

This produces 16 (base) + 1 (conservation) + 11 (AVONET) + 18 (EltonTraits) + 1 (common name) = 47 columns. Both AVONET and EltonTraits contribute body mass, but in distinct columns (avonet_body_mass_g and elton_body_mass_g), so there is no overwriting.

And for mammals, combining life-history traits from PanTHERIA with diet data from EltonTraits and German common names:

mammal_result <- taxify(c(
  "Vulpes vulpes", "Canis lupus", "Ursus arctos",
  "Lutra lutra", "Lynx lynx"
)) |>
  add_conservation_status() |>
  add_pantheria() |>
  add_elton_traits() |>
  add_common_names(lang = "de")

Both EltonTraits and PanTHERIA cover mammals, so both contribute data to the mammal chain. EltonTraits provides diet composition percentages and foraging strata; PanTHERIA provides life-history traits like longevity, litter size, and home range. The combination gives a multidimensional view of each species’ ecology without any manual data assembly.

The same pattern works for fungi, combining lifestyle traits from FungalTraits with guild classifications from FUNGuild:

fungal_result <- taxify(c(
  "Amanita muscaria", "Boletus edulis", "Trametes versicolor"
)) |>
  add_conservation_status() |>
  add_fungal_traits() |>
  add_funguild()

FungalTraits provides the detailed ecological traits (lifestyle, growth form, substrate, mycorrhizal exploration type), while FUNGuild adds the trophic mode and guild classification. The two enrichments have complementary column sets, so there is no overwriting except for growth form, which is distinguished by the funguild_growth_form prefix in FUNGuild. The confidence_ranking column from FUNGuild is a useful quality filter: restricting to “Highly Probable” assignments before downstream analysis reduces noise.

Fish analyses can similarly combine morphological and ecological enrichments:

fish_result <- taxify(c(
  "Salmo trutta", "Esox lucius", "Gadus morhua"
)) |>
  add_conservation_status() |>
  add_fish_traits() |>
  add_fishbase()

FISHMORPH provides the morphological ratios (body elongation, fin shape, eye size) that define the ecomorphological profile of each species, while FishBase adds the ecological and life-history context (trophic level, depth range, vulnerability). Note that Gadus morhua (Atlantic Cod) will have NA values in all FISHMORPH columns because FISHMORPH covers freshwater species only, but it will be fully populated by FishBase. This kind of partial coverage across complementary enrichments is expected and easy to diagnose from the summary() output.

Enrichment chains can be as long as needed. Performance is linear in the number of enrichments: each add_*() call performs one join, regardless of how many enrichments have already been applied. A chain of 10 enrichments on a 50,000-row result completes in seconds. The enrichment files themselves are loaded via vectra’s memory-mapped I/O, so even enrichments with hundreds of thousands of rows (like WCVP at ~340,000) do not consume large amounts of RAM.

The pipe chain pattern also plays well with reproducibility workflows. The entire analysis, from raw species list to fully enriched table, is captured in a single, readable expression. Saving this code alongside the session info (including taxify version and enrichment versions from summary()) gives a complete record of which data was used to produce the results.

Coverage patterns

Not all species appear in all enrichments. Each dataset has a taxonomic scope (plants, birds, mammals, amphibians, vertebrates, butterflies, arthropods, fungi, algae, fish, lizards, or cross-taxon) and a geographic scope (global, European, NW European). When an enrichment has no data for a species, the corresponding columns contain NA. Understanding coverage patterns is essential for interpreting enriched results correctly and for choosing which enrichments to apply.

mixed <- taxify(c(
  "Quercus robur",     # plant
  "Parus major",       # bird
  "Vulpes vulpes",     # mammal
  "Bufo bufo",         # amphibian
  "Amanita muscaria",  # fungus
  "Salmo trutta"       # fish
))

mixed |>
  add_woodiness() |>
  add_avonet() |>
  add_pantheria() |>
  add_amphibio() |>
  add_fungal_traits() |>
  add_fishbase()
#>        input_name woodiness beak_length pantheria_body_mass_g body_size_mm primary_lifestyle fb_trophic_level ...
#> 1   Quercus robur     woody          NA                    NA           NA              <NA>               NA ...
#> 2     Parus major        NA        11.2                    NA           NA              <NA>               NA ...
#> 3   Vulpes vulpes        NA          NA                5480.0           NA              <NA>               NA ...
#> 4       Bufo bufo        NA          NA                    NA        150.0              <NA>               NA ...
#> 5 Amanita muscaria       NA          NA                    NA           NA   ectomycorrhizal               NA ...
#> 6    Salmo trutta        NA          NA                    NA           NA              <NA>              3.4 ...

Each species populates only the columns from enrichments that cover its taxon group. The NA values are not errors or data quality problems; they reflect the scope of the underlying datasets. Quercus robur has a woodiness value but no beak length, body mass, body size, fungal traits, or fish data. Amanita muscaria has a primary lifestyle but no plant, bird, mammal, amphibian, or fish traits. Salmo trutta has FishBase data but nothing from the other taxon-specific enrichments. This is expected behavior.

Approximate coverage rates by enrichment

The following table summarizes the approximate species coverage of each enrichment, its taxonomic scope, and its geographic scope. Numbers are approximate because enrichments are updated periodically and because coverage depends somewhat on the backbone used (different backbones accept slightly different sets of names).

Enrichment Taxon scope Geographic scope ~Species
conservation_status all groups global 166,000
woodiness plants global 50,000
eive plants European 14,500
diaz_traits plants global 46,000
leda plants NW European 8,000
elton_traits birds + mammals global 15,400
avonet birds global 11,000
pantheria mammals global 5,400
amphibio amphibians global 6,800
fungal_traits fungi global 10,200 genera
funguild fungi global 13,000
algae_traits macroalgae European 1,745
fish_traits freshwater fish global 8,300
fishbase all fish global 35,000
lizard_traits lizards global 6,600
anage vertebrates global 4,700
animaltraits cross-taxon (arthropods+) global 2,000
leptraits butterflies global 12,400
arthropod_traits arthropods NW European 4,900
griis all groups per country 23,000 combos
glonaf plants global by region 16,000 × 1,300
wcvp plants global by region 340,000
common_names all groups multi-language varies

For a European plant survey, the enrichment with the highest absolute coverage is WCVP (~340,000 species), followed by conservation status (~166,000), woodiness (~50,000), Diaz traits (~46,000), EIVE (~14,500), and LEDA (~8,000). However, for a specifically NW European dataset, LEDA’s ~8,000 species may actually cover a larger fraction of the species list than the Diaz dataset, because LEDA is geographically focused on the same region.

Interpreting NA columns

When an entire column is NA for all rows in a result, the most likely explanation is a taxon-scope mismatch. Woodiness covers vascular plants, so a bird dataset will have NA in every row of that column. The reverse holds for AVONET against a plant list, or PanTHERIA against amphibians. This is expected behavior: the enrichment data simply does not include species from that taxon group.

A partially populated column (some rows NA, others filled) means the enrichment covers the taxon group but the specific species is not in the source dataset. Common reasons for per-species NA include:

  1. Source dataset incomplete. No trait database covers 100% of described species. PanTHERIA covers ~5,400 of the ~6,500 described mammal species; roughly 1,100 mammals will have NA values.
  2. Recently described species. Species described or split after the dataset’s publication date will be absent. PanTHERIA (2009) misses all species described since 2009.
  3. Name alignment failure. Rare, but possible for taxa with ongoing taxonomic revisions where the backbone and the enrichment source use different name variants that the cross-backbone resolution did not capture. If a species consistently fails to match, filing a GitHub issue helps us improve the name alignment pipeline.
  4. Infraspecific taxa. Most enrichments operate at the species level. If the taxify result contains subspecies or varieties (e.g., “Quercus robur subsp. robur”), the enrichment may not have a matching entry at that rank.

It is worth noting that coverage is not the same as data quality. An enrichment might cover 95% of the species in a result, but the trait values for some of those species could be based on few measurements, extrapolated from congeners, or derived from captive rather than wild populations. The enrichment system does not expose confidence intervals or sample sizes for individual trait values; that level of detail lives in the original source databases. For analyses that require measurement-level metadata (sample size, measurement uncertainty, geographic origin of measurements), consult the original source cited on the add_*() help page.

To check the overall enrichment rate for a result, the summary() output reports the number of matched rows per enrichment. We can also compute it directly:

result <- taxify(species_list) |> add_woodiness()
# Fraction of matched species with woodiness data
mean(!is.na(result$woodiness[!is.na(result$accepted_name)]))

The enrichment register in summary() output

Every add_*() call records metadata about the enrichment in an attribute (taxify_meta) on the result data.frame. This metadata includes the enrichment name, source label, version string, and the count of rows that received non-NA trait values. Calling summary() on a taxify result displays this information alongside the standard match statistics, providing a compact overview of the entire analysis pipeline.

result <- taxify(c("Quercus robur", "Fagus sylvatica", "Pinus sylvestris")) |>
  add_conservation_status() |>
  add_woodiness() |>
  add_eive()

summary(result)
#> -- taxify results --------------------------------------------------------
#>   backend: WFO v2024.12  |  3 names submitted
#>
#>   matched       3  (exact: 3, case-insensitive: 0, fuzzy: 0)
#>   --------------------------------------------------------
#>   taxon groups: plant: 3
#>
#>   enrichments:
#>     conservation_status  (IUCN Red List 2025.1)     -- 3 of 3 matched
#>     woodiness            (Zanne et al. 2014 1.0)    -- 3 of 3 matched
#>     eive                 (EIVE 1.0 2023.1)          -- 3 of 3 matched

The enrichment register lists each applied enrichment with its source name, version, and the fraction of successfully matched names. In this example, all three enrichments achieved 100% coverage (3 of 3 matched), which is expected for well-known European tree species. On a larger dataset with a broader taxonomic scope, we would typically see lower fractions, especially for enrichments with narrow geographic or taxonomic coverage.

The register is cumulative: applying more enrichments adds more lines. This makes summary() a useful diagnostic at the end of a pipe chain. If one enrichment shows unexpectedly low coverage (e.g., “2 of 500 matched” for EIVE on a dataset that we expected to be European plants), it signals a problem worth investigating. Common causes include the species list containing non-plant taxa, non-European species, or names at ranks other than species.

The version strings in the register provide exact provenance information for the methods section of a paper. Rather than writing “we used the IUCN Red List” (which version? downloaded when?), we can report the version string directly from summary() (e.g., “IUCN Red List v2025.1 as distributed by taxify enrichment conservation_status v2025.04”).

Practical guidance: which enrichments for which taxa

The choice of enrichments depends on the taxonomic scope and geographic focus of the analysis. Below are recommended enrichment stacks for common use cases, with brief notes on what each enrichment contributes.

Vascular plants (European)

European plant ecology benefits from the richest set of enrichments. A full stack combines conservation status, growth form, ecological niche position, global functional traits, regional functional traits, native range, and vernacular names.

result <- taxify(species_list, backend = "wfo") |>
  add_conservation_status() |>
  add_woodiness() |>
  add_fungalroot() |>
  add_eive() |>
  add_diaz_traits() |>
  add_leda() |>
  add_wcvp(region = "EUR") |>
  add_common_names()

EIVE and LEDA both cover European plants, but their trait columns are complementary. EIVE provides niche position along five environmental gradients (where does this species grow?). LEDA provides morphological and dispersal traits (what does this species look like, how does it disperse?). The combination produces a detailed functional profile suitable for community-weighted mean analyses, trait-based ordinations, and functional diversity calculations. The Diaz traits add a global perspective on seed mass and plant height that complements LEDA’s regional measurements.

The seed_mass_mg collision between LEDA and Diaz was discussed earlier. In this stack, LEDA runs before Diaz, so the Diaz seed_mass_mg will be the value in the final result. If the LEDA value is preferred, reverse the order or omit add_diaz_traits().

Vascular plants (global)

Outside Europe, EIVE and LEDA coverage drops to near zero. The global plant stack relies on the wider-coverage datasets and omits the regional enrichments.

result <- taxify(species_list, backend = "wfo") |>
  add_conservation_status() |>
  add_woodiness() |>
  add_fungalroot() |>
  add_diaz_traits() |>
  add_wcvp(region = c("NAM", "SAM", "AFR")) |>
  add_common_names()

Woodiness, Diaz traits, and FungalRoot mycorrhizal type all have global coverage, so they contribute useful data regardless of the geographic origin of the species list. WCVP can be queried for any TDWG region, providing native range information for the continents relevant to the study.

Birds

Birds are covered by two complementary enrichments: AVONET for morphology and migration, EltonTraits for diet and foraging. Together they provide a detailed functional profile spanning body plan, habitat use, dietary niche, and movement ecology.

result <- taxify(species_list, backend = "col") |>
  add_conservation_status() |>
  add_avonet() |>
  add_elton_traits() |>
  add_common_names()

Both AVONET and EltonTraits include body mass, stored in separate columns (avonet_body_mass_g from specimen measurements and elton_body_mass_g from literature compilation). Small discrepancies between the two are expected and can be informative: large discrepancies for a species may indicate measurement error in one source or sexually dimorphic species where the two sources sampled different sexes.

Mammals

Mammals are covered by PanTHERIA (life-history traits) and EltonTraits (diet and foraging behavior). The combination provides body mass from two independent sources (pantheria_body_mass_g and elton_body_mass_g), which can serve as a cross-validation of the mass data.

result <- taxify(species_list, backend = "col") |>
  add_conservation_status() |>
  add_pantheria() |>
  add_elton_traits() |>
  add_common_names()

PanTHERIA contributes life-history variables that EltonTraits does not cover (longevity, litter size, gestation, weaning age, home range, diet breadth, habitat breadth). EltonTraits contributes the detailed diet composition percentages and foraging stratum data that PanTHERIA does not provide. There is no redundancy except body mass.

Amphibians

AmphiBIO is the sole dedicated amphibian enrichment. It can be combined with conservation status and common names.

result <- taxify(species_list, backend = "col") |>
  add_conservation_status() |>
  add_amphibio() |>
  add_common_names()

Amphibians are the most threatened vertebrate class, with roughly 40% of assessed species listed in threatened categories (VU, EN, or CR) according to the IUCN. Conservation status is therefore particularly informative for amphibian analyses. The combination of AmphiBIO habitat traits (aquatic, fossorial, arboreal) with IUCN status can reveal associations between habitat specialization and extinction risk.

Fish

Fish are covered by two complementary enrichments: FISHMORPH for morphological traits of freshwater species, and FishBase for ecological and life-history traits across all fish (freshwater + marine). The WoRMS backend provides authoritative taxonomy for marine fish; COL and GBIF cover both freshwater and marine species.

result <- taxify(species_list, backend = "worms") |>
  add_conservation_status() |>
  add_fish_traits() |>
  add_fishbase() |>
  add_common_names()

For freshwater fish community studies, both enrichments contribute data. FISHMORPH provides the ecomorphological ratios used in functional diversity calculations, while FishBase adds trophic level, depth range, and vulnerability. For marine fish studies, only FishBase will contribute data (FISHMORPH covers freshwater species only). The fb_vulnerability column from FishBase is particularly useful alongside IUCN conservation status for prioritizing species in fisheries management and marine spatial planning.

Reptiles (lizards)

Lizards are covered by the Meiri lizard traits enrichment, which provides life-history and ecological traits for ~6,600 species. Combined with conservation status, it gives a functional profile suitable for reptile community analyses and conservation assessments.

result <- taxify(species_list, backend = "col") |>
  add_conservation_status() |>
  add_lizard_traits() |>
  add_common_names()

Snakes and turtles are not covered by the lizard enrichment. For those groups, the add_data() function can join custom trait datasets. For cross-class longevity and metabolic comparisons, add_anage() covers reptiles alongside mammals, birds, amphibians, and fish.

Butterflies

LepTraits is the dedicated butterfly enrichment, providing wingspan, voltinism, habitat affinities, and host plant data for ~12,400 species globally. For European butterfly ecology, it can be combined with the NW European Arthropod traits for additional life-history variables.

result <- taxify(species_list, backend = "col") |>
  add_conservation_status() |>
  add_leptraits() |>
  add_common_names()

Arthropods (NW European)

For arthropod community studies in NW Europe, the arthropod traits enrichment provides the most comprehensive trait coverage. It can be combined with AnimalTraits for cross-taxon body mass comparisons and with LepTraits for additional butterfly-specific traits.

result <- taxify(species_list, backend = c("col", "gbif")) |>
  add_conservation_status() |>
  add_arthropod_traits() |>
  add_animaltraits() |>
  add_common_names()

For arthropod studies outside NW Europe, AnimalTraits provides body mass for ~1,700 arthropod species globally, though with fewer trait dimensions than the Logghe et al. dataset.

Fungi

Fungi are covered by two enrichments: FungalTraits for genus-level ecological traits and FUNGuild for trophic guild classifications. The COL and GBIF backends provide fungal taxonomy.

result <- taxify(species_list, backend = "col") |>
  add_conservation_status() |>
  add_fungal_traits() |>
  add_funguild() |>
  add_common_names()

FungalTraits provides the lifestyle, growth form, and interaction capability traits that describe what each genus does ecologically. FUNGuild adds the trophic mode and guild classification used in fungal community ecology. The confidence_ranking column from FUNGuild allows filtering to high-confidence assignments, which is important for quantitative analyses where guild misclassification would introduce systematic bias.

Macroalgae (European)

European macroalgae are covered by AlgaeTraits, which provides morphological and ecological traits for ~1,745 species. The WoRMS backend is recommended for marine algae taxonomy.

result <- taxify(species_list, backend = "worms") |>
  add_conservation_status() |>
  add_algae_traits() |>
  add_common_names()

AlgaeTraits is geographically scoped to European coastlines. For non-European macroalgae studies, the add_data() function can join custom datasets. The key advantage of AlgaeTraits over general plant trait databases is that it provides marine-specific traits (tidal zone, wave exposure, calcification) that are not captured by terrestrial plant databases like LEDA or EIVE.

Mixed-taxon datasets

When a dataset spans multiple kingdoms (e.g., a biodiversity survey with plants, birds, and mammals), there are two strategies.

The first is to apply all relevant enrichments to the full result and accept NA values where taxonomic scope does not overlap:

This produces a wide data.frame where most cells in any given row are NA (a plant row has woodiness data but NA for beak length, body mass, etc.). The advantage is simplicity: one data.frame, one summary() call, no manual splitting.

The second strategy is to split the result by taxon group, enrich each subset with the appropriate stack, and recombine:

result <- taxify(species_list)

plants  <- result[result$kingdom == "Plantae", ]
birds   <- result[result$family %in% bird_families, ]
mammals <- result[result$family %in% mammal_families, ]

plants  <- plants |> add_woodiness() |> add_eive()
birds   <- birds |> add_avonet() |> add_elton_traits()
mammals <- mammals |> add_pantheria() |> add_elton_traits()

The second approach avoids wide data.frames with many NA columns and produces cleaner trait matrices for downstream analyses that treat columns as features (ordination, clustering, machine learning). The disadvantage is more code and the need to maintain separate data.frames for each group.

The choice depends on the analysis goal. For a summary table in a paper (e.g., “species, conservation status, key traits”), the first approach works well. For functional diversity calculations or trait-based models that expect a complete trait matrix, the second approach typically produces better inputs because it avoids rows with structurally missing values (values that are missing by design, not by data limitation).

A middle ground is to apply add_conservation_status() and add_common_names() to the full dataset (since both cover all taxon groups), then split by group for the taxon-specific enrichments. This gives us conservation data and vernacular names for every species in a single data.frame, while keeping the taxon-specific trait matrices clean.

Joining custom data

Beyond the built-in enrichments, add_data() joins any external dataset to a taxify result. It accepts a file path (CSV, CSV.GZ, XLSX, SQLite/DB, or VTR) or an in-memory data.frame. The function identifies the species name column (automatically by running the first 10 rows of each character column through taxify() and selecting the column with the highest match rate, or via an explicit species_col argument), resolves those names through the same backbone(s) used in the original taxify() call, and joins on accepted_id.

result <- taxify(c("Quercus robur", "Pinus sylvestris", "Fagus sylvatica"))

# From a CSV file (auto-detect species column)
result |> add_data("my_traits.csv")
# From a data.frame with explicit species column
my_traits <- data.frame(
  species = c("Quercus robur", "Pinus sylvestris", "Fagus sylvatica"),
  bark_thickness_mm = c(25, 15, 8),
  shade_tolerance = c(0.6, 0.3, 0.8)
)
result |> add_data(my_traits, species_col = "species")

Because add_data() resolves names through the backbone before joining, it handles synonyms correctly. If the external data uses “Pinus abies” and the backbone resolves it to “Picea abies”, the join still works. This is the recommended way to integrate local field data, unpublished trait measurements, or datasets from sources not covered by the built-in enrichments.

The cols argument can restrict which columns are joined from the external data. If the external data has 50 columns but we only need two, passing cols = c("bark_thickness_mm", "shade_tolerance") avoids cluttering the result with unwanted columns. The fuzzy argument (default TRUE) enables fuzzy matching for names in the external data that do not exact-match the backbone; fuzzy_threshold controls the maximum allowed string distance.

Column names from the external data that collide with existing columns in the taxify result are automatically prefixed with "data_" to prevent overwriting. If multiple rows in the external data resolve to the same accepted_id with identical trait values, they are deduplicated. If they resolve to the same accepted_id with conflicting values (e.g., two different height measurements for the same species), add_data() raises an error asking the user to resolve the ambiguity before joining. This strict handling of duplicates prevents the row duplication that a plain merge() would produce.

The add_data() function also supports SQLite databases via the table argument, and .vtr files directly (useful for sharing pre-built enrichments between collaborators). For XLSX files, the openxlsx2 package is required (listed in Suggests).

Data provenance and citation

Each enrichment draws on published, peer-reviewed datasets with their own licenses and citation requirements. Citing the correct source and version is a professional obligation when using these data in publications. The summary() output includes the source and version for each applied enrichment, providing a starting point for the methods section. The original references are listed in each add_*() function’s help page (accessible via ?add_avonet, ?add_leda, ?add_pantheria, etc.).

For reproducibility, the version recorded in meta.json pins the exact build of each enrichment .vtr file that was used. Static enrichments (Zanne 2014, PanTHERIA 2009, EltonTraits 2014, AmphiBIO 2017, LEDA 2008, Diaz 2022, Seebens 2017, FungalTraits 2020, FUNGuild 2016, AlgaeTraits 2023, FISHMORPH 2021, Meiri 2018, LepTraits 2022, AnimalTraits 2022, NW European Arthropods 2025, GloNAF 2019) have fixed versions that never change. Non-static enrichments (IUCN, GRIIS, WCVP, common names) are updated when the upstream source publishes a new release, and the version in meta.json reflects which release was used. Reporting the enrichment version in a publication ensures that results can be reproduced even if the upstream data is later revised or corrected.

The licenses of the source datasets range from CC0 (EltonTraits, PanTHERIA, woodiness, common names, LepTraits, AnimalTraits) to CC BY 4.0 (EIVE, AmphiBIO, AVONET, GRIIS, GloNAF, FungalTraits, AlgaeTraits, FISHMORPH, Meiri lizard traits), CC BY (AnAge), CC BY-NC (NW European Arthropods), CC BY 3.0 (Diaz traits), and CC BY-NC 3.0 (FishBase). LEDA and WCVP have their own terms published on their respective websites. The taxify package itself does not redistribute these datasets in their original form; the .vtr files are built from publicly available sources and distributed via GitHub Releases. When using enrichment data in a publication, cite the original source (the reference on the ?add_* help page) and optionally note the taxify enrichment version for reproducibility.

A minimal methods paragraph citing enrichments might read:

Taxonomic names were resolved against the WFO backbone (v2024.12) using taxify (v0.x.x). Conservation status was obtained from the IUCN Red List (v2025.1) via add_conservation_status(). Woodiness classification followed Zanne et al. (2014). Ecological indicator values were sourced from EIVE 1.0 (Dengler et al. 2023). All enrichment versions are recorded in the taxify result metadata and available via summary().

Summary

taxify’s enrichment system turns taxonomic name matching into a gateway to ecological trait data. The 22 built-in enrichments cover conservation status, growth form, ecological niches, functional traits, diet, morphology, life-history, geographic ranges, invasive status, and vernacular names across plants, birds, mammals, amphibians, vertebrates, butterflies, arthropods, fungi, algae, fish, and reptiles. All enrichments share the same underlying join mechanics, download automatically on first use, cache locally for subsequent sessions, and compose freely with the pipe operator.

The cross-backbone name resolution built into the .vtr files means we do not have to worry about which backbone we used: enrichments work identically with WFO, COL, GBIF, ITIS, NCBI, OTT, or WoRMS results. The summary() method tracks which enrichments have been applied, their source versions, and their coverage rates, supporting both exploratory analysis and reproducible reporting.

For taxa or traits not covered by the built-in layers, add_data() integrates any external dataset using the same backbone-resolved name matching. Between the built-in enrichments and the add_data() escape hatch, most common ecological analyses can go from raw species lists to trait-enriched analytical tables in a single pipe chain.

The key properties of the enrichment system, to recap, are: automatic download and caching (no manual data management), cross-backbone compatibility (enrichments work regardless of which backend produced the result), version tracking (the summary() method documents exactly which data versions were used), and compositional design (enrichments stack freely via the pipe operator without side effects or ordering constraints). These properties together aim to make the path from species names to trait-enriched analyses as short and reproducible as possible.