Joining custom data with add

The problem

Taxonomic name matching is rarely the last step. After taxify() resolves your species list to accepted names and IDs, the next task is usually to attach trait data, occurrence records, or measurement tables from external sources. The trouble is that external datasets almost never use the same names as your backbone. A CSV of leaf trait measurements might record Pinus nigra subsp. laricio, while the backbone stores the accepted name as Pinus nigra. A colleague’s spreadsheet might list Picea excelsa (a synonym retired decades ago), while WFO recognises Picea abies.

Joining on raw species strings misses these cases: the rows do not match, and the merged data.frame has NAs where values should exist. The standard workaround is to run the external names through the backbone first, resolve them to accepted IDs, and then join on those IDs. add_data() wraps that entire workflow into a single pipe step.

library(taxify)

Joining a data.frame

The most common case: we have trait measurements in a data.frame sitting in our R session, and we want to attach them to a taxify() result. Here we create a small table of specific leaf area (SLA) and maximum height for five European tree species.

# Our taxify result
species <- c(
  "Quercus robur", "Fagus sylvatica", "Picea abies",
  "Pinus sylvestris", "Betula pendula"
)
result <- taxify(species, backend = "wfo")

# External trait data — note one synonym and one subspecies
traits <- data.frame(
  taxon = c(
    "Quercus robur", "Fagus sylvatica", "Picea excelsa",
    "Pinus sylvestris", "Betula pendula"
  ),
  sla = c(18.2, 24.1, 6.5, 8.0, 22.3),
  max_height_m = c(35, 40, 50, 30, 25)
)

# Join — "Picea excelsa" resolves to "Picea abies" through the backbone
result <- result |> add_data(traits, species_col = "taxon")

add_data() takes the names from the taxon column, runs them through the same backbone(s) used in the original taxify() call, resolves each to an accepted_id, and left-joins on that ID. Picea excelsa is a synonym of Picea abies in WFO, so the SLA and height values land on the correct row even though the literal strings differ. The output has two new columns, sla and max_height_m, appended to the existing result.

Joining from a CSV file

When the data lives in a file rather than in memory, we can pass the path directly. add_data() reads .csv and .csv.gz files via vectra’s CSV reader, which handles large files efficiently without loading everything into R at once. Tab-separated files (.tsv, .tsv.gz) are also supported.

result <- taxify(species, backend = "wfo")
result <- result |> add_data("path/to/leaf_traits.csv")

If the CSV has a single obvious species-name column, auto-detection picks it up. If there are several plausible character columns, or if the names are encoded in a column with an unusual name like latin_binomial, specifying species_col avoids ambiguity.

result |> add_data("leaf_traits.csv", species_col = "latin_binomial")

The same pattern works for compressed CSV files. A .csv.gz path is detected and decompressed transparently.

result |> add_data("global_leaf_traits.csv.gz", species_col = "species")

Joining from an Excel file

Spreadsheets are common in ecology, especially for hand-curated trait databases shared among collaborators. add_data() reads .xlsx files via the openxlsx2 package, which must be installed separately.

# install.packages("openxlsx2")  # if not already installed
result |> add_data("bird_morphometry.xlsx")

When sheet, start_row, and species_col are all left at their defaults, add_data() scans the workbook to find the right combination automatically. It tests each sheet and up to 20 candidate header rows, probing character columns against the backbone until it finds species names. This handles the common case where a colleague’s spreadsheet has a title block, column descriptions, or notes above the actual data table.

The scan reports what it found:

Scanning Excel layout...
  Detected: sheet 'measurements', header row 3, species column 'latin_name' (90% match rate)

To skip auto-detection, specify any combination of sheet, start_row, and species_col explicitly:

# Specific sheet by name or number
result |> add_data("bird_morphometry.xlsx", sheet = "measurements")
result |> add_data("bird_morphometry.xlsx", sheet = 2)

# Known header row (e.g., rows 1-2 are title/notes)
result |> add_data("bird_morphometry.xlsx", start_row = 3)

# All three specified — no scanning at all
result |> add_data("bird_morphometry.xlsx", sheet = 1, start_row = 3,
                   species_col = "latin_name")

SQLite databases

When trait data lives in a SQLite database, add_data() reads the table via vectra’s SQLite reader (which depends on DBI and RSQLite under the hood). Because a single .sqlite or .db file can hold many tables, the table argument is mandatory here. Omitting it raises an informative error rather than guessing.

# SQLite — requires DBI and RSQLite
result |> add_data(
  "traits.sqlite",
  table = "plant_traits",
  species_col = "species"
)

This is particularly handy when we already maintain a relational database of measurements across projects. We can point add_data() at the relevant table without exporting to CSV first, and the backbone matching still runs the same way it does for any other format.

vectra native format

If we have pre-built .vtr files (the columnar format that taxify uses internally for backbone storage), they can be passed directly. vectra reads these with near-zero overhead because no parsing or type inference is needed.

result |> add_data("prebuilt_traits.vtr", species_col = "canonical_name")

This is mainly useful when sharing processed trait tables between team members or across projects. A .vtr file produced by one workflow can be re-used in another without converting back through CSV. The export_data() function makes this easy:

# Save a taxify result (with enrichments) as .vtr
result |> export_data("processed_traits.vtr")

# A colleague can load it directly
other_result |> add_data("processed_traits.vtr")

export_data() also supports .csv, .tsv, and .xlsx for interoperability with tools outside R.

result |> export_data("for_excel_users.xlsx")
result |> export_data("for_python.csv")

Joining from a TSV file

Tab-separated files work the same way as CSV. add_data() reads .tsv and .tsv.gz files via read.delim().

result |> add_data("leaf_traits.tsv", species_col = "species")
result |> add_data("leaf_traits.tsv.gz")

Other file formats

For formats not directly supported (.parquet, .rds), reading the file into a data.frame first and passing it to add_data() works in every case.

my_data <- readRDS("legacy_traits.rds")
result |> add_data(my_data, species_col = "sp")

Species column auto-detection

When species_col is not specified, add_data() probes each character column in the external data. It takes the first 10 rows of each column, runs them through taxify() against the same backbone, and picks the column with the highest match rate. A column needs at least 50% of its probe names to match before it qualifies.

# Auto-detection in action
traits <- data.frame(
  site = c("A", "A", "B", "B"),
  species = c("Quercus robur", "Fagus sylvatica",
              "Betula pendula", "Picea abies"),
  habitat = c("forest", "forest", "forest edge", "boreal"),
  sla = c(18.2, 24.1, 22.3, 6.5)
)

# Three character columns: site, species, habitat
# Only "species" will produce >50% backbone matches
result |> add_data(traits)

Auto-detection works well when the species column contains clean binomial names and the other character columns contain obviously non-taxonomic strings (site codes, habitat descriptions, observer names). It can fail when column names are ambiguous, or when species names are heavily misspelled or use common names. In those situations, specifying species_col explicitly saves time and avoids a confusing error message.

Selecting columns with `cols`

By default, add_data() joins all columns from the external data except the species column. When the external dataset has dozens of columns and we only need two or three, the cols argument keeps the output tidy.

# Full trait table with many columns
big_traits <- data.frame(
  species = c("Quercus robur", "Fagus sylvatica"),
  sla = c(18.2, 24.1),
  max_height_m = c(35, 40),
  leaf_nitrogen = c(2.1, 2.4),
  wood_density = c(0.56, 0.58),
  seed_mass_mg = c(3500, 220),
  bark_thickness_mm = c(25, 8)
)

# Only join SLA and wood density
result |> add_data(big_traits, species_col = "species",
                   cols = c("sla", "wood_density"))

This also helps when some columns in the external data would create name collisions (discussed below) that we would rather avoid entirely.

Column name collisions

If the external data has columns with the same name as columns already present in the taxify() result, add_data() prefixes the incoming columns with data_. Existing columns in the taxify result remain unchanged regardless of what the external data contains.

# The taxify result already has a "family" column
# External data also has a "family" column (taxonomic family from a
# different source) plus a "leaf_area" column
external <- data.frame(
  species = c("Quercus robur", "Fagus sylvatica"),
  family = c("Fagaceae", "Fagaceae"),
  leaf_area = c(45.2, 38.7)
)

result |> add_data(external, species_col = "species")
# Output gains "data_family" (from external) and "leaf_area" (no collision)

A message prints when collisions are detected, listing the renamed columns. If we know in advance that a collision will occur and we do not need the conflicting column, filtering it out via cols is cleaner than letting the rename happen.

Duplicate species handling

External datasets sometimes contain the same species more than once. This happens with repeated measurements across sites, multiple literature sources compiled into one table, or subspecies that resolve to the same accepted species. add_data() distinguishes two cases.

Exact duplicates occur when all trait values for a given species are identical across the repeated rows. This is harmless: the duplicates are collapsed into a single row with a warning.

# Harmless: same species, same values (perhaps from two sites)
dup_ok <- data.frame(
  species = c("Quercus robur", "Quercus robur", "Fagus sylvatica"),
  sla = c(18.2, 18.2, 24.1)
)
result |> add_data(dup_ok, species_col = "species")
# Warning: 1 duplicate rows ... deduplicated.

Conflicting duplicates occur when the same species appears with different trait values. “Conflicting” here means that at least one of the selected trait columns differs between two rows that share the same accepted_id. The comparison is column-by-column and treats two NA values as equal (both missing counts as agreement). So if two rows for Quercus robur have SLA values of 18.2 and 21.5, that is a conflict. If both rows have SLA of 18.2 but one has NA for wood density while the other has 0.56, that is also a conflict. Only when every selected column matches exactly across all rows for a given species do we consider the duplicates identical.

Because add_data() cannot decide which value is correct when a conflict exists, it raises an error and names the offending species.

# Conflicting: same species, different SLA values
dup_bad <- data.frame(
  species = c("Quercus robur", "Quercus robur", "Fagus sylvatica"),
  sla = c(18.2, 21.5, 24.1)
)
result |> add_data(dup_bad, species_col = "species")
# Error: 1 species resolved to the same accepted_id but have
#   different trait values.
#   Examples: 'Quercus robur' (wfo-0000309171)

The fix depends on the data. If the duplicates represent within-species variation (e.g., measurements from different populations), aggregating before joining is the right approach. If they represent data entry errors, removing the bad rows resolves the issue. The cols argument can also help when only some columns conflict: selecting the non-conflicting subset lets the join proceed.

# Aggregate first, then join
library(stats)
dup_agg <- aggregate(sla ~ species, data = dup_bad, FUN = mean)
result |> add_data(dup_agg, species_col = "species")

Note that duplicates are checked after backbone resolution, not on the raw names. If the external data lists both Picea excelsa and Picea abies with different SLA values, those two names resolve to the same accepted species and trigger the conflicting-duplicate error. This is intentional: the join key is the accepted ID, and conflicting values for the same key cannot coexist.

How the join works

The full pipeline inside add_data() has five steps:

Read the external data. File paths are dispatched by extension (.csv, .csv.gz, .tsv, .tsv.gz, .xlsx, .sqlite, .vtr). Data.frames pass through directly. The format detection is based solely on the file extension, so a misnamed file (e.g., a tab-separated file saved as .csv) will produce a read error rather than silent misparse.
Identify the species column, either from the explicit species_col argument or via auto-detection. When auto-detecting, add_data() samples the first 10 rows of every character column and runs each sample through taxify() against the same backbone(s). The column whose sample achieves the highest match rate wins, provided it clears the 50% threshold. Columns containing site codes, habitat labels, or observer names rarely match any backbone entry, so the true species column tends to stand out clearly.
Match the species names through the same backbone(s) used in the original taxify() call. This produces an accepted_id for each row in the external data. The backbone choice is read from the taxify_meta attribute that taxify() attaches to its output, so we do not need to specify it again. Fuzzy matching is on by default (controlled via fuzzy and fuzzy_threshold). Any names that fail to resolve are dropped from the joinable pool, and their count appears in the summary message at the end.
Check for duplicates. After backbone resolution, any rows that share the same accepted_id are inspected. Exact duplicates (identical trait values across all selected columns) are collapsed with a warning. Conflicting duplicates raise an error, as described in the section above.
Left join on accepted_id. Every row in the original taxify() result that has a matching accepted_id in the external data receives the trait columns. Rows without a match get NA. Column name collisions are resolved by prefixing the incoming columns with data_.

The join preserves every row of the original result; nothing is dropped. Species present in the external data but absent from the original result are ignored. A summary message reports how many species were matched and how many names in the external data could not be resolved through the backbone.

Controlling fuzzy matching

By default, add_data() uses the same fuzzy matching as taxify() to resolve names in the external data. Fuzzy matching catches typos and minor spelling differences, but it can produce false matches for short or similar names. The fuzzy_threshold argument controls how permissive the matching is. Lower values are stricter.

# Strict: only very close matches
result |> add_data(traits, species_col = "taxon", fuzzy_threshold = 0.1)

# Exact matching only (no fuzzy)
result |> add_data(traits, species_col = "taxon", fuzzy = FALSE)

Disabling fuzzy matching entirely (fuzzy = FALSE) is useful when the external data is already well-curated and we want to avoid any risk of cross-species contamination from approximate string matches.

Combining add_data() with enrichments

add_data() fits naturally into a pipe chain alongside the built-in enrichment functions. Custom data and pre-built enrichments use the same accepted_id join key, so they can be stacked in any order.

result <- taxify(species, backend = "wfo") |>
  add_conservation_status() |>
  add_woodiness() |>
  add_data(traits, species_col = "taxon")

Each step appends columns to the result. The final data.frame contains the core taxify() output, IUCN conservation status, woodiness classification, and our custom SLA and height measurements, all aligned by accepted species identity.

Joining custom data with add_data()