pkgdown/mathjax-config.html

Skip to contents

Hybrid names in taxonomy

Botanical nomenclature uses a dedicated marker for hybrids: the multiplication sign (×, U+00D7). This marker appears in three distinct positions, each signalling a different kind of hybrid.

A nothogenus places the marker before the genus name, signalling an intergeneric hybrid (a cross between species in two different genera). Leyland cypress is a well-known example:

×Cupressocyparis leylandii

A nothospecies places the marker before the specific epithet, with the genus the same on both sides of the cross:

Mentha ×piperita

Peppermint (Mentha ×piperita, a cross of M. aquatica and M. spicata) is the classic case. The third form, a hybrid formula, names both parent species explicitly, joined by the multiplication sign:

Salix alba × Salix fragilis

In real-world data, the multiplication sign is frequently replaced by a lowercase or uppercase “x”. Herbarium databases, spreadsheet exports, and OCR outputs rarely preserve the Unicode character. taxify accepts all three forms (×, x, X) and normalizes them internally. The detection logic distinguishes a standalone “x” used as a hybrid marker from an “x” that is part of a word (e.g., the genus Saxifraga) by requiring whitespace boundaries around the letter.

How taxify detects hybrids

Detection happens early in the pipeline, during name cleaning and before any backbone matching. When taxify() receives an input vector, each name passes through clean_names(), which calls the internal detect_hybrid() function. The function tokenizes the name, looks for the hybrid marker in specific positions, and classifies the result as nothogenus, nothospecies, formula, or non-hybrid.

The output of taxify() includes an is_hybrid column (logical) that records whether a hybrid marker was found in the original input. This column is always present regardless of whether the name ultimately matched a backbone record. The finer classification into nothogenus, nothospecies, or formula is not exposed directly in the main output; it becomes available through add_hybrid_info(), which we cover below after looking at how matched hybrids behave in the result table.

After detection, the hybrid marker is stripped from the name before matching. For a nothospecies like “Mentha ×piperita”, the cleaned form becomes “Mentha piperita”. For a hybrid formula like “Salix alba × Salix fragilis”, only the first parent binomial (“Salix alba”) is retained as the cleaned name, since formulas are not single taxon names and cannot match a backbone record directly.

For nothospecies, taxify also constructs a secondary search form with the multiplication sign reinserted (“Mentha × piperita”) and attempts to match that against the backbone. Some backbones store nothospecies with the × character in the canonical name, so this secondary attempt can recover matches that the stripped form misses.

Worked example: matching a mixed species list

Consider a list that includes ordinary species, a nothospecies, a nothogenus, and a hybrid formula. We pass them all to taxify() in a single call.

names <- c(
  "Quercus robur",
  "Mentha x piperita",
  "x Cupressocyparis leylandii",
  "Salix alba x Salix fragilis",
  "Platanus x hispanica"
)

result <- taxify(names, backend = "wfo")
result[, c("input_name", "accepted_name", "is_hybrid", "match_type")]

The expected output looks roughly like this:

input_name accepted_name is_hybrid match_type
Quercus robur Quercus robur FALSE exact
Mentha x piperita Mentha × piperita TRUE exact
x Cupressocyparis leylandii NA TRUE none
Salix alba x Salix fragilis Salix alba TRUE exact
Platanus x hispanica Platanus × hispanica TRUE exact

Several things are visible here. The two nothospecies (Mentha, Platanus) matched successfully because WFO stores these as accepted names with the × character in the canonical name. The nothogenus ×Cupressocyparis returned no match because intergeneric hybrid genera are less commonly included in backbone databases. The hybrid formula matched only the first parent (Salix alba), since the formula itself is not a single taxon name.

The is_hybrid column is TRUE for all four hybrid inputs, regardless of whether the name matched. This column records a property of the input, not of the match result.

Extracting hybrid details with add_hybrid_info()

The add_hybrid_info() function takes a taxify() result and parses the input_name column to extract structured hybrid information. It adds three columns:

  • hybrid_parent_1: the first parent binomial (for formulas) or NA

  • hybrid_parent_2: the second parent binomial (for formulas, with abbreviated genera expanded) or NA

  • hybrid_type: one of "nothogenus", "nothospecies", "formula", or NA for non-hybrids

For nothogenus and nothospecies names, both parent columns are NA because the input names only the hybrid itself, not its parents. The parent species of Mentha ×piperita (Mentha aquatica and Mentha spicata) are not encoded in the name string. Only hybrid formulas carry both parent names explicitly.

result |> add_hybrid_info()

The three new columns for our five-name example:

input_name hybrid_type hybrid_parent_1 hybrid_parent_2
Quercus robur NA NA NA
Mentha x piperita nothospecies NA NA
x Cupressocyparis leylandii nothogenus NA NA
Salix alba x Salix fragilis formula Salix alba Salix fragilis
Platanus x hispanica nothospecies NA NA

Worked example: parsing hybrid formulas

Hybrid formulas appear in botanical and horticultural datasets more often than one might expect. Field botanists record them when the parentage of a specimen is known or suspected. The formulas vary in notation: some spell out both genera in full, others abbreviate the second genus.

formulas <- c(
  "Salix alba x Salix fragilis",
  "Quercus pyrenaica x Q. petraea",
  "Populus nigra x Populus deltoides",
  "Rosa canina x R. gallica"
)

formula_result <- taxify(formulas, backend = "wfo")
formula_result <- formula_result |> add_hybrid_info()

formula_result[, c("input_name", "hybrid_type",
                    "hybrid_parent_1", "hybrid_parent_2")]
input_name hybrid_type hybrid_parent_1 hybrid_parent_2
Salix alba x Salix fragilis formula Salix alba Salix fragilis
Quercus pyrenaica x Q. petraea formula Quercus pyrenaica Quercus petraea
Populus nigra x Populus deltoides formula Populus nigra Populus deltoides
Rosa canina x R. gallica formula Rosa canina Rosa gallica

The genus abbreviation “Q.” in the second example was expanded to “Quercus” automatically. taxify infers the full genus from the first parent in the formula. The same expansion happened for “R.” to “Rosa” in the fourth row. This expansion is purely textual: the first token of the first parent is used as the genus for the second parent whenever the second parent’s genus field matches the pattern of a single capital letter followed by a period.

What matches and what does not

The three hybrid types have different matching profiles against backbone databases.

Nothospecies are the best-supported form. WFO and COL both store many nothospecies as accepted names, with the × character as part of the canonical name. Mentha ×piperita, Platanus ×hispanica, and Narcissus ×medioluteus are examples that appear in both backbones. taxify’s matching logic handles the marker correctly: it first tries the stripped form (“Mentha piperita”) and then the form with the × reinserted (“Mentha × piperita”). At least one of these typically matches.

Nothogenera have lower coverage. Intergeneric hybrids like ×Cupressocyparis, ×Triticosecale, and ×Festulolium exist in some backbones but are absent from others. WFO includes several nothogenera relevant to agriculture and horticulture. COL’s coverage varies by taxonomic group. When a nothogenus does not match, the output row will have match_type = "none" and accepted_name = NA, but is_hybrid will still be TRUE.

Hybrid formulas will not match a backbone record directly, because the formula is not a taxon name. taxify extracts the first parent binomial as the cleaned name for matching, so the result row reflects the match status of the first parent. To resolve both parents, match them separately.

# Match both parents of a hybrid formula separately
parents <- c("Salix alba", "Salix fragilis")
parent_result <- taxify(parents, backend = "wfo")

This approach gives a full match result (accepted name, synonym status, authorship) for each parent individually. In a dataset with many hybrid formulas, we can extract the parent columns from add_hybrid_info() and feed them back through taxify() as a batch.

# Batch-resolve all hybrid formula parents
info <- result |> add_hybrid_info()
formula_rows <- info[info$hybrid_type == "formula" & !is.na(info$hybrid_type), ]

all_parents <- unique(na.omit(c(
  formula_rows$hybrid_parent_1,
  formula_rows$hybrid_parent_2
)))

parent_matches <- taxify(all_parents, backend = "wfo")

The multiplication sign and its substitutes

The Unicode multiplication sign (U+00D7) is the correct character for hybrid notation under the International Code of Nomenclature. In practice, data arrive with three common representations:

  1. The Unicode character itself: × (common in well-curated databases)

  2. A lowercase x surrounded by spaces (common in spreadsheets and field data)

  3. An uppercase X surrounded by spaces (less common, but occurs in older databases and OCR output)

taxify normalizes all three forms internally. The detect_hybrid() function replaces every occurrence of U+00D7 with a space-padded “x” and then works with a uniform token stream, so the downstream logic only needs to handle one representation. The space-boundary requirement prevents false positives: “Saxifraga” does not trigger hybrid detection because the “x” sits within a word rather than standing alone between tokens.

A subtlety arises with mojibake. When UTF-8 text containing the × character is read with a Latin-1 or Windows-1252 encoding, the two-byte sequence can be misinterpreted as “0c3097” or “0c3014”. The name cleaning pipeline detects and repairs both of these common misreadings before hybrid detection runs, so names corrupted by encoding errors are still handled correctly.

Practical notes

Which backbones have the most hybrids. WFO has the broadest coverage of plant nothospecies and nothogenera, reflecting its focus on the world flora. COL includes hybrids across all kingdoms but coverage is uneven. GBIF aggregates data from many sources and includes hybrid names where the contributing checklists provide them. ITIS, NCBI, and OTT have minimal hybrid coverage.

Hybrid detection is input-side only. taxify detects hybrids in the names that you supply. It does not scan the backbone for hybrid records. If a backbone stores “Mentha × piperita” as an accepted name, taxify will match your input against it, but the backbone record’s own hybrid status is not exposed as a separate field. The is_hybrid column reflects your input, not the backbone.

Formulas with infraspecific ranks. The parser expects binomials (genus plus epithet) on both sides of the × marker. Formulas that include subspecies or variety ranks (e.g., “Salix alba var. vitellina × Salix fragilis”) will still be detected as formulas, but the parent extraction may include the rank and infraspecific epithet as part of the parent name. This is generally the desired behavior, since the full trinomial identifies the parent more precisely than the binomial alone.

Authorship in hybrid names. Hybrid names sometimes carry authorship strings (e.g., “Mentha ×piperita L.”). The name cleaning pipeline strips authorship before matching, so the presence of an author string does not interfere with hybrid detection or matching.

# Authorship is stripped; hybrid detection still works
taxify("Mentha x piperita L.", backend = "wfo")

Adding hybrid info is lightweight. add_hybrid_info() operates entirely on the input_name column via string parsing. It does not re-query any backbone or access any files on disk. On a result with 10,000 rows, the function completes in milliseconds.