Hybrid names in taxonomy
Botanical nomenclature uses a dedicated marker for hybrids: the multiplication sign (×, U+00D7). This marker appears in three distinct positions, each signalling a different kind of hybrid.
A nothogenus places the marker before the genus name, signalling an intergeneric hybrid (a cross between species in two different genera). Leyland cypress is a well-known example:
×Cupressocyparis leylandii
A nothospecies places the marker before the specific epithet, with the genus the same on both sides of the cross:
Mentha ×piperita
Peppermint (Mentha ×piperita, a cross of M. aquatica and M. spicata) is the classic case. The third form, a hybrid formula, names both parent species explicitly, joined by the multiplication sign:
Salix alba × Salix fragilis
In real-world data, the multiplication sign is frequently replaced by
a lowercase or uppercase “x”. Herbarium databases, spreadsheet exports,
and OCR outputs rarely preserve the Unicode character. taxify accepts
all three forms (×, x, X) and
normalizes them internally. The detection logic distinguishes a
standalone “x” used as a hybrid marker from an “x” that is part of a
word (e.g., the genus Saxifraga) by requiring whitespace
boundaries around the letter.
How taxify detects hybrids
Detection happens early in the pipeline, during name cleaning and
before any backbone matching. When taxify() receives an
input vector, each name passes through clean_names(), which
calls the internal detect_hybrid() function. The function
tokenizes the name, looks for the hybrid marker in specific positions,
and classifies the result as nothogenus, nothospecies, formula, or
non-hybrid.
The output of taxify() includes an
is_hybrid column (logical) that records whether a hybrid
marker was found in the original input. This column is always present
regardless of whether the name ultimately matched a backbone record. The
finer classification into nothogenus, nothospecies, or formula is not
exposed directly in the main output; it becomes available through
add_hybrid_info(), which we cover below after looking at
how matched hybrids behave in the result table.
After detection, the hybrid marker is stripped from the name before matching. For a nothospecies like “Mentha ×piperita”, the cleaned form becomes “Mentha piperita”. For a hybrid formula like “Salix alba × Salix fragilis”, only the first parent binomial (“Salix alba”) is retained as the cleaned name, since formulas are not single taxon names and cannot match a backbone record directly.
For nothospecies, taxify also constructs a secondary search form with the multiplication sign reinserted (“Mentha × piperita”) and attempts to match that against the backbone. Some backbones store nothospecies with the × character in the canonical name, so this secondary attempt can recover matches that the stripped form misses.
Worked example: matching a mixed species list
Consider a list that includes ordinary species, a nothospecies, a
nothogenus, and a hybrid formula. We pass them all to
taxify() in a single call.
names <- c(
"Quercus robur",
"Mentha x piperita",
"x Cupressocyparis leylandii",
"Salix alba x Salix fragilis",
"Platanus x hispanica"
)
result <- taxify(names, backend = "wfo")
result[, c("input_name", "accepted_name", "is_hybrid", "match_type")]The expected output looks roughly like this:
| input_name | accepted_name | is_hybrid | match_type |
|---|---|---|---|
| Quercus robur | Quercus robur | FALSE | exact |
| Mentha x piperita | Mentha × piperita | TRUE | exact |
| x Cupressocyparis leylandii | NA | TRUE | none |
| Salix alba x Salix fragilis | Salix alba | TRUE | exact |
| Platanus x hispanica | Platanus × hispanica | TRUE | exact |
Several things are visible here. The two nothospecies (Mentha, Platanus) matched successfully because WFO stores these as accepted names with the × character in the canonical name. The nothogenus ×Cupressocyparis returned no match because intergeneric hybrid genera are less commonly included in backbone databases. The hybrid formula matched only the first parent (Salix alba), since the formula itself is not a single taxon name.
The is_hybrid column is TRUE for all four hybrid inputs,
regardless of whether the name matched. This column records a property
of the input, not of the match result.
Extracting hybrid details with add_hybrid_info()
The add_hybrid_info() function takes a
taxify() result and parses the input_name
column to extract structured hybrid information. It adds three
columns:
hybrid_parent_1: the first parent binomial (for formulas) or NAhybrid_parent_2: the second parent binomial (for formulas, with abbreviated genera expanded) or NAhybrid_type: one of"nothogenus","nothospecies","formula", or NA for non-hybrids
For nothogenus and nothospecies names, both parent columns are NA because the input names only the hybrid itself, not its parents. The parent species of Mentha ×piperita (Mentha aquatica and Mentha spicata) are not encoded in the name string. Only hybrid formulas carry both parent names explicitly.
result |> add_hybrid_info()The three new columns for our five-name example:
| input_name | hybrid_type | hybrid_parent_1 | hybrid_parent_2 |
|---|---|---|---|
| Quercus robur | NA | NA | NA |
| Mentha x piperita | nothospecies | NA | NA |
| x Cupressocyparis leylandii | nothogenus | NA | NA |
| Salix alba x Salix fragilis | formula | Salix alba | Salix fragilis |
| Platanus x hispanica | nothospecies | NA | NA |
Worked example: parsing hybrid formulas
Hybrid formulas appear in botanical and horticultural datasets more often than one might expect. Field botanists record them when the parentage of a specimen is known or suspected. The formulas vary in notation: some spell out both genera in full, others abbreviate the second genus.
formulas <- c(
"Salix alba x Salix fragilis",
"Quercus pyrenaica x Q. petraea",
"Populus nigra x Populus deltoides",
"Rosa canina x R. gallica"
)
formula_result <- taxify(formulas, backend = "wfo")
formula_result <- formula_result |> add_hybrid_info()
formula_result[, c("input_name", "hybrid_type",
"hybrid_parent_1", "hybrid_parent_2")]| input_name | hybrid_type | hybrid_parent_1 | hybrid_parent_2 |
|---|---|---|---|
| Salix alba x Salix fragilis | formula | Salix alba | Salix fragilis |
| Quercus pyrenaica x Q. petraea | formula | Quercus pyrenaica | Quercus petraea |
| Populus nigra x Populus deltoides | formula | Populus nigra | Populus deltoides |
| Rosa canina x R. gallica | formula | Rosa canina | Rosa gallica |
The genus abbreviation “Q.” in the second example was expanded to “Quercus” automatically. taxify infers the full genus from the first parent in the formula. The same expansion happened for “R.” to “Rosa” in the fourth row. This expansion is purely textual: the first token of the first parent is used as the genus for the second parent whenever the second parent’s genus field matches the pattern of a single capital letter followed by a period.
What matches and what does not
The three hybrid types have different matching profiles against backbone databases.
Nothospecies are the best-supported form. WFO and COL both store many nothospecies as accepted names, with the × character as part of the canonical name. Mentha ×piperita, Platanus ×hispanica, and Narcissus ×medioluteus are examples that appear in both backbones. taxify’s matching logic handles the marker correctly: it first tries the stripped form (“Mentha piperita”) and then the form with the × reinserted (“Mentha × piperita”). At least one of these typically matches.
Nothogenera have lower coverage. Intergeneric
hybrids like ×Cupressocyparis, ×Triticosecale, and ×Festulolium exist in
some backbones but are absent from others. WFO includes several
nothogenera relevant to agriculture and horticulture. COL’s coverage
varies by taxonomic group. When a nothogenus does not match, the output
row will have match_type = "none" and
accepted_name = NA, but is_hybrid will still
be TRUE.
Hybrid formulas will not match a backbone record directly, because the formula is not a taxon name. taxify extracts the first parent binomial as the cleaned name for matching, so the result row reflects the match status of the first parent. To resolve both parents, match them separately.
# Match both parents of a hybrid formula separately
parents <- c("Salix alba", "Salix fragilis")
parent_result <- taxify(parents, backend = "wfo")This approach gives a full match result (accepted name, synonym
status, authorship) for each parent individually. In a dataset with many
hybrid formulas, we can extract the parent columns from
add_hybrid_info() and feed them back through
taxify() as a batch.
# Batch-resolve all hybrid formula parents
info <- result |> add_hybrid_info()
formula_rows <- info[info$hybrid_type == "formula" & !is.na(info$hybrid_type), ]
all_parents <- unique(na.omit(c(
formula_rows$hybrid_parent_1,
formula_rows$hybrid_parent_2
)))
parent_matches <- taxify(all_parents, backend = "wfo")The multiplication sign and its substitutes
The Unicode multiplication sign (U+00D7) is the correct character for hybrid notation under the International Code of Nomenclature. In practice, data arrive with three common representations:
The Unicode character itself:
×(common in well-curated databases)A lowercase
xsurrounded by spaces (common in spreadsheets and field data)An uppercase
Xsurrounded by spaces (less common, but occurs in older databases and OCR output)
taxify normalizes all three forms internally. The
detect_hybrid() function replaces every occurrence of
U+00D7 with a space-padded “x” and then works with a uniform token
stream, so the downstream logic only needs to handle one representation.
The space-boundary requirement prevents false positives: “Saxifraga”
does not trigger hybrid detection because the “x” sits within a word
rather than standing alone between tokens.
A subtlety arises with mojibake. When UTF-8 text containing the × character is read with a Latin-1 or Windows-1252 encoding, the two-byte sequence can be misinterpreted as “0c3097” or “0c3014”. The name cleaning pipeline detects and repairs both of these common misreadings before hybrid detection runs, so names corrupted by encoding errors are still handled correctly.
Practical notes
Which backbones have the most hybrids. WFO has the broadest coverage of plant nothospecies and nothogenera, reflecting its focus on the world flora. COL includes hybrids across all kingdoms but coverage is uneven. GBIF aggregates data from many sources and includes hybrid names where the contributing checklists provide them. ITIS, NCBI, and OTT have minimal hybrid coverage.
Hybrid detection is input-side only. taxify detects
hybrids in the names that you supply. It does not scan the backbone for
hybrid records. If a backbone stores “Mentha × piperita” as an accepted
name, taxify will match your input against it, but the backbone record’s
own hybrid status is not exposed as a separate field. The
is_hybrid column reflects your input, not the backbone.
Formulas with infraspecific ranks. The parser expects binomials (genus plus epithet) on both sides of the × marker. Formulas that include subspecies or variety ranks (e.g., “Salix alba var. vitellina × Salix fragilis”) will still be detected as formulas, but the parent extraction may include the rank and infraspecific epithet as part of the parent name. This is generally the desired behavior, since the full trinomial identifies the parent more precisely than the binomial alone.
Authorship in hybrid names. Hybrid names sometimes carry authorship strings (e.g., “Mentha ×piperita L.”). The name cleaning pipeline strips authorship before matching, so the presence of an author string does not interfere with hybrid detection or matching.
# Authorship is stripped; hybrid detection still works
taxify("Mentha x piperita L.", backend = "wfo")Adding hybrid info is lightweight.
add_hybrid_info() operates entirely on the
input_name column via string parsing. It does not re-query
any backbone or access any files on disk. On a result with 10,000 rows,
the function completes in milliseconds.