Encoding Modes¶
RESOLVE supports four species encoding strategies, each with different trade-offs between speed, expressiveness, and data requirements. This guide explains how each mode works, when to use it, and how to configure it.
Overview¶
All encoders solve the same problem: compress a variable-length list of species (with abundances) into a fixed-dimension vector that the shared MLP encoder can process. They differ in how much structure they preserve and how much they can learn.
| Mode | Input | Output | Learnable | Handles unseen species |
|---|---|---|---|---|
hash |
species names + abundances | fixed-dim vector | No | Yes |
embed |
top-k species IDs | concatenated embeddings | Yes | No |
rank_pool |
all species + abundances | pooled embeddings | Yes | Partially (via taxonomy) |
transformer |
species tokens | attention-pooled vector | Yes | Partially (via taxonomy) |
Hash Encoding (default)¶
Feature hashing maps each species name to a position in a fixed-dimension vector using a hash function. Abundances are accumulated at the hashed positions. No vocabulary or training is needed.
trainer = resolve.Trainer(
dataset,
species_encoding="hash",
hash_dim=64, # Output dimension (default: 32)
)
How it works:
- For each species in a plot, hash the species name to an index in
[0, hash_dim) - Add the species abundance at that index
- The resulting vector has
hash_dimdimensions regardless of species count
Strengths:
- O(1) memory per species (no vocabulary or embedding table)
- Handles any species pool size, including species never seen during training
- Fastest encoding mode, both in training and inference
- Good baseline that is hard to beat on noisy or small datasets
Weaknesses:
- Hash collisions: two species can map to the same index, losing signal
- No learnable species representations; the encoder must compensate
- Higher
hash_dimreduces collisions but increases input dimension
When to use:
- Starting point for any new dataset
- Large species pools (>10k species) where embedding tables would be huge
- Fast iteration during development and hyperparameter search
- Datasets where species identity matters less than aggregate composition
Embed Encoding¶
Learned per-species embeddings assign each of the top-k most frequent species its own embedding vector. Less frequent species are grouped into an "unknown" bucket.
trainer = resolve.Trainer(
dataset,
species_encoding="embed",
species_embed_dim=32, # Embedding dimension per species
top_k_species=10, # Number of species with own embeddings
)
How it works:
- During
fit(), identify the top-k most frequent species across all plots - Assign each a learnable embedding vector of dimension
species_embed_dim - For a given plot, look up embeddings for present top-k species and concatenate
- Species outside the top-k contribute to an "unknown mass" feature
Strengths:
- Learnable representations capture species-specific patterns
- The model can learn that certain species are strong indicators of specific targets
- Compact: only stores embeddings for the most informative species
Weaknesses:
- Cannot represent species not in the top-k vocabulary
- Top-k truncation discards information from rare species
- Requires enough data for the embeddings to learn meaningful representations
- Fixed input dimension depends on
top_k_species * species_embed_dim
When to use:
- Datasets with strong species identity signal (certain species are diagnostic)
- Moderate species pools where most signal comes from common species
- When you want interpretable species embeddings for downstream analysis
Rank-Pool Encoding¶
Variable-length species lists with weighted mean pooling. Every species gets a learnable embedding, and the plot representation is the abundance-weighted mean of all present species' embeddings.
trainer = resolve.Trainer(
dataset,
species_encoding="rank_pool",
species_normalization="log1p", # Recommended for rank_pool
)
How it works:
- Build a vocabulary of all species seen during training
- Assign each species a learnable embedding
- For a plot, look up embeddings for all present species
- Compute abundance-weighted mean pooling over the embeddings
- Taxonomy embeddings (genus, family) are included when available
Strengths:
- Uses all species, not just top-k (no truncation)
- Learnable embeddings with the flexibility of variable-length input
- Taxonomy integration provides a fallback signal for rare species
- Abundance weighting preserves dominance information
Weaknesses:
- Requires padding for batched training (variable-length lists)
- Larger vocabulary table than embed mode (all species, not just top-k)
- Slower than hash encoding due to embedding lookups and pooling
- Novel species at inference fall back to taxonomy-only signal
When to use:
- Datasets where species richness varies widely across plots
- When rare species carry important signal (e.g., indicator species)
- When taxonomy information is available and informative
- Mid-size species pools (1k-10k species)
Transformer Encoding¶
Self-attention over species tokens with attention pooling. Each species is a token; the transformer learns which species combinations matter and how they interact.
trainer = resolve.Trainer(
dataset,
species_encoding="transformer",
n_attention_layers=2,
n_heads=4,
transformer_ff_dim=256,
transformer_pooling="attention", # or "mean"
lr=3e-4, # Lower LR required for attention layers
use_amp=False, # Disable AMP to avoid fp16 overflow in attention
)
How it works:
- Each species becomes a token: embedding + positional encoding
- Self-attention layers model species-species interactions
- Attention pooling (or mean pooling) compresses the token sequence into a fixed vector
- The pooled vector feeds into the shared MLP encoder
Strengths:
- Models species co-occurrence and interaction patterns
- Attention pooling learns which species to focus on per plot
- Best empirical performance on large, complex datasets
- Can capture non-linear species assemblage signatures
Weaknesses:
- Quadratic attention cost in the number of species per plot
- Requires lower learning rate (3e-4 vs 1e-3) to avoid attention overflow
- AMP (mixed precision) should be disabled to prevent fp16 precision loss
- Needs more data and longer training to converge
- Slower per epoch than all other modes
When to use:
- Maximum accuracy is the priority and compute budget allows it
- Species interactions are meaningful for the target variable
- Large datasets (>50k plots) where the model has enough data to learn attention patterns
- Final production runs after simpler modes have been benchmarked
Decision Flowchart¶
Is your dataset small (<1k plots)?
YES → Use hash (fast iteration, less overfitting risk)
NO ↓
Do you have >10k species?
YES → Use hash or rank_pool (embed vocabulary too large)
NO ↓
Is species identity a strong signal (diagnostic species)?
YES → Use embed (learnable per-species representations)
NO ↓
Does species richness vary widely across plots?
YES → Use rank_pool (handles variable-length lists naturally)
NO ↓
Do you need maximum accuracy and have >50k plots?
YES → Use transformer (self-attention + attention pooling)
NO ↓
Default starting point:
→ hash with hash_dim=64 (best speed/accuracy trade-off)
Benchmark Comparison (ASAAS 10k subset)¶
Results from the ASAAS dataset (10,000 sample plots), 3-fold spatial block cross-validation, 50 epochs with patience=10. All runs on a single GPU with hidden_dims=[512, 256, 128].
| Encoding | Area MAE | Area Band-10% | EUNIS Accuracy | EUNIS F1 (macro) | Time/epoch |
|---|---|---|---|---|---|
hash_32 |
baseline | baseline | baseline | baseline | 1x |
hash_64 |
-3-5% | +2-4% | +1-2% | +1-2% | ~1x |
embed |
-5-8% | +3-6% | +2-4% | +2-4% | ~1.2x |
rank_pool |
-8-12% | +5-8% | +3-5% | +3-5% | ~1.5x |
transformer_v4 |
-10-14% | +6-10% | +4-6% | +4-6% | ~2x |
transformer_v5 |
-12-16% | +8-12% | +5-8% | +5-8% | ~3x |
Relative improvements
Values show improvement relative to hash_32. Actual numbers depend on the specific dataset, target configuration, and random seed. Run python benchmarks/run_benchmarks.py --data-size 10k --configs encodings to reproduce.
Combining with Other Settings¶
Encoding mode interacts with several other training parameters:
- Learning rate: Hash/embed/rank_pool work well with
lr=1e-3. Transformer needslr=3e-4. - AMP: Safe for hash/embed/rank_pool. Disable for transformer (
use_amp=False). - Batch size: All modes work with the default
batch_size=4096. Transformer benefits from smaller batches (2048) on small datasets. - Hidden dims: Deeper networks help more with hash encoding (compensating for lost signal) than with transformer (which already has attention capacity).
Next Steps¶
- Training Models: Full training configuration reference
- Performance Tuning: Optimize speed and accuracy
- Understanding Embeddings: Extract and interpret learned representations