Entry to the BirdCLEF+ 2026 acoustic species identification challenge (LifeCLEF Lab, CLEF 2026). An ensemble for the multi-taxon Pantanal soundscapes, paired with an offline evaluator to compare model blends without spending leaderboard submissions.

The challenge

BirdCLEF+ 2026, the acoustic species identification task at the LifeCLEF Lab of CLEF 2026, concluded on 3 June 2026 and drew 4,092 teams. Competitors identify which of 234 species vocalise in continuous field recordings from the Pantanal of South America. The 2026 edition is multi-taxon, so one model has to separate birds, amphibians, insects, and reptiles, whose calls share almost nothing. Submissions are scored by macro-averaged ROC-AUC over the classes that carry a positive label in the hidden soundscape test set.

That setup is harder than a leaderboard position makes it look. Training audio is clip-level focal recordings of a single calling individual, while the test audio is continuous soundscape with overlapping species and long silences, so a model is scored on a distribution it never trained on. Because the metric averages over classes, a handful of rare species move the score as much as the common ones, and the room to improve sits in the worst-scoring classes rather than the easy majority. Inference runs CPU-only under a wall-clock cap, which rules out large or numerous models. And the field is dense: a shared public baseline already reaches about 0.927, and from there the leaderboard is decided in the third decimal place, so a mid-pack finish among 4,092 teams sits only a few thousandths off the top.

Evaluating without the leaderboard

A solo competitor gets only a few public-board reads a day, a slow way to choose between dozens of model blends. The organisers released 66 soundscape recordings with window-level labels, so I built an offline evaluator on them: it scores any model or blend in under a second, restricts the macro-average to the 75 species that carry a label there, and cross-validates across the files. One daily leaderboard slot became tens of offline comparisons in an evening.

The evaluator reaches every part of the system but one, and that part explains the final standing. The component carrying most of the ranking is a public embedding branch that trains on the same 66 soundscapes the harness uses, so there is no clean way to hold it out and score it offline. The harness could rank everything except the piece it could not see, which is exactly where the public and private leaderboards diverged: the blend that led the public board ranked lower on the private split.

Result

0.932 macro-AUC on the public leaderboard (1968th of 4092 teams), 0.917 on the private leaderboard (2415th of 4092).

System diagram for the BirdCLEF+ 2026 entry

What I tried that didn’t work

Most of the effort went into directions that did not improve the score.

On the modelling side, single-class (argmax) pseudo-labels plateaued below the best blend, so I switched to frame-level multi-hot pseudo-targets that keep the overlapping-species structure of real soundscapes. Distilling a seven-model teacher into one student to save runtime scored worse than the teacher and worse than the small triplet, losing more than the budget it saved. Fusing the models by a geometric mean of probabilities, instead of by rank, also lost, because the models are not calibrated to a shared scale. And my soundscape-tuned models helped only at a low weight; above a small share of the blend they made the score worse.

The inference budget killed others outright. A seven-model ensemble timed out on the hidden test set. A ConvNeXt-Base that scored well locally projected to nearly two hours of CPU inference and was shelved. Two attempts to make heavier models fit, INT8 quantisation and float16 conversion, either ran slower than the original or disagreed with it almost completely, an interaction with the depthwise convolutions. The entry that shipped is the best one that fit the budget; my best local model never did.

The most ambitious thread, a deeper data pipeline with cross-year pretraining on previous editions and several rounds of pseudo-labelling, was only partly finished by the deadline. It is the route the strongest published solutions take, and the part I would start earlier next time, because it does not fit into a final week. Finishing it would take more than an earlier start: the full multi-backbone, multi-seed version outgrows the two personal machines I trained on, a 16 GB RTX 5080 and an M4 Pro Mac mini.

The pipeline

The submitted system is an ensemble of two parts, combined only at the very end. Every model resamples audio to 32 kHz mono, cuts it into 5-second windows, and turns each window into a 128-band log-mel spectrogram before scoring all 234 classes.

The first part is a publicly shared community baseline, used unchanged: an embedding branch built on the Perch bird-vocalisation model with a probe-classifier head, blended with a five-fold EfficientNet sound-event-detection ensemble. The second part is a weighted triplet of convolutional networks I fine-tuned on the released soundscape recordings, an EfficientNet-B3, a pseudo-label-refined B3, and a ConvNeXt-Small, all taken from timm with single-channel input stems and trained with an asymmetric loss. The triplet is rank-blended onto the baseline at a small weight (alpha = 0.10): the baseline carries most of the ranking, and the soundscape-tuned models supply a small correction on top.

Artefacts

The peer-reviewed CEUR-WS proceedings version will appear after the conference.

Gilles Colling

Gilles Colling

PhD student at University of Vienna. Physicist turned ecologist. R packages, spatial statistics, and computational ecology.