Hierarchical spatial modelling for applied population and community ecology

Jun 24-27, 2024

From June 24 to 27, 2024, Jeff Doser and Marc Kéry led a workshop on hierarchical Bayesian spatial models at the Swiss Ornithological Institute in Sempach, Switzerland. Participants brought their own laptops with R, using the `spOccupancy` and `spAbundance` packages throughout the sessions. While the course didn’t require prior experience with Bayesian or spatial statistics, a solid understanding of regression models in R was expected. All course materials, including lectures and code, were made available via GitHub.

Course Highlights

The course combined theoretical background with hands-on implementation, with about 70 percent of the time dedicated to lectures and the remaining 30 percent to exercises. It began with an overview of hierarchical modelling and spatial statistics, then moved into scalable modelling techniques using `spOccupancy` and `spAbundance`. Over four packed days, participants worked through topics including single- and multi-species occupancy models, joint species distribution models with imperfect detection, and multi-season models. Later sessions focused on modelling count data using spatial GLMMs, N-mixture models, and hierarchical distance sampling. Real-world applications, such as analyses of Swiss bird data, anchored the exercises in practice. Each day ran from 9:00 to around 17:30, with generous breaks and time set aside for questions and discussion.

Introduction to Hierarchical Models for Distribution and Abundance

The opening session of the workshop set the foundation of hierarchical modelling by introducing the concept of HMs (hierarchical models) and explaining how they help us answer core ecological questions. These questions include: How many individuals are there? Where are they? How do abundance and distribution vary across space and time? And what environmental or demographic factors drive these patterns?

At the heart of this framework lies the idea of point patterns. In theory, the most fundamental way to describe a population or community is as a realization of a stochastic point process. A point pattern consists of a set of spatial coordinates, with each point representing an individual organism. This can be modeled using a homogeneous or inhomogeneous point process,, depending on whether intensity (expected density of individuals) varies across space.

However, real-world data rarely take this ideal form. It’s difficult to record exact locations for every individual, and spatially continuous models require mathematical tools like integrals that biologists often avoid. Moreover, data often come in aggregated formats, counts per site, presence/absence in grid cells, or summaries by administrative region. So instead of modeling point patterns directly, we summarize them into two main quantities:

  • Abundance (N): The number of individuals per site or unit area.

  • Distribution (z or ψ): Presence/absence or the probability of occurrence.

The lecture emphasized that distribution is not an independent quantity: It is a deterministic function of abundance. In particular, if abundance follows a Poisson distribution with rate \( \lambda \), then occupancy probability is given by:

\[\psi = 1 - e^{-\lambda}\]

This makes distribution a lower-information summary of abundance. It’s useful when abundance data are unavailable or detection is very low, but it cannot reveal much once all sites are saturated (i.e., when \( \psi \to 1 \)).

The discussion also highlighted the scale dependence of both abundance and distribution. How we discretize space (how large our sampling units are) directly affects the interpretation of results. For instance, small units tend to have low occupancy (since many will contain zero individuals), while very large units may have occupancy close to one. This modifiable area unit problem complicates comparisons across studies and scales.

Another key theme was measurement error. While point patterns have three types of error (location error, false positives, and false negatives), abundance and distribution data typically deal with only two: missed detections and false positives. Accounting for these errors is essential to avoid biased inference.

The second half of the session introduced hierarchical models as the statistical framework best suited to deal with these complexities. A hierarchical model is a sequence of random processes, where each submodel represents a biological or observation process. For example:

  • Abundance model: \( N_j \sim \text{Poisson}(\lambda_j) \)

  • Observation model: \( y_{j,k} \sim \text{Binomial}(N_j, p_{j,k}) \)

Alternatively, for presence/absence:

  • State model: \( z_j \sim \text{Bernoulli}(\psi_j) \)

  • Observation model: \( y_{j,k} \sim \text{Bernoulli}(z_j \cdot p_{j,k}) \)

These models can incorporate covariates and spatial effects using generalized linear model (GLM) formulations. For instance:

\[\log(\lambda_j) = \beta_0 + \beta_1 x_j + w_j,\]

where \( w_j \) is a spatial random effect.

The benefit of this structure is twofold: First, it separates the ecological signal from the observation noise; second, it allows for flexible and realistic modeling of complex datasets, across space, time, and species.

The takeaway message was clear: Hierarchical models are not just statistical tools, but conceptual frameworks that match the layered nature of ecological data. Whether we are modeling occupancy, abundance, detection, or their spatial and temporal patterns, hierarchical models allow us to describe the world more realistically and to account for the uncertainty that pervades all ecological observations.

Foundations & Core Spatial Concepts

The first lecture of the occupancy modeling workshop, led by Jeffrey Doser, introduced the fundamental concepts behind species distribution models that account for imperfect detection. The problem is simple in theory but difficult in practice: when studying a species’ presence or absence across a landscape, how do we separate biological truth from observation error? This is particularly important in ecology, where even skilled observers may miss a species that is present, and absences may reflect limitations of detection rather than true absence.

To set the stage, the lecture began by discussing the kinds of questions ecologists often want to answer. These include understanding where species live, how effective conservation efforts have been, whether management changes influence entire communities, and how distributions shift over time. Each of these depends on having accurate knowledge of species occurrence, but our observations are always partial, noisy, and incomplete. Occupancy models were designed to deal with this reality.

We started by comparing different types of species data: presence-only data (where we know only where a species has been seen), presence-absence data (which also record when a species wasn’t observed), and count data (which record how many individuals were detected). The focus here was on presence-absence data collected through repeated surveys, what MacKenzie et al. (2002) call detection/nondetection data. For example, at a given site, a species might be seen during the first and last visit, but missed in between. This type of dataset allows us to start modeling both the probability that a species is present and the probability that it is detected.

Occupancy models work by separating the biological process: Whether the species actually occupies the site, from the observation process, whether it was detected. Each site has a latent (unobserved) occupancy state \( z_j \), which equals 1 if the species is present and 0 if absent. At the same time, each survey \( k \) at site \( j \) yields an observation \( y_{j,k} \), which is a function of both occupancy and detection. Even if the species is present, it may not be detected.

The model is built around two parts. First, the occupancy model, which describes the true (but unobserved) state of nature. We define the probability of occupancy at site \( j \) as \( \psi_j \), and model this as

\[z_j \sim \text{Bernoulli}(\psi_j)\]

Often, occupancy is influenced by site-level environmental variables—say, elevation, vegetation cover, or land use. These are included through a logistic regression:

\[\text{logit}(\psi_j) = \beta_0 + \beta_1 x_{1j} + \dots + \beta_R x_{Rj}\]

Second, the observation model, which accounts for the fact that a species may go undetected even if it is present. If the species is at the site, we model detection using

\[y_{j,k} \sim \text{Bernoulli}(p_{j,k}),\]

where \( p_{j,k} \) is the detection probability, possibly depending on covariates like weather, observer skill, or time of day:

\[\text{logit}(p_{j,k}) = \alpha_0 + \alpha_1 w_{1jk} + \dots + \alpha_Q w_{Qjk}\]

Together, these two components form a hierarchical model that allows us to estimate both presence and detectability. This is powerful because it lets us distinguish a true absence from a non-detection.

Of course, the model depends on some assumptions. Most importantly, the occupancy state must remain constant during the survey period: This is the closure assumption. The model also assumes no false positives (i.e. if the species is detected, it is truly there), that detections are independent across surveys, and that all relevant variation in detection is explained by covariates.

Bayesian models provide full posterior distributions, making uncertainty transparent and interpretable. They are more flexible and more naturally handle latent variables and hierarchical structures, especially when extending to spatial or multi-species models. In practice, Bayesian occupancy models are estimated using MCMC (Markov Chain Monte Carlo), which iteratively samples from the posterior distribution of each parameter.

We briefly discussed how this works in practice. After specifying priors for each parameter, the MCMC sampler proposes new values, accepts or rejects them based on likelihood, and repeats this process thousands of times. The resulting samples can be summarized using posterior means, medians, or credible intervals. Diagnostics like trace plots and R-hat values help assess convergence.

The hands-on component of the lecture used the `spOccupancy` R package, which simplifies fitting single-species, spatial, and multi-species occupancy models in a Bayesian framework. For this session, we worked with presence-absence data from the Swiss Breeding Bird Survey (MHB), focusing on the European goldfinch. Each site had multiple surveys, and the aim was to estimate occupancy across Switzerland while accounting for imperfect detection.

The next part of the workshop introduced spatial models. Many ecological datasets exhibit spatial autocorrelation, meaning that observations closer in space tend to be more similar than those further apart. If we ignore this dependence, we risk biased estimates and underestimated uncertainty. Spatial models incorporate coordinates directly into the analysis, allowing us to represent ecological processes that vary smoothly across the landscape.

The foundation is the spatial linear model. Suppose we observe a response \( y_i \) at each site \( i \). We model this as

\[y_i = \mathbf{x}_i^\top \beta + w_i + \epsilon_i,\]

where \( \mathbf{x}_i \) are covariates, \( \beta \) are regression coefficients, \( w_i \) is a spatial random effect, and \( \epsilon_i \) is independent error. The spatial random effect is modeled with a Gaussian process prior:

\[\mathbf{w} \sim \text{MVN}(0, \mathbf{C}),\]

where \(\mathbf{C}\) is a covariance matrix with entries determined by distances between sites. A common choice is the exponential covariance function,

\[\text{Cov}(w_i, w_j) = \sigma^2 \exp(-\phi d_{ij}),\]

where \( \sigma^2 \) is the variance, \( d_{ij} \) is the distance between sites \( i \) and \( j \), and \( \phi \) controls the rate at which correlation decays with distance. This structure allows nearby sites to share more information than distant ones.

Gaussian processes can be embedded in occupancy models by letting the occupancy probability at each site depend not only on covariates but also on a spatial random effect. This provides a flexible way to capture unmeasured spatial structure, such as habitat heterogeneity or dispersal processes. In practice, these models are fit using Bayesian methods, often implemented in packages like `spOccupancy`.

We explored how changing the parameters affects the spatial surface. Increasing \( \sigma^2 \) increases the overall variability in the spatial effect, while changing \( \phi \) alters the spatial scale of correlation. Small values of \( \phi \) yield long-range dependence, while large values yield short-range dependence. Visualizing these surfaces helps to understand what the model is capturing.

The key advantage of spatial models is that they reduce bias and improve inference by properly accounting for spatial dependence. However, this comes at a computational cost, since Gaussian process models require storing and inverting large covariance matrices.

The final part of the foundations addressed the computational challenges of fitting Gaussian process models when the number of locations becomes large. This is the so-called big-\( n \) problem. When we assign a Gaussian process prior to the spatial random effect, the covariance matrix \(\mathbf{C}\) is \( J \times J \), where \( J \) is the number of locations. For small \( J \), such as 5 or 10, inversion is trivial. But as \( J \) grows, the cost of storing and inverting \(\mathbf{C}\) scales as \( O(J^2) \) and \( O(J^3) \), respectively. For \( J = 1000 \), this already becomes infeasible, and with \( J = 1600 \), computation is prohibitively slow.

This difficulty arises because Gaussian processes require evaluating multivariate normal likelihoods, inverting dense matrices, and storing all pairwise distances. These demands quickly overwhelm both memory and processing time.

Several strategies have been developed to make Gaussian processes scalable:

  • Low-rank methods approximate the full covariance structure using a smaller set of basis functions, reducing the dimensionality of the spatial surface.

  • Sparse methods impose sparsity in the covariance or precision matrix, drastically reducing computational requirements.

  • Nearest Neighbor Gaussian Processes (NNGPs) approximate dependencies locally, conditioning each site only on a subset of its nearest neighbors. This retains much of the spatial structure while making computation feasible for thousands of sites.

Among these, the lecture focused in detail on Nearest Neighbor Gaussian Processes (NNGPs), introduced by Datta et al. (2016) and further developed by Finley et al. (2019). NNGPs offer a sparse and scalable approximation to full GPs, while retaining the same model structure and interpretation. That is, NNGPs have the same spatial parameters as GPs and generate similar spatial surfaces, but can be applied to much larger datasets.

The conceptual foundation of NNGPs is surprisingly intuitive. First, spatial locations are ordered, often just along the x-axis. Then, for each location, we define a set of its \( m \) nearest neighbors (subject to the ordering). The key idea is that the spatial random effect at each location is assumed to depend only on the values of these neighbors. This assumption introduces conditional independence, which is the heart of what makes the approach computationally efficient.

Instead of a dense \( J \times J \) covariance matrix, NNGPs produce a sparse matrix. Each entry depends only on a small set of nearby locations. As a result, the storage requirement drops to \( O(Jm^2) \), and computation scales as \( O(Jm^3) \). Because \( m \ll J \), the savings are substantial. And unlike low-rank methods, NNGPs preserve local variation and perform well in capturing short-range spatial structure.

In practical terms, the `spOccupancy` and `spAbundance` R packages implement NNGPs by ordering sites along the Easting coordinate and using default neighbor sizes like \( m = 15 \). Users can adjust this value and compare different options using criteria like the Widely Applicable Information Criterion (WAIC). Even relatively small neighbor sets tend to perform well, and increasing \( m \) improves accuracy at the cost of computation.

We then looked at how NNGPs are built. Mathematically, they rely on rewriting the Gaussian process as a product of conditional densities. Graphical models are used to impose the neighbor structure, introducing sparsity in a principled way. At the end, the only change to the standard GP model is replacing the dense covariance matrix \( \mathbf{C} \) with a sparse NNGP-derived version. The rest of the modeling workflow, prior specification, likelihood structure, MCMC fitting, remains the same.

An example was presented with a simulated dataset of 1,600 locations. A full GP took significantly longer to fit, while the NNGP version completed in a fraction of the time with nearly identical results.

Spatial Occupancy & Survey Design

The focus shifted to the implementation and interpretation of spatial occupancy models, specifically, how to fit and understand them using the `spOccupancy` package. While previous lectures laid out the theory behind spatial processes and modeling frameworks like Gaussian processes and their scalable approximations, this lecture applied those tools directly to the occupancy context.

The foundation remains the same: the basic occupancy model consists of two components. The ecological or occupancy sub-model defines the true presence or absence of a species at each site, while the detection sub-model accounts for imperfect observations. These are combined in a hierarchical Bayesian framework, allowing uncertainty to be modeled explicitly.

This time, the motivation was framed using a real-world example: The European goldfinch in Switzerland. The goal was to determine whether including spatial structure improves model performance and produces a more realistic species distribution map. The short answer is yes: spatial models provide more accurate predictions, more realistic estimates of uncertainty, and deeper insight into unmeasured ecological processes.

But why is residual spatial autocorrelation so common in species distributions? First, because ecological drivers like habitat suitability or environmental gradients often have spatial structure, and many of these drivers are unmeasured or unknown. Second, because species themselves interact with space, through dispersal, conspecific attraction, and biotic interactions. As a result, nearby locations tend to be more similar than distant ones, even after accounting for known covariates.

The spatial occupancy model reflects this reality by extending the standard occupancy formulation. In its spatial form, the occupancy probability \( \psi_j \) at site \( j \) is given by:

\[\text{logit}(\psi_j) = \mathbf{X}_j \boldsymbol{\beta} + w(s_j),\]

where \( \mathbf{X}_j \boldsymbol{\beta} \) is the usual linear predictor based on covariates and \( w(s_j) \) is a spatial random effect evaluated at location \( s_j \). This term captures residual spatial variation. Detection remains modeled in the same way, typically as:

\[\text{logit}(p_{j,k}) = \mathbf{W}_{j,k} \boldsymbol{\alpha}\]

Together, these define a full spatial occupancy model that can be fitted either with a full Gaussian process or with the more scalable NNGP approach.

The lecture then moved on to discuss prior distributions, especially for the regression coefficients \( \boldsymbol{\beta} \). In traditional generalized linear models, it is common to assign Normal priors like \( \text{Normal}(0, 1000) \). But this turns out to be highly informative in logistic models like occupancy, because coefficients live on the logit scale. A large prior variance leads to probabilities near 0 or 1, which can bias estimation. Instead, a more appropriate default is:

\[\boldsymbol{\beta} \sim \text{Normal}(0, 2.72)\]

This choice places more uniform weight on the probability scale and avoids the overconfidence induced by broader priors.

For the spatial parameters, the same structure introduced in Lecture 2 applies. The spatial decay parameter \( \phi \) governs how quickly spatial correlation drops with distance, and the spatial variance \( \sigma^2 \) controls the magnitude of the random effect. These parameters are often weakly identifiable, especially \( \phi \), and the lecture introduced the adaptive Metropolis algorithm used by `spOccupancy` to address this.

In contrast to a standard Metropolis algorithm, which uses a fixed proposal variance, the adaptive version adjusts the tuning variance throughout the MCMC run. This is done in batches, typically every 25 samples, and attempts to achieve a target acceptance rate of 0.43. Each batch provides feedback about how the chain is mixing, and the tuning variance is updated accordingly. This helps accelerate convergence, especially for parameters like \( \phi \), which are difficult to sample efficiently due to their strong influence on the shape of the spatial surface.

In practice, users of `spOccupancy` need to specify four key values when using the adaptive Metropolis implementation:

  1. `tuning`: The initial tuning variance.

  2. `accept.rate`: The target acceptance rate.

  3. `n.batch`: Number of MCMC batches.

  4. `batch.length`: Number of MCMC samples per batch.

The total number of MCMC samples is the product of `n.batch` and `batch.length`.

The classic framework for occupancy studies uses a grid of \( J \) sites, each visited \( K \) times. These repeated surveys allow estimation of both the occupancy probability \( \psi \) and the detection probability \( p \), which are confounded in single-visit datasets. The observed detection histories, such as \([1, 0, 0, 1]\), are used to infer whether a species is truly present but missed, or genuinely absent.

Choosing the right number of sites and visits is nontrivial. It depends on your goals. Do you want to estimate occupancy with a given precision? Compare two habitats? Detect a trend over time? Evaluate a new detection method? Or model population dynamics? The optimal survey effort looks different for each objective.

For basic estimation of occupancy, the main concern is the cumulative detection probability, defined as:

\[p^* = 1 - (1 - p)^K\]

This represents the probability of detecting the species at least once across \( K \) visits, given that it is present and detection probability per visit is \( p \). To design your survey, you can define the total number of surveys \( T = J \times K \), and then optimize the allocation between sites and visits.

MacKenzie & Royle (2005) offer practical advice based on simulations. When detection is low (\( p \ll 0.5 \)), it is better to concentrate effort on fewer sites with more visits. When occupancy is low (\( \psi \ll 0.5 \)), the opposite holds: more sites with fewer visits yield more information. If detection is your main goal, prioritize repeated visits. If surveying new sites is costly, say, due to travel or access issues, it's again better to revisit fewer places.

In most cases, \( K \geq 3 \) is recommended to allow stable estimation of detection probabilities. When your study includes multiple species, it's often safest to design around the rarest or hardest-to-detect species. That design usually works for everything else.

The lecture then introduced alternative designs that deviate from the standard “\( J \) sites with \( K \) visits” format. The removal design stops visiting a site after the first detection. If a species is detected on the first visit, there's no need to return. This can save time and resources. It's particularly effective when occupancy is moderate to high (\( \psi \geq 0.4 \)). The removal design yields lower standard errors than the standard design under these conditions.

A second alternative is the conditional design, where every site is visited once, and only those with a detection are revisited. This approach was explored by Specht et al. (2017) and others. It is especially efficient for rare species with low detection probability, where many sites are likely to yield no detections at all. The first visit acts as a filter, identifying candidates for further effort.

A third hybrid is the mixed design, also known as “double sampling” or “fractional replication.” Here, a subset of sites receives only one visit (\( J_S \)), while another subset (\( J_R \)) is visited \( K \) times. This design can reduce field costs when some data is already available or when resources are limited. However, single-visit data only help if the species is very easy to detect. Otherwise, their contribution to parameter estimation is marginal. Still, when combined with repeated visits, even a few single-visit data points can slightly improve inference, particularly when occupancy and detection are modeled using continuous covariates.

The lecture then turned to dynamic objectives. Suppose you want to detect a change in occupancy over time. How much effort is needed to detect a decline of 50% with power \( = 0.8 \)? Guillera-Arroita & Lahoz-Monfort (2012) addressed this by simulating occupancy change scenarios. When both \( \psi \) and \( p \) are low, the required effort becomes huge. However, when detection is high, increasing the number of visits beyond two offers little additional power. When detection is low, increasing \( K \) yields strong gains in power.

Design must also take into account various sources of additional heterogeneity or error. For instance, variation in abundance, environmental covariates, observer skill, and species behavior can all affect detection. Royle (2006) modeled detection heterogeneity explicitly. False-positive detections introduce another layer of complexity, requiring models that distinguish between true and false observations. Royle & Link (2006) and Miller et al. (2011) introduced frameworks for handling such errors.

Spatial autocorrelation, as discussed in earlier lectures, can bias parameter estimates if ignored. Guélat & Kéry (2018) showed that ignoring spatial structure may result in false confidence in model predictions. Meanwhile, species availability also matters. A species may be present at a site but not detectable during the survey due to behavioral or life-history constraints. This is not a false negative: It’s a different process altogether, often tied to habitat use or seasonal activity.

So how should one choose the best design? It depends not just on the goal of the study, but also on the system being studied. Static systems might require estimation of a single occupancy value, while dynamic systems call for estimates of colonization, extinction, or trends over time. Sessile species like plants and amphibians differ from mobile species like birds. The spatial and temporal scale of interest also influences design decisions, as does the assumed closure period.

There are two main ways to select a suitable design. One is through simulation. By generating artificial data under plausible parameter values, fitting the intended model, and comparing the estimates, researchers can explore how bias, precision, or power behave under different survey setups. This method is accessible and forces researchers to understand their assumptions. It's especially helpful when sample sizes are small.

The other is through mathematical optimization, where analytic or numerical solutions are derived for design parameters that maximize power or minimize error. This approach is more elegant but less intuitive and often requires asymptotic assumptions or complex derivations.

The lecture ended with a simulation exercise, where students explored how the precision of estimated occupancy depends on the number of visits, occupancy probability, and detection probability. The results reinforced earlier messages: there is no one-size-fits-all design, but thoughtful choices can make a huge difference in study success.

Multi-species & Temporal Occupancy

After building a solid understanding of single-species models, we moved into the realm of multi-species occupancy models (MSOMs), a powerful extension that leverages the structure of community data to improve inference, particularly for rare or hard-to-detect species. The statistical motivation is clear. Many of the species that conservationists are most interested in, Species of Greatest Conservation Need, are precisely those that are hardest to model due to low detection rates. When only a handful of detections exist across a landscape, single-species models tend to collapse under uncertainty. Multi-species models mitigate this by pooling information: not in a simplistic way, but by structuring the model hierarchically so that each species has its own parameters, but those parameters are drawn from a community-level distribution.

In MSOMs, the key structure looks similar to the standard occupancy model, just extended across species:

  • The occupancy state \( z_{i,j} \) describes whether species \( i \) occupies site \( j \).

  • The observation data \( y_{i,j,k} \) record detections for species \( i \) at site \( j \) and replicate \( k \).

  • The occupancy probability \( \psi_{i,j} \) and detection probability \( p_{i,j,k} \) are modeled using covariates, but their species-specific effects are treated as random effects drawn from a common community distribution.

For instance, suppose we have a covariate \( x_j \) for each site. The model might specify:

\[\text{logit}(\psi_{i,j}) = \beta_{0,i} + \beta_{1,i} x_j\]

with

\[\beta_{0,i} \sim \text{Normal}(\mu_{\beta_0}, \sigma_{\beta_0}^2), \quad \beta_{1,i} \sim \text{Normal}(\mu_{\beta_1}, \sigma_{\beta_1}^2)\]

This hierarchical structure allows us to estimate both species-specific effects and the community-wide trends (i.e., the mean and variance of these effects across species). Importantly, species with fewer detections are shrunk toward the community mean, resulting in more stable estimates, especially useful when dealing with rare species.

There’s also a compelling ecological motivation. Historically, conservation efforts have focused on single species, but there is growing interest in understanding communities as wholes, especially in the face of biodiversity loss and habitat degradation. MSOMs provide a framework to assess community-wide responses to environmental gradients, spatial structure, and anthropogenic pressures, while still retaining species-specific inference.

The lecture introduced the `msPGOcc()` function in the `spOccupancy` package, which fits these models using Pólya-Gamma data augmentation. The required input is a three-dimensional array of detection-nondetection data: species \( \times \) site \( \times \) replicate. The arguments are nearly identical to the single-species function `PGOcc()`, making it straightforward to extend existing analyses to a multi-species context.

One of the main strengths of MSOMs is the ability to derive community-level metrics. Because the model estimates site-by-species occupancy probabilities \( \psi_{i,j} \), we can compute species richness at each site simply by summing across species:

\[\text{Richness}_j = \sum_i \psi_{i,j}\]

These quantities aren’t model parameters, they’re derived quantities, made available through the full Bayesian framework. The same logic applies to metrics of diversity, beta-diversity, or composition. Inference is then available not only at the species level, but across entire communities.

However, MSOMs are not without downsides. They are slower to run than single-species models, and the need to format and handle multi-dimensional arrays can introduce complexity. While `spOccupancy` simplifies much of this, users still need to be careful when specifying species order, initial values, and priors. Another challenge is defining the community: Which species should be included? As Pacifici et al. (2014) noted, community boundaries are often fuzzy, especially in mobile or transient systems.

One limitation of standard MSOMs is that they do not model residual species correlations. Each species is treated as conditionally independent, given the covariates and random effects. Yet in reality, species co-occur for reasons beyond shared environment: competition, facilitation, and unmeasured factors all contribute to residual associations.

To address this, the lecture introduced latent factor MSOMs, a hybrid approach that combines the strengths of MSOMs and Joint Species Distribution Models (JSDMs). The core idea is to represent unmeasured influences via latent factors: Hidden variables that account for residual correlations between species.

Instead of explicitly estimating a huge interspecies covariance matrix (which becomes intractable as the number of species grows), we reduce the dimensionality. Each species has a set of factor loadings, representing how strongly it responds to each latent factor. These loadings create a “signature” for each species. Species with similar signatures will tend to co-occur, even if the underlying cause isn’t directly observed.

This model is fitted in `spOccupancy` using the `lfMsPGOcc()` function. The user specifies the number of latent factors \( q \), and the model estimates both the latent factors and species-specific loadings. A common starting point is to use 5–10 latent factors, depending on the size of the community and the amount of variation expected.

The latent factor model provides a second set of derived quantities: The residual correlation matrix between species. This matrix reflects how species co-occur after accounting for known covariates. However, caution is needed in interpreting these correlations. They do not necessarily imply biological interactions. As emphasized in Kéry and Royle (2021), residual correlation may reflect unmeasured covariates, sampling artifacts, or scale mismatches. Still, these matrices are useful for hypothesis generation and model-based ordination.

We concluded with a discussion of priors and constraints. Without restrictions, the factor loadings matrix is not identifiable. A common solution is to fix the diagonal of the loadings matrix to 1 and the upper triangle to 0. The remaining entries receive a \( \text{Normal}(0, 1) \) prior. This ensures model identifiability and interpretability. Ordering of species in the data becomes important: Especially for the first \( q \) species, whose loadings are fixed. Good practice includes placing common, well-detected species early in the list and reordering if convergence is poor.

We extended the multi-species framework introduced earlier by integrating spatial autocorrelation into community models. While single-species spatial occupancy models are already well-established, applying spatial structure across many species simultaneously brings unique challenges and benefits. Jeffrey Doser presented two distinct formulations of spatial MSOMs, highlighting their differences in power, scalability, and ecological interpretation.

The motivation for spatial MSOMs is straightforward. In community models, spatial autocorrelation becomes even more important than in single-species models. Different species respond to different environmental drivers, many of which are unmeasured or spatially structured. If these latent spatial processes are not explicitly modeled, parameter estimates may be biased and predictive performance degraded. Additionally, many species are rare, and their spatial patterns can only be recovered by leveraging spatial borrowing of strength.

The first approach introduced was the Type 1 spatial MSOM, which directly parallels the single-species spatial occupancy model. Here, each species receives its own spatial random effect \( w_i(s_j) \), modeled via a Nearest Neighbor Gaussian Process (NNGP). The occupancy probability for species \( i \) at site \( j \) is defined as:

\[\text{logit}(\psi_{i,j}) = \mathbf{X}_j \boldsymbol{\beta}_i + w_i(s_j)\]

This is essentially a stacked set of independent spatial occupancy models, one for each species, fitted simultaneously. The NNGP formulation ensures computational feasibility by approximating the full Gaussian process through sparse conditioning. However, estimating \( N \) spatial surfaces, one per species, quickly becomes burdensome when the number of species grows. The method is best suited to moderate datasets (e.g., fewer than 10 species and 100–500 locations), where each species has sufficient detections to estimate its own spatial field.

Type 1 spatial MSOMs retain all species-level parameters from the non-spatial version, and the detection model remains unchanged:

\[\text{logit}(p_{i,j,k}) = \mathbf{W}_{j,k} \boldsymbol{\alpha}_i\]

While intuitive and direct, this approach has practical limitations. First, computation becomes slow and memory-intensive for large communities. Second, rare species may not have enough data to reliably estimate a full spatial surface. As a result, this model is only used in select cases, especially when researchers are specifically interested in mapping spatial variation for a small number of well-detected species.

To overcome these limitations, the lecture introduced a more scalable and powerful method: the spatial factor MSOM.

This model builds on the latent factor MSOM introduced earlier, where residual species correlations are explained through a small number of latent variables. The key idea here is to allow these latent variables, called spatial factors, to vary smoothly across space. Each spatial factor \( f_q(s_j) \) is modeled as a spatial process (again using NNGPs for efficiency), and the species-specific spatial effects are reconstructed as weighted sums:

\[w_i(s_j) = \sum_{q=1}^{Q} \lambda_{i,q} f_q(s_j)\]

Here, \( \lambda_{i,q} \) are the species-specific loadings, and \( Q \) is the number of spatial factors, typically much smaller than the number of species. Instead of estimating \( N \) spatial surfaces, we estimate only \( Q \), and then use the loadings to derive species-specific spatial variation. The latent factors act as **missing covariates** that capture shared spatial structure across species.

The occupancy model now takes the form:

\[\text{logit}(\psi_{i,j}) = \mathbf{X}_j \boldsymbol{\beta}_i + \sum_{q=1}^{Q} \lambda_{i,q} f_q(s_j)\]

This spatial factor formulation is implemented in `spOccupancy` via the `sfMsPGOcc()` function. Just like in the non-spatial latent factor model, the user specifies the number of spatial factors, priors on the community-level parameters, and the restrictions required to ensure identifiability. Specifically, the factor loadings matrix \( \Lambda \) must be constrained, typically by fixing the upper triangle to zero and the diagonal to one for the first \( Q \) species.

This model handles spatial autocorrelation, species correlations, and imperfect detection in a unified framework. It scales to large communities, examples in the lecture included 2,619 sites and 98 species, and enables biodiversity metrics to be derived from posterior distributions. It is also more efficient than Type 1 spatial MSOMs, because far fewer spatial surfaces are estimated.

Moreover, the spatial factor MSOM facilitates model-based ordination, a way of visualizing community structure in reduced dimensions. Because species are represented by their loadings on latent spatial factors, the loadings define a low-dimensional embedding of ecological niches. This can be visualized using ordination plots, revealing patterns of niche partitioning, spatial association, or community clustering.

One important caveat, again, is interpretability. The latent spatial factors are not tied to any specific environmental gradient. They are emergent summaries of residual variation. While this does not reduce their inferential power, it does mean that causal interpretation must be cautious. Still, when combined with covariates, these models offer a rich view of species–environment relationships and spatial community structure.

The lecture concluded with a hands-on exercise using a simulated plant community, designed to illustrate niche partitioning along a spatial gradient. Participants explored how species loadings aligned with the spatial factors, and how the estimated occupancy surfaces varied across the landscape. This exercise reinforced the conceptual power of spatial factor MSOMs: Not just as tools for mapping presence or absence, but as models of ecological process and pattern.

The final layer of complexity in the occupancy modeling framework was the integration of time. While spatial models capture autocorrelation across landscapes, many ecological questions also involve dynamics across seasons or years. Multi-season models allow us to track how species distributions change over time and how occupancy patterns evolve due to colonization, extinction, or other temporal processes.

The goal of a multi-season occupancy model is to understand both spatial and temporal variation in species occurrence. For example, researchers might want to know how bat distributions in the western USA are shifting, whether occupancy is increasing or decreasing over time, or how environmental change is influencing range dynamics. Repeated surveys over multiple seasons provide the data needed to answer these questions.

From a statistical perspective, including multiple seasons improves our ability to estimate spatial random effects. More importantly, it allows us to model both the underlying ecological process and the observation process across time. The data structure follows a robust design: for each of \( J \) sites, data are collected over \( T \) seasons, and within each season \( t \), site \( j \) is visited \( K_{jt} \) times. This structure can accommodate missing data and unbalanced sampling efforts, where some sites are only visited in some years or with varying intensity.

Multi-season occupancy models assume closure within each season. That is, the occupancy state remains fixed for a given site during that year, but allow for changes between seasons. A site may be occupied in one year and unoccupied in the next, reflecting colonization or local extinction. This distinction underpins the conceptual difference between single- and multi-season models.

Three types of multi-season models were discussed.

First, the dynamic occupancy model, introduced by MacKenzie et al. (2003), explicitly models colonization and extinction between seasons. These models are more mechanistic, treating occupancy status as a Markov process that evolves through time:

\[\begin{aligned}z_{j,t} \mid z_{j,t-1} &\sim \text{Bernoulli}(\gamma_{j,t}) \quad \text{if } z_{j,t-1} = 0 \quad (\text{colonization}) \\z_{j,t} \mid z_{j,t-1} &\sim \text{Bernoulli}(1 - \epsilon_{j,t}) \quad \text{if } z_{j,t-1} = 1 \quad (\text{persistence})\end{aligned}\]

These models are conceptually rich but require more data to estimate transitions reliably, particularly when species are rare or detection is low.

Second, the stacked occupancy model simplifies things by treating each site-season combination as an independent unit. In this framework, the site-year is considered the “sampling unit,” and a standard single-season occupancy model is fitted across all combinations. This design sacrifices the mechanistic interpretation of colonization and extinction but is easier to fit and often more robust when data are sparse. Site-level random effects are used to account for repeated measures at the same location over time, reducing pseudoreplication:

\[\text{logit}(\psi_{j,t}) = \mathbf{X}_{j,t} \boldsymbol{\beta} + \alpha_j,\]

where \( \alpha_j \sim \text{Normal}(0, \sigma^2_\text{site}) \) is a site-level random effect.

Third, spatio-temporal occupancy models extend the stacked framework by including spatial and temporal random effects to capture autocorrelation across both dimensions. These models are especially useful for prediction and trend estimation, offering flexible structures that can accommodate nonlinear trends or autocorrelated processes.

Spatial effects are modeled using Gaussian processes or their scalable approximations. Temporal effects can be included in one of two ways:

  1. Unstructured temporal random effects, which assign each year a separate intercept.

  2. AR(1) processes, where the temporal random effects follow an autoregressive structure:

\[\eta_t \sim \text{Normal}(\rho \eta_{t-1}, \sigma^2_\text{time})\]

This formulation allows occupancy probability to vary smoothly over time, capturing temporal dependence in species dynamics.

In the `spOccupancy` package, different functions implement these models:

All multi-season models in `spOccupancy` rely on adaptive Metropolis-Hastings sampling. As in earlier spatial models, users must specify the number of MCMC batches and batch length. Temporal and spatial effects are modeled independently in these “separable” models, though recent developments in nonseparable models allow more complex interactions between time and space.

The lecture emphasized that spatial and temporal random effects are not just statistical adjustments. They represent latent ecological processes, unmeasured habitat variation, dispersal limitation, climate anomalies, that drive real species distributions. Properly accounting for these structures improves both inference and prediction.

One of the most exciting developments in this area is the ability to estimate spatially varying trends in occupancy. For example, we may want to know whether a species is declining in one region while increasing in another. This is possible by modeling interactions between spatial coordinates and time, or by allowing regression coefficients to vary across space:

\[\text{logit}(\psi_{j,t}) = \beta_0 + \beta_1(t) + f(s_j),\]

where \( \beta_1(t) \) may be time-varying, and \( f(s_j) \) is a spatial effect. These models can be used to generate maps of temporal change, showing where species are and where they are going.

We pushed the boundaries of spatial occupancy modeling by introducing spatially-varying coefficient (SVC) models. While previous approaches focused on modeling residual spatial autocorrelation, usually via a spatially-varying intercept, SVC models allow any coefficient in the model to vary smoothly across space. This is a powerful extension, especially for ecologists interested in understanding how species-environment relationships differ across geographic regions.

The central insight behind SVC models is simple: Species-environment relationships are often nonstationary. That is, the effect of a covariate on species occurrence may differ in different parts of the range. These variations can arise from many sources: Local climate, historical disturbance regimes, fine-scale habitat structure, soil characteristics, or species interactions. For example, a temperature gradient may have very different effects at the core of a species’ range versus its edges. These context-dependent responses are widespread in ecology and require models that allow for such flexibility.

Formally, spatially-varying species–environment relationships allow the slope coefficient itself to become a spatial surface. Consider a general occupancy model with one covariate \( x(s_j) \) at site \( s_j \). A standard model might define:

\[\text{logit}(\psi_j) = \beta_0 + \beta_1 x(s_j)\]

In an SVC model, this becomes:

\[\text{logit}(\psi_j) = \beta_0 + \left[\beta_1 + w_1(s_j)\right] x(s_j)\]

Here, \( w_1(s_j) \) is a spatial adjustment to the covariate coefficient, modeled using a Gaussian process. This term allows the strength—and even the direction—of the relationship between \( x \) and occupancy to change across the landscape.

To build toward this, the lecture reviewed five common approaches for modeling variation in species–environment relationships:

  1. Linear models, where the covariate effect is fixed across space.

  2. Quadratic models, which introduce curvature (e.g., unimodal responses).

  3. Stratum-specific models, which assign different effects to discrete regions (e.g., habitat types), but only allow coarse-grained variation.

  4. Interaction models, where a covariate’s effect depends on another variable (e.g., temperature × vegetation type), but the interaction is still explicitly defined.

  5. SVC models, which treat the interaction variable as unknown and let the data determine how relationships vary over space.

Among these, SVC models are the most flexible. They generalize the stratum model by allowing each site to have its own slope coefficient, with smoothing imposed by a spatial correlation structure. They also generalize the interaction model by letting the interaction structure emerge from the data, rather than being pre-specified.

The SVC model shares the same detection submodel as previous occupancy models, but the ecological submodel now includes multiple spatial processes: One for each spatially-varying coefficient (SVC), plus one for the spatially-varying intercept (SVI). Each of these random surfaces is modeled using a Nearest Neighbor Gaussian Process (NNGP), with its own spatial decay and variance parameters. This flexibility comes at a computational cost, but also enables much richer inference.

Implementation in `spOccupancy` is done using the `svcPGOcc()` function for single-season models. Users specify which covariates should be treated as spatially varying. Each SVC can receive its own priors, and separate spatial parameters are estimated for each. Because these models are data hungry, it is recommended to use more informative priors on the spatial decay parameter \( \phi \), particularly to avoid fitting extremely small effective spatial ranges. For example, setting the upper bound of \( \phi \) based on the 25th percentile of intersite distances helps stabilize inference.

The lecture emphasized the practical uses of SVC models, especially for mapping spatial variation in species–environment relationships. Using the `predict()` function, one can generate maps of the spatially-varying coefficients. These maps show how the effect of a covariate changes across space, which can inform conservation decisions, identify ecological thresholds, or detect regions of species vulnerability.

Several extensions are supported:

  • `svcMsPGOcc()` fits multi-species SVC occupancy models using a spatial factor approach.

  • `svcTPGOcc()` fits multi-season SVC models for single species.

  • `svcTMsPGOcc()` fits multi-season, multi-species SVC models.

But one shouldn’t assign spatial variation to every covariate by default. Spatially-varying coefficients (SVCs) should only be used when there is a clear ecological rationale suggesting that the effect of a covariate differs across space. It is good practice to compare SVC models to simpler, nested alternatives that are grounded in explicit hypotheses. A spatially-varying intercept should always be included to account for underlying spatial structure in the baseline. Once fitted, results should be visualized and interpreted in ecological context, with particular attention to the scale and direction of spatial variation.

Abundance Models

We introduced a different but related modeling framework: spatial generalized linear mixed models (GLMMs) for abundance estimation. Unlike occupancy models, which only estimate presence or absence, GLMMs for abundance aim to model count data directly. This includes relative abundance models, where detection is not explicitly modeled, but where we still want to understand spatial and environmental drivers of population size.

There are good reasons to estimate abundance instead of, or in addition to, occupancy. Abundance patterns often reflect more fine-scale ecological processes. Abundance is also the foundation for conservation assessments, extinction risk evaluations, sustainable harvest estimates, and quantification of ecosystem services. Moreover, abundance can vary within areas where occupancy remains constant. Two populations may both be present, but one may be barely holding on while the other is thriving.

Absolute abundance is the total number of individuals in an area, but it is difficult to measure directly. Most ecological surveys do not meet the assumptions required for absolute abundance estimation, such as in distance sampling or capture-recapture. Instead, many studies rely on relative abundance: An index that is assumed to be proportional to true abundance. Relative abundance can be modeled using standard count data, provided that survey effort is standardized and detection biases are minimized or accounted for with covariates.

The most basic is the Poisson model, which assumes that the variance equals the mean. In practice, this is often too restrictive, so a Negative Binomial (NB) model is commonly used instead. It includes a dispersion parameter \( k \), where smaller values indicate more overdispersion. When \( k \to \infty \), the model converges to the Poisson.

For very large counts or continuous abundance data like biomass, Gaussian GLMMs may also be appropriate. In all of these models, abundance is modeled as a function of covariates with an appropriate link function, typically log or identity.

\[y_j \sim \text{Poisson}(\lambda_j), \quad \log(\lambda_j) = \mathbf{X}_j \boldsymbol{\beta} + \alpha_j\]

where \( y_j \) is the observed count at site \( j \), \( \mathbf{X}_j \) are site-level covariates, and \( \alpha_j \) is a random effect—possibly spatial.

Importantly, detection probability is not modeled separately in these GLMMs. This means that the resulting estimates reflect a mixture of detection and true abundance. If changes in detection are minimal across space, then relative abundance patterns primarily reflect real ecological variation. If detection varies strongly and unpredictably, however, the results can be misleading. For this reason, GLMMs for abundance should include covariates or random effects that capture variation in detection, such as observer ID or survey conditions.

The spatial extension of GLMMs follows the same principles discussed in earlier occupancy lectures. A spatially structured random effect is added to the model, modeled using a Nearest Neighbor Gaussian Process (NNGP):

\[\log(\lambda_j) = \mathbf{X}_j \boldsymbol{\beta} + w(s_j)\]

where \( w(s_j) \) is the spatial random effect at location \( s_j \), governed by an exponential covariance function with a spatial decay parameter \( \phi \) and variance \( \sigma^2 \). This allows for spatial smoothing and improved prediction at unsampled locations.

Models are fitted using the `spAbundance` package, which mirrors the syntax and structure of `spOccupancy`. However, the MCMC samplers used for GLMMs (e.g. Poisson and NB) are based on adaptive Metropolis-Hastings rather than Pólya-Gamma, which makes them less efficient and slower to converge. As a result, users often need to run longer chains: More samples and more batches to ensure convergence.

For a typical run, the user specifies:

  • Number of MCMC batches

  • Batch length

  • Tuning variance

  • Target acceptance rate (usually 0.43)

The sampler automatically adjusts the tuning variance during the run to maintain the desired acceptance rate. This is done every 25 iterations (one batch), and the process repeats across all batches. If the acceptance rate is too low, the tuning variance is increased; if too high, it is decreased.

Spatial GLMMs can be fitted using the `spAbund()` function. Non-spatial GLMMs use `abund()`. The same framework supports different likelihoods: Poisson, NB, Gaussian, and zero-inflated Gaussian. The models are not limited to abundance, GLMMs can be applied to any response variable, making them versatile tools in applied ecology.

One key challenge in spatial GLMMs is the confounding between overdispersion and spatial autocorrelation. In a spatial NB GLMM, two components account for extra variance: the NB dispersion parameter \( k \) and the spatial random effect. If both try to explain the same variation, the model may fail to converge. To avoid this, informative priors should be placed on the spatial decay parameter \( \phi \). A recommended approach is to set the upper bound of the uniform prior on \( \phi \) to \( 3 / q_{0.15} \), where \( q_{0.15} \) is the 15th percentile of all pairwise site distances.

After single-species models, the lecture moved to multi-species abundance models, which are structured similarly to multi-species occupancy models. These GLMMs are sometimes called multivariate GLMMs or abundance-based Joint Species Distribution Models (JSDMs). As in MSOMs, each species’ regression coefficients are drawn from community-level distributions, allowing for borrowing of strength across the community. There are three main formulations:

  1. `msAbund()` – Multi-species model without residual species correlation.

  2. `lfMsAbund()` – Latent factor model that accounts for species correlation.

  3. `sfMsAbund()` – Spatial factor model that includes both spatial autocorrelation and species correlation.

These models allow users to estimate relative abundance across communities, generate biodiversity metrics like richness and Shannon diversity, visualize community structure via ordination, and assess co-occurrence patterns. The lecture ended with two exercises: modeling relative abundance of the northern cardinal across North Carolina using a spatial GLMM, and estimating Shannon diversity across multiple sites with a spatial multi-species GLMM.

We returned to the topic of abundance, but this time under a very different framework. While GLMMs for count data model relative abundance, N-mixture models aim to estimate true latent abundance by explicitly accounting for imperfect detection. These models are an extension of occupancy models and rely on repeated count surveys: Using unmarked individuals to separate the detection process from the ecological process.

The premise is elegant: just as repeated detection/non-detection surveys allow us to estimate occupancy in the presence of imperfect detection, repeated counts at each site can be used to infer how many individuals were likely present, even if we missed some of them on each visit. This is the heart of the N-mixture model.

Let \( N_j \) be the latent abundance at site \( j \). During each visit \( k \), we record a count \( y_{j,k} \), which is assumed to follow a Binomial distribution:

\[y_{j,k} \sim \text{Binomial}(N_j, p)\]

Here, \( p \) is the detection probability, and the assumption is that \( N_j \) individuals were available for detection, but each had an independent probability \( p \) of being observed on that visit. The latent abundance \( N_j \) itself is modeled with a Poisson or Negative Binomial prior:

\[N_j \sim \text{Poisson}(\lambda_j)\]

with

\[\log(\lambda_j) = \mathbf{X}_j \boldsymbol{\beta}\]

This hierarchical structure allows us to estimate both the detection process and the true abundance distribution across sites.

However, N-mixture models come with strong assumptions:

  1. Closure within the sampling period (no births, deaths, immigration, or emigration between visits).

  2. No false positives (all detections must be true individuals).

  3. Independence of detections.

  4. No individual heterogeneity in detection probability (every individual at a site has the same \( p \)).

  5. Correct distributional assumptions (the Poisson or NB must represent the data-generating process).

In practice, these assumptions are often violated or approximated, and the robustness of the model depends heavily on how closely reality follows these rules.

The implementation of N-mixture models in `spAbundance` is straightforward using the `NMix()` function. Users can choose Poisson or Negative Binomial distributions for \( N_j \). The model allows both random intercepts and slopes, though only as uncorrelated effects. Compared to occupancy models, convergence is slower and requires longer MCMC runs. This is because we are now estimating a latent integer variable for each site, making the parameter space larger and the likelihood more complex.

The lecture then introduced the spatial extension of N-mixture models. Just like in previous spatial occupancy and abundance models, we can add a spatially structured random effect to the model for \( \lambda_j \), using a Gaussian process (or NNGP) to capture spatial autocorrelation:

\[\log(\lambda_j) = \mathbf{X}_j \boldsymbol{\beta} + w(s_j)\]

Here, \( w(s_j) \) is the spatial random effect at site \( j \), modeled with an exponential decay function. Importantly, spatial autocorrelation is only added to the abundance part of the model (not the detection component).

One advantage of N-mixture models over occupancy models is that count data provide more information, which often leads to better estimation of the spatial surface. Compared to binary detection/non-detection data, counts are more informative, especially in estimating spatial covariance parameters like the decay rate \( \phi \) and variance \( \sigma^2 \). The result is that spatial random effects are often estimated with lower uncertainty in N-mixture models than in occupancy models with the same number of sites.

That said, the flexibility of N-mixture models comes with considerable challenges.

Posterior predictive checks (PPCs) are critical for evaluating model fit. The standard approach is to:

  1. Fit the model and draw replicate datasets from the posterior predictive distribution.

  2. Optionally bin the data (e.g., total counts across visits or sites).

  3. Compute fit statistics (e.g., mean, variance) for the observed and replicated datasets.

  4. Compare the distributions (large discrepancies indicate lack of fit).

In `spAbundance`, both conditional and marginal replicate datasets can be generated. Conditional replicates are based on the sampled values of \( N_j \), while marginal replicates involve simulating new \( N_j \) values from the prior. Marginal PPCs are generally more sensitive and should be preferred when checking overall model fit.

WAIC is available for model comparison using `waicAbund()`, but it is computationally demanding due to the need to integrate over latent \( N_j \). Despite this, it provides a principled way to compare alternative models, such as Poisson versus NB, provided the chains have converged and the models fit the data well.

The most important section of the lecture was the discussion of difficulties with N-mixture models. While the theory is elegant, the practice can be treacherous. When assumptions are met and the data are not overly dispersed, spatial N-mixture models can yield reliable abundance estimates. But when overdispersion is present, especially in detection, the model can break down.

A central issue is identifiability. It is extremely difficult to distinguish between overdispersion in detection and overdispersion in abundance. Knape et al. (2018) showed that a beta-binomial Poisson N-mixture model and a binomial NB N-mixture model are nearly indistinguishable. This means that the model may attribute detection noise to abundance variance, or vice versa: Leading to inflated and implausible estimates of abundance.

Moreover, when closure is violated, the model can severely overestimate true abundance. This happens frequently in mobile species, or when counts are affected by transient pulses of activity. Kéry (2010) showed that estimates can easily become unreasonably high if closure fails and the model is forced to explain excessive variation.

As a result, spatial N-mixture models are arguably the most sensitive and difficult to fit among the models in `spAbundance`. They may require highly informative priors to converge. They are also prone to producing misleading results when used on overdispersed or messy data.

Several practical solutions were discussed. One is to use more informative priors on spatial or dispersion parameters to reduce identifiability issues. Another is to consider modeling relative abundance using spatial GLMMs instead, especially if the closure assumption is suspect. It is also important to run extensive posterior predictive checks and simulations to validate model behavior.

N-mixture models are powerful tools, but with that power comes responsibility. When used carefully, they offer a way to bridge the gap between detection-limited data and estimates of true ecological abundance. But when misused or blindly applied, they can lead to highly biased or even misleading conclusions.

We introduced the most complex model class presented so far: Spatial multi-species N-mixture models. These models extend the hierarchical abundance estimation framework to entire communities, integrating both species correlations and spatial structure while still accounting for imperfect detection. They represent the synthesis of nearly every idea discussed throughout the course: Latent abundance, repeated counts, spatial autocorrelation, and species community structure, woven into a single, high-dimensional model.

As with occupancy models, many surveys, particularly point counts and transect surveys, record data for multiple species. These community-level datasets can be analyzed using a hierarchical structure in which species-specific parameters are modeled as random effects drawn from shared distributions. This allows for borrowing strength across species, which is especially valuable for rare or poorly sampled species.

The basic setup mirrors the single-species N-mixture model. For each species \( i \), site \( j \), and visit \( k \), we model the observed count \( y_{i,j,k} \) as

\[y_{i,j,k} \sim \text{Binomial}(N_{i,j}, p_{i,j,k})\]

where \( N_{i,j} \) is the latent abundance of species \( i \) at site \( j \), and \( p_{i,j,k} \) is the detection probability on visit \( k \). The abundance itself is typically modeled using a Poisson or Negative Binomial distribution, i.e.:

\[N_{i,j} \sim \text{Poisson}(\lambda_{i,j}), \quad \log(\lambda_{i,j}) = \mathbf{X}_{j} \boldsymbol{\beta}_i\]

Species-level coefficients \( \boldsymbol{\beta}_i \) are then modeled hierarchically as random effects from a shared community-level distribution, allowing for community-level inference as well as species-specific results.

Three main formulations of multi-species N-mixture models were introduced, all implemented in the `spAbundance` package.

  1. Multi-species N-mixture models (`msNMix`), model each species independently, but share species-level effects through a hierarchical prior. No residual correlations between species are included. These are the most stable and fastest to fit, and work well when interspecies correlations are weak or not of interest.

  2. Latent factor multi-species N-mixture models (`lfMsNMix`), introduce a set of latent variables that capture residual species correlations. These latent factors are shared across species, and each species receives a loading vector that determines how it responds to each factor. This structure allows for co-occurrence modeling, ordination, and improved inference when species exhibit nonindependent structure.

  3. Spatial factor multi-species N-mixture models (`sfMsNMix`) extend the latent factor model by adding spatial autocorrelation. Each latent factor is now modeled as a spatial process (typically using a Nearest Neighbor Gaussian Process (NNGP)), which allows both residual species correlation and spatial structure to be modeled simultaneously. This is the most powerful and flexible formulation, capable of handling complex communities with spatial structure and correlated responses.

Notably, there is no non-factor spatial multi-species model in the package. This is because modeling separate spatial surfaces for each species, as done in `spMsPGOcc()` for occupancy, becomes computationally infeasible for count data. The algorithms scale poorly with large count data due to heavier likelihood evaluation and slower mixing. Simulations in Doser et al. (2023) showed that the spatial factor model performs comparably even when species correlations are absent, so it’s the preferred default.

As with other model classes, fitting these models comes with caveats. Posterior convergence can be slow, especially for overdispersed count data. Multi-species N-mixture models may require hundreds of thousands of MCMC iterations to reach convergence. They are also sensitive to initial values: A bad starting point can lead to poor mixing or convergence failure. To deal with this, the lecture recommended:

  • First running the model for a short number of iterations to generate sensible initial values.

  • Then running the full model with multiple chains, using the saved initial values.

  • Using the terminal or script-based runs to parallelize chains and reduce computational burden.

All the usual constraints and prior recommendations from earlier lectures apply here. Informative priors for the spatial decay parameter \( \phi \), careful ordering of species for latent factor models, and convergence diagnostics like trace plots and effective sample size are all critical.

Now we introduced hierarchical distance sampling, a method designed to estimate true abundance or density while explicitly accounting for imperfect detection. Unlike standard count-based GLMMs or N-mixture models that rely on repeated counts, distance sampling uses spatial information, specifically the distance between observer and detected individuals, to model detectability. This additional structure allows for stronger identification of abundance, even when detection rates are low or highly variable.

Distance sampling is widely used in field ecology. The core idea is that detection probability decreases with distance: an animal standing right next to the observer is almost always seen, while one at the edge of the detection zone is easily missed. By recording the distance to each detected individual, ecologists can build a detection function \( g(x) \) that models this relationship. A key assumption is that detection is perfect at zero distance, i.e., \( g(0) = 1 \), and that detection probability monotonically decreases with increasing distance.

Distance data can be collected as either continuous distances or binned into discrete intervals. In practice, most modeling, including all functions in `spAbundance`, uses binned data. This simplifies the likelihood and allows modeling via multinomial distributions. For example, a survey might record detections in bins such as 0–25 m, 25–50 m, and so on. Each bin records how many individuals were observed within that range. Binned distances can be derived from continuous measurements after data collection.

A key quantity in distance sampling is the detection probability \( p \), which is not constant but instead a function of both the detection function \( g(x) \) and the distribution of individuals with respect to the observer. In most applications, the latter is assumed to be uniform. For example, individuals are equally likely to occur at any distance from the transect line or point count center.

Two main detection functions are supported in `spAbundance`:

  1. The half-normal, where detection probability drops off smoothly and symmetrically around the observer.

  2. The negative exponential, where detection decays rapidly with distance.

Each function has a scale parameter \( \sigma \), which determines how quickly detection declines. Covariates can influence \( \sigma \), allowing detection probability to vary by time, observer, habitat, or weather. A typical detection function is parameterized as:

Finally we introduced the extension of hierarchical distance sampling (HDS) models from single species to multi-species frameworks. Many ecological surveys, such as transect counts or point counts, collect data simultaneously on multiple species. Just as with occupancy, GLMMs, and N-mixture models, the single-species HDS model can be generalized to account for community-level structure.

In multi-species distance sampling data, each observation is organized by species \( i = 1, 2, \ldots, I \), sites \( j = 1, 2, \ldots, J \), and distance bins \( k = 1, 2, \ldots, K \). For each site and species, the number of individuals detected within each distance interval is recorded. This structure allows the detection process to be modeled explicitly through detection functions, while abundance is estimated hierarchically.

Three main types of multi-species HDS models were presented, all available in the `spAbundance` package. The first are basic multi-species HDS models (msDS), where species are modeled hierarchically as random effects but no residual species correlations are included. The second are latent factor multi-species HDS models (lfMsDS), where species are modeled hierarchically and residual correlations are captured through latent factors. The third are spatial factor multi-species HDS models (sfMsDS), which extend the latent factor approach by adding spatial autocorrelation, allowing the model to capture both species correlations and spatial structure simultaneously.

The general structure of the models follows what has been seen earlier. For species \( i \) at site \( j \), the latent abundance \( N_{i,j} \) is modeled as

\[N_{i,j} \sim \text{Poisson}(\lambda_{i,j})\]

with a log-linear model for abundance,

\[\log(\lambda_{i,j}) = \mathbf{X}_j \boldsymbol{\beta}_i\]

where \( \mathbf{X}_j \) represents site-level covariates and \( \boldsymbol{\beta}_i \) are species-specific parameters. Observed counts in distance bins follow a multinomial distribution conditional on abundance and the detection function:

\[\mathbf{y}_{i,j} \sim \text{Multinomial}(N_{i,j}, \boldsymbol{\pi}_{i,j})\]

with \( \boldsymbol{\pi}_{i,j} \) representing the probabilities of detection in each distance bin. These probabilities are derived from a detection function \( g(x) \), such as the half-normal, which links the likelihood of detection to distance from the observer.

In the latent factor formulation, species-specific coefficients \( \boldsymbol{\beta}_i \) are not independent but linked through a lower-dimensional set of latent variables. Each species has loadings on these latent factors, enabling the model to capture shared ecological responses or unmeasured gradients. The spatial factor model builds on this by modeling each latent factor as a spatial process, allowing for residual spatial autocorrelation in abundance across sites.

The lecture concluded with an applied example estimating forest bird density in the northeastern United States using a spatial factor multi-species HDS model. Students worked through the process of loading multi-species count data, binning detections into distance intervals, specifying covariates, and fitting the model. The example highlighted how the models can recover both species-specific density estimates and community-level spatial patterns, offering insights into biodiversity across landscapes.

This session tied together the main themes of the workshop. By the end, it was clear that multi-species HDS models allow ecologists to move from single-species inference to full community-level analysis, incorporating detection, abundance, correlation, and spatial structure in one unified framework.