BART (Bayesian Additive Regression Trees) Workshop

Sep 22-24, 2025

From September 22 to 24, 2025, Jeremy Yoder and Colin Carlson led a workshop on Bayesian Additive Regression Trees (BART) and their application to species distribution modeling. The sessions combined conceptual explanations with hands-on coding, using the R packages `embarcadero` and `dbarts`. We worked through a complete workflow, starting from classical species distribution models and moving toward Bayesian tree ensembles with hierarchical extensions. The workshop did not require prior experience with BART or boosted trees, but a basic understanding of regression models in R was expected.

Take aways

Across three days, the workshop built a continuous path from classical regression models to Bayesian tree ensembles, connecting ecological intuition with modern statistical tools. Through a mix of lectures and live coding in R, we worked step by step from basic species distribution models toward BART and its random-intercept extensions. Using the Joshua tree (Yucca brevifolia) dataset as a running example, we compared how different modeling approaches capture environmental structure across the southwestern United States.

Each session deepened the same idea: ecological responses are nonlinear, interactive, and uncertain, and modeling should reflect this complexity rather than simplify it away. BART provided a natural framework for that goal, combining the flexibility of decision trees with the interpretability of regression and the rigor of Bayesian inference.

We learned BART modeling as a way to move beyond the constraints of linear models, using ensembles of regression trees to uncover nonlinear relationships, capture interactions automatically, and quantify uncertainty while accounting for regional structure in the data.

Species Distribution Modeling (SDM)

The first lecture set the foundation for everything that followed. Before jumping into Bayesian frameworks or tree ensembles, we revisited the basic structure of species distribution models (SDMs) and what they actually estimate. The guiding question was simple: given a set of environmental conditions, what is the probability that a species occurs at a given site?

Model structure

At its simplest, an SDM models a binary response variable \(Y_i\), representing presence (1) or absence (0) at site \(i\), as a function of environmental predictors \(\mathbf{X}_i\). In classical form:

\[Y_i \sim \text{Bernoulli}(p_i)\]

and

\[\text{logit}(p_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_k X_{ki},\]

where \(\beta_0\) is the intercept, \(\beta_j\) are the coefficients for predictors \(X_j\), and \(p_i\) is the estimated probability of presence.

This logistic regression framework assumes that each predictor contributes linearly on the logit scale and that effects combine additively. Both assumptions are restrictive for ecological data, where predictors often interact or have nonlinear responses.

Presence/Absence and Pseudo-Absences

The Joshua tree (Yucca brevifolia) dataset provided an ideal test case. True absence data were unavailable, so pseudo-absences were generated by randomly sampling points outside known occurrences. This allowed us to fit a model despite incomplete detection. The response variable \(Y_i\) thus reflected both presences and pseudo-absences, forming a quasi-Bernoulli sample.

However, pseudo-absences introduce a sampling bias because the background distribution does not necessarily represent true absences. If pseudo-absences are drawn from regions with distinct environmental conditions, the estimated coefficients \(\beta_j\) may be skewed. The instructors emphasized that understanding this bias is crucial when interpreting parameter estimates or predicted probabilities.

Environmental Predictors

Predictors were derived from BioClim climatic layers, for our modelling set we used:

  • Mean annual temperature (\(\text{BIO1}\))

  • Annual precipitation (\(\text{BIO12}\))

  • Temperature seasonality (\(\text{BIO4}\))

  • Precipitation of the driest quarter (\(\text{BIO17}\))

Each \(X_j\) was scaled before fitting, making coefficients \(\beta_j\) comparable in magnitude. We also discussed the ecological meaning of each variable. For instance, \(\text{BIO12}\) often reflects water limitation, while \(\text{BIO4}\) can indicate sensitivity to temperature variability.

Interpretation

Once the model was fitted, predicted probabilities \(\hat{p}_i\) were visualized across environmental gradients:

\[\hat{p}_i = \frac{1}{1 + e^{-(\beta_0 + \sum_j \beta_j X_{ji})}}\]

Plotting \(\hat{p}_i\) against temperature or precipitation revealed typical sigmoidal relationships, highlighting thresholds where probability of occurrence changes sharply. Even when responses were clearly nonlinear (for example, tolerance to intermediate temperature ranges) the logit-linear assumption forced the model into overly simple patterns.

Motivation for Flexible Models

To address these limitations, the instructors introduced the idea of tree-based methods. Unlike regression, which assumes a fixed functional form, tree models partition predictor space recursively, allowing \(p_i\) to change arbitrarily across regions of \(\mathbf{X}\). Formally, a decision tree can be expressed as a step function:

\[f(\mathbf{X}) = \sum_{m=1}^{M} c_m \, I(\mathbf{X} \in R_m),\]

where \(R_m\) are partitions of the predictor space and \(c_m\) are fitted constants. Ensemble methods such as Boosted Regression Trees (BRTs) or Random Forests combine many such trees to produce smoother, more stable predictions. We compared logistic regression and BRT predictions for the same dataset. The BRT captured nonlinear responses and variable interactions automatically, setting the conceptual bridge toward BART, where this flexibility is embedded within a Bayesian framework.

Bayesian Modeling

The second lecture introduced the Bayesian framework that underpins BART. The shift from classical to Bayesian modeling wasn’t presented as a new method but as a new way of reasoning about models: treating parameters as random variables and uncertainty as an integral part of inference.

The Bayesian Framework

In the Bayesian approach, all unknown quantities, coefficients, predictions, even latent processes, are treated as random variables with probability distributions. The goal is to update prior beliefs using data according to Bayes’ theorem:

\[p(\theta \mid y) = \frac{p(y \mid \theta) p(\theta)}{p(y)},\]

where

  • \(p(\theta)\) is the prior, representing beliefs about parameters before seeing the data,

  • \(p(y \mid \theta)\) is the likelihood, describing how data are generated given parameters,

  • \(p(\theta \mid y)\) is the posterior, the updated belief after observing data,

  • \(p(y)\) is the marginal likelihood, a normalizing constant.

Inference is based on the full posterior distribution rather than on point estimates such as \(\hat{\beta}\) or \(\hat{p}\). Every model quantity carries an explicit measure of uncertainty.

From Likelihood to Posterior

In logistic regression, the likelihood for \(n\) independent observations is:

\[p(\mathbf{y} \mid \boldsymbol{\beta}) = \prod_{i=1}^n p_i^{y_i} (1 - p_i)^{1 - y_i},\]

with

\[\text{logit}(p_i) = \beta_0 + \sum_{j=1}^{k} \beta_j X_{ji}.\]

Assuming a normal prior \(\beta_j \sim \mathcal{N}(0, \sigma^2_\beta)\) for each coefficient, the posterior becomes:

\[p(\boldsymbol{\beta} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \boldsymbol{\beta}) \, p(\boldsymbol{\beta})\]

Since this posterior cannot be derived analytically, it’s typically approximated using Markov Chain Monte Carlo (MCMC) methods, drawing samples \(\boldsymbol{\beta}^{(1)}, \ldots, \boldsymbol{\beta}^{(S)}\) from \(p(\boldsymbol{\beta} \mid \mathbf{y})\). These samples form the basis for estimating means, credible intervals, and predictive distributions.

Application to Species Distribution

In the SDM context, Bayesian inference allows us to model detection uncertainty, account for imperfect data, and incorporate spatial or hierarchical structure. For example, random effects for regions or observers can be introduced with their own priors:

\[\beta_{0r} \sim \mathcal{N}(\mu_0, \sigma^2_r), \]

where \(\beta_{0r}\) is the region-specific intercept and \(\sigma^2_r\) controls variability across regions. Such extensions are central to hierarchical Bayesian models and later reappear in random-intercept BART.

From Parametric to Flexible Priors

The final part of the session connected Bayesian regression to the motivation for tree-based models. In a standard logistic regression, the prior structure is simple: typically independent normals on \(\beta_j\). But for nonlinear and interactive effects, defining appropriate priors becomes difficult.

Tree-based Bayesian models, like BART, sidestep this by placing priors on functions rather than individual parameters. Instead of \(\beta_j\), BART defines priors over trees that partition predictor space and contribute small additive effects. These functional priors allow the model to learn nonlinearities while preserving a coherent probabilistic interpretation.

By the end of the lecture, the group had implemented a simple Bayesian logistic regression in R using ` brms` , compared posterior distributions of coefficients, and visualized predictive uncertainty. It provided the conceptual bridge between regression models and the probabilistic logic that BART extends to high-dimensional nonlinear problems.

Bayesian Additive Regression Trees (BART)

The third lecture introduced the central method of the workshop: Bayesian Additive Regression Trees (BART). To understand it, we first revisited how a single regression tree works. Then we built up to the idea of combining many of them under a Bayesian framework.

Regression Trees

A regression tree models a response \(y_i\) by recursively partitioning the predictor space into regions where the response is approximately constant. Conceptually, it works like a structured game of “20 questions”. At each step, the model asks a yes-or-no question about one predictor variable and divides the data into two groups that are more homogeneous in their response values.

Formally, consider a dataset \(\{(\mathbf{X}_i,y_i)\}_{i=1}^n\) with predictors \(\mathbf{X}_i=(X_{i1},X_{i2},\ldots,X_{ip})\). The goal is to partition the predictor space into \(M\) disjoint regions \(R_1,R_2,\ldots,R_M\), such that within each region \(R_m\), the response \(y_i\) is well approximated by a constant \(\mu_m\). The model is defined as:

\[f(\mathbf{X}_i)=\sum_{m=1}^{M}\mu_m I(\mathbf{X}_i\in R_m),\]

where \(I(\cdot)\) is an indicator function equal to 1 if \(\mathbf{X}_i\) belongs to region \(R_m\) and 0 otherwise. The parameters of this model are both the region boundaries (the splits) and the terminal node means \(\mu_m\). To build the tree, we choose the splits that minimize the residual sum of squares (RSS):

\[\text{RSS} = \sum_{m=1}^{M} \sum_{i \in R_m} (y_i - \mu_m)^2\]

At each step, the algorithm considers every variable \(j\( and every possible split point \(s\(, dividing the data into two regions:

\[R_1(j, s) = \{ \mathbf{X} \mid X_j < s \}, \qquad R_2(j, s) = \{ \mathbf{X} \mid X_j \ge s \}.\]

For each candidate split, we compute the resulting total error:

\[E(j, s) = \sum_{i \in R_1(j, s)} (y_i - \bar{y}_{R_1})^2 + \sum_{i \in R_2(j, s)} (y_i - \bar{y}_{R_2})^2\]

and choose \((j^*, s^*)\) that minimizes \(E(j, s)\). This greedy algorithm continues recursively, fitting new splits within each region until a stopping criterion is reached, typically a minimum node size or maximum depth.
Each region \(R_m\) thus defines a rectangular subset of predictor space, and each \(\mu_m = \bar{y}_{R_m}\) is simply the average response within that subset. In geometric terms, the fitted function \(f(\mathbf{X})\) is a piecewise-constant approximation of the true regression surface, flat within each region, with discontinuities at split boundaries.

Intuitively, each split can be seen as an ecological decision rule. The first split might divide “cool” from “warm” climates based on a temperature threshold (e.g., \(X_{\text{temp}} < 15\)°C). Within each branch, the next split might depend on precipitation (e.g., \(X_{\text{precip}} > 400\) mm), producing finer subdivisions of environmental space. The process continues until every observation belongs to one of \(M\) homogeneous “niches,” each characterized by a distinct mean response.

This recursive partitioning has two important implications.

First, it allows automatic interaction modeling: the effect of one variable can depend on previous splits. Temperature may only matter under certain precipitation regimes, or vice versa, without needing to specify interactions manually.

Second, it provides a nonparametric representation: the model makes no assumptions about linearity or additivity, only that locally similar conditions yield similar responses.

However, single trees are prone to overfitting and instability. Because each split depends on the previous ones, small data perturbations can lead to very different trees. The fitted function is discontinuous, jumping sharply at region boundaries, and may not generalize well outside the training sample. Despite these limitations, regression trees remain the conceptual foundation of ensemble methods like Random Forests, Boosted Regression Trees, and Bayesian Additive Regression Trees (BART). All of them retain the core idea of recursively partitioning predictor space, but they combine many trees to produce smoother, more stable, and often more interpretable models.

From Trees to Ensembles

To stabilize predictions, tree-based ensembles like Random Forests and Boosted Regression Trees (BRTs) combine many trees into a smoother model. Random forests average predictions from many independent trees, while boosting adds trees sequentially to model residuals left by previous ones.

The general ensemble form is:

\[f(\mathbf{X}) = \sum_{m=1}^{M} g(\mathbf{X}; T_m, \mu_m),\]

where each \(g\) represents a single regression tree. In BRTs, each tree is trained on the residuals of the previous ensemble, scaled by a learning rate to prevent overfitting. The process is deterministic and optimized via gradient descent.

BART takes this same additive structure but replaces the deterministic boosting process with a fully Bayesian formulation.

The BART Model

In BART, the model for continuous outcomes is:

\[y_i = \sum_{m=1}^{M} g(\mathbf{X}_i; T_m, \mu_m) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)\]

For binary responses, BART uses a probit link:

\[P(Y_i = 1 \mid \mathbf{X}_i) = \Phi\!\left(\sum_{m=1}^{M} g(\mathbf{X}_i; T_m, \mu_m)\right)\]

Each tree is weakly regularized, contributing only a small adjustment to the overall prediction. The prior enforces this behavior by penalizing deep trees and shrinking leaf values toward zero.

Priors and Regularization

BART places priors over:

  • Tree structure: probability of node splits decreases with depth \(P(\text{split at depth } d) = \alpha (1 + d)^{-\beta}\)

  • Terminal node values: \(\mu_{mj} \sim \mathcal{N}(0, \sigma_\mu^2)\)

  • Error variance: \(\sigma^2 \sim \nu \lambda / \chi^2_\nu\)

These priors act as built-in regularization, keeping trees shallow and smoothing the ensemble. Each tree captures a minor portion of the signal; together, they reconstruct complex functions.

Posterior Sampling

Instead of optimizing via gradient descent, BART draws samples from the posterior:

\[P(\{T_m, \mu_m\}_{m=1}^{M}, \sigma^2 \mid y, X) \propto P(y \mid \{T_m, \mu_m\}, \sigma^2, X) \prod_{m=1}^{M} P(T_m) P(\mu_m \mid T_m) P(\sigma^2)\]

Sampling proceeds with a backfitting Gibbs sampler:

  1. Randomly select a tree.

  2. Remove its contribution from the ensemble to compute residuals.

  3. Update that tree to model the residuals.

  4. Reinsert it and repeat for all trees.

Over many iterations, the ensemble converges toward the posterior distribution of possible models.

Interpretation

Predictions are obtained by averaging over posterior samples:

\[\hat{y}_i = \frac{1}{S} \sum_{s=1}^{S} f^{(s)}(\mathbf{X}_i)\]

and uncertainty naturally arises from the spread of \(f^{(s)}\).

In ecological applications, this posterior uncertainty is not noise, it reflects variation in model structure and parameter estimates that correspond to genuine uncertainty about species–environment relationships.

The lecture closed with live coding using `embarcadero` and `dbarts`, fitting BART to the Joshua tree dataset and visualizing how the sum-of-trees structure captured nonlinear temperature-precipitation responses far beyond the reach of traditional regression.

Interpreting Models — Partials and Spartials

Once BART models are fitted, the focus shifts from prediction to interpretation. Unlike classical regressions with explicit coefficients, BART encodes relationships implicitly through the structure and frequency of tree splits. Understanding how predictors influence the response requires summarizing the model’s posterior behavior rather than inspecting parameter estimates.

Partial Dependence

A partial dependence plot (PDP) shows how the predicted response changes with one predictor while averaging over all others. For a single variable \(x_j\), the partial function is defined as

\[\bar{f}(x_j) = \frac{1}{n} \sum_{i=1}^{n} f(x_j, \mathbf{X}_{i, -j}),\]

where \(\mathbf{X}_{i, -j}\) represents all predictors except \(x_j\). The function \(\bar{f}(x_j)\)expresses the marginal effect of \(x_j\) on the response, averaged over the distribution of other variables.

In BART, this quantity is computed over posterior samples, which allows uncertainty in model structure and parameter values to propagate through to the partial curve:

\[\hat{f}(x_j) = \frac{1}{S} \sum_{s=1}^{S} \frac{1}{n} \sum_{i=1}^{n} f^{(s)}(x_j, \mathbf{X}_{i, -j}),\]

where each \(f^{(s)}\) represents a different draw from the posterior distribution of trees.

Partial dependence plots reveal nonlinearities and thresholds that would be invisible in a linear model. In species distribution models, they often highlight ecological tolerances, such as a species’ response to temperature increasing up to an optimum before declining, or they can show interactions when combined with another variable.

Spatial Partials (“Spartials”)

While partials describe marginal effects in predictor space, spatial partials project those effects back into geographic space. A spatial partial fixes one variable (e.g. temperature) and computes predictions across all locations, holding other predictors at their observed values. This produces a spatial map of the effect of that variable, isolating how the environment modifies the predicted probability of presence across space. Mathematically, this is the same as evaluating \(\bar{f}(x_j)\) over the raster grid:

\[\hat{y}_{r} = \frac{1}{S} \sum_{s=1}^{S} f^{(s)}(x_{j,r}, \mathbf{X}_{r, -j}),\]

for each raster cell \(r\). The result can be visualized as a continuous spatial gradient, showing, for example, where temperature or precipitation most strongly limits species presence.

These maps can reveal geographic zones of model sensitivity or environmental control, helping connect statistical relationships to real ecological constraints.

Uncertainty and Variable Importance

Because BART is Bayesian, all interpretations include credible intervals. The spread of partial dependence curves or spatial partial maps reflects epistemic uncertainty, uncertainty in the model given limited data. Areas or variables with wide posterior intervals often signal sparse data or strong covariation among predictors.

Variable importance in BART is derived from the frequency and depth of tree splits. Predictors that are used frequently near the top of trees tend to explain more variance in the response. However, this frequency-based importance is diagnostic: it indicates where the model found structure.

Random-Intercept BART

The final lecture introduced Random-Intercept BART (riBART), an extension designed to handle structured ecological data. It is used in situations where observations are grouped by site, region, year, or any hierarchical unit that introduces dependence. Traditional BART assumes that all data points are independent, an assumption often violated in spatial or temporal studies. Random-intercept BART incorporates group-level variation directly into the model, combining the flexibility of tree ensembles with the structure of mixed models.

From Fixed to Random Effects

In a linear mixed model, we write the response as:

\[y_i = \alpha_{g[i]} + \mathbf{x}_i^\top \boldsymbol{\beta} + \varepsilon_i,\]

where \(\alpha_{g[i]}\) is a random intercept for group \(g[i]\) (e.g. site or region), drawn from a group-level distribution such as

\[\alpha_{g[i]} \sim \text{Normal}(0, \sigma_\alpha^2).\]

This accounts for differences in baseline response between groups without estimating a separate parameter for each one. Random-intercept BART extends this logic to the nonparametric setting:

\[y_i = f(\mathbf{x}_i) + a_{g[i]} + \varepsilon_i,\]

where \(f(\mathbf{x}_i)\) is the sum-of-trees prediction, and \(a_{g[i]}\) is a random effect that captures structured residual variation. Both components are estimated jointly in the Bayesian framework, allowing uncertainty in \(f\) and \(a\) to interact naturally.

In hierarchical ecological data, such as multiple observations per site, spatial clusters of plots, or repeated surveys across years, this structure is very important. Without it, models can conflate within-group correlation with environmental effects, leading to overconfident estimates and inflated variable importance.

Posterior Estimation

The random effects \(a_{g[i]}\) are treated as latent parameters and updated within the Gibbs sampler alongside the tree structures. After integrating over posterior draws, predictions take the form:

\[\hat{y}_i = \frac{1}{S} \sum_{s=1}^{S} \big( f^{(s)}(\mathbf{x}_i) + a^{(s)}_{g[i]} \big),\]

with corresponding credible intervals derived from the full posterior sample.

This formulation naturally partitions uncertainty into two components, variation in the environmental response $f$ and variation attributable to group-level heterogeneity \(a\).

In practice, the addition of random intercepts allows the model to fit local deviations without distorting global relationships. For example, two sites might experience the same temperature and precipitation but differ in baseline suitability due to unmeasured soil conditions or sampling effects; the random effect absorbs this difference.

Application to the Joshua Tree Dataset

Using riBART, we modeled Joshua tree presence across multiple U.S. states. Each state received its own random intercept, capturing regional biases not explained by climate alone. This formulation revealed subtle differences in baseline occupancy probabilities: Nevada and California showing higher intercepts than Utah, while the main environmental response (temperature-precipitation curvature) remained consistent across regions. The resulting model combined three strengths:

  1. Flexibility: Nonlinear environmental responses via tree ensembles

  2. Structure: Hierarchical grouping via random intercepts

  3. Uncertainty quantification: Full posterior inference for both components

Together, these features allow BART to move beyond simple species-environment relationships, modeling ecological data in their hierarchical and uncertain form.