Multivariate models in
mgcv
(Pederson et al. 2019)[https://doi.org/10.7717/peerj.6876]Flexible dynamic models with
mvgam
(Clark et al. 2022)[https://doi.org/10.1111/2041-210X.13974]
Multivariate models in mgcv
(Pederson et al. 2019)[https://doi.org/10.7717/peerj.6876]
Flexible dynamic models with mvgam
(Clark et al. 2022)[https://doi.org/10.1111/2041-210X.13974]
mgcv
Often we have multiple related response variables
Examples: different locations, species, or strata
Goal: model shared and individual trends across responses while accounting for correlation and partial pooling
mgcv
Lots of ways we can think to model variation by time series
Different intercepts / trends (harbor seal example in MARSS)
Site:year random effects (GLMMs)
Factor smooths (specific to GAMs)
mgcv
?s()
function to specify smooth terms.gam(response ~ s(predictor, by = factor_variable), data = dataset)
by = factor
, a separate smooth is estimated for each levelWhat are the differences in these models?
s(time, by = x, bs = “cr”)
s(time, by = location)
s(time, by = location, bs = “fs”)
\[x \cdot f(time)\] - te(time, x) or ti(time, x) become non-linear
What about when by
is a factor? [Hint: don’t do this!]
s(time, by = location)
Unique smooths and penalties per location, location parsed to numeric?
We need to explicitly tell mgcv
that we want to use a factor smooth
s(time, by = location, bs = “fs”)
Unique smooths per location
penalty is shared across all levels of the factor (same wiggliness)
data("harborSeal") long_seal <- as.data.frame(harborSeal) |> pivot_longer(cols = starts_with("CoastalEstuaries"): starts_with("Georgia.Strait"), names_to = "location", values_to = "value") long_seal$location <- as.factor(long_seal$location)
Model G: global smooth
Model GS: global smooth, time series - specific smooths
Model S: time series - specific smooths
fit_G <- gam(value ~ s(Year, k=8, bs="cr"), #<< data=long_seal, family="gaussian")
fit_G <- gam(value ~ s(Year, k=8, bs="cr") + s(location, k=12, bs="re"), #<< data=long_seal, family="gaussian")
fit_GS <- gam(value ~ s(Year, k=8, bs="cr") + s(Year, by = location, k = 8, bs="fs") + #<< s(location, k=12, bs="re"), data=long_seal, family="gaussian")
fit_S <- gam(value ~ s(Year, by = location, k = 8, bs="fs") + s(location, k=12, bs="re"), data=long_seal, family="gaussian")
print(AIC(fit_G, fit_GS, fit_S))
Add flexibility for time - series specific smooths
Similar to MARSS model with either shared (single trend) or MARSS model where each time series is a different trend
Wiggliness of both global and factor smooths can be specified
mgcv
Lots of flexibility
We could have used gamm
for additional flexibility in random effects
Implements mixed models (year:site random effects, etc)
mvgam
DGAMs are an extension of GAMs designed to handle time-varying and dynamic data
Particularly useful for modeling temporal dependencies / nonlinear relationships
Can be used with multivariate time series data
DGAMs incorporate dynamic smooth terms that evolve over time
GAMs are great, but uncertainty blows up as we get further from data (forecast)
Random walk / AR1 processes also have uncertainty balloon
DGAMs represent a hybrid approach that can work particularly well for forecasting
y ~ s(t, k = 12)
y ~ s(t, k = 50)
Remember that GAMs are typically represented as \[ E[Y_t] = \beta_0 + \sum_{i=1}^{I} s(x_{i,t})\] where \(Y_t\) is the response variable at time \(t\), \(\beta_0\) is the intercept, and \(s(x_{i,t})\) are smooth functions of the predictors
Because the response can be non-Gaussian we use a link function \(g^{=1}()\) to relate the mean of the response to the linear predictor
\[ E[Y_t] = g^{=1} \left( \beta_0 + \sum_{i=1}^{I} s(x_{i,t}) \right)\] where \(g\) is the link function (e.g., logit, log, identity)
DGAMs add a dynamic component to this equation, allowing for time-varying smooths
\[ E[Y_t] = g^{-1} \left( \beta_0 + \sum_{i=1}^{I} s(x_{i,t}) + z_{t} \right)\] where \(z_{t} = \psi_{0} + \psi_{1}z_{t-1} + \epsilon_t\) and \(\epsilon_t \sim N(0, \sigma_{\epsilon})\) and \(g^{-1}\) is the inverse link function
Random walks can be specified with any dynamic model that we’ve already discussed (AR, MA processes, etc)
\(z_{t} = \psi_{0} + \psi_{1}z_{t-1} + \epsilon_t\)
\(z_{t} = \psi_{0} + \psi_{1}z_{t-1} + \psi_{2}z_{t-2} + \epsilon_t\)
For multiple time series, there can be > 1 latent trend, and we can estimate factor loadings (DFA). For time series \(j\)
\[ E[Y_{j,t}] = g^{-1} \left( \beta_0 + \sum_{i=1}^{I} s(x_{i,j,t}) + \sum_{m=1}^{M} z_{m,t}\theta_{j} \right)\]
The mvgam
package provides tools for fitting DGAMs in R
Bayesian estimation with Stan backend
Supports multivariate time series modeling with dynamic smooths
Provides an interface to easily define dynamic terms in the model
mvgam
As expected, this is giving the same result as mgcv
fit <- mvgam(y ~ s(time, k = 50), data = df, family = "gaussian")
mvgam
Before, we were using factor smooths to model harbor seal trends
We can fit these exact same models (without trend) in mvgam
fit_GS <- mvgam(value ~ s(Year, k=8, bs="cr") +#<< s(Year, by = location, k = 8, bs="fs") + s(location, k=12, bs="re"), data=long_seal, family="gaussian", chains = 1)
plot_mvgam_smooth(fit_GS)
Jacox MG, CA Edwards, EL Hazen, SJ Bograd (2018) Coastal upwelling revisited: Ekman, Bakun, and improved upwelling indices for the U.S. west coast, J. Geophysical Research, 123(10), 7332-7350.
CUTI estimates of vertical transport near the coast
BEUTI estimates of vertical nitrate flux
CUTI collected at monthly time steps
1 deg latitude resolution (31 - 47) – we’ll use every other
print(head(cuti_long))
fit_trend <- mvgam(cuti ~ s(time, bs="cr") + s(time, by = latitude, bs="fs") + s(latitude, k=9, bs="re") + s(month, k = 12, bs="cc"),#<<, data=cuti2324, share_obs_params = TRUE,#<<, family="gaussian", chains = 1)
plot_mvgam_smooth(fit_trend, smooth = "month")
plot_mvgam_smooth(fit_trend, smooth = "time")
plot(fit_trend)
Next we’ll add a trend as a random walk
By default this is creating a separate trend for each time series
May have too many competing smooths etc
fit_trend_rw <- mvgam(cuti ~ s(time, bs="cr") + s(time, by = latitude, bs="fs") + s(latitude, k=9, bs="re") + s(month, k = 12, bs="cc"), trend_model = RW(), data=cuti_long[cuti_long$year>2022,], share_obs_params = TRUE,#<<, family="gaussian", chains = 1)
plot(fit_trend_rw)
plot_mvgam_smooth(fit_trend_rw)
plot_mvgam_trend(fit_trend_rw)
Our smooths and the trend are competing to explain the process
Let’s drop out the main smooth now and let that be explained by the trend
fit_trend_ar1 <- mvgam(cuti ~ s(time, bs="cr") + s(latitude, k=9, bs="re") + s(month, k = 12, bs="cc"), trend_model = 'AR1', data=cuti_long[cuti_long$year>2022,], share_obs_params = TRUE, family="gaussian", chains = 1)
But maybe we’re trying to estimate too many trends
Instead, model trends as dynamic factors (n = 2)
fit_trend_fac <- mvgam(cuti ~ s(time, bs="cr") + s(latitude, k=9, bs="re") + s(month, k = 12, bs="cc"), trend_model = 'AR1', use_lv = TRUE,#<<, n_lv = 2,#<<, data=cuti_long[cuti_long$year>2022,], share_obs_params = TRUE, family="gaussian", chains = 1)
What do factors look like?
By default mvdlm models these as AR1
DFA model with AR1 trend + seasonal smooth
plot(fit_trend_fac, type = 'factors')
pars <- rstan::extract(fit_trend_fac$model_output) print(apply(pars$lv_coefs, c(2,3), mean))
Factor model seems to be struggling (maybe more so in middle latitudes)
Factors for this application probably not flexible enough
Multivariate models in mgcv
allow for flexible modeling of multiple time series
Factor smooths provide a way to relationships that differ across different levels of a factor
mvgam
extends this approach to include dynamic components, allowing for time-varying smooths and latent trends
For time series DGAMs can be used to model temporal dependencies and nonlinear relationships in multivariate time series data
All approaches have tradeoffs between fit to the training data and forecast error
GAM fit and forecast error depends on the number of basis functions used (wiggliness)
MARSS / DFA fit and forecast error depends on the number of factors and how they are modeled