Model diagnostics and model selection

Overview of today’s material

Diagnostics
Approaches for model selection
Cross validation
Quantifying forecast performance

Residual Diagnostics in State-Space Models

State-space models in MARSS assume:

Observation equation: \(y_t = Zx_t + a + v_t\)
State equation: \(x_t = bx_{t-1} + u + w_t\)

Where: - \(v_t \sim N(0, R)\): observation error - \(w_t \sim N(0, Q)\): process error

What Types of Residuals Can We Get?

MARSS() provides:

Model residuals: \[y_t - \hat{y}_t\]
- How well the model predicts observations
State residuals: \[x_t - E[x_t | y]\]
- Deviations from estimated latent state

Residuals workflow

For each of model and state residuals
Plot residuals vs time
Check ACF/PACF for autocorrelation
QQ plot + histogram for normality
Formal tests (optional): Shapiro-Wilk, Ljung-Box

Residuals workflow

DHARMa package has great built in diagnostics
Not compatible with simulation steps in MARSS
But residual checks are straightforward and we can implement our own equivalents

Example data: global ocean temperature anomalies

simple univariate state space model

data("gtemp_ocean")
fit <- MARSS(gtemp_ocean)

What Types of Residuals Can We Get?

Residuals:

resid <- MARSSresiduals(fit, type = c("tT", "tt1", "tt"), 
                        normalize = FALSE)

Extract residuals of interest

“tT” smoothed residuals conditioned on all data y
“tt1” one-step ahead residuals, conditioned on data up to t-1 (innovation residuals)
“tt” contemporaneous residuals conditioned on data up to time t

Plotting model residuals

Is there a trend? Are they stationary?

resid_mod <- residuals(fit, type = "tT")
ggplot(resid_mod, aes(t, .resids, color = type)) + 
  geom_point() + xlab("Time") + ylab("Residuals") + 
  geom_hline(aes(yintercept = 0), col="black")

ACFs and PCFs of residuals

Spikes = autocorrelation

resid1 <- resid_mod[(resid_mod$type=="ytT"),]
acf(resid1$.resids, main = "ACF: Model Residuals")

ACFs and PCFs of residuals

PACF at lag 2: maybe we’d be better off with AR(2) model?

resid1 <- resid_mod[(resid_mod$type=="ytT"),]
pacf(resid1$.resids, main = "PACF: Model Residuals")

Normality checks

qqnorm(resid1$.resid)
qqline(resid1$.resid)

Normality checks

Skewness or multimodality?

hist(resid1$.resid, 40,
     main = "Histogram of Model Residuals", xlab = "Residual")

Extensions for multivariate models

All of the previous diagnostics are appropriate for multivariate models
Are residuals correlated across multiple time series?
Why might this occur and how to fix it?

How good are our models?

Several candidate models might be built based on

hypotheses / mechanisms
diagnostics / summaries of fit

Models can be evaluated by their ability to explain data

OR by the tradeoff in the ability to explain data, and ability to predict future data
OR just in their predictive abilities
- Hindcasting
- Forecasting

How good are our models?

We can illustrate with an example to the harborSealWA dataset in MARSS

\(y_{t} = b_{0}+b_{1}*t+e_{t}\)

How good are our models?

## 
## Call:
## lm(formula = SJI ~ Year, data = harborSealWA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46099 -0.08022  0.06576  0.13286  0.21464 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.392e+02  1.601e+01  -8.697 1.85e-07 ***
## Year         7.397e-02  8.043e-03   9.197 8.69e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1916 on 16 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8409, Adjusted R-squared:  0.831 
## F-statistic: 84.58 on 1 and 16 DF,  p-value: 8.693e-08

How good are our models?

Our regression model had a pretty good SS

\[SS=\sum _{ i=1 }^{ n }{ { \left( y_{ i }-E[{ y }_{ i }] \right) }^{ 2 } } \]

But SS is problematic
- as we consider more complex models, they’ll inevitably reduce SS
- there’s no cost or penalty for having too many parameters

Model selection

Lots of metrics have been developed to overcome this issue and penalize complex models

Occam’s razor: “the law of briefness”
Principle of parsimony: choose the simplest possible model that explains the data pretty well
- choose a model that minimizes bias and variance

Model selection

https://www.ncbi.nlm.nih.gov/books/NBK543534/figure/ch8.Fig3/

Model selection: AIC

Akaike’s Information Criterion (AIC, Akaike 1973)

Attempts to balance the goodness of fit of the model against the number of parameters
Based on deviance = minus twice negative log likelihood

Deviance = \[-2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right)\]

Deviance is a measure of model fit to data
- lower values are better
- Maximizing likelihood is equivalent to minimizing negative likelihood

Model selection: AIC

Why the large focus on AIC?
Heavily used in ecology (Burnham and Anderson 2002)[https://www.springer.com/gp/book/9780387953649]
Also the default in many stepwise model selection procedures in R
forecast, glmulti, bestglm, AICcmodavg, MuMIn

Model selection: AIC

Many base functions in R support the extraction of AIC

y = cumsum(rnorm(20))
AIC(lm(y~1))
AIC(glm(y~1))
AIC(mgcv::gam(y~1))
AIC(glmmTMB::glmmTMB(y~1))
AIC(lme4::lmer(y~1))
AIC(stats::arima(y))
AIC(forecast::Arima(y))
AIC(MARSS:MARSS(y))

Model selection: AIC

Many *IC approaches to model selection also rely on deviance. Where they differ is how they structure the penalty term.

For AIC, the penalty is 2 * number of parameters (\(k\)),

\[AIC = -2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right) + 2k\]

This is not affected by sample size, \(n\)

Model selection: AIC

AIC aims to find the best model to predict data generated from the same process that generated your observations

Downside: AIC has a tendency to overpenalize, especially for more complex models

Equivalent to significance test w/ \(\alpha\) = 0.16

Alternative: Schwarz/Bayesian Information Criterion (SIC/BIC)

Not Bayesian!
Relies on Laplace approximation to posterior
\(\alpha\) becomes a function of sample size

Model selection: AIC

BIC is measure of explanatory power (rather than balancing explanation / prediction)

\[BIC = -2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right) + k\cdot ln(n)\]

Tendency of BIC to underpenalize

AIC and BIC for time series forecasting

Smallest AIC similar to minimizing one-step ahead forecasts using MSE Rob Hyndman’s blog
AIC approximates LOOCV Stone (1977)
BIC approximates k-fold cross validation Shao (1997)

Bayesian model selection

The big difference between the Bayesian and maximum likelihood approaches are that

ML methods are maximizing the likelihood over the parameter space
Bayesian methods are integrating over the parameter space, asking ‘what values are best, on average?’

Many of the ML methods discussed were designed for models with only fixed effects.

What about correlated parameters, nested or hierarchical models?

Bayesian model selection

Again, lots of options that have evolved quickly over the last several decades

Bayes factors (approximated by BIC)
- can be very difficult to calculate for complex models
Deviance Information Criterion (DIC)
- Spiegelhalter et al. (2002)
- DIC is easy to get out of some programs (JAGS)
Widely applicable information criterion (WAIC)
- Watanabe (2010)
Leave One Out Information Criterion (LOOIC)
- Vehtari et al. 2017, Vehtari et al. 2019

Bridging ML / Bayesian model selection

J.T. Thorson 2024: Measuring complexity for hierarchical models using effective degrees of freedom

similar to mgcv in that we can approximate the effective degrees of freedom
but applied to models with random effects / hierarchical structure
has generally not been available in maximum likelihood

Cross validation

Recent focus in ecology & fisheries on prediction

Dietze et al. 2017

Maris et al. 2017

Pennekamp et al. 2017

Pennekamp et al. 2018

Szuwalkski & Thorson 2017

Anderson et al. 2017

Resampling techniques

Jackknife

Hold out each data point, recomputing some statistic (and then average across 1:n)

Bootstrap

Similar to jackknife, but with resampling

Cross-validation (k-fold)

Divide dataset into k-partitions
How well do (k-1) partitions predict kth set of points?
Relationship between LOOCV and AIC / BIC

Data split: test/training sets (e.g. holdout last 5 data pts)

Resampling techniques: bootstrap

Bootstrap or jackknife approaches are useful

generally used in the context of time series models to generate new or pseudo-datasets
posterior predictive checks in Bayesian models generate new data from posterior draws
state space models: use estimated deviations / errors to simulate, estimate CIs

Examples

MARSS::MARSSboot()
MARSS::MARSSinnovationsboot()
forecast::bld.mbb.bootstrap()
forecast::forecast(..., bootstrap=TRUE)

Resampling techniques: K-fold cross validation

As an example, we’ll use a time series of body temperature from the beavers dataset

data(beavers)
beaver = dplyr::filter(beaver2, time>200)

Resampling techniques: K-fold cross validation

Choose model (e.g. State space model w/MARSS)
Partition data
Fit & prediction

Resampling techniques: K-fold cross validation

set.seed(123)
K = 5 
beaver$part = sample(1:K, size=nrow(beaver), replace=T)
beaver$pred = 0
beaver$pred_se = 0
for(k in 1:K) {
  y = beaver$temp
  y[which(beaver$part==k)] = NA
    mod = MARSS(y, model=list("B"="unequal"))
    beaver$pred[beaver$part==k] = 
      mod$states[1,which(beaver$part==k)]
    beaver$pred_se[beaver$part==k] = 
      mod$states.se[1,which(beaver$part==k)]
}

Resampling techniques: K-fold cross validation

How large should K be?
Bias/variance tradeoff:
Low K: low variance, larger bias, quicker to run. ML approaches recommend 5-10
High K (LOOCV): low bias, high variance, computationally expensive

Resampling techniques: K-fold cross validation

Resampling techniques: repeated K-fold cross validation

To remove effect of random sampling / partitioning, repeat K-fold cross validation and average predictions for a given data point
caret() package in R does this for some classes of models
Data splitting for time series

Resampling techniques: repeated K-fold cross validation

Need to specify repeats

train_control = caret::trainControl(method="repeatedcv", 
                number=5, repeats=20)

Again this is extendable across many widely used models

Resampling techniques

What about for time series data?

Previous resampling was random
No preservation of order (autocorrelation)

Resampling techniques: LTOCV

Leave Time Out Cross Validation = leave each year out in turn
Predict using historical and future data
Re-analyze the beaver data using LTOCV

Resampling techniques: future (aka forward chain) CV

Leave Future Out Cross Validation: only evaluate models on future data

Fold 1: training[1], test[2]
Fold 2: training[1:2], test[3]
Fold 3: training[1:3], test[4]
Fold 4: training[1:4], test[5]

Resampling techniques: LFOCV

Apply MARSS model to beaver dataset
Assign partitions in order, 1:5

beaver$part = ceiling(5*seq(1,nrow(beaver)) / (nrow(beaver)))

iterate through 2:5 fitting the model and forecasting

Resampling techniques: LFOCV

Bayesian cross validation

LOOIC (Leave-one out cross validation) + preferred over alternatives

WAIC (widely applicable information criterion)

Both available in loo::loo()

Additional reading: https://cran.r-project.org/web/packages/loo/vignettes/loo2-example.html

Prediction and forecast evaluations

Let’s fit an ARMA(1,1) model to the global temperature data, after 1st differencing to remove trend

Prediction and forecast evaluations

All models will have forecast uncertainty increase with time
Rate of increase depends on the model

plot(f1)

Quantifying forecast performance

One of the most widelty used metrics is mean square error (MSE)

\[MSE=E\left[ { e }_{ t }^{ 2 } \right] =E\left[ { \left( { x }_{ t }-{ \hat { x } }_{ t } \right) }^{ 2 } \right]\]

Root mean squared error (RMSE) also very common

Quantifying forecast performance

Like with model selection, the bias-variance tradeoff is important

principle of parsimony

MSE can be rewritten as

\[MSE=Var\left( { \hat { x } }_{ t } \right) +Bias{ \left( { x }_{ t },{ \hat { x } }_{ t } \right) }^{ 2 }\] * Smaller MSE = lower bias + variance

Quantifying forecast performance

MSE and all forecast metrics can be calculated for

single data points
entire time series
future forecasts

\[MSE={ \frac { \sum _{ t=1 }^{ n }{ { \left( { x }_{ t }-{ \hat { x } }_{ t } \right) }^{ 2 } } }{ n } }\]

Do you care just about predicting the final outcome of a forecast, or also the trajectory to get there?

Variants of MSE

Root mean square error, RMSE (quadratic score)

RMSE = \(\sqrt { RMSE }\)
on the same scale as the data
also referred to as RMSD, root mean square deviation

Mean absolute error, MAE (linear score) \[E\left[ \left| { x }_{ t }-{ \hat { x } }_{ t } \right| \right]\]

Median absolute error, MdAE

\[median\left[ \left| { x }_{ t }-{ \hat { x } }_{ t } \right| \right]\]

Scale independent measures of performance

Better when applying statistics of model(s) to multiple datasets MSE or RMSE will be driven by time series that is larger in magnitude

Percent Error Statistics

Percent Error:

\[{ p }_{ t }=\frac { { e }_{ t }\cdot 100 }{ { Y }_{ t } }\]

Mean Absolute Percent Error (MAPE):

\[MAPE\quad =\quad E\left[ \left| { p }_{ t } \right| \right]\] Root Mean Square Percent Error (RMSPE):

\[RMSPE\quad =\quad \sqrt { E\left[ { p }_{ t }^{ 2 } \right] }\]

Issues with percent error statistics

\[{ p }_{ t }=\frac { { e }_{ t }\cdot 100 }{ { Y }_{ t } }\]

What happens when Y = 0?
Distribution of percent errors tends to be highly skewed / long tails
MAPE tends to put higher penalty on positive errors
See Hyndman & Koehler (2006)

Scaled error statistics

Define scaled error as

\[{ q }_{ t }=\frac { { e }_{ t } }{ \frac { 1 }{ n-1 } \sum _{ i=2 }^{ n }{ \left( { Y }_{ i }-{ Y }_{ i-1 } \right) } }\]

denominator is MAE from random walk model, so performance is gauged relative to that
this does not allow for missing data

Absolute scaled error (ASE)

\[ASE=\left| { q }_{ t } \right|\] Mean absolute scaled error (MASE)

\[MASE=E\left[ \left| { q }_{ t } \right| \right]\]

Interpreting ASE and MASE

All values are relative to the naïve random walk model

Values < 1 indicate better performance than RW model
Values > 1 indicate worse performance than RW model

Implementation in R

Fit an ARIMA model to ‘airmiles’,holding out last 3 points

n = length(airmiles)
air.model = auto.arima(log(airmiles[1:(n-3)]))

Implementation in R

Forecast the fitted model 3 steps ahead
Use holdout data to evaluate accuracy

air.forecast = forecast(air.model, h = 3)
plot(air.forecast)

Implementation in R

Evaluate RMSE / MASE statistics for 3 holdouts

accuracy(air.forecast, log(airmiles[(n-2):n]), test = 3)

##                  ME      RMSE       MAE       MPE     MAPE     MASE
## Test set -0.4183656 0.4183656 0.4183656 -4.051598 4.051598 2.010664

Evaluate RMSE / MASE statistics for only last holdout

accuracy(air.forecast, log(airmiles[(n-2):n]), test = 1)

##                  ME      RMSE       MAE       MPE     MAPE     MASE
## Test set -0.1987598 0.1987598 0.1987598 -1.960106 1.960106 0.955239

MSE vs MAPE vs MASE

Raw statistics (e.g. MSE, RMSE) shouldn’t be applied for data of different scale

Percent error metrics (e.g. MAPE) may be skewed & undefined for real zeroes

Scaled error metrics (ASE, MASE) have been shown to be more robust meta-analyses of many datasets + Hyndman & Koehler (2006)

Scoring rules

Metrics (RMSE, etc.) evaluate point estimates of predictions vs. observations
But what if we also care about how uncertain our predictions / forecasts are?
limited to applications of parametric methods
Scoring rules
Draper (2005)
Gneting and Raftery (2012)
R packages: scoring, scoringRules

Overview of today’s material

Residual Diagnostics in State-Space Models

What Types of Residuals Can We Get?

Residuals workflow

Residuals workflow

Example data: global ocean temperature anomalies

Example data: global ocean temperature anomalies

What Types of Residuals Can We Get?

Plotting model residuals

ACFs and PCFs of residuals

ACFs and PCFs of residuals

Normality checks

Normality checks

Extensions for multivariate models

How good are our models?

How good are our models?

How good are our models?

How good are our models?

Model selection

Model selection

Model selection: AIC

Model selection: AIC

Model selection: AIC

Model selection: AIC

Model selection: AIC

Model selection: AIC

AIC and BIC for time series forecasting

Bayesian model selection

Bayesian model selection

Bridging ML / Bayesian model selection

Cross validation

Resampling techniques

Resampling techniques: bootstrap

Resampling techniques: K-fold cross validation

Resampling techniques: K-fold cross validation

Resampling techniques: K-fold cross validation

Resampling techniques: K-fold cross validation

Resampling techniques: K-fold cross validation

Resampling techniques: K-fold cross validation

Resampling techniques: repeated K-fold cross validation

Resampling techniques: repeated K-fold cross validation

Resampling techniques

Resampling techniques: LTOCV

Resampling techniques: future (aka forward chain) CV

Resampling techniques: LFOCV

Resampling techniques: LFOCV

Bayesian cross validation

Prediction and forecast evaluations

Prediction and forecast evaluations

Quantifying forecast performance

Quantifying forecast performance

Quantifying forecast performance

Variants of MSE

Scale independent measures of performance

Percent Error Statistics

Issues with percent error statistics

Scaled error statistics

Interpreting ASE and MASE

Implementation in R

Implementation in R

Implementation in R

MSE vs MAPE vs MASE

Scoring rules

Scoring rules

Scoring rules

Questions?