January 29 2019

Overview of today’s material

  • Approaches for model selection
  • Cross validation
  • Quantifying forecast performance

How good are our models?

Several candidate models might be built based on

  • hypotheses / mechanisms
  • diagnostics / summaries of fit

Models can be evaluated by their ability to explain data

  • OR by the tradeoff in the ability to explain data, and ability to predict future data
  • OR just in their predictive abilities
    • Hindcasting
    • Forecasting

How good are our models?

We can illustrate with an example to the harborSealWA dataset in MARSS

How good are our models?

Metrics like the sum of squares are appealing,

\[SS=\sum _{ i=1 }^{ n }{ { \left( y_{ i }-E[{ y }_{ i }] \right) }^{ 2 } } \]

How good are our models?

## 
## Call:
## lm(formula = SJI ~ Year, data = harborSealWA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46099 -0.08022  0.06576  0.13286  0.21464 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.392e+02  1.601e+01  -8.697 1.85e-07 ***
## Year         7.397e-02  8.043e-03   9.197 8.69e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1916 on 16 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.8409, Adjusted R-squared:  0.831 
## F-statistic: 84.58 on 1 and 16 DF,  p-value: 8.693e-08

How good are our models?

Our regression model had a pretty good sum-of-squares

  • But SS is problematic
    • as we consider more complex models, they’ll inevitably reduce SS
    • there’s no cost or penalty for having too many parameters

Model selection

Lots of metrics have been developed to overcome this issue and penalize complex models

  • Occam’s razor: “the law of briefness”

  • Principle of parsimony: choose the simplest possible model that explains the data pretty well
    • choose a model that minimizes bias and variance

Model selection

Model selection: AIC

Akaike’s Information Criterion (AIC, Akaike 1973)

  • Attempts to balance the goodness of fit of the model against the number of parameters

  • Based on deviance = minus twice negative log likelihood

Deviance = \[-2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right)\]

  • Deviance is a measure of model fit to data
    • lower values are better
    • Maximizing likelihood is equivalent to minimizing negative likelihood

Model selection: AIC

Many *IC approaches to model selection also rely on deviance. Where they differ is how they structure the penalty term.

For AIC, the penalty is 2 * number of parameters (\(k\)),

\[AIC = -2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right) + 2k\]

  • But what about sample size, \(n\)?

Model selection: AIC

Small sample AIC

\[AICc=AIC+\frac { 2k(k+1) }{ n-k-1 }\]

  • What happens to this term as n increases?

Model selection: AIC

AIC aims to find the best model to predict data generated from the same process that generated your observations

Downside: AIC has a tendency to overpenalize, especially for more complex models

  • Equivalent to significance test w/ \(\alpha\) = 0.16

Alternative: Schwarz/Bayesian Information Criterion (SIC/BIC)

  • Not Bayesian!
  • Relies on Laplace approximation to posterior
  • \(\alpha\) becomes a function of sample size

Model selection: AIC

BIC is measure of explanatory power (rather than balancing explanation / prediction)

\[BIC = -2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right) + k\cdot ln(n)\]

  • Tendency of BIC to underpenalize

Model selection: AIC

Philosophical differences between AIC / BIC

  • AIC / AICc tries to choose a model that approximates reality
    • does not assume that reality exists in your set of candidate models
  • BIC assumes that one of your models is truth
    • This model will tend to be favored more as sample size increases

Model selection: AIC

Many base functions in R support the extraction of AIC

y = cumsum(rnorm(20))
AIC(lm(y~1))
AIC(glm(y~1))
AIC(mgcv::gam(y~1))
AIC(glmmTMB::glmmTMB(y~1))
AIC(lme4::lmer(y~1))
AIC(stats::arima(y))
AIC(forecast::Arima(y))

Bayesian model selection

Largely beyond the scope of this class, but as an overview there’s several options

  • Bayes factors (approximated by BIC)
    • can be very difficult to calculate for complex models
  • Deviance Information Criterion (DIC)
    • Spiegelhalter et al. (2002)
    • DIC is easy to get out of some programs (JAGS)

DIC is also attempting to balance bias and variance

Bayesian model selection

Like AIC, DIC is estimated from (1) the deviance, and (2) penalty, or effective number of parameters \(p_{D}\). \[DIC = -2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right) + 2p_{D}\] Options for \(p_{D}\) are

  • \(E[D]-D\left( \bar { \theta } \right)\) (Spiegelhalter et al. 2002) and

  • \(\frac { Var(D\left( \theta \right) ) }{ 2 }\) (Gelman et al. 2004)

Bayesian model selection

The big difference between the Bayesian and maximum likelihood approaches are that

  • ML methods are maximizing the likelihood over the parameter space
  • Bayesian methods are integrating over the parameter space, asking ‘what values are best, on average?’

Many of the ML methods discussed were designed for models with only fixed effects.

  • What about correlated parameters, nested or hierarchical models?

Cross validation

Resampling techniques

Jackknife

  • Hold out each data point, recomputing some statistic (and then average across 1:n)

Bootstrap

  • Similar to jackknife, but with resampling

Cross-validation (k-fold)

  • Divide dataset into k-partitions
  • How well do (k-1) partitions predict kth set of points?
  • Relationship between LOOCV and AIC

Data split: test/training sets (e.g. holdout last 5 data pts)

Resampling techniques

Bootstrap or jackknife approaches are useful

  • generally used in the context of time series models to generate new or pseudo-datasets

  • state space models: use estimated deviations / errors to simulate, estimate CIs

Examples

MARSS::MARSSboot()
MARSS::MARSSinnovationsboot()
forecast::bld.mbb.bootstrap()
forecast::forecast(..., bootstrap=TRUE)

Resampling techniques

As an example, we’ll use a time series of body temperature from the beavers dataset

data(beavers)
beaver = dplyr::filter(beaver2, time>200)
ggplot(beaver, aes(time, temp)) + 
  geom_point() + geom_line() + 
  ylab("Temperature") + xlab("Time")

Resampling techniques: K-fold cross validation

  • Choose model (e.g. regression)
  • Partition data
  • Fit & prediction

Things to highlight: # folds, how sampling is done

K = 5 
beaver$part = sample(1:K, size=nrow(beaver), replace=T)
beaver$pred = 0
for(i in 1:K) {
    mod = lm(temp~time, data = beaver[beaver$part!=i,])
    beaver$pred[beaver$part==i] = 
    predict(mod, newdata = beaver[beaver$part==i,])
}

Resampling techniques: K-fold cross validation

Resampling techniques: K-fold cross validation

  • How large should K be?

  • Bias/variance tradeoff:

  • Low K: low variance, larger bias, quicker to run. ML approaches recommend 5-10

  • High K (LOOCV): low bias, high variance, computationally expensive

Resampling techniques: repeated K-fold cross validation

  • To remove effect of random sampling / partitioning, repeat K-fold cross validation and average predictions for a given data point

  • caret() package in R

Resampling techniques: repeated K-fold cross validation

  • Need to specify repeats
train_control = caret::trainControl(method="repeatedcv", 
                number=5, repeats=20)
  • Again this is extendable across many widely used models

Resampling techniques: repeated K-fold cross validation

summary(model)

Call: lm(formula = .outcome ~ ., data = dat)

Residuals: Min 1Q Median 3Q Max -0.5553 -0.2160 -0.1019 0.2139 0.6868

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.613e+01 1.266e-01 285.30 <2e-16 time 8.795e-04 7.441e-05 11.82 <2e-16 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.2905 on 85 degrees of freedom Multiple R-squared: 0.6218, Adjusted R-squared: 0.6173 F-statistic: 139.7 on 1 and 85 DF, p-value: < 2.2e-16

Resampling techniques

What about for time series data?

  • Previous resampling was random
  • No preservation of order (autocorrelation)

Resampling techniques: forward chain CV

Idea: only evaluate models on subsequent data

  • Fold 1: training[1], test[2]
  • Fold 2: training[1:2], test[3]
  • Fold 3: training[1:3], test[4]
  • Fold 4: training[1:4], test[5]

Resampling techniques: forward chain CV

Example with Arima(1,1,1) model

  • Assign partitions in order, 1:5
beaver$part = ceiling(5*seq(1,nrow(beaver)) / (nrow(beaver)))
  • iterate through 2:5 fitting the model and forecasting

Resampling techniques: forward chain CV

Example code:

for(i in 2:5) {
mod = Arima(beaver$temp[which(beaver$part < i)], order=c(1,1,1))

beaver$pred[which(beaver$part == i)] = 
  as.numeric(forecast(mod, h = length(which(beaver$part == i)))$mean)
}

Bayesian cross validation

Prediction and forecast evaluations

Forecasting with arima()

  • Let’s fit an ARMA(1,1) model to the global temperature data, after 1st differencing to remove trend

  • You can use the arima() function or Arima() function – Arima() is a wrapper for arima()

Forecasting with arima()

Seasonal component not included because these data are collected annually

ar.global.1 = Arima(globtemp, order = c(1,1,1),
              seasonal=list(order=c(0,0,0),
              period=12))

f1 = forecast(ar.global.1, h = 10)
  • h = number of time steps to forecast in future

Forecasting with arima()

summary(f1)
## 
## Forecast method: ARIMA(1,1,1)
## 
## Model Information:
## Series: globtemp 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       0.3138  -0.6975
## s.e.  0.1400   0.0989
## 
## sigma^2 estimated as 0.01035:  log likelihood=117.81
## AIC=-229.62   AICc=-229.44   BIC=-220.91
## 
## Error measures:
##                      ME      RMSE        MAE      MPE     MAPE      MASE
## Training set 0.01578913 0.1006282 0.08541372 20.22738 76.15972 0.9569172
##                     ACF1
## Training set 0.002808596
## 
## Forecasts:
##      Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## 2016      0.7942681 0.6638616 0.9246747 0.5948285 0.9937078
## 2017      0.7705020 0.6173184 0.9236855 0.5362279 1.0047760
## 2018      0.7630437 0.5967695 0.9293178 0.5087493 1.0173381
## 2019      0.7607031 0.5840228 0.9373834 0.4904939 1.0309123
## 2020      0.7599686 0.5739514 0.9459858 0.4754799 1.0444574
## 2021      0.7597381 0.5649751 0.9545011 0.4618738 1.0576024
## 2022      0.7596658 0.5565764 0.9627551 0.4490674 1.0702641
## 2023      0.7596431 0.5485686 0.9707176 0.4368325 1.0824537
## 2024      0.7596359 0.5408715 0.9784004 0.4250646 1.0942073
## 2025      0.7596337 0.5334418 0.9858256 0.4137030 1.1055644

Forecasting with arima()

plot(f1)

Quantifying forecast performance

One of the most widelty used metrics is mean square error (MSE)

\[MSE=E\left[ { e }_{ t }^{ 2 } \right] =E\left[ { \left( { x }_{ t }-{ \hat { x } }_{ t } \right) }^{ 2 } \right]\]

  • Root mean squared error (RMSE) also very common

Quantifying forecast performance

Like with model selection, the bias-variance tradeoff is important

  • principle of parsimony

MSE can be rewritten as

\[MSE=Var\left( { \hat { x } }_{ t } \right) +Bias{ \left( { x }_{ t },{ \hat { x } }_{ t } \right) }^{ 2 }\] * Smaller MSE = lower bias + variance

Quantifying forecast performance

MSE and all forecast metrics can be calculated for

  • single data points
  • entire time series
  • future forecasts

\[MSE={ \frac { \sum _{ t=1 }^{ n }{ { \left( { x }_{ t }-{ \hat { x } }_{ t } \right) }^{ 2 } } }{ n } }\]

  • Do you care just about predicting the final outcome of a forecast, or also the trajectory to get there?

Variants of MSE

Root mean square error, RMSE (quadratic score)

  • RMSE = \(\sqrt { RMSE }\)
  • on the same scale as the data
  • also referred to as RMSD, root mean square deviation

Mean absolute error, MAE (linear score) \[E\left[ \left| { x }_{ t }-{ \hat { x } }_{ t } \right| \right]\]

Median absolute error, MdAE

\[median\left[ \left| { x }_{ t }-{ \hat { x } }_{ t } \right| \right]\]

Scale independent measures of performance

Better when applying statistics of model(s) to multiple datasets MSE or RMSE will be driven by time series that is larger in magnitude

Percent Error Statistics

Percent Error:

\[{ p }_{ t }=\frac { { e }_{ t }\cdot 100 }{ { Y }_{ t } }\]

Mean Absolute Percent Error (MAPE):

\[MAPE\quad =\quad E\left[ \left| { p }_{ t } \right| \right]\] Root Mean Square Percent Error (RMSPE):

\[RMSPE\quad =\quad \sqrt { E\left[ { p }_{ t }^{ 2 } \right] }\]

Issues with percent error statistics

  • What happens when Y = 0?

  • Distribution of percent errors tends to be highly skewed / long tails

  • MAPE tends to put higher penalty on positive errors

  • See Hyndman & Koehler (2006)

Scaled error statistics

Define scaled error as

\[{ q }_{ t }=\frac { { e }_{ t } }{ \frac { 1 }{ n-1 } \sum _{ i=2 }^{ n }{ \left( { Y }_{ i }-{ Y }_{ i-1 } \right) } }\]

  • denominator is MAE from random walk model, so performance is gauged relative to that
  • this does not allow for missing data

Absolute scaled error (ASE)

\[ASE=\left| { q }_{ t } \right|\] Mean absolute scaled error (MASE)

\[MASE=E\left[ \left| { q }_{ t } \right| \right]\]

Interpreting ASE and MASE

All values are relative to the naïve random walk model

  • Values < 1 indicate better performance than RW model

  • Values > 1 indicate worse performance than RW model

Implementation in R

  • Fit an ARIMA model to ‘airmiles’,holding out last 3 points
n = length(airmiles)
air.model = auto.arima(log(airmiles[1:(n-3)]))

Implementation in R

  • Forecast the fitted model 3 steps ahead
  • Use holdout data to evaluate accuracy
air.forecast = forecast(air.model, h = 3)
plot(air.forecast)

Implementation in R

Evaluate RMSE / MASE statistics for 3 holdouts

accuracy(air.forecast, log(airmiles[(n-2):n]), test = 3)
##                  ME      RMSE       MAE       MPE     MAPE     MASE
## Test set -0.4183656 0.4183656 0.4183656 -4.051598 4.051598 2.010664

Evaluate RMSE / MASE statistics for only last holdout

accuracy(air.forecast, log(airmiles[(n-2):n]), test = 1)
##                  ME      RMSE       MAE       MPE     MAPE     MASE
## Test set -0.1987598 0.1987598 0.1987598 -1.960106 1.960106 0.955239

Performance metrics summary

Raw statistics (e.g. MSE, RMSE) shouldn’t be applied for data of different scale

Percent error metrics (e.g. MAPE) may be skewed & undefined for real zeroes

Scaled error metrics (ASE, MASE) have been shown to be more robust meta-analyses of many datasets + Hyndman & Koehler (2006)

Questions?