- Approaches for model selection
- Cross validation
- Quantifying forecast performance
January 29 2019
Several candidate models might be built based on
Models can be evaluated by their ability to explain data
We can illustrate with an example to the harborSealWA
dataset in MARSS
Metrics like the sum of squares are appealing,
\[SS=\sum _{ i=1 }^{ n }{ { \left( y_{ i }-E[{ y }_{ i }] \right) }^{ 2 } } \]
## ## Call: ## lm(formula = SJI ~ Year, data = harborSealWA) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.46099 -0.08022 0.06576 0.13286 0.21464 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.392e+02 1.601e+01 -8.697 1.85e-07 *** ## Year 7.397e-02 8.043e-03 9.197 8.69e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1916 on 16 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.8409, Adjusted R-squared: 0.831 ## F-statistic: 84.58 on 1 and 16 DF, p-value: 8.693e-08
Our regression model had a pretty good sum-of-squares
Lots of metrics have been developed to overcome this issue and penalize complex models
Occam’s razor: “the law of briefness”
Akaike’s Information Criterion (AIC, Akaike 1973)
Attempts to balance the goodness of fit of the model against the number of parameters
Based on deviance = minus twice negative log likelihood
Deviance = \[-2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right)\]
Many *IC approaches to model selection also rely on deviance. Where they differ is how they structure the penalty term.
For AIC, the penalty is 2 * number of parameters (\(k\)),
\[AIC = -2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right) + 2k\]
Small sample AIC
\[AICc=AIC+\frac { 2k(k+1) }{ n-k-1 }\]
AIC aims to find the best model to predict data generated from the same process that generated your observations
Downside: AIC has a tendency to overpenalize, especially for more complex models
Alternative: Schwarz/Bayesian Information Criterion (SIC/BIC)
BIC is measure of explanatory power (rather than balancing explanation / prediction)
\[BIC = -2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right) + k\cdot ln(n)\]
Philosophical differences between AIC / BIC
Many base functions in R support the extraction of AIC
y = cumsum(rnorm(20)) AIC(lm(y~1)) AIC(glm(y~1)) AIC(mgcv::gam(y~1)) AIC(glmmTMB::glmmTMB(y~1)) AIC(lme4::lmer(y~1)) AIC(stats::arima(y)) AIC(forecast::Arima(y))
Largely beyond the scope of this class, but as an overview there’s several options
DIC is also attempting to balance bias and variance
Like AIC, DIC is estimated from (1) the deviance, and (2) penalty, or effective number of parameters \(p_{D}\). \[DIC = -2\cdot ln\left( L(\underline { \theta } |\underline { y } ) \right) + 2p_{D}\] Options for \(p_{D}\) are
\(E[D]-D\left( \bar { \theta } \right)\) (Spiegelhalter et al. 2002) and
\(\frac { Var(D\left( \theta \right) ) }{ 2 }\) (Gelman et al. 2004)
The big difference between the Bayesian and maximum likelihood approaches are that
Many of the ML methods discussed were designed for models with only fixed effects.
Recent focus in ecology & fisheries on prediction
Jackknife
Bootstrap
Cross-validation (k-fold)
Data split: test/training sets (e.g. holdout last 5 data pts)
Bootstrap or jackknife approaches are useful
generally used in the context of time series models to generate new or pseudo-datasets
state space models: use estimated deviations / errors to simulate, estimate CIs
Examples
MARSS::MARSSboot() MARSS::MARSSinnovationsboot() forecast::bld.mbb.bootstrap() forecast::forecast(..., bootstrap=TRUE)
As an example, we’ll use a time series of body temperature from the beavers
dataset
data(beavers) beaver = dplyr::filter(beaver2, time>200)
ggplot(beaver, aes(time, temp)) + geom_point() + geom_line() + ylab("Temperature") + xlab("Time")
Things to highlight: # folds, how sampling is done
K = 5 beaver$part = sample(1:K, size=nrow(beaver), replace=T) beaver$pred = 0 for(i in 1:K) { mod = lm(temp~time, data = beaver[beaver$part!=i,]) beaver$pred[beaver$part==i] = predict(mod, newdata = beaver[beaver$part==i,]) }
How large should K be?
Bias/variance tradeoff:
Low K: low variance, larger bias, quicker to run. ML approaches recommend 5-10
High K (LOOCV): low bias, high variance, computationally expensive
To remove effect of random sampling / partitioning, repeat K-fold cross validation and average predictions for a given data point
caret() package in R
train_control = caret::trainControl(method="repeatedcv", number=5, repeats=20)
summary(model)
Call: lm(formula = .outcome ~ ., data = dat)
Residuals: Min 1Q Median 3Q Max -0.5553 -0.2160 -0.1019 0.2139 0.6868
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.613e+01 1.266e-01 285.30 <2e-16 time 8.795e-04 7.441e-05 11.82 <2e-16 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
Residual standard error: 0.2905 on 85 degrees of freedom Multiple R-squared: 0.6218, Adjusted R-squared: 0.6173 F-statistic: 139.7 on 1 and 85 DF, p-value: < 2.2e-16
What about for time series data?
Idea: only evaluate models on subsequent data
Example with Arima(1,1,1) model
beaver$part = ceiling(5*seq(1,nrow(beaver)) / (nrow(beaver)))
Example code:
for(i in 2:5) { mod = Arima(beaver$temp[which(beaver$part < i)], order=c(1,1,1)) beaver$pred[which(beaver$part == i)] = as.numeric(forecast(mod, h = length(which(beaver$part == i)))$mean) }
LOOCV (Leave-one out cross validation) + preferred over alternatives
WAIC (widely applicable information criterion)
loo::loo()
Additional reading: https://cran.r-project.org/web/packages/loo/vignettes/loo2-example.html
Let’s fit an ARMA(1,1) model to the global temperature data, after 1st differencing to remove trend
You can use the arima() function or Arima() function – Arima() is a wrapper for arima()
Seasonal component not included because these data are collected annually
ar.global.1 = Arima(globtemp, order = c(1,1,1), seasonal=list(order=c(0,0,0), period=12)) f1 = forecast(ar.global.1, h = 10)
summary(f1)
## ## Forecast method: ARIMA(1,1,1) ## ## Model Information: ## Series: globtemp ## ARIMA(1,1,1) ## ## Coefficients: ## ar1 ma1 ## 0.3138 -0.6975 ## s.e. 0.1400 0.0989 ## ## sigma^2 estimated as 0.01035: log likelihood=117.81 ## AIC=-229.62 AICc=-229.44 BIC=-220.91 ## ## Error measures: ## ME RMSE MAE MPE MAPE MASE ## Training set 0.01578913 0.1006282 0.08541372 20.22738 76.15972 0.9569172 ## ACF1 ## Training set 0.002808596 ## ## Forecasts: ## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 ## 2016 0.7942681 0.6638616 0.9246747 0.5948285 0.9937078 ## 2017 0.7705020 0.6173184 0.9236855 0.5362279 1.0047760 ## 2018 0.7630437 0.5967695 0.9293178 0.5087493 1.0173381 ## 2019 0.7607031 0.5840228 0.9373834 0.4904939 1.0309123 ## 2020 0.7599686 0.5739514 0.9459858 0.4754799 1.0444574 ## 2021 0.7597381 0.5649751 0.9545011 0.4618738 1.0576024 ## 2022 0.7596658 0.5565764 0.9627551 0.4490674 1.0702641 ## 2023 0.7596431 0.5485686 0.9707176 0.4368325 1.0824537 ## 2024 0.7596359 0.5408715 0.9784004 0.4250646 1.0942073 ## 2025 0.7596337 0.5334418 0.9858256 0.4137030 1.1055644
plot(f1)
One of the most widelty used metrics is mean square error (MSE)
\[MSE=E\left[ { e }_{ t }^{ 2 } \right] =E\left[ { \left( { x }_{ t }-{ \hat { x } }_{ t } \right) }^{ 2 } \right]\]
Like with model selection, the bias-variance tradeoff is important
MSE can be rewritten as
\[MSE=Var\left( { \hat { x } }_{ t } \right) +Bias{ \left( { x }_{ t },{ \hat { x } }_{ t } \right) }^{ 2 }\] * Smaller MSE = lower bias + variance
MSE and all forecast metrics can be calculated for
\[MSE={ \frac { \sum _{ t=1 }^{ n }{ { \left( { x }_{ t }-{ \hat { x } }_{ t } \right) }^{ 2 } } }{ n } }\]
Root mean square error, RMSE (quadratic score)
Mean absolute error, MAE (linear score) \[E\left[ \left| { x }_{ t }-{ \hat { x } }_{ t } \right| \right]\]
Median absolute error, MdAE
\[median\left[ \left| { x }_{ t }-{ \hat { x } }_{ t } \right| \right]\]
Better when applying statistics of model(s) to multiple datasets MSE or RMSE will be driven by time series that is larger in magnitude
Percent Error:
\[{ p }_{ t }=\frac { { e }_{ t }\cdot 100 }{ { Y }_{ t } }\]
Mean Absolute Percent Error (MAPE):
\[MAPE\quad =\quad E\left[ \left| { p }_{ t } \right| \right]\] Root Mean Square Percent Error (RMSPE):
\[RMSPE\quad =\quad \sqrt { E\left[ { p }_{ t }^{ 2 } \right] }\]
What happens when Y = 0?
Distribution of percent errors tends to be highly skewed / long tails
MAPE tends to put higher penalty on positive errors
See Hyndman & Koehler (2006)
Define scaled error as
\[{ q }_{ t }=\frac { { e }_{ t } }{ \frac { 1 }{ n-1 } \sum _{ i=2 }^{ n }{ \left( { Y }_{ i }-{ Y }_{ i-1 } \right) } }\]
Absolute scaled error (ASE)
\[ASE=\left| { q }_{ t } \right|\] Mean absolute scaled error (MASE)
\[MASE=E\left[ \left| { q }_{ t } \right| \right]\]
All values are relative to the naïve random walk model
Values < 1 indicate better performance than RW model
Values > 1 indicate worse performance than RW model
n = length(airmiles) air.model = auto.arima(log(airmiles[1:(n-3)]))
air.forecast = forecast(air.model, h = 3) plot(air.forecast)
Evaluate RMSE / MASE statistics for 3 holdouts
accuracy(air.forecast, log(airmiles[(n-2):n]), test = 3)
## ME RMSE MAE MPE MAPE MASE ## Test set -0.4183656 0.4183656 0.4183656 -4.051598 4.051598 2.010664
Evaluate RMSE / MASE statistics for only last holdout
accuracy(air.forecast, log(airmiles[(n-2):n]), test = 1)
## ME RMSE MAE MPE MAPE MASE ## Test set -0.1987598 0.1987598 0.1987598 -1.960106 1.960106 0.955239
Raw statistics (e.g. MSE, RMSE) shouldn’t be applied for data of different scale
Percent error metrics (e.g. MAPE) may be skewed & undefined for real zeroes
Scaled error metrics (ASE, MASE) have been shown to be more robust meta-analyses of many datasets + Hyndman & Koehler (2006)