13.10 NEON EFI Aquatics Challenge
In generating forecasts, it’s often a good idea to use multiple training / test sets. One question with this challenge is what’s the best lag (or order of AR model) for generating forecasts? Starting with lag-1, we can evaluate the effects of different lags by starting at some point in the dataset (e.g. using 50% of the observations), and generate and evaluate forecasts. We then iterate, adding a day of data, generating and evaluating forecasts, and repeating through the rest of the data. [An alternative approach would be to adopt a moving window, where each prediction would be based on the same number of historical data points].
I’ve generalized the code above into a function that takes the initial Lake Barco dataset, and
<- function(data, last_obs, n_forecast, n_lag) {
create_stan_data <- dplyr::filter(data, indx %in% seq(last_obs + 1,
o2_test + n_forecast)))
(last_obs
<- dplyr::filter(data, indx <= last_obs, !is.na(oxygen))
o2_train <- o2_train$indx
o2_x <- o2_train$oxygen
o2_y <- o2_train$oxygen_sd
o2_sd <- nrow(o2_train)
n_o2
<- list(n = last_obs, n_o2 = n_o2, n_lag = n_lag,
stan_data n_forecast = n_forecast, o2_x = o2_x, o2_y = o2_y, o2_sd = o2_sd)
return(list(train = o2_train, stan_data = stan_data, test = o2_test))
}
Now we can try iterating over sets of data with a lag 1 model
<- 7
n_forecast <- 1
n_lag <- NA
rmse for (i in 500:(nrow(data) - n_lag)) {
<- create_stan_data(data, last_obs = i, n_forecast = n_forecast,
dat_list n_lag = n_lag)
# fit the model. opimizing can be sensitive to starting
# values, so let's try
<- -1e+100
best_map for (j in 1:10) {
<- rstan::optimizing(m, data = dat_list$stan_data)
test_fit if (test_fit$value > best_map) {
<- test_fit
fit <- test_fit$value
best_map
}
}if (fit$return_code == 0) {
# extract forecasts
<- fit$par[grep("forecast", names(fit$par))]
pred
# evaluate predictions
<- sqrt(mean((dat_list$test$oxygen - pred)^2,
rmse[i] na.rm = T))
} }