## 13.10 NEON EFI Aquatics Challenge

In generating forecasts, it’s often a good idea to use multiple training / test sets. One question with this challenge is what’s the best lag (or order of AR model) for generating forecasts? Starting with lag-1, we can evaluate the effects of different lags by starting at some point in the dataset (e.g. using 50% of the observations), and generate and evaluate forecasts. We then iterate, adding a day of data, generating and evaluating forecasts, and repeating through the rest of the data. [An alternative approach would be to adopt a moving window, where each prediction would be based on the same number of historical data points].

I’ve generalized the code above into a function that takes the initial Lake Barco dataset, and

create_stan_data <- function(data, last_obs, n_forecast, n_lag) {
o2_test <- dplyr::filter(data, indx %in% seq(last_obs + 1,
(last_obs + n_forecast)))

o2_train <- dplyr::filter(data, indx <= last_obs, !is.na(oxygen))
o2_x <- o2_train$indx o2_y <- o2_train$oxygen
o2_sd <- o2_train$oxygen_sd n_o2 <- nrow(o2_train) stan_data <- list(n = last_obs, n_o2 = n_o2, n_lag = n_lag, n_forecast = n_forecast, o2_x = o2_x, o2_y = o2_y, o2_sd = o2_sd) return(list(train = o2_train, stan_data = stan_data, test = o2_test)) } Now we can try iterating over sets of data with a lag 1 model n_forecast <- 7 n_lag <- 1 rmse <- NA for (i in 500:(nrow(data) - n_lag)) { dat_list <- create_stan_data(data, last_obs = i, n_forecast = n_forecast, n_lag = n_lag) # fit the model. opimizing can be sensitive to starting # values, so let's try best_map <- -1e+100 for (j in 1:10) { test_fit <- rstan::optimizing(m, data = dat_list$stan_data)
if (test_fit$value > best_map) { fit <- test_fit best_map <- test_fit$value
}
}
if (fit$return_code == 0) { # extract forecasts pred <- fit$par[grep("forecast", names(fit$par))] # evaluate predictions rmse[i] <- sqrt(mean((dat_list$test\$oxygen - pred)^2,
na.rm = T))
}
}