13.11 NEON EFI Aquatics Challenge

So far, these models have only included oxygen. We can easily bring in the temperature data, which has it’s own standard error associated with it. We can set up a null model modeling temperature and oxygen independently, with different lags allowed

First we have to modify our data function to also generate temperature data.

create_stan_data <- function(data, last_obs, n_forecast, n_lag_o2, 
    n_lag_temp) {
    # create test data
    o2_test <- dplyr::filter(data, indx %in% seq(last_obs + 1, 
        (last_obs + n_forecast)))
    temp_test <- dplyr::filter(data, indx %in% seq(last_obs + 
        1, (last_obs + n_forecast)))

    o2_train <- dplyr::filter(data, indx <= last_obs, !is.na(oxygen))
    o2_x <- o2_train$indx
    o2_y <- o2_train$oxygen
    o2_sd <- o2_train$oxygen_sd
    n_o2 <- nrow(o2_train)

    temp_train <- dplyr::filter(data, indx <= last_obs, !is.na(temperature))
    temp_x <- temp_train$indx
    temp_y <- temp_train$temperature
    temp_sd <- temp_train$temperature_sd
    n_temp <- nrow(temp_train)

    stan_data <- list(n = last_obs, n_lag_o2 = n_lag_o2, n_lag_temp = n_lag_temp, 
        n_forecast = n_forecast, n_o2 = n_o2, o2_x = o2_x, o2_y = o2_y, 
        o2_sd = o2_sd, n_temp = n_temp, temp_x = temp_x, temp_y = temp_y, 
        temp_sd = temp_sd)

    return(list(o2_train = o2_train, temp_train = temp_train, 
        stan_data = stan_data, o2_test = o2_test, temp_test = temp_test))
}
m <- stan_model(file = "model_02.stan")

where model_02.stan is the Stan model file that you can download here.

# Now we can try iterating over sets of data with a lag 1
# model
n_forecast <- 7
n_lag <- 1
rmse <- NA
for (i in 500:(nrow(data) - n_lag)) {
    dat_list <- create_stan_data(data, last_obs = i, n_forecast = n_forecast, 
        n_lag_o2 = n_lag, n_lag_temp = n_lag)

    # fit the model. opimizing can be sensitive to starting
    # values, so let's try
    best_map <- -1e+100
    for (j in 1:10) {
        test_fit <- rstan::optimizing(m, data = dat_list$stan_data)
        if (test_fit$value > best_map) {
            fit <- test_fit
            best_map <- test_fit$value
        }
    }

    # extract forecasts
    o2_pred <- fit$par[grep("o2_forecast", names(fit$par))]
    temp_pred <- fit$par[grep("temp_forecast", names(fit$par))]

    pred <- c(o2_pred, temp_pred)
    obs <- c(dat_list$o2_test$oxygen, dat_list$temp_test$temperature)
    # evaluate predictions
    rmse[i] <- sqrt(mean((obs - pred)^2, na.rm = T))
    print(rmse)
}

Question: why do the future predictions of either temp or oxygen sometimes fall to 0, and appear to do poorly?

Question: modify the script above to loop over various oxygen and temperature lags. Using a subset of data, say observations 500:800, which combination yields the best predictions?