Packages

library(ggplot2)
library(atsalibrary)
library(tidyverse)
library(forecast)
library(rEDM)

Data

data(sockeye, package="atsalibrary")
dat <- sockeye %>% subset(region=="WOOD" & brood_year>=1960)
endtrain <- 2000
train.idx <- which(dat$brood_year <= endtrain)
test.idx <- which(dat$brood_year > endtrain)
train.year <- dat$brood_year[train.idx]
test.year <- dat$brood_year[test.idx]

Part 1 NN

The forecast for \(t\) is the spawners in \(t-1\). If this was a good model, then it would be a line where for a given value on the y-axis there would be only one value on the x-axis. It’s a bit of a cloud but let’s use it anyhow.

tdat <- subset(dat, brood_year<=endtrain)
n <- nrow(tdat)
plot(tdat$spawners[2:n], 
     tdat$spawners[1:(n-1)], 
     xlab="t (forecast)", ylab="t-1 (observation)")

Task 1

  1. Get the spawners in \(2001\) (\(t-1\)). Call it x

    yr <- 2001
    x <- subset(dat, brood_year==(yr-1))$spawners
  2. Find the 2 years in the training data where spawners are closest to x. Since I will need to get the forecasts from my training data, I don’t want to include the last year of the training data \(n\). If I did that, then I wouldn’t be able to get the forecast because \(n+1\) is not in my training data.

    You can use sort(..., index.return=TRUE).

    n <- nrow(tdat)
    idx <- sort(abs(x-tdat$spawners[1:(n-1)]), index.return=TRUE)$ix[1:2]
    tdat$brood_year[idx]
    ## [1] 1992 1975
  3. Take the mean of the spawners in the \(t+1\) years to get your forecast. Note I will use dat here in case, the best matched year is the

    fc <- mean(tdat$spawners[idx+1])
    fc
    ## [1] 998.5

    I wrote a little function to do my forecast. I pass in the year and it gives me a forecast. It finds the 2 closest years in the training data by using sort(..., index.return=TRUE).

    frnn <- function(yr, nn=2){
      x <- subset(dat, brood_year==(yr-1))$spawners
      idx <- sort(abs(x-tdat$spawners[1:(n-1)]), index.return=TRUE)$ix[1:nn]
      fc <- tdat$spawners[idx+1]
      mean(fc)
    }

The forecast for 2002 is 1565.

Task 2

Make all my forecasts.

fr <- c()
for (yr in test.year){
  fr <- c(fr, frnn(yr))
}
df <- data.frame(brood_year=(endtrain+1):2005, nn=fr)
df$observed <- dat$spawners[test.idx]

Plot the forecasts. Hmm not so good since we have trend in our data.

ggplot(df, aes(x=brood_year, y=nn)) + geom_line() +
  geom_point(aes(x=brood_year, y=observed)) +
  ggtitle("NN forecasts")

Task 3

The RMSE for the NN forecasts is

sqrt(mean((df$observed-df$nn)^2))
## [1] 318.9716

random walk forecasts

Function to get the forecasts

frrw <- function(yr){
  subset(dat, brood_year==(yr-1))$spawners
}

Add the RW forecasts to my data frame.

fr <- c()
for (yr in test.year){
  fr <- c(fr, frrw(yr))
}
df$rw <- fr

Hmm, the NN forecasts are much worse than my RW forecasts. But my NN strategy was very crude and I didn’t account for the trend in the data. I didn’t even weight by distance.

sqrt(
  c(RW=mean((df$observed-df$rw)^2),
  NN=mean((df$observed-df$nn)^2))
  )
##       RW       NN 
## 140.2854 318.9716

Ok that is the idea. Let’s try the EDM algorithm in the rEDM package.

Part 2 EDM

Empirical dynamic modeling for time series

Task 1

Use the code on slide 59 to fit the training data tdat with the Simplex EDM model.

https://nwfsc-timeseries.github.io/atsa/Lectures/Week%208/lec_15_semiparametric.html#59

mod <- rEDM::simplex(as.numeric(tdat$spawners), E=1:10)

Task 2

Let’s use E=5.

Correlation plot.

p1 <- ggplot(mod, aes(E,unlist(rho))) + 
  geom_line() + theme_bw() + geom_point() + 
  xlab("E (lags)") + ylab("rho = cor(obs,pred)")

RMSE plot.

gridExtra::grid.arrange(p1, p2, nrow = 1)

Task 4

Use the code for slide 64 to use data 1960 to 2000 for your library (training data) and data 2001 to 2005 for your prediction (test data).

lib is your training data. The function wants the index. But we will leave off the last year in our training data because we need that for our \(t+1\) forecast. The function will not return the prediction for the first year of our test data. I will pass in an extra year at the beginning of our test data.

mod <- rEDM::simplex(as.numeric(dat$spawners), 
      E=5, stats_only=FALSE,
      lib=c(min(train.idx), max(train.idx)-1), 
      pred=c(min(test.idx)-1, max(test.idx)))
# get rid of the extra years
pred <- mod$model_output$E5
pred <- pred[pred$Index %in% test.idx,]
# add to my data frame
df$simplex <- pred$Predictions
df2 <- pivot_longer(df, cols=-brood_year)
ggplot(subset(df2, name!="observed"), aes(x=brood_year, y=value, col=name)) +
  geom_point() +
  geom_line(data=subset(df2, name=="observed"), aes(x=brood_year, y=value)) +
  ggtitle("Observed vs forecasts")

Task 5

Better! Still not as good as random walk though. But this is a very unfair comparison since the EDM models only get data well in the past and the RW gets last year.

sqrt(
  c(RW=mean((df$observed-df$rw)^2),
  NN=mean((df$observed-df$nn)^2),
  Simplex=mean((df$observed-df$simplex)^2))
)
##       RW       NN  Simplex 
## 140.2854 318.9716 156.0928

Part 3 S-Map

Simplex needs a value for E. I will use E=5 from the Simplex model.

mod <- rEDM::s_map(as.numeric(tdat$spawners), 
                  E=5, stats_only=FALSE)

Use the \(\theta\) plot to look for non-linearity. There is evidence of non-linearity. Note that \(\rho\) is not very big.

ggplot(mod$stats, aes(theta, unlist(rho))) + geom_line() + geom_point() + 
  xlab("Theta") + ylab("rho = cor(obs,pred)") + theme_bw()

Let’s modify the code in the lecture to look at other E. It actually looks like E=1 is best.

sdat <- as.numeric(tdat$spawners)
mod <- rEDM::s_map(sdat, E=1, stats_only=FALSE)
smap.df <- mod$stats
smap.df$E <- 1

for(i in 2:5) {
  mod <- rEDM::s_map(sdat, E=i, stats_only=FALSE)
  temp <- mod$stats
  temp$E <- i
  smap.df <- rbind(smap.df,temp)
}
smap.df$E <- as.factor(smap.df$E)
p1 <- ggplot(smap.df, aes(theta, unlist(rho), group=E,col=E)) + 
  geom_line() + geom_point() + 
  xlab("Theta") + ylab("rho = cor(obs,pred)")

p2 <- ggplot(smap.df, aes(theta, unlist(rmse), group=E,col=E)) + 
  geom_line() + geom_point() + 
  xlab("Theta") + ylab("RMSE")

gridExtra::grid.arrange(p1, p2, nrow=1)

S-Map predictions

We can do predictions in the same way we did with Simplex. I will use E=1 and theta=1 here.

mod <- rEDM::s_map(as.numeric(dat$spawners), 
      E=1, theta=1, stats_only=FALSE,
      lib=c(min(train.idx), max(train.idx)-1), 
      pred=c(min(test.idx)-1, max(test.idx)))

Add the predictions to my data frame.

# get rid of the extra years
pred <- mod$model_output$theta1
pred <- pred[pred$Index %in% test.idx,]
# add to my data frame
df$smap <- pred$Predictions

Make a plot.

df2 <- pivot_longer(df, cols=-brood_year)
ggplot(subset(df2, name!="observed"), aes(x=brood_year, y=value, col=name)) +
  geom_point() +
  geom_line(data=subset(df2, name=="observed"), aes(x=brood_year, y=value)) +
  ggtitle("Observed vs forecasts")

Finally compared the RMSE. S-Map didn’t do so well.

sqrt(
c(RW=mean((df$observed-df$rw)^2),
  NN=mean((df$observed-df$nn)^2),
  Simplex=mean((df$observed-df$simplex)^2),
  SMap.E1=mean((df$observed-df$smap)^2))
)
##       RW       NN  Simplex  SMap.E1 
## 140.2854 318.9716 156.0928 223.6768

I will look at various E and \(\theta\) and compare the RMSE for the predictions. It looks like for the predictions E=5, \(\theta=2\) is good.

sdat <- dat$spawners
mod <- rEDM::s_map(sdat, 
      E=1, stats_only=FALSE,
      lib=c(min(train.idx), max(train.idx)-1), 
      pred=c(min(test.idx)-1, max(test.idx)))
smap.df <- mod$stats
smap.df$E <- 1

for(i in 2:5) {
  mod <- rEDM::s_map(sdat, 
      E=i, stats_only=FALSE,
      lib=c(min(train.idx), max(train.idx)-1), 
      pred=c(min(test.idx)-1, max(test.idx)))
  temp <- mod$stats
  temp$E <- i
  smap.df <- rbind(smap.df,temp)
}
smap.df$E <- as.factor(smap.df$E)
p1 <- ggplot(smap.df, aes(theta, unlist(rho), group=E,col=E)) + 
  geom_line() + geom_point() + 
  xlab("Theta") + ylab("rho = cor(obs,pred)")

p2 <- ggplot(smap.df, aes(theta, unlist(rmse), group=E,col=E)) + 
  geom_line() + geom_point() + 
  xlab("Theta") + ylab("RMSE")

gridExtra::grid.arrange(p1, p2, nrow=1)

I will add that to the model.

mod <- rEDM::s_map(as.numeric(dat$spawners), 
      E=5, theta=2, stats_only=FALSE,
      lib=c(min(train.idx), max(train.idx)-1), 
      pred=c(min(test.idx)-1, max(test.idx)))
pred <- mod$model_output$theta2
pred <- pred[pred$Index %in% test.idx,]
# add to my data frame
df$smap2 <- pred$Predictions

That was data-dredging on my part to pick the model based on how it performed in the test data. But the best S-Map model did perform best. Note, I didn’t try the same thing with the Simplex data however.

sqrt(
c(RW=mean((df$observed-df$rw)^2),
  NN=mean((df$observed-df$nn)^2),
  Simplex=mean((df$observed-df$simplex)^2),
  SMap.E1=mean((df$observed-df$smap)^2),
  SMap.E15=mean((df$observed-df$smap2)^2))
)
##       RW       NN  Simplex  SMap.E1 SMap.E15 
## 140.2854 318.9716 156.0928 223.6768 118.5397

Summary. Compare to ARIMA

As an experiment, I will try an ARIMA model.

testdat <- dat[test.idx,]
mod <- forecast::auto.arima(as.numeric(tdat$spawners))
pred <- forecast::forecast(mod, h=5)
df$arima <- as.data.frame(pred)[,1]

Best model selected by auto.arima is a random walk with autocorrelated error (ARIMA(0,1,1)). That model is considerably better than the others even though for the forecast I did not use the test data at all. The forecast is just 5 years past the training data.

sqrt(
c(RW=mean((df$observed-df$rw)^2),
  NN=mean((df$observed-df$nn)^2),
  Simplex=mean((df$observed-df$simplex)^2),
  SMap.E1=mean((df$observed-df$smap)^2),
  SMap.E15=mean((df$observed-df$smap2)^2),
  ARIMA=mean((df$observed-df$arima)^2))
)
##       RW       NN  Simplex  SMap.E1 SMap.E15    ARIMA 
## 140.2854 318.9716 156.0928 223.6768 118.5397  90.0254