Lab 1 Forecasting with ARIMA models

Author

Zoe Rand, Madison Shipley, Emma Timmins-Schiffman

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(forecast)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Code
library(here)
here() starts at /Users/eli.holmes/Documents/GitHub/fish550-2023
Code
library(cowplot)

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp

Data

We chose to work with all data to compare ability of the ARIMA models to forecast across different patterns of population dynamics and different sizes of training data sets.

Code
ruggerone_data <- readRDS(here::here("Lab-1", "Data_Images", "ruggerone_data.rds"))

#The above code was not working for me, so I've added this in temporarily (MS)
#ruggerone_data <- readRDS("C:/GitHub/fish550-2023/Lab-1/Data_Images/ruggerone_data.rds")

Question your team will address

  1. If we approached these data as stock managers, how confident could we be in projected return data 5, 10, or 20 years into the future?
  2. What types of population patterns are ARIMA models best suited for?
  3. Is one species or region “easier” to forecast than others?

Initial plan

For each species we will subset by region and test for stationarity. Then for forecast levels of 5, 10, and 20 years for each region, we will run auto-arima. We will look at forecasts and accuracy using RMSE to determine what level of for asting could be appropriate when considering management utility.

We will pick a couple of regions for each species to demonstrate ACF and PACF, and look through model results for any residuals.

What we actually did

It was harder to compare models across species and regions than we assumed. We created functions to help streamline this process. Also after researching forecast accuracy metrics, we decided to use MASE as a metric instead of RMSE. It was difficult to decipher why some regions were more easily forecasted and others returned unreliable results. Additionally, each species had different dynamics, and some variable methods were applied in order to find a model that worked.

Sockeye: Combined Regions

Subset and format the data to analyze just sockeye.

Code
SockByRegion<-ruggerone_data %>%
  filter(region != "japan") %>%
  filter(region != "korea") %>%
  group_by(species, region, year) %>%
  summarize(total = sum(returns, na.rm=TRUE)) %>% 
  mutate(lnreturns = log(total)) %>%
  filter(species == "sockeye")
`summarise()` has grouped output by 'species', 'region'. You can override using
the `.groups` argument.

Create a time series object and train the data on the first 59 years of data; forecast the last 5 years.

Code
sockeye.ts<-ts(SockByRegion$lnreturns, start=SockByRegion$year[1])

train.sockeye<-window(sockeye.ts, start=1952, end=2010)
test.sockeye<-window(sockeye.ts, start=2011, end=2015)

Assess the ACF and PACF of the training data set.

Look at the options for fitting an ARIMA to the data and then choose a final model. The best model for both of these is ARIMA(0,1,2); however, a comparison with other models suggests that ARIMA(1,1,1) is also a good fit with AIC within 0.3 of the ARIMA(0,1,2). The full dataset requires differencing (d=1).

Code
fit <- forecast::auto.arima(train.sockeye, trace=T)

 ARIMA(2,1,2) with drift         : Inf
 ARIMA(0,1,0) with drift         : 61.95027
 ARIMA(1,1,0) with drift         : 60.11204
 ARIMA(0,1,1) with drift         : 57.27974
 ARIMA(0,1,0)                    : 59.87487
 ARIMA(1,1,1) with drift         : Inf
 ARIMA(0,1,2) with drift         : 57.23854
 ARIMA(1,1,2) with drift         : Inf
 ARIMA(0,1,3) with drift         : Inf
 ARIMA(1,1,3) with drift         : Inf
 ARIMA(0,1,2)                    : 55.53198
 ARIMA(0,1,1)                    : 55.33664
 ARIMA(1,1,1)                    : 55.19379
 ARIMA(1,1,0)                    : 57.99975
 ARIMA(2,1,1)                    : 57.46972
 ARIMA(1,1,2)                    : 57.48549
 ARIMA(2,1,0)                    : 58.91271
 ARIMA(2,1,2)                    : Inf

 Best model: ARIMA(1,1,1)                    
Code
fit.final.sock<-forecast::auto.arima(train.sockeye, approximation = F, stepwise = F)

Plot the 5 year forecast for the last part of the dataset compared to the actual data. The real data are represented by the black dots; the forecast is represented by the black line.

Assess how well forecasting performs for sockeye returns by region

Create objects needed for plotting.

Code
regions<-unique(SockByRegion$region)
regionskey<-c("Cook Inlet", "E. Kamchatka", "Kodiak", "Russia", "N.British Columbia",
              "Prince William Sound", "S. Alaska Pen.", "S. British Columbia", "SE Alaska", "W. Kamchatka", "Washington", "W. Alaska")
names(regionskey)<-regions
forecastlevels<-c(5, 10, 20)
Allcombs<-expand_grid(regions, forecastlevels)

Plot ACF and PACF for each region.

Code
ACFandPACF<-function(reg){
  Sockdat<-SockByRegion %>% filter(region == reg)
  #create time series
  datts <- ts(Sockdat$lnreturns, start=Sockdat$year[1])
  return(list(a = acf(datts, plot = FALSE), p = pacf(datts, plot = FALSE)))
}
Code
DiagPlots<-lapply(regions, ACFandPACF)
names(DiagPlots)<-regions

ACF plots

PACF plots

Fit ARIMA models to each region.

Code
FitModFunction<-function(reg, forelevel){
  #filter region
  Sockdat<-SockByRegion %>% filter(region == reg)
  #create time series
  datts <- ts(Sockdat$lnreturns, start=Sockdat$year[1]) #this assumes the first year in data is the start of the time series (they are in order) 
  cutoff<-2015-forelevel
  train <- window(datts, 1952, cutoff)
  test <- window(datts, cutoff+1, 2015)
  
  mod <- auto.arima(train)
  
  #testing to be sure that this is the best model (is the best mode the simplest if it is within 2 AIC values?)
  trace <- capture.output({
    # assign so it doesn't pollute the output
    model <- auto.arima(datts, trace = TRUE)
  })
  con    <- textConnection(trace)
  models <- read.table(con, sep=":")
  close(con)
  
  #getting the "best models" that are within 2 AIC units
  BestMods<-models%>% filter(row_number() != nrow(models)) %>% mutate(AIC = replace(V2, V2 == "Inf", 99999), AIC = as.numeric(AIC), DeltaAIC = AIC-min(AIC)) %>% filter(DeltaAIC <= 2.0)
  for(i in 1:nrow(BestMods)){
    BestMods$Mod[i]<-strsplit(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
    BestMods$npar[i]<-sum(as.numeric(BestMods$Mod[i][[1]][c(1,3)]))
    if(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][2] == " with drift         "){
      BestMods$npar[i] = BestMods$npar[i] + 1
    }
  }
  
  New<-BestMods %>% filter(npar == min(npar))
  if(0 %in% New$DeltaAIC){
      #auto arima picked the best model
      res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
  }else{
      #of the models with the fewest parameters, pick the lowest AIC
      newmod<-New %>% filter(AIC == min(AIC)) %>% select(Mod)
      mod<-Arima(train, order = as.numeric(strsplit(newmod$Mod[[1]], "[,]")), include.constant = TRUE)
      res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
    }

  return(list(Fit = mod, MASE = res, Bm = BestMods)) #include best mods for testing to see that it's doing what I want
}
RegionModsSock<-mapply(FitModFunction, Allcombs$regions, Allcombs$forecastlevels, SIMPLIFY = FALSE)

Extract MASE for comparisons of models across regions.

Code
RegionMASESock<-sapply(RegionModsSock, function(x){y<-x$MASE})
RegionBestModSock<-sapply(RegionModsSock, function(x){y<-as.character(x$Fit)})

ResultsTableSock<-Allcombs %>% add_column(Model = RegionBestModSock, MASE = RegionMASESock)

How well does auto.arima do in choosing a model? Is it different from what we would choose looking at ACF and PACF? For Cook Inlet, auto.arima selected ARIMA(4,1,1), but PACF has a significant lag at 6 and ACF trails off.

Code
sock.ci<-subset(SockByRegion, region=='ci')
ci.ts<-ts(sock.ci$lnreturns, start=sock.ci$year[1])
forecast::ndiffs(ci.ts, test='adf')
[1] 1
Code
forecast::ndiffs(ci.ts, test='kpss')
[1] 1
Code
train.ci5<-window(ci.ts, start=1952, end=2010)
test.ci5<-window(ci.ts, start=2011, end=2015)
ci.final5<-forecast::auto.arima(train.ci5, approximation = F, stepwise = F)
accuracy(forecast(ci.final5, h=5), test.ci5)
                     ME      RMSE       MAE       MPE     MAPE      MASE
Training set 0.02471079 0.3370305 0.2702999 -38.06705 62.87923 0.8967555
Test set     0.34919358 0.4082214 0.3491936  17.84353 17.84353 1.1584957
                     ACF1 Theil's U
Training set  0.008040561        NA
Test set     -0.080588177  1.829245
Code
#MASE is just above 1

acf(train.ci5)

Code
pacf(train.ci5)

Code
#select ARIMA (6,1,0)
fit.ci2 <- Arima(train.ci5, order=c(6,1,0), include.mean=TRUE)
accuracy(forecast(fit, h=5), test.ci5)
                     ME      RMSE       MAE       MPE     MAPE      MASE
Training set 0.03439583 0.3641608 0.2787817 -40.97753 65.59960 0.9248951
Test set     0.29776606 0.3308204 0.2977661  15.30517 15.30517 0.9878781
                     ACF1 Theil's U
Training set -0.007356741        NA
Test set      0.206715703  1.669537
Code
#MASE is very high for the test set.

Here are the plots comparing the two models for Cook Inlet. The plots look very similar, but the forecast differs a bit for the last 2 years.

Chum: Regional

Looking at data:

Getting a subset of the data (removing regions with no data):

Code
#removing Korea Japan because there's no data
ChumByRegion<-ruggerone_data %>%
  filter(region != "japan") %>%
  filter(region != "korea") %>%
  group_by(species, region, year) %>%
  summarize(total = sum(returns, na.rm=TRUE)) %>% 
  mutate(lnreturns = log(total)) %>%
  filter(species == "chum")
`summarise()` has grouped output by 'species', 'region'. You can override using
the `.groups` argument.
Code
head(ChumByRegion)
# A tibble: 6 × 5
# Groups:   species, region [1]
  species region  year total lnreturns
  <chr>   <chr>  <dbl> <dbl>     <dbl>
1 chum    ci      1952 1.25     0.226 
2 chum    ci      1953 1.44     0.363 
3 chum    ci      1954 1.93     0.656 
4 chum    ci      1955 0.957   -0.0436
5 chum    ci      1956 2.11     0.748 
6 chum    ci      1957 2.76     1.01  
Code
#making sure all the regions cover all the years (or at least start and end)
ChumByRegion %>% group_by(region) %>% summarise(startyear = min(year), endyear = max(year))
# A tibble: 12 × 3
   region startyear endyear
   <chr>      <dbl>   <dbl>
 1 ci          1952    2015
 2 e_kam       1952    2015
 3 kod         1952    2015
 4 m_i         1952    2015
 5 nbc         1952    2015
 6 pws         1952    2015
 7 s_pen       1952    2015
 8 sbc         1952    2015
 9 seak        1952    2015
10 w_kam       1952    2015
11 wa          1952    2015
12 wak         1952    2015
Code
#all start in 1952 and end in 2015

Creating tibble to loop through:

Code
#regions vector
regions<-unique(ChumByRegion$region)
#regions key
regionskey<-c("Cook Inlet", "E. Kamchatka", "Kodiak", "Russia", "N.British Columbia",
              "Prince William Sound", "S. Alaska Pen.", "S. British Columbia", "SE Alaska", "W. Kamchatka", "Washington", "W. Alaska")
names(regionskey)<-regions #for plotting
#forecast levels
forecastlevels<-c(5, 10, 20)
#all combinations
Allcombs<-expand_grid(regions, forecastlevels)

Function for ACF and PACF

Code
#ACF and PACF
ACFandPACF<-function(reg){
  Chumdat<-ChumByRegion %>% filter(region == reg)
  #create time series
  datts <- ts(Chumdat$lnreturns, start=Chumdat$year[1])
  return(list(a = acf(datts, plot = FALSE), p = pacf(datts, plot = FALSE)))
}
#loop through regions/levels
DiagPlots<-lapply(regions, ACFandPACF)
names(DiagPlots)<-regions
Code
#ACF plots for each region
par(mfrow=c(3,4))
for(r in 1:length(regions)){
  plot(DiagPlots[[r]][[1]], main = paste0("Region: ", regionskey[r]))
}

Code
#PACF plots for each region
par(mfrow=c(3,4))
for(r in 1:length(regions)){
  plot(DiagPlots[[r]][[2]], main = paste0("Region: ", regionskey[r]))
}

Code
#function for ARIMA models
FitModFunction<-function(reg, forelevel){
  #This function takes a region and forecast level, subsets the data according to these parameters, then sets up the time series object, and the test and train sets. It uses auto.arima to find the "best" model and then this is checked by comparing other models with DeltaAICc < 2 and the number of parameters. If auto.arima picked a the model with the lowest AIC and the fewest parameters, it forecasts using this model and checks forecast accuracy with MASE. If not, it refits the model using Arima and the simpler model, and then checks forecast accuracy with this.
  #filter region
  Chumdat<-ChumByRegion %>% filter(region == reg)
  #create time series
  datts <- ts(Chumdat$lnreturns, start=Chumdat$year[1]) #this assumes the first year in data is the start of the time series (they are in order) 
  cutoff<-2015-forelevel
  train <- window(datts, 1952, cutoff)
  test <- window(datts, cutoff+1, 2015)
  
  mod <- auto.arima(train)
  
  #testing to be sure that this is the best model (is the best mode the simplest if it is within 2 AIC values?)
  trace <- capture.output({
    # assign so it doesn't pollute the output
    model <- auto.arima(datts, trace = TRUE)
  })
  con    <- textConnection(trace)
  models <- read.table(con, sep=":")
  close(con)
  
  #getting the "best models" that are within 2 AIC units
  BestMods<-models%>% filter(row_number() != nrow(models)) %>% mutate(AIC = replace(V2, V2 == "Inf", 99999), AIC = as.numeric(AIC), DeltaAIC = AIC-min(AIC)) %>% filter(DeltaAIC <= 2.0)
  for(i in 1:nrow(BestMods)){
    BestMods$Mod[i]<-strsplit(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
    BestMods$npar[i]<-sum(as.numeric(BestMods$Mod[i][[1]][c(1,3)]))
    if(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][2] == " with drift         "){
      BestMods$npar[i] = BestMods$npar[i] + 1
    }
  }
  
  New<-BestMods %>% filter(npar == min(npar))
  if(0 %in% New$DeltaAIC){
      #auto arima picked the best model
      res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
  }else{
      #of the models with the fewest parameters, pick the lowest AIC
      newmod<-New %>% filter(AIC == min(AIC)) %>% select(Mod)
      mod<-Arima(train, order = as.numeric(strsplit(newmod$Mod[[1]], "[,]")), include.constant = TRUE)
      res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
    }

  return(list(Fit = mod, MASE = res, Bm = BestMods)) #include best mods for testing to see that it's doing what I want
}
Code
RegionModsChum<-mapply(FitModFunction, Allcombs$regions, Allcombs$forecastlevels, SIMPLIFY = FALSE)
Code
RegionMASEChum<-sapply(RegionModsChum, function(x){y<-x$MASE})
RegionBestModChum<-sapply(RegionModsChum, function(x){y<-as.character(x$Fit)})
#combine into tables
ResultsTableChum<-Allcombs %>% add_column(Model = RegionBestModChum, MASE = RegionMASEChum)
knitr::kable(head(ResultsTableChum))
regions forecastlevels Model MASE
ci 5 ARIMA(1,1,1) 0.4232248
ci 10 ARIMA(1,1,1) 0.3650863
ci 20 ARIMA(0,0,2) with non-zero mean 1.3678453
e_kam 5 ARIMA(1,0,0) with non-zero mean 2.0352335
e_kam 10 ARIMA(1,0,0) with non-zero mean 1.6504750
e_kam 20 ARIMA(1,0,0) with non-zero mean 1.4845904

Pink: Combined Regions

Subset the data to look at pink salmon only.

Code
#Filter by species (Pink)
dat <- ruggerone_data %>%  
  filter(species=="pink" & region=="ci") %>% 
  mutate(log.returns = log(returns)) %>% 
  select(year, log.returns)

Plot the data

Code
#Plot by region
ruggerone_data %>% 
  filter(species=="pink") %>% 
  ggplot(aes(x=year, y=log(returns))) + 
  geom_line() + 
  ggtitle("pink salmon log abundance by region") +
  facet_wrap(~region)

Note that there is no data in Korea, and WA has a lot of 0 values and very low returns. We will filter these regions out. All 0 values (-Inf in log space) were removed.

Code
PinkByRegion<-ruggerone_data %>%
  filter(region != "korea") %>% 
  filter(region != "wa") %>%#Remove WA too, it's trouble 
  group_by(species, region, year) %>%
  summarize(total = sum(returns, na.rm=TRUE)) %>% 
  mutate(lnreturns = log(total)) %>%
  mutate(lnreturns = ifelse(lnreturns == -Inf, NA, lnreturns)) %>%
  filter(species == "pink")%>% 
  print(n=5)
`summarise()` has grouped output by 'species', 'region'. You can override using
the `.groups` argument.
# A tibble: 768 × 5
# Groups:   species, region [12]
  species region  year total lnreturns
  <chr>   <chr>  <dbl> <dbl>     <dbl>
1 pink    ci      1952  4.36     1.47 
2 pink    ci      1953  1.30     0.264
3 pink    ci      1954  4.67     1.54 
4 pink    ci      1955  2.67     0.981
5 pink    ci      1956  3.57     1.27 
# ℹ 763 more rows

Identify start and end years for each region.

Code
PinkByRegion %>% group_by(region) %>% summarise(startyear = min(year), endyear = max(year))
# A tibble: 12 × 3
   region startyear endyear
   <chr>      <dbl>   <dbl>
 1 ci          1952    2015
 2 e_kam       1952    2015
 3 japan       1952    2015
 4 kod         1952    2015
 5 m_i         1952    2015
 6 nbc         1952    2015
 7 pws         1952    2015
 8 s_pen       1952    2015
 9 sbc         1952    2015
10 seak        1952    2015
11 w_kam       1952    2015
12 wak         1952    2015

The next step is to ID Stationarity with all the pink data. Lets plot the data in aggregate.

Code
PinkByRegion %>%
  group_by(year) %>%
  summarize(total = sum(total, na.rm=T)) %>%
  ggplot(aes(x=year, y=log(total))) +
  geom_line() +
  ylab('Log (Returns)') +
  xlab('Year')

Next, a time series object was created and we look at the ACF and PACF

Code
total.pink<-PinkByRegion %>%
  group_by(year) %>%
  summarize(lntotal=log(sum(total, na.rm=T)))

pink.ts<-ts(total.pink$lntotal, 
            start=total.pink$year[1])
plot(diff(pink.ts)) #something odd happened between 1990 and 2005

Code
acf(diff(pink.ts)) #ruh roh, ACF correlation for entire series 

The ACF looks like there is a lot of corelation, that’s probably because Pinks have a very consistent two year cycle.

Let’s try a forecast model to see what happens.

Code
#Let's train and test with a 10 year period
train.pink<-window(pink.ts, start=1952, end=2005)
test.pink<-window(pink.ts, start=2006, end=2015)

fit <- forecast::auto.arima(train.pink, trace=T)

 ARIMA(2,1,2) with drift         : Inf
 ARIMA(0,1,0) with drift         : 44.48651
 ARIMA(1,1,0) with drift         : -0.8966894
 ARIMA(0,1,1) with drift         : 18.15923
 ARIMA(0,1,0)                    : 42.45004
 ARIMA(2,1,0) with drift         : -1.222687
 ARIMA(3,1,0) with drift         : -2.155628
 ARIMA(4,1,0) with drift         : 0.3831016
 ARIMA(3,1,1) with drift         : 0.2902468
 ARIMA(2,1,1) with drift         : Inf
 ARIMA(4,1,1) with drift         : Inf
 ARIMA(3,1,0)                    : -4.270823
 ARIMA(2,1,0)                    : -3.360303
 ARIMA(4,1,0)                    : -1.841485
 ARIMA(3,1,1)                    : -1.889482
 ARIMA(2,1,1)                    : Inf
 ARIMA(4,1,1)                    : 0.4019518

 Best model: ARIMA(3,1,0)                    
Code
fit.final.pink<-forecast::auto.arima(train.pink, approximation = F, stepwise = F)

#30 year forcast, not so believeable
fit.final.pink %>%
  forecast(h=15) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.pink))

The best model was the ARIMA(3,1,0), but the forecast doesn’t look particularly great.

Given the life history of pink salmon, let’s parse the data set into two pieces, even and odd years.

Code
#Even Years: Totals 
PinkByRegion_even<-PinkByRegion %>% 
  filter(year %% 2 == 0)

#Odd Years: Totals 
PinkByRegion_odd<-PinkByRegion %>% 
  filter(year %% 2 == 1)

#Trends
PinkByRegion_even %>%
  group_by(year) %>%
  summarize(total = sum(total, na.rm=T)) %>%
  ggplot(aes(x=year, y=log(total))) +
  geom_line() +
  ylab('Log (Returns)') +
  xlab('Year') +
  ggtitle('Total Pinks (Even Years)')

Code
PinkByRegion_odd %>%
  group_by(year) %>%
  summarize(total = sum(total, na.rm=T)) %>%
  ggplot(aes(x=year, y=log(total))) +
  geom_line() +
  ylab('Log (Returns)') +
  xlab('Year') +
  ggtitle('Total Pinks (Odd Years)')

Let’s look at the ACF and PACF for even years:

Code
#Even Years -- Total
total.pink_even<-PinkByRegion_even %>%
  group_by(year) %>%
  summarize(lntotal=log(sum(total, na.rm=T)))

pink.ts_even<-ts(total.pink_even$lntotal, 
            start=total.pink$year[1], frequency = 0.5)
plot(diff(pink.ts_even)) #Looks pretty stationary

Code
acf(diff(pink.ts_even)) #This looks much better

The differences and the ACF plots here look better than the aggregate data set. Let’s look at a forcast.

Code
#Train and test for a 10 year period
train.pink_even<-window(pink.ts_even, start=1952, end=2004)
test.pink_even<-window(pink.ts_even, start=2006, end=2014)

fit <- forecast::auto.arima(train.pink_even, trace=T)

 ARIMA(2,1,2) with drift         : Inf
 ARIMA(0,1,0) with drift         : 2.346942
 ARIMA(1,1,0) with drift         : 4.419114
 ARIMA(0,1,1) with drift         : 4.222938
 ARIMA(0,1,0)                    : 0.1554815
 ARIMA(1,1,1) with drift         : Inf

 Best model: ARIMA(0,1,0)                    
Code
fit.final.pink_even<-forecast::auto.arima(train.pink_even, approximation = F, stepwise = F)

fit.final.pink_even %>%
  forecast(h=15) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.pink_even))

Code
#That is a straight line..... not very good.....

The best model was the ARIMA(0,1,0) model. But the forcast doesn’t capture the trend.

Let’s repeat the next steps for odd years.

Code
#Odd Years -- Total

total.pink_odd<-PinkByRegion_odd %>%
  group_by(year) %>%
  summarize(lntotal=log(sum(total, na.rm=T)))

pink.ts_odd<-ts(total.pink_odd$lntotal, 
                 start=total.pink_odd$year[1], frequency = 0.5)
plot(diff(pink.ts_odd)) #Looks pretty stationary

Code
acf(diff(pink.ts_odd)) #This also looks better

These look stationary and the ACF looks better.

Code
#Train and test for a 10 year period
train.pink_odd<-window(pink.ts_odd, start=1953, end=2005)
test.pink_odd<-window(pink.ts_odd, start=2007, end=2015)

fit <- forecast::auto.arima(train.pink_odd, trace=T)

 ARIMA(2,1,2) with drift         : Inf
 ARIMA(0,1,0) with drift         : 4.01998
 ARIMA(1,1,0) with drift         : 5.551776
 ARIMA(0,1,1) with drift         : 4.461601
 ARIMA(0,1,0)                    : 1.771054
 ARIMA(1,1,1) with drift         : Inf

 Best model: ARIMA(0,1,0)                    
Code
fit.final.pink_odd<-forecast::auto.arima(train.pink_odd, approximation = F, stepwise = F)

fit.final.pink_odd %>%
  forecast(h=15) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.pink_odd))

The best model was the ARIMA(0,1,0) model. But the forcast for only odd years also doesn’t capture the trend.

Pinks: Regional Considerations

Because of pink salmon life history, for regional models, models with even and odd years were considered separately.

Code
#Even Years First 
#Differenced plots for all Regions 
PinkByRegion_even %>%
  group_by(region) %>%
  mutate(diff_total = c(NA, diff(total))) %>%
  ggplot(aes(x = year, y = diff_total)) +
  geom_line() +
  facet_wrap(~region, scales = "free_y") +
  ylab("Difference in Total Returns") +
  xlab("Year") +
  ggtitle("Diff by Region (Even Years)") 
Warning: Removed 1 row containing missing values (`geom_line()`).

Code
  #ggfortify::ggstat_acf(method = "ma", na.action = na.pass)

# Odd Years First 
#Differenced plots for all Regions 
PinkByRegion_odd %>%
  group_by(region) %>%
  mutate(diff_total = c(NA, diff(total))) %>%
  ggplot(aes(x = year, y = diff_total)) +
  geom_line() +
  facet_wrap(~region, scales = "free_y") +
  ylab("Difference in Total Returns") +
  xlab("Year") +
  ggtitle("Diff by Region (Odd Years)") 
Warning: Removed 1 row containing missing values (`geom_line()`).

To test stationarity outside the auto.arima function , the ADF and KPSS tests were compared for all years, and even and odd years. Note for the ADF null hypothesis is that the system is non-stationary (we want to reject), and the KPSS test null hypothesis is that there is stationarity.

Code
#Augmented Dicky Fuller 
tseries::adf.test(pink.ts, k=0)
Warning in tseries::adf.test(pink.ts, k = 0): p-value smaller than printed
p-value

    Augmented Dickey-Fuller Test

data:  pink.ts
Dickey-Fuller = -6.7912, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
Code
tseries::adf.test(pink.ts_even, k=0)

    Augmented Dickey-Fuller Test

data:  pink.ts_even
Dickey-Fuller = -2.7882, Lag order = 0, p-value = 0.2688
alternative hypothesis: stationary
Code
tseries::adf.test(pink.ts_odd, k=0)

    Augmented Dickey-Fuller Test

data:  pink.ts_odd
Dickey-Fuller = -3.3957, Lag order = 0, p-value = 0.07573
alternative hypothesis: stationary
Code
tseries::kpss.test(pink.ts, null = c("Level", "Trend"))
Warning in tseries::kpss.test(pink.ts, null = c("Level", "Trend")): p-value
smaller than printed p-value

    KPSS Test for Level Stationarity

data:  pink.ts
KPSS Level = 1.2192, Truncation lag parameter = 3, p-value = 0.01
Code
tseries::kpss.test(pink.ts_even, null = c("Level", "Trend"))

    KPSS Test for Level Stationarity

data:  pink.ts_even
KPSS Level = 0.64625, Truncation lag parameter = 3, p-value = 0.01843
Code
tseries::kpss.test(pink.ts_odd, null = c("Level", "Trend"))

    KPSS Test for Level Stationarity

data:  pink.ts_odd
KPSS Level = 0.64278, Truncation lag parameter = 3, p-value = 0.01875

A complication is that the data sets split by even and odd years were not passing the tests of stationarity. Perhaps other steps should have been taken.

Code
#Functions for Regional ACF and PACF 
#======================================================
ACFandPACF<-function(reg){
  Pinkdat<-PinkByRegion %>% filter(region == reg)
  #create time series
  datts <- ts(Pinkdat$lnreturns, start=Pinkdat$year[1]) 
  return(list(a = acf(datts, plot = FALSE), p = pacf(datts, plot = FALSE)))
}

ACFandPACF_even<-function(reg){
  Pinkdat<-PinkByRegion_even %>% filter(region == reg)
  #create time series
  datts <- ts(Pinkdat$lnreturns, start=Pinkdat$year[1]) 
  return(list(a = acf(datts, plot = FALSE), p = pacf(datts, plot = FALSE)))
}

ACFandPACF_odd<-function(reg){
  Pinkdat<-PinkByRegion_odd %>% filter(region == reg)
  #create time series
  datts <- ts(Pinkdat$lnreturns, start=Pinkdat$year[1]) 
  return(list(a = acf(datts, plot = FALSE), p = pacf(datts, plot = FALSE)))
}

FitModFunction<-function(reg, forelevel){
  #filter region
  Pinkdat<-PinkByRegion %>% filter(region == reg)
  #create time series
  datts <- ts(Pinkdat$lnreturns, start=Pinkdat$year[1], frequency = 0.5) #set Frequency for 0.5
  cutoff<-2014-forelevel
  train <- window(datts, Pinkdat$year[1], cutoff)
  test <- window(datts, cutoff+1, 2014)
  
  mod <- auto.arima(train)
  #testing to be sure that this is the best model (is the best mode the simplest if it is within 2 AIC values?)
  trace <- capture.output({
    # assign so it doesn't pollute the output
    model <- auto.arima(datts, trace = TRUE)
  })
  con    <- textConnection(trace)
  models <- read.table(con, sep=":")
  close(con)
  
  #getting the "best models" that are within 2 AIC units
  BestMods<-models%>% filter(row_number() != nrow(models)) %>% mutate(AIC = replace(V2, V2 == "Inf", 99999), AIC = as.numeric(AIC), DeltaAIC = AIC-min(AIC)) %>% filter(DeltaAIC <= 2.0)
  for(i in 1:nrow(BestMods)){
    BestMods$Mod[i]<-strsplit(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
    BestMods$npar[i]<-sum(as.numeric(BestMods$Mod[i][[1]][c(1,3)]))
    if(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][2] == " with drift         "){
      BestMods$npar[i] = BestMods$npar[i] + 1
    }
  }
  
  New<-BestMods %>% filter(npar == min(npar))
  if(0 %in% New$DeltaAIC){
    #auto arima picked the best model
    res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
  }else{
    #of the models with the fewest parameters, pick the lowest AIC
    newmod<-New %>% filter(AIC == min(AIC)) %>% select(Mod)
    mod<-Arima(train, order = as.numeric(strsplit(newmod$Mod[[1]], "[,]")), include.constant = TRUE)
    res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
  }
  
  return(list(Fit = mod, MASE = res, Bm = BestMods)) #include best mods for testing to see that it's doing what I want
  
}

FitModFunction_even<-function(reg, forelevel){
  #filter region
  Pinkdat<-PinkByRegion_even %>% filter(region == reg)
  #create time series
  datts <- ts(Pinkdat$lnreturns, start=Pinkdat$year[1], frequency = 0.5) #set Frequency for 0.5
  cutoff<-2014-forelevel
  train <- window(datts, Pinkdat$year[1], cutoff)
  test <- window(datts, cutoff+1, 2014)
  
  mod <- auto.arima(train)
  #testing to be sure that this is the best model (is the best mode the simplest if it is within 2 AIC values?)
  trace <- capture.output({
    # assign so it doesn't pollute the output
    model <- auto.arima(datts, trace = TRUE)
  })
  con    <- textConnection(trace)
  models <- read.table(con, sep=":")
  close(con)
  
  #getting the "best models" that are within 2 AIC units
  BestMods<-models%>% filter(row_number() != nrow(models)) %>% mutate(AIC = replace(V2, V2 == "Inf", 99999), AIC = as.numeric(AIC), DeltaAIC = AIC-min(AIC)) %>% filter(DeltaAIC <= 2.0)
  for(i in 1:nrow(BestMods)){
    BestMods$Mod[i]<-strsplit(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
    BestMods$npar[i]<-sum(as.numeric(BestMods$Mod[i][[1]][c(1,3)]))
    if(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][2] == " with drift         "){
      BestMods$npar[i] = BestMods$npar[i] + 1
    }
  }
  
  New<-BestMods %>% filter(npar == min(npar))
  if(0 %in% New$DeltaAIC){
    #auto arima picked the best model
    res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
  }else{
    #of the models with the fewest parameters, pick the lowest AIC
    newmod<-New %>% filter(AIC == min(AIC)) %>% select(Mod)
    mod<-Arima(train, order = as.numeric(strsplit(newmod$Mod[[1]], "[,]")), include.constant = TRUE)
    res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
  }
  
  return(list(Fit = mod, MASE = res, Bm = BestMods)) #include best mods for testing to see that it's doing what I want
  
}

FitModFunction_odd<-function(reg, forelevel){
  #filter region
  Pinkdat<-PinkByRegion_odd %>% filter(region == reg)
  #create time series
  datts <- ts(Pinkdat$lnreturns, start=Pinkdat$year[1], frequency = 0.5) 
  cutoff<-2015-forelevel
  train <- window(datts, Pinkdat$year[1], cutoff)
  test <- window(datts, cutoff+1, 2015)
  
  mod <- auto.arima(train)
  #testing to be sure that this is the best model (is the best mode the simplest if it is within 2 AIC values?)
  trace <- capture.output({
    # assign so it doesn't pollute the output
    model <- auto.arima(datts, trace = TRUE)
  })
  con    <- textConnection(trace)
  models <- read.table(con, sep=":")
  close(con)
  
  #getting the "best models" that are within 2 AIC units
  BestMods<-models%>% filter(row_number() != nrow(models)) %>% mutate(AIC = replace(V2, V2 == "Inf", 99999), AIC = as.numeric(AIC), DeltaAIC = AIC-min(AIC)) %>% filter(DeltaAIC <= 2.0)
  for(i in 1:nrow(BestMods)){
    BestMods$Mod[i]<-strsplit(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
    BestMods$npar[i]<-sum(as.numeric(BestMods$Mod[i][[1]][c(1,3)]))
    if(strsplit(strsplit(BestMods$V1[i], "[(]")[[1]][2], "[)]")[[1]][2] == " with drift         "){
      BestMods$npar[i] = BestMods$npar[i] + 1
    }
  }
  
  New<-BestMods %>% filter(npar == min(npar))
  if(0 %in% New$DeltaAIC){
    #auto arima picked the best model
    res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
  }else{
    #of the models with the fewest parameters, pick the lowest AIC
    newmod<-New %>% filter(AIC == min(AIC)) %>% select(Mod)
    mod<-Arima(train, order = as.numeric(strsplit(newmod$Mod[[1]], "[,]")), include.constant = TRUE)
    res<-accuracy(forecast(mod, h=forelevel), test)[2,"MASE"] #test set MASE
  }
  
  return(list(Fit = mod, MASE = res, Bm = BestMods)) #include best mods for testing to see that it's doing what I want
  
}

#================================================================

Next step is to then define the regions

Code
#regions vector
regions<-unique(PinkByRegion$region)
#regions key
regionskey<-c("Cook Inlet", "E. Kamchatka", "Japan", "Kodiak", "Russia", "N.British Columbia",
              "Prince William Sound", "S. Alaska Pen.", "S. British Columbia", "SE Alaska", "W. Kamchatka", "W. Alaska")
names(regionskey)<-regions #for plotting

#loop through regions/levels
#all
DiagPlots<-lapply(regions, ACFandPACF)
   names(DiagPlots)<-regions

#Even
DiagPlots_even<-lapply(regions, ACFandPACF_even)
   names(DiagPlots_even)<-regions
   
#Odd
DiagPlots_odd<-lapply(regions, ACFandPACF_odd)
   names(DiagPlots_odd)<-regions   

Lets look at the regional breakdowns of ACF and PACF, first for all years:

Code
#Even Years 
par(mfrow=c(3,4))
for(r in 1:length(regions)){
  plot(DiagPlots[[r]][[1]], main = paste0("Region: ", regionskey[r]))
}

Code
#PACF plots for each region
par(mfrow=c(3,4))
for(r in 1:length(regions)){
  plot(DiagPlots[[r]][[2]], main = paste0("Region: ", regionskey[r]))
}

The ACFs for regions show different levels of lag or total correlation. Let’s look for even years:

Code
#Even Years 
par(mfrow=c(3,4))
for(r in 1:length(regions)){
  plot(DiagPlots_even[[r]][[1]], main = paste0("Region: ", regionskey[r]))
}

Code
#PACF plots for each region
par(mfrow=c(3,4))
for(r in 1:length(regions)){
  plot(DiagPlots_even[[r]][[2]], main = paste0("Region: ", regionskey[r]))
}

And then for odd years:

Code
#Odd Years 
par(mfrow=c(3,4))
for(r in 1:length(regions)){
  plot(DiagPlots_odd[[r]][[1]], main = paste0("Region: ", regionskey[r]))
}

Code
#PACF plots for each region
par(mfrow=c(3,4))
for(r in 1:length(regions)){
  plot(DiagPlots_odd[[r]][[2]], main = paste0("Region: ", regionskey[r]))
}

The next step is to look at the ARIMA models for regions by all years, even and odd years, and define forecast levels. Let’s start with all years.

Code
#forecast levels
forecastlevels<-c(5, 10, 20)
#all combinations
Allcombs<-expand_grid(regions, forecastlevels)

#================================================

#All
RegionMods<-mapply(FitModFunction, Allcombs$regions, Allcombs$forecastlevels, SIMPLIFY = FALSE)
# head(RegionMods_even)
# names(RegionMods_even)

Now let’s consider even years.

Code
#Even
RegionMods_even<-mapply(FitModFunction_even, Allcombs$regions, Allcombs$forecastlevels, SIMPLIFY = FALSE)
head(RegionMods_even)
$ci
$ci$Fit
Series: train 
ARIMA(0,1,0) 

sigma^2 = 0.3224:  log likelihood = -23.88
AIC=49.77   AICc=49.92   BIC=51.1

$ci$MASE
[1] 0.3602762

$ci$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(0,1,0)                      55.48062 55.48062        0 0, 1, 0    0


$ci
$ci$Fit
Series: train 
ARIMA(0,1,0) 

sigma^2 = 0.3052:  log likelihood = -21.47
AIC=44.93   AICc=45.1   BIC=46.19

$ci$MASE
[1] 1.243461

$ci$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(0,1,0)                      55.48062 55.48062        0 0, 1, 0    0


$ci
$ci$Fit
Series: train 
ARIMA(0,1,0) 

sigma^2 = 0.2563:  log likelihood = -15.5
AIC=33.01   AICc=33.22   BIC=34.05

$ci$MASE
[1] 0.3950404

$ci$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(0,1,0)                      55.48062 55.48062        0 0, 1, 0    0


$e_kam
$e_kam$Fit
Series: train 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
        mean
      2.6102
s.e.  0.1257

sigma^2 = 0.4743:  log likelihood = -29.83
AIC=63.65   AICc=64.11   BIC=66.39

$e_kam$MASE
[1] 0.2398603

$e_kam$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(0,0,0) with non-zero mean   71.15545 71.15545  0.43946 0, 0, 0    0
2  ARIMA(1,0,0) with non-zero mean   72.55604 72.55604  1.84005 1, 0, 0    1
3  ARIMA(0,0,1) with non-zero mean   70.71599 70.71599  0.00000 0, 0, 1    1
4  ARIMA(1,0,1) with non-zero mean   72.39343 72.39343  1.67744 1, 0, 1    2
5  ARIMA(0,0,2) with non-zero mean   71.53128 71.53128  0.81529 0, 0, 2    2


$e_kam
$e_kam$Fit
Series: train 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
        mean
      2.5867
s.e.  0.1318

sigma^2 = 0.4873:  log likelihood = -28.1
AIC=60.19   AICc=60.69   BIC=62.79

$e_kam$MASE
[1] 0.219349

$e_kam$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(0,0,0) with non-zero mean   71.15545 71.15545  0.43946 0, 0, 0    0
2  ARIMA(1,0,0) with non-zero mean   72.55604 72.55604  1.84005 1, 0, 0    1
3  ARIMA(0,0,1) with non-zero mean   70.71599 70.71599  0.00000 0, 0, 1    1
4  ARIMA(1,0,1) with non-zero mean   72.39343 72.39343  1.67744 1, 0, 1    2
5  ARIMA(0,0,2) with non-zero mean   71.53128 71.53128  0.81529 0, 0, 2    2


$e_kam
$e_kam$Fit
Series: train 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
        mean
      2.6917
s.e.  0.1209

sigma^2 = 0.337:  log likelihood = -18.74
AIC=41.48   AICc=42.12   BIC=43.67

$e_kam$MASE
[1] 0.2382605

$e_kam$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(0,0,0) with non-zero mean   71.15545 71.15545  0.43946 0, 0, 0    0
2  ARIMA(1,0,0) with non-zero mean   72.55604 72.55604  1.84005 1, 0, 0    1
3  ARIMA(0,0,1) with non-zero mean   70.71599 70.71599  0.00000 0, 0, 1    1
4  ARIMA(1,0,1) with non-zero mean   72.39343 72.39343  1.67744 1, 0, 1    2
5  ARIMA(0,0,2) with non-zero mean   71.53128 71.53128  0.81529 0, 0, 2    2
Code
#names(RegionMods_even) #should be three for each region

Now let’s consider odd years.

Code
#Odd 
RegionMods_odd<-mapply(FitModFunction_odd, Allcombs$regions, Allcombs$forecastlevels, SIMPLIFY = FALSE)
head(RegionMods_odd)
$ci
$ci$Fit
Series: train 
ARIMA(1,0,0) with zero mean 

Coefficients:
         ar1
      0.6039
s.e.  0.1467

sigma^2 = 0.4057:  log likelihood = -27.79
AIC=59.57   AICc=60.03   BIC=62.31

$ci$MASE
[1] 2.144842

$ci$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(1,0,0) with non-zero mean    75.3159 75.31590  0.36364 1, 0, 0    1
2  ARIMA(1,0,0) with zero mean       74.95226 74.95226  0.00000 1, 0, 0    1


$ci
$ci$Fit
Series: train 
ARIMA(1,0,0) with zero mean 

Coefficients:
         ar1
      0.6037
s.e.  0.1610

sigma^2 = 0.4189:  log likelihood = -26.28
AIC=56.57   AICc=57.07   BIC=59.16

$ci$MASE
[1] 1.555645

$ci$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(1,0,0) with non-zero mean    75.3159 75.31590  0.36364 1, 0, 0    1
2  ARIMA(1,0,0) with zero mean       74.95226 74.95226  0.00000 1, 0, 0    1


$ci
$ci$Fit
Series: train 
ARIMA(1,0,0) with zero mean 

Coefficients:
         ar1
      0.6430
s.e.  0.1528

sigma^2 = 0.3645:  log likelihood = -19.87
AIC=43.74   AICc=44.37   BIC=45.93

$ci$MASE
[1] 1.425649

$ci$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(1,0,0) with non-zero mean    75.3159 75.31590  0.36364 1, 0, 0    1
2  ARIMA(1,0,0) with zero mean       74.95226 74.95226  0.00000 1, 0, 0    1


$e_kam
$e_kam$Fit
Series: train 
ARIMA(0,1,0) 

sigma^2 = 0.1936:  log likelihood = -16.75
AIC=35.49   AICc=35.65   BIC=36.82

$e_kam$MASE
[1] 0.1721234

$e_kam$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(1,1,0) with drift           60.55848 60.55848  1.02617 1, 1, 0    2
2  ARIMA(0,1,0)                      59.53231 59.53231  0.00000 0, 1, 0    0


$e_kam
$e_kam$Fit
Series: train 
ARIMA(0,0,1) with non-zero mean 

Coefficients:
         ma1    mean
      0.5514  3.9961
s.e.  0.1250  0.1117

sigma^2 = 0.1551:  log likelihood = -12.29
AIC=30.59   AICc=31.63   BIC=34.48

$e_kam$MASE
[1] 0.2364036

$e_kam$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(1,1,0) with drift           60.55848 60.55848  1.02617 1, 1, 0    2
2  ARIMA(0,1,0)                      59.53231 59.53231  0.00000 0, 1, 0    0


$e_kam
$e_kam$Fit
Series: train 
ARIMA(0,0,1) with non-zero mean 

Coefficients:
         ma1    mean
      0.5404  3.9140
s.e.  0.1481  0.1168

sigma^2 = 0.1433:  log likelihood = -8.97
AIC=23.94   AICc=25.28   BIC=27.21

$e_kam$MASE
[1] 0.1918676

$e_kam$Bm
                                 V1        V2      AIC DeltaAIC     Mod npar
1  ARIMA(1,1,0) with drift           60.55848 60.55848  1.02617 1, 1, 0    2
2  ARIMA(0,1,0)                      59.53231 59.53231  0.00000 0, 1, 0    0
Code
#names(RegionMods_odd) #should be three for each region

Now, we’re extracting the MASE and creating our final model tables

Code
#getting MASE
#All
RegionMASE<-sapply(RegionMods, function(x){y<-x$MASE})
RegionBestMod<-sapply(RegionMods, function(x){y<-as.character(x$Fit)})
#Even
RegionMASE_even<-sapply(RegionMods_even, function(x){y<-x$MASE})
RegionBestMod_even<-sapply(RegionMods_even, function(x){y<-as.character(x$Fit)})
#Odd
RegionMASE_odd<-sapply(RegionMods_odd, function(x){y<-x$MASE})
RegionBestMod_odd<-sapply(RegionMods_odd, function(x){y<-as.character(x$Fit)})
#combine into tables
#all
ResultsTable<-Allcombs %>% add_column(Model = RegionBestMod, MASE = RegionMASE)
ResultsTable
# A tibble: 36 × 4
   regions forecastlevels Model                            MASE
   <chr>            <dbl> <chr>                           <dbl>
 1 ci                   5 ARIMA(2,0,0) with non-zero mean 0.460
 2 ci                  10 ARIMA(2,0,0) with non-zero mean 0.549
 3 ci                  20 ARIMA(1,0,1) with non-zero mean 0.711
 4 e_kam                5 ARIMA(2,0,0) with non-zero mean 0.134
 5 e_kam               10 ARIMA(2,0,0) with non-zero mean 0.111
 6 e_kam               20 ARIMA(2,0,0) with non-zero mean 0.196
 7 japan                5 ARIMA(0,0,0) with non-zero mean 0.744
 8 japan               10 ARIMA(0,0,0) with non-zero mean 0.901
 9 japan               20 ARIMA(0,1,1)                    0.853
10 kod                  5 ARIMA(0,1,1) with drift         0.221
# ℹ 26 more rows
Code
#even
ResultsTable_even<-Allcombs %>% add_column(Model = RegionBestMod_even, MASE = RegionMASE_even)
ResultsTable_even
# A tibble: 36 × 4
   regions forecastlevels Model                            MASE
   <chr>            <dbl> <chr>                           <dbl>
 1 ci                   5 ARIMA(0,1,0)                    0.360
 2 ci                  10 ARIMA(0,1,0)                    1.24 
 3 ci                  20 ARIMA(0,1,0)                    0.395
 4 e_kam                5 ARIMA(0,0,0) with non-zero mean 0.240
 5 e_kam               10 ARIMA(0,0,0) with non-zero mean 0.219
 6 e_kam               20 ARIMA(0,0,0) with non-zero mean 0.238
 7 japan                5 ARIMA(0,1,0)                    0.743
 8 japan               10 ARIMA(0,1,0)                    0.500
 9 japan               20 ARIMA(1,0,0) with non-zero mean 0.731
10 kod                  5 ARIMA(0,0,0) with non-zero mean 0.155
# ℹ 26 more rows
Code
#Odd
ResultsTable_odd<-Allcombs %>% add_column(Model = RegionBestMod_odd, MASE = RegionMASE_odd)
ResultsTable_odd
# A tibble: 36 × 4
   regions forecastlevels Model                            MASE
   <chr>            <dbl> <chr>                           <dbl>
 1 ci                   5 ARIMA(1,0,0) with zero mean     2.14 
 2 ci                  10 ARIMA(1,0,0) with zero mean     1.56 
 3 ci                  20 ARIMA(1,0,0) with zero mean     1.43 
 4 e_kam                5 ARIMA(0,1,0)                    0.172
 5 e_kam               10 ARIMA(0,0,1) with non-zero mean 0.236
 6 e_kam               20 ARIMA(0,0,1) with non-zero mean 0.192
 7 japan                5 ARIMA(1,1,0)                    0.898
 8 japan               10 ARIMA(0,1,0)                    0.557
 9 japan               20 ARIMA(1,0,0) with non-zero mean 0.550
10 kod                  5 ARIMA(0,1,0)                    0.124
# ℹ 26 more rows

Results

Sockeye:

Plot MASE for three different forecast periods - 5, 10, and 20 years - across all regions. MASE < 1 is a “good” value.

None of the ARIMA models performed well for the forecasts of 20 years of data. Below is a comparison of the three lengths of forecasted data for South British Columbia where MASE was below 1 for the 5 year forecast but >1 for the 10 and 20 year forecasts.

Code
sock.sbc<-subset(SockByRegion, region=='sbc')
sbc.ts<-ts(sock.sbc$lnreturns, start=sock.sbc$year[1])
#create training and test datasets for the 5, 10, and 20 year forecasts
train.sbc5<-window(sbc.ts, start=1952, end=2010)
test.sbc5<-window(sbc.ts, start=2011, end=2015)
sbc.final5<-forecast::auto.arima(train.sbc5, approximation = F, stepwise = F)

train.sbc10<-window(sbc.ts, start=1952, end=2005)
test.sbc10<-window(sbc.ts, start=2006, end=2015)
sbc.final10<-forecast::auto.arima(train.sbc10, approximation = F, stepwise = F)

train.sbc20<-window(sbc.ts, start=1952, end=1995)
test.sbc20<-window(sbc.ts, start=1996, end=2015)
sbc.final20<-forecast::auto.arima(train.sbc20, approximation = F, stepwise = F)

Here are plots of the three forecast scenarios for sockeye in South British Columbia.

Looking at stationarity:

Code
Ndiff<-sapply(RegionBestModSock, function(x){
  a<-strsplit(strsplit(strsplit(x, "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
  return(a[[1]][2])}
)

tibble(Ndiff = Ndiff, region = Allcombs$regions, level = Allcombs$forecastlevels) %>%
  ggplot() + geom_bar(aes(x = region, y = Ndiff, fill = as.factor(level)), stat = "identity", position = "dodge") +
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region", y = "Number of Differences") + 
  ggtitle("Number of differences to achieve stationarity (Sockeye)") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

For Sockeye, 6 regions required differencing for all of the three subsets of the data (3 forecasting levels). Two regions required differencing for one subset of the data; four regions were stationary at any level of subsetting.

None of the regional models for the 5-year forecast had autocorrelation in the residuals based on the results of the Ljung-Box test (p-value > 0.05).

Chum:

Code
ggplot(ResultsTableChum) + 
  geom_bar(aes(x = regions, y = MASE, fill = as.factor(forecastlevels)), stat = "identity", position = "dodge") + 
  geom_hline(aes(yintercept = 1), linetype = "dashed") + 
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region") + 
  ggtitle("Chum") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

Many of the 5 and 10 year forecasts for Chum seem to perform well across regions. Models fit to data in Russia do the worst at forecasting in that region. Additionally, forecasts are poor across the different numbers of years tested in E. Kamachatka, SE Alaska and W. Kamachatka.

20 year forecast for Kodiak:

Code
chum.kod<-subset(ChumByRegion, region=='kod')
chum.ts<-ts(chum.kod$lnreturns, start=chum.kod$year[1])
#test datasets for plotting
test.kod20<-window(chum.ts, start=1996, end=2015)
forecast(RegionModsChum[[9]]$Fit, h = 20) %>% autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.kod20))

10 year forecast for Russia:

Code
chum.russ<-subset(ChumByRegion, region=='m_i')
chum.ts<-ts(chum.russ$lnreturns, start=chum.russ$year[1])
#test datasets for plotting
test.russ10<-window(chum.ts, start=2006, end=2015)
forecast(RegionModsChum[[11]]$Fit, h = 10) %>% autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.russ10))

It looks like returns really increase at the end of the time series in this region, which is likely why ARIMA models don’t forecast this well.

Looking at stationarity:

Code
Ndiff<-sapply(RegionBestModChum, function(x){
  a<-strsplit(strsplit(strsplit(x, "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
  return(a[[1]][2])}
)

tibble(Ndiff = Ndiff, region = Allcombs$regions, level = Allcombs$forecastlevels) %>%
  ggplot() + geom_bar(aes(x = region, y = Ndiff, fill = as.factor(level)), stat = "identity", position = "dodge") +
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region", y = "Number of Differences") + 
  ggtitle("Number of differences to achieve stationarity (Chum)") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

For Chum, time series for 5 of the regions were stationary (using tests from auto.arima) and 7 required 1 difference to be stationary.

There was significant autocorrelation (based on Ljung-Box test) in 4 of the regional models for Chum:

Code
ac_mods_chum<-c(26, 27, 31, 32) #indexes of models with autocorrelated residuals
for(i in 1:length(ac_mods_chum)){
  print(paste(regionskey[ResultsTableChum$regions[ac_mods_chum[i]]], ResultsTableChum$forecastlevels[ac_mods_chum[i]]))
  checkresiduals(RegionModsChum[[ac_mods_chum[i]]]$Fit)
}
[1] "SE Alaska 10"


    Ljung-Box test

data:  Residuals from ARIMA(1,0,0) with non-zero mean
Q* = 25.498, df = 9, p-value = 0.002467

Model df: 1.   Total lags used: 10

[1] "SE Alaska 20"


    Ljung-Box test

data:  Residuals from ARIMA(1,0,0) with non-zero mean
Q* = 15.578, df = 8, p-value = 0.04884

Model df: 1.   Total lags used: 9

[1] "NA 5"


    Ljung-Box test

data:  Residuals from ARIMA(0,1,1)
Q* = 35.287, df = 9, p-value = 5.302e-05

Model df: 1.   Total lags used: 10

[1] "NA 10"


    Ljung-Box test

data:  Residuals from ARIMA(0,1,1)
Q* = 34.843, df = 9, p-value = 6.35e-05

Model df: 1.   Total lags used: 10

This suggests that ARIMA models may not be the best fit in this instance.

Pink:

Plot MASE for three different forecast periods - 5, 10, and 20 years - across all regions. MASE < 1 is a “good” value.

Code
#All Years
ggplot(ResultsTable) + 
  geom_bar(aes(x = regions, y = MASE, fill = as.factor(forecastlevels)), stat = "identity", position = "dodge") + 
  geom_hline(aes(yintercept = 1), linetype = "dashed") + 
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region") + 
  ggtitle("Pinks all Years") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

Code
#Even Years 
ggplot(ResultsTable_even) + 
  geom_bar(aes(x = regions, y = MASE, fill = as.factor(forecastlevels)), stat = "identity", position = "dodge") + 
  geom_hline(aes(yintercept = 1), linetype = "dashed") + 
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region") + 
  ggtitle("Pinks Even Years") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

Code
#Odd Years 
ggplot(ResultsTable_odd) + 
  geom_bar(aes(x = regions, y = MASE, fill = as.factor(forecastlevels)), stat = "identity", position = "dodge") + 
  geom_hline(aes(yintercept = 1), linetype = "dashed") + 
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region") + 
  ggtitle("Pinks Odd Years") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

Despite the MASE fits seeming pretty good for the pink model runs some of the forecasts were underwhelming. Two regions were considered for forcasts, Cook Inlet, and SE Alaska with models that included for all years, even years, and odd years.

The first region, Cook Inlet, performed poorly (relative to some of the other regions) when looking at MASE values.

Code
#Cook Inlet
#----------------------------------------------------
#All Years
pink.ci<-subset(PinkByRegion, region=='ci')
ci.ts<-ts(pink.ci$lnreturns, start=pink.ci$year[1], frequency = 1)

#create training and test datasets for the 5 year forecast 
train.ci5<-window(ci.ts, start=1952, end=2010)
test.ci5<-window(ci.ts, start=2011, end=2015)
ci.final5<-forecast::auto.arima(train.ci5, approximation = F, stepwise = F)

#create training and test datasets for the 10 year forecast 
train.ci10<-window(ci.ts, start=1952, end=2005)
test.ci10<-window(ci.ts, start=2006, end=2015)
ci.final10<-forecast::auto.arima(train.ci10, approximation = F, stepwise = F)

#create training and test datasets for the 20 year forecast 
train.ci20<-window(ci.ts, start=1952, end=1995)
test.ci20<-window(ci.ts, start=1996, end=2015)
ci.final20<-forecast::auto.arima(train.ci20, approximation = F, stepwise = F)

#------------------------------
#Even Years
pink.ci_even<-subset(PinkByRegion_even, region=='ci')
ci.ts_even<-ts(pink.ci_even$lnreturns, start=pink.ci_even$year[1], frequency = 0.5)

#create training and test datasets for the 5 year forecast 
train.ci_even5<-window(ci.ts_even, start=1952, end=2010)
test.ci_even5<-window(ci.ts_even, start=2011, end=2014)
ci_even.final5<-forecast::auto.arima(train.ci_even5, approximation = F, stepwise = F)

#create training and test datasets for the 10 year forecast 
train.ci_even10<-window(ci.ts_even, start=1952, end=2005)
test.ci_even10<-window(ci.ts_even, start=2006, end=2014)
ci_even.final10<-forecast::auto.arima(train.ci_even10, approximation = F, stepwise = F)

#create training and test datasets for the 20 year forecast 
train.ci_even20<-window(ci.ts_even, start=1952, end=1995)
test.ci_even20<-window(ci.ts_even, start=1996, end=2014)
ci_even.final20<-forecast::auto.arima(train.ci_even20, approximation = F, stepwise = F)

#------------------------------
#Odd Years
pink.ci_odd<-subset(PinkByRegion_odd, region=='ci')
ci.ts_odd<-ts(pink.ci_odd$lnreturns, start=pink.ci_odd$year[1], frequency = 0.5)

#create training and test datasets for the 5 year forecast 
train.ci_odd5<-window(ci.ts_odd, start=1953, end=2009)
test.ci_odd5<-window(ci.ts_odd, start=2011, end=2015)
ci_odd.final5<-forecast::auto.arima(train.ci_odd5, approximation = F, stepwise = F)

#create training and test datasets for the 10 year forecast 
train.ci_odd10<-window(ci.ts_odd, start=1953, end=2005)
test.ci_odd10<-window(ci.ts_odd, start=2007, end=2015)
ci_odd.final10<-forecast::auto.arima(train.ci_odd10, approximation = F, stepwise = F)

#create training and test datasets for the 20 year forecast 
train.ci_odd20<-window(ci.ts_odd, start=1953, end=1995)
test.ci_odd20<-window(ci.ts_odd, start=1997, end=2015)
ci_odd.final20<-forecast::auto.arima(train.ci_odd20, approximation = F, stepwise = F)

Let’s look at forcast plots for the best performing models that consider all data, even years, and odd years.

Code
#Plots
plot_5 <- ci.final5 %>%
  forecast(h=5) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci5))

plot_10 <- ci.final10 %>%
  forecast(h=10) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci10))

plot_20 <- ci.final20 %>%
  forecast(h=20) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci20))

plot_even_5 <- ci_even.final5 %>%
  forecast(h=5) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci_even5))

plot_even_10 <- ci_even.final10 %>%
  forecast(h=10) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci_even10))

plot_even_20 <- ci_even.final20 %>%
  forecast(h=20) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci_even20))

plot_odd_5 <- ci_odd.final5 %>%
  forecast(h=5) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci_odd5))

plot_odd_10 <- ci_odd.final10 %>%
  forecast(h=10) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci_odd10))

plot_odd_20 <- ci_odd.final20 %>%
  forecast(h=20) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.ci_odd20))

plot_grid(plot_5, plot_10, plot_20, plot_even_5, plot_even_10, plot_even_20, plot_odd_5, plot_odd_10, plot_odd_20, ncol = 3, nrow = 3)

None of the models fit well. Let’s look at one more region.

Code
#SE AK
#---------------------------------------------------
#All Years
pink.seak<-subset(PinkByRegion, region=='seak')
seak.ts<-ts(pink.seak$lnreturns, start=pink.seak$year[1], frequency = 0.5)

#create training and test datasets for the 5 year forecast 
train.seak5<-window(seak.ts, start=1952, end=2010)
test.seak5<-window(seak.ts, start=2011, end=2015)
seak.final5<-forecast::auto.arima(train.seak5, approximation = F, stepwise = F)

#create training and test datasets for the 10 year forecast 
train.seak10<-window(seak.ts, start=1952, end=2005)
test.seak10<-window(seak.ts, start=2006, end=2015)
seak.final10<-forecast::auto.arima(train.seak10, approximation = F, stepwise = F)

#create training and test datasets for the 20 year forecast 
train.seak20<-window(seak.ts, start=1952, end=1995)
test.seak20<-window(seak.ts, start=1996, end=2015)
seak.final20<-forecast::auto.arima(train.seak20, approximation = F, stepwise = F)

#----------------------------------------------------
#Even Years
pink.seak_even<-subset(PinkByRegion_even, region=='seak')
seak.ts_even<-ts(pink.seak_even$lnreturns, start=pink.seak_even$year[1], frequency = 0.5)

#create training and test datasets for the 5 year forecast 
train.seak_even5<-window(seak.ts_even, start=1952, end=2010)
test.seak_even5<-window(seak.ts_even, start=2012, end=2014)
seak_even.final5<-forecast::auto.arima(train.seak_even5, approximation = F, stepwise = F)

#create training and test datasets for the 10 year forecast 
train.seak_even10<-window(seak.ts_even, start=1952, end=2004)
test.seak_even10<-window(seak.ts_even, start=2006, end=2014)
seak_even.final10<-forecast::auto.arima(train.seak_even10, approximation = F, stepwise = F)

#create training and test datasets for the 20 year forecast 
train.seak_even20<-window(seak.ts_even, start=1952, end=1994)
test.seak_even20<-window(seak.ts_even, start=1996, end=2014)
seak_even.final20<-forecast::auto.arima(train.seak_even20, approximation = F, stepwise = F)

#------------------------------
#Odd Years
pink.seak_odd<-subset(PinkByRegion_odd, region=='seak')
seak.ts_odd<-ts(pink.seak_odd$lnreturns, start=pink.seak_odd$year[1], frequency = 0.5)

#create training and test datasets for the 5 year forecast 
train.seak_odd5<-window(seak.ts_odd, start=1953, end=2009)
test.seak_odd5<-window(seak.ts_odd, start=2011, end=2015)
seak_odd.final5<-forecast::auto.arima(train.seak_odd5, approximation = F, stepwise = F)

#create training and test datasets for the 10 year forecast 
train.seak_odd10<-window(seak.ts_odd, start=1953, end=2005)
test.seak_odd10<-window(seak.ts_odd, start=2007, end=2015)
seak_odd.final10<-forecast::auto.arima(train.seak_odd10, approximation = F, stepwise = F)

#create training and test datasets for the 20 year forecast 
train.seak_odd20<-window(seak.ts_odd, start=1953, end=1995)
test.seak_odd20<-window(seak.ts_odd, start=1997, end=2015)
seak_odd.final20<-forecast::auto.arima(train.seak_odd20, approximation = F, stepwise = F)

Let’s look at forcast plots for the best performing models that consider all data, even years, and odd years.

Code
#Plots 
plot_5 <- seak.final5 %>%
  forecast(h=5) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak5))

plot_10 <- seak.final10 %>%
  forecast(h=10) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak10))

plot_20 <- seak.final20 %>%
  forecast(h=20) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak20))

plot_even_5 <- seak_even.final5 %>%
  forecast(h=5) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak_even5))

plot_even_10 <- seak_even.final10 %>%
  forecast(h=10) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak_even10))

plot_even_20 <- seak_even.final20 %>%
  forecast(h=20) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak_even20))

plot_odd_5 <- seak_odd.final5 %>%
  forecast(h=5) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak_odd5))

plot_odd_10 <- seak_odd.final10 %>%
  forecast(h=10) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak_odd10))

plot_odd_20 <- seak_odd.final20 %>%
  forecast(h=20) %>%
  autoplot() + geom_point(aes(x=x, y=y), data=fortify(test.seak_odd20))

plot_grid(plot_5, plot_10, plot_20, plot_even_5, plot_even_10, plot_even_20, plot_odd_5, plot_odd_10, plot_odd_20, ncol = 3, nrow = 3)

These models were all pretty flat and also didn’t seem to capture the trends particularly well.

Next, the number of models that had a differencing was plotted for all years, even years, and odd years.

Code
#All Years 
Ndiff<-sapply(RegionBestMod, function(x){
  a<-strsplit(strsplit(strsplit(x, "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
  return(a[[1]][2])}
)

tibble(Ndiff = Ndiff, region = Allcombs$regions, level = Allcombs$forecastlevels) %>%
  ggplot() + geom_bar(aes(x = region, y = Ndiff, fill = as.factor(level)), stat = "identity", position = "dodge") +
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region", y = "Number of Differences") + 
  ggtitle("Number of differences to achieve stationarity (Pink-All Years)") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

Code
#Even Years 
Ndiff_even<-sapply(RegionBestMod_even, function(x){
  a<-strsplit(strsplit(strsplit(x, "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
  return(a[[1]][2])}
)

tibble(Ndiff = Ndiff_even, region = Allcombs$regions, level = Allcombs$forecastlevels) %>%
  ggplot() + geom_bar(aes(x = region, y = Ndiff, fill = as.factor(level)), stat = "identity", position = "dodge") +
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region", y = "Number of Differences") + 
  ggtitle("Number of differences to achieve stationarity (Pink-Even Years)") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

Code
#Odd Years 
Ndiff_odd<-sapply(RegionBestMod_odd, function(x){
  a<-strsplit(strsplit(strsplit(x, "[(]")[[1]][2], "[)]")[[1]][1],"[,]")
  return(a[[1]][2])}
)

tibble(Ndiff = Ndiff_odd, region = Allcombs$regions, level = Allcombs$forecastlevels) %>%
  ggplot() + geom_bar(aes(x = region, y = Ndiff, fill = as.factor(level)), stat = "identity", position = "dodge") +
  scale_x_discrete(labels = as_labeller(regionskey)) +
  labs(fill = "Forecast Levels", x = "Region", y = "Number of Differences") + 
  ggtitle("Number of differences to achieve stationarity (Pink-Odd Years)") + theme_bw() + theme(axis.text.x=element_text(angle=-90, hjust = 0, vjust = 0.5 ))

When considering models that used all data, five regions needed differencing. In models that considered even years only, six regions needed differencing. In models that considered odd years only, ten regions needed differencing.

Finally we’re going to look at some residuals for models that displayed autocorraltion based on Ljung-Box test. There were only

Code
#Residuals 
#all
ac_mods_Pink<- c(10, 11, 15, 29)

for(i in 1:length(ac_mods_Pink)){
print(paste(regionskey[ResultsTable$regions[ac_mods_Pink[i]]], ResultsTable_even$forecastlevels[ac_mods_Pink[i]]))
checkresiduals(RegionMods[[ac_mods_Pink[i]]]$Fit)
}
[1] "Kodiak 5"


    Ljung-Box test

data:  Residuals from ARIMA(0,1,1) with drift
Q* = 12.949, df = 5, p-value = 0.02386

Model df: 1.   Total lags used: 6

[1] "Kodiak 10"


    Ljung-Box test

data:  Residuals from ARIMA(0,1,1) with drift
Q* = 13.668, df = 4, p-value = 0.008432

Model df: 1.   Total lags used: 5

[1] "Russia 20"


    Ljung-Box test

data:  Residuals from ARIMA(0,0,0) with non-zero mean
Q* = 13.669, df = 4, p-value = 0.008429

Model df: 0.   Total lags used: 4

[1] "SE Alaska 10"


    Ljung-Box test

data:  Residuals from ARIMA(0,1,1) with drift
Q* = 7.1988, df = 4, p-value = 0.1257

Model df: 1.   Total lags used: 5
Code
#Even
ac_mods_Pink_even<-c(19, 20) 

for(i in 1:length(ac_mods_Pink_even)){
print(paste(regionskey[ResultsTable_even$regions[ac_mods_Pink_even[i]]], ResultsTable_even$forecastlevels[ac_mods_Pink_even[i]]))
checkresiduals(RegionMods_even[[ac_mods_Pink_even[i]]]$Fit)
}
[1] "Prince William Sound 5"


    Ljung-Box test

data:  Residuals from ARIMA(0,0,0) with non-zero mean
Q* = 19.847, df = 6, p-value = 0.002949

Model df: 0.   Total lags used: 6

[1] "Prince William Sound 10"


    Ljung-Box test

data:  Residuals from ARIMA(0,0,0) with non-zero mean
Q* = 11.944, df = 5, p-value = 0.03556

Model df: 0.   Total lags used: 5
Code
#Odd
ac_mods_Pink_odd<-c(8) 

for(i in 1:length(ac_mods_Pink_odd)){
  print(paste(regionskey[ResultsTable_odd$regions[ac_mods_Pink_odd[i]]], ResultsTable_odd$forecastlevels[ac_mods_Pink_odd[i]]))
  checkresiduals(RegionMods_odd[[ac_mods_Pink_odd[i]]]$Fit)
}
[1] "Japan 10"


    Ljung-Box test

data:  Residuals from ARIMA(0,1,0)
Q* = 11.71, df = 5, p-value = 0.03899

Model df: 0.   Total lags used: 5

#———————————————————————

Discussion

Comparing the models across the three forecasts for SBC underlines that the data chosen for a forecast matters. Not only is the estimated ARIMA a better fit for a shorter forecast (and is trained on more data), but the training data for the model for the 20 year forecast is not stationary so the ARIMA parameters are very different from the parameters estimated for the 5 and 10 year forecasts.

For sockeye, six regions had at least one forecast with a well performing model (MASE < 1); however, six regions did not have MASE < 1 for even the 5 year forecast model. None of the 20 year forecasts resulted in MASE < 1. This suggests that our data are so stochastic that it is difficult to forecast more than 5 years into the future. Another hypothesis could be that the training data set isn’t long enough to generate a good model for the 20 year forecast, but from looking at the data we think the large inter-year variability in returns makes it difficult to fit an accurate model. In general, the models that fit the data better (lower MASE) tended to include differencing and were more likely to have higher order parameters.

Models fit to Chum do better than models fit to Sockeye. There may be something about the patterns of chum population fluctuations that are more amenable to being modeled with ARIMA.

Model fits to pink depended on data assumptions given their cyclic life history, and while the MASE results were lower than the other stock, ARIMA models are likely not the appropriate method for assessing or predicting pinnk returns with any level of certainty.

Description of each team member’s contributions

All team members helped decide on the goal and ran the analyses for the individual species and all regions. All team members wrote the code and created the results for one species. ZR researched approaches for measuring accuracy of forecasts and created functions to run the ARIMA models over multiple regions and select the best model (even if it was different than selected by auto.arima). ETS and MS modified this code to work with their own species. All team members helped write and edit the report.