Hidden Markov Models

Author

Liz Elmstrom, Terrance Wang, Emma Timmins-Schiffman

Published

May 9, 2023

General Questions

As part of the California Current Integrated Ecosystem Report, NOAA scientists do annual updates of stoplight charts for ecosystem indicators.

Through this analysis, we used these data to ask whether the stoplight biological indicators undergo regime or “transition” shifts through time.

Methods

Using the stoplight dataset, we developed 2- and 3-state multivariate HMMs with copepod-related biological indicators as our response variables. We then tested whether PDO-related variables affecting transition probablities. In all models, we assumed all responses are Gaussian. To facilitate the visual comparison across response variables, we ran models on both standardized (z-scored) and non-standardized data.

Hypotheses

Copepods and other zooplankton are responsive to oceanographic conditions that alter their physical environment and food sources. We hypothesize that 4 indicators of the copepod community - richness, N biomass, S biomass, and biological transition - will be significantly impacted by PDO transitions. If copepods are sensitive to PDO, we should see shifts in the copepod community on a ~decadal scale. Covariates related to the PDO should strengthen the model if there is a strong link between the PDO and copepods.

Data

Load the data

Code
library(depmixS4)
Loading required package: nnet
Loading required package: MASS
Loading required package: Rsolnp
Loading required package: nlme
Code
library(ggplot2)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.1     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::%||%()     masks base::%||%()
✖ dplyr::collapse() masks nlme::collapse()
✖ dplyr::filter()   masks stats::filter()
✖ dplyr::lag()      masks stats::lag()
✖ dplyr::select()   masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
stoplight <- read.csv(here::here("Lab-4", "stoplight.csv"))

Wrangle the data

Code
cop.dat<-stoplight[9:12,]
cop.row<-cop.dat[,1]
cop.dat<-cop.dat[,3:26]
rownames(cop.dat)<-cop.row

#look at the data over time
cop.t<-t(cop.dat)
rows <- gsub("X","", rownames(cop.t))
row.names(cop.t)<-rows
cop.df<-data.frame(cop.t)
cop.df$Year<-row.names(cop.df)

# to make this tidyr happy
cop.df.long = cop.df %>% pivot_longer(-Year, names_to="indicator", values_to = "value") %>%
  mutate(Year=as.numeric(Year))

cop.df.long %>% group_by(indicator) %>% mutate(value=scale(value)) %>% 
  ggplot(aes(x=Year,y=value,color=indicator)) + geom_line() + theme_bw() + labs(y="Z score")

Code
library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
Code
covar <- stoplight %>%
  filter(Ecosystem.Indicators %in% 
    c('Copepod_richness','N_copepod','S_copepod',"Deep_temperature","SST", "PDO_DecMarch","Biological_transition"))%>%
  select(!Type)%>% 
  t() %>%
  as_tibble %>% 
  row_to_names(row_number = 1)%>%
  mutate_if(is.character, as.numeric)
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
Code
covar$Year <- 1998:2021

# to make this tidyr happy
cv.df = covar %>% pivot_longer(-Year, names_to="variable", values_to="value") %>%
  # z scores
  group_by(variable) %>% mutate(z_score = scale(value)) %>% group_by(Year) %>%
  filter(variable %in% c("N_copepod","S_copepod","Copepod_richness","Biological_transition"))

# save this as the non z score version
big.df = cv.df %>%
  select(-z_score) %>%
  # repivot for data analysis 
  pivot_wider(names_from="variable", values_from="value")

# save as z score version
big.df.z = cv.df %>%
  select(-value) %>%
  # repivot for data analysis 
  pivot_wider(names_from="variable", values_from="z_score")

Results

Multivariate HMM - 1-3 states

Here we evaluate the data support for a multivariate HMM 1, 2, or 3 state model.

Code
# loop through a few states and compare AIC
set.seed(123)
AIC_tbl = data.frame(nstate=0, AIC=0)
fit_list = NULL
for (i in 1:3){
mod<-depmix(list(Copepod_richness ~ 1, N_copepod ~ 1, S_copepod ~ 1,Biological_transition ~ 1),
                nstates = i,
                family=list(gaussian(), gaussian(),gaussian(),gaussian()),
                data=big.df)
fitmod <- fit(mod)
fit_list[[i]] = fitmod
# Transition matrix
# summary(fitmod, which = "transition")
# AIC
AIC_tbl[i,"nstate"] = i
AIC_tbl[i,"AIC"] = AIC(fitmod)
}
converged at iteration 1 with logLik: -214.936 
converged at iteration 9 with logLik: -160.1462 
converged at iteration 5 with logLik: -127.0371 
Code
# do the same but with z scores
set.seed(123)
AIC_tbl_z = data.frame(nstate=0, AIC=0)
fit_list_z = NULL
for (i in 1:3){
mod<-depmix(list(Copepod_richness ~ 1, N_copepod ~ 1, S_copepod ~ 1,Biological_transition ~ 1),
                nstates = i,
                family=list(gaussian(), gaussian(),gaussian(),gaussian()),
                data=big.df.z)
fitmod <- fit(mod)
fit_list_z[[i]] = fitmod
# Transition matrix
# summary(fitmod, which = "transition")
# AIC
AIC_tbl_z[i,"nstate"] = i
AIC_tbl_z[i,"AIC"] = AIC(fitmod)
}
converged at iteration 1 with logLik: -134.1752 
converged at iteration 9 with logLik: -79.38551 
converged at iteration 5 with logLik: -46.27637 
Code
AIC_tbl
  nstate      AIC
1      1 445.8719
2      2 358.2925
3      3 318.0742
Code
AIC_tbl_z
  nstate      AIC
1      1 284.3505
2      2 196.7710
3      3 156.5527

2 states seem like the most reasonable. Though a 3rd state lowers the AIC, adding more states to an HMM usually lowers the AIC. We proceed with a 2 state HMM due to its interpretability (there are 2 states in the PDO fluctuation, our system of interest).

HMM 2-State plots

Code
# compare HMM with raw data and z score data
# RAW DATA
#most probable states
prstates2 <- apply(posterior(fit_list[[2]]) [,c("S1", "S2")], 1, which.max)
Warning in .local(object, ...): Argument 'type' not specified and will default
to 'viterbi'. This default may change in future releases of depmixS4. Please
see ?posterior for alternative options.
Code
# plot(prstates2, type='b', xlab='Time', ylab='State')

#estimated data/fits
mu2 <- summary(fit_list[[2]])[,1]
Initial state probabilities model 
pr1 pr2 
  0   1 

Transition matrix 
        toS1  toS2
fromS1 0.782 0.218
fromS2 0.446 0.554

Response parameters 
Resp 1 : gaussian 
Resp 2 : gaussian 
Resp 3 : gaussian 
Resp 4 : gaussian 
    Re1.(Intercept) Re1.sd Re2.(Intercept) Re2.sd Re3.(Intercept) Re3.sd
St1          -1.334  1.065           0.280  0.142          -0.146  0.101
St2           4.030  2.227          -0.366  0.394           0.325  0.187
    Re4.(Intercept) Re4.sd
St1         102.265  27.54
St2         219.239  81.85
Code
pred2 <- data.frame("year"=seq(min(big.df$Year), max(big.df$Year)), "fit" = mu2[prstates2])

# Z SCORE DATA
#most probable states
prstates2_z <- apply(posterior(fit_list_z[[2]]) [,c("S1", "S2")], 1, which.max)
Warning in .local(object, ...): Argument 'type' not specified and will default
to 'viterbi'. This default may change in future releases of depmixS4. Please
see ?posterior for alternative options.
Code
# plot(prstates2, type='b', xlab='Time', ylab='State')

#estimated data/fits
mu2_z <- summary(fit_list_z[[2]])[,1]
Initial state probabilities model 
pr1 pr2 
  0   1 

Transition matrix 
        toS1  toS2
fromS1 0.782 0.218
fromS2 0.446 0.554

Response parameters 
Resp 1 : gaussian 
Resp 2 : gaussian 
Resp 3 : gaussian 
Resp 4 : gaussian 
    Re1.(Intercept) Re1.sd Re2.(Intercept) Re2.sd Re3.(Intercept) Re3.sd
St1          -0.649  0.341           0.581  0.338          -0.650  0.370
St2           1.070  0.713          -0.958  0.937           1.072  0.683
    Re4.(Intercept) Re4.sd
St1          -0.548  0.342
St2           0.904  1.016
Code
pred2_z <- data.frame("year"=seq(min(big.df$Year), max(big.df$Year)), "fit" = mu2_z[prstates2_z])

# Raw data
ggplot() + 
  geom_point(data=cv.df,aes(x=Year,y=value,color=variable)) +
  geom_line(data=pred2, aes(year, fit)) + 
  ggtitle("Copepods - raw data and predictions") + 
  ylab("Indicator Value") + xlab("Year") + theme_bw()

Code
# Z score
ggplot() + 
  geom_point(data=cv.df,aes(x=Year,y=z_score,color=variable),shape=4,size=3) +
  geom_line(data=pred2_z, aes(year, fit), linetype = "dashed") + 
  ggtitle("Copepods - z-score data and predictions") + 
  ylab("Indicator Value") + xlab("Year") + theme_bw()

Code
# Transition Tables
summary(fit_list[[2]], which = "transition")
Transition matrix 
        toS1  toS2
fromS1 0.782 0.218
fromS2 0.446 0.554
Code
summary(fit_list_z[[2]], which = "transition")
Transition matrix 
        toS1  toS2
fromS1 0.782 0.218
fromS2 0.446 0.554

There seems to be 2 state system, where the first state is associated with high copepod richness and S copepod, and low N copepod The second state is associated with low copepod richness and S copepod, and high N copepod. The probability of transitioning from state 1 to state 2 is 0.218. The probability of transitioning from state 2 to state 1 is 0.446.

Our analysis shows that z-scoring the indicator data shows the same state fits as the raw data results, albeit at different scales. Z-scores are only useful for plotting purposes. Transition probabilities are identical. Though z-scoring the data is helpful in visualizing the contrast among states, we proceed with raw data when adding covariates.

Multivariate HMM with covariates

Here we test the data support for including SST and Deep temperature as covariates that affect the transition probabilities of our 2-state HMM model.

Code
#building the 2 state model
set.seed(123)
# model sel
cov.terms<-c("SST","Deep_temperature")
# Create a table of indicators as to whethr that predictor is included
mod.ind <- as.matrix(expand.grid(c(FALSE,TRUE),c(FALSE,TRUE)))
mod.tab <- data.frame(mod.ind)
names(mod.tab) <- cov.terms

# Define a column for storing AIC
mod.tab$AIC  <- NA
fit_covar_list = NULL
for (i in 1:nrow(mod.tab)){
  ind_on=as.matrix(mod.tab[i,1:length(cov.terms)])
  
  # Set which covariates should be included in formulas
  if (length(cov.terms[ind_on])==0){
    formula = ~1
  }else{
    formula =  paste("~",paste(cov.terms[ind_on],collapse="+"),sep="")
  }

  mod.2covar <-depmix(list(Copepod_richness ~ 1, N_copepod ~ 1, S_copepod ~ 1,Biological_transition ~ 1),
                nstates = 2,# Could change this to three. Adding states almost always improves AICs
                family=list(gaussian(), gaussian(),gaussian(),gaussian()),
                transition = eval(parse(text=formula)),## Affects the probability of transitioning
                data=covar)
  fit_covar_list[[i]]= fit(mod.2covar)
  
  mod.tab[i,"AIC"] = AIC((fit_covar_list[[i]]))
}
converged at iteration 9 with logLik: -160.1462 
converged at iteration 5 with logLik: -157.8728 
converged at iteration 8 with logLik: -159.0089 
converged at iteration 7 with logLik: -153.9941 
Code
mod.tab
    SST Deep_temperature      AIC
1 FALSE            FALSE 358.2925
2  TRUE            FALSE 357.7455
3 FALSE             TRUE 360.0179
4  TRUE             TRUE 353.9881

Inclusion of both SST and Deep Temp as covariates for the transition matrix seems to improve the 2-state model.

Effect of covariates

Code
convert_probs = function(fit, coef_value = 0) {
  # extract coefficients
  p1 <- fit@transition[[1]]@parameters$coefficients
  p2 <- fit@transition[[2]]@parameters$coefficients
  
  # assemble transition matrix in mlogit space
  m <- matrix(0, 2, 2)
  m[1,] <- p1[1,] + p1[2,] * coef_value
  m[2,] <- p2[1,] + p2[2,] * coef_value
  
  # exponentiate
  m <- exp(m)
  
  # normalize
  m[1,] <- m[1,] / sum(m[1,])
  m[2,] <- m[2,] / sum(m[2,])
  return(m)
}

Checking out the covariate effects separately.

Code
SST_mod <- fit_covar_list[[2]]
#summary(SST_mod)

SST_mod@transition
[[1]]
Model of type multinomial (mlogit), formula: ~SST
Coefficients: 
            St1       St2
(Intercept)   0 62.299316
SST           0 -4.599023
Probalities at zero values of the covariates.
8.785182e-28 1 

[[2]]
Model of type multinomial (mlogit), formula: ~SST
Coefficients: 
            St1        St2
(Intercept)   0  5.7093173
SST           0 -0.3275636
Probalities at zero values of the covariates.
0.003303982 0.996696 
Code
convert_probs(SST_mod, coef_value = 0)
             [,1]     [,2]
[1,] 8.785182e-28 1.000000
[2,] 3.303982e-03 0.996696
Code
convert_probs(SST_mod, coef_value = 1)
             [,1]      [,2]
[1,] 8.731344e-26 1.0000000
[2,] 4.578688e-03 0.9954213

SST has a negative effect on the transition from state 2 to state 1. The transition probability from state 1 to state 2 stays roughly the same.

Code
DT_mod <- fit_covar_list[[3]]
#summary(DT_mod)

DT_mod@transition
[[1]]
Model of type multinomial (mlogit), formula: ~Deep_temperature
Coefficients: 
                 St1       St2
(Intercept)        0 -24.24479
Deep_temperature   0   2.78945
Probalities at zero values of the covariates.
1 2.955431e-11 

[[2]]
Model of type multinomial (mlogit), formula: ~Deep_temperature
Coefficients: 
                 St1       St2
(Intercept)        0 35.136158
Deep_temperature   0 -4.392846
Probalities at zero values of the covariates.
5.502504e-16 1 
Code
convert_probs(DT_mod, coef_value = 0)
             [,1]         [,2]
[1,] 1.000000e+00 2.955431e-11
[2,] 5.502504e-16 1.000000e+00
Code
convert_probs(DT_mod, coef_value = 1)
             [,1]         [,2]
[1,] 1.000000e+00 4.809097e-10
[2,] 4.449887e-14 1.000000e+00

Deep temperature has a positive effect on the transition probabilites.

Inspecting the best model

Code
summary(fit_covar_list[[4]])
Initial state probabilities model 
pr1 pr2 
  1   0 

Transition model for state (component) 1 
Model of type multinomial (mlogit), formula: ~SST + Deep_temperature
Coefficients: 
                 St1       St2
(Intercept)        0 124.42887
SST                0 -90.16694
Deep_temperature   0 137.30232
Probalities at zero values of the covariates.
9.145973e-55 1 

Transition model for state (component) 2 
Model of type multinomial (mlogit), formula: ~SST + Deep_temperature
Coefficients: 
                 St1         St2
(Intercept)        0  5.54344693
SST                0 -0.33007969
Deep_temperature   0  0.02560362
Probalities at zero values of the covariates.
0.003897764 0.9961022 


Response parameters 
Resp 1 : gaussian 
Resp 2 : gaussian 
Resp 3 : gaussian 
Resp 4 : gaussian 
    Re1.(Intercept) Re1.sd Re2.(Intercept) Re2.sd Re3.(Intercept) Re3.sd
St1           4.056  2.211          -0.371  0.391           0.327  0.187
St2          -1.329  1.067           0.281  0.142          -0.145  0.102
    Re4.(Intercept) Re4.sd
St1         219.667 81.944
St2         102.467 27.677
Code
#most probable states
prstates_covar <- apply(posterior(fit_covar_list[[4]]) [,c("S1", "S2")], 1, which.max)
Warning in .local(object, ...): Argument 'type' not specified and will default
to 'viterbi'. This default may change in future releases of depmixS4. Please
see ?posterior for alternative options.
Code
plot(prstates_covar, type='b', xlab='Time', ylab='State')

Code
#estimated data
mu_covar <- summary(fit_covar_list[[4]])[,1]
Initial state probabilities model 
pr1 pr2 
  1   0 

Transition model for state (component) 1 
Model of type multinomial (mlogit), formula: ~SST + Deep_temperature
Coefficients: 
                 St1       St2
(Intercept)        0 124.42887
SST                0 -90.16694
Deep_temperature   0 137.30232
Probalities at zero values of the covariates.
9.145973e-55 1 

Transition model for state (component) 2 
Model of type multinomial (mlogit), formula: ~SST + Deep_temperature
Coefficients: 
                 St1         St2
(Intercept)        0  5.54344693
SST                0 -0.33007969
Deep_temperature   0  0.02560362
Probalities at zero values of the covariates.
0.003897764 0.9961022 


Response parameters 
Resp 1 : gaussian 
Resp 2 : gaussian 
Resp 3 : gaussian 
Resp 4 : gaussian 
    Re1.(Intercept) Re1.sd Re2.(Intercept) Re2.sd Re3.(Intercept) Re3.sd
St1           4.056  2.211          -0.371  0.391           0.327  0.187
St2          -1.329  1.067           0.281  0.142          -0.145  0.102
    Re4.(Intercept) Re4.sd
St1         219.667 81.944
St2         102.467 27.677
Code
pred_covar <- data.frame("year"=seq(min(covar$Year), max(covar$Year)), "fit" = mu_covar[prstates_covar])

ggplot(covar) + 
  geom_point(data=cv.df,aes(x=Year,y=z_score,color=variable),size=2,alpha=0.7) +
  geom_line(data=pred2, aes(year, scale(fit),color="w/o cov"))+
  geom_line(data=pred_covar, aes(year, scale(fit),color="w cov")) + 
  ggtitle("2 state model - with vs without covariates") + 
  ylab("Z score") + xlab("Year") + theme_bw()

Code
## Fits almost the same with and without covariates

Adding covariates only slightly changes the fits of states, but since the AIC is decreased by < 10 it may not be worth including the covariates in the final model. The transition from state 1 to 2 only differ by 1 year around 1998 and 2006.

Discussion

Our best model was a 2 state model using N copepod, S copepod, copepod richness, and biological transition as indicators with SST and deep temperature as covariates. The 2nd state of high copepod richness (warm PDO), high S copepod biomass (winter dominant), low N copepod biomass (summer dominant), and late biological transition (winter>summer) confirms what we know about copepod ecology in the CA Current. The transition matrix suggests that the systems tends to remain in the 1st state on average. This 1st state tends to be associated with the cool PDO.

Temperature covariates seem to be minmally important in explaining the state transitions. However, if covariates are included, warmer temperatures are generally associated with increased copepod diversity.

Team contributions

All team members contributed to experimental design and planning. ETS did the first draft of the 2- and 3- state HMMs and provided feedback and editing of the other team members’ work. Liz wrote the initial code to include the effect of covariates on the transition probabilities. Terrance improved and cleaned up code, and created the for loops for model selection with z scores. All members edited and contributed to writing the Methods, Results and Discussions sections.