12 Mar 2019

Modeling zero-inflated data

Zero-inflated = data with excess of zero counts

The degree of zero-inflation may be extreme

Modeling zero-inflated data

Zero-inflated = data with excess of zero counts

Or more subtle

Lots of options for dealing with zeros

  1. Remove them from the dataset and ignore them

  2. Transform your data

  3. Work with complicated statistical models

1. Dropping zeros

1. Dropping zeros

1. Dropping zeros

Can work in a limited number of cases BUT

Becomes problematic for

  • models that depend on autoregressive structure

  • models that don’t allow for missing data / NAs

  • datasets where the zeros aren’t random (Daphnia example)

  • zeros are caused by response (density dependence, etc)

2. Transforming data

With zero-inflated data, it’s common to add a small number and transform the results – ln(), sqrt(), etc.

  • Problem: what constant should be added before we apply the transformation?

Example: Lake WA Conochilus

Example: Lake WA Conochilus

Example: Lake WA Conochilus

Lots of problems with this approach generally

  • choice of constant is subjective

  • leads to biases in parameters being estimated

Example: let’s fit auto.arima() to this time series using different constants

Example: Lake WA Conochilus

log(Conochilus + 0.001) = ARIMA(3,1,1) + no drift

log(Conochilus + 0.01) = ARIMA(3,1,2) + no drift

log(Conochilus + 0.1) = ARIMA(2,1,1) + no drift

log(Conochilus + 1) = ARIMA(5,1,0) + drift

References for adding constants (or not)

Ekwaru et al. 2017 link

O’hara and Kotze 2010 link

3. Using more complicated statistical models

  1. Tweedie distribution

  2. Delta or hurdle- models

Tweedie distribution

  • Tweedie is very flexible

  • Skewed continuous distribution with point mass at zero
    • unlike Gamma, which is undefined at zero

Implementation in R

Lots of existing family support

  • Specify link function (q) - related predicted values to covariates

\(μ_i^q = x_i^Tb\)

  • Specify power function (p) - relates variance to mean

\(var(y_i) = phi * μ_i^p\)

  • When 1 < p < 2, behavior is compound Poisson / Tweedie

Let’s apply GLMs with Tweedie to Lake WA data

library(statmod)
g = glm(Conochilus ~ 1, family=
    tweedie(var.power=1.3,link.power=0), 
  data=as.data.frame(lakeWAplanktonRaw))

What’s the right variance power to choose?

Package ‘tweedie’ does ML estimation of the p parameter.

library(tweedie)
tp = tweedie.profile(Conochilus ~ 1, 
  p.vec = seq(1,2,by=0.1), 
  data=as.data.frame(lakeWAplanktonRaw))

What’s the right variance power to choose?

plot(tp, xlab="p", ylab="log lik")

Other implementations in R

Also supported in ‘mgcv’. p parameter may be fixed

gam(Conochilus~s(year), family=Tweedie(1.25,power(.1)),
  data=as.data.frame(lakeWAplanktonRaw))

or estimated

gam(Conochilus~s(year), family=tw(),
  data=as.data.frame(lakeWAplanktonRaw))

Delta (aka hurdle models)

Tweedie models every data point as being generated from the same process

Alternative: use delta- or hurdle- models

  • These model the zeros with one sub-model, and the positive values with a second sub-model

  • Very widely used in fisheries (index standardization etc)

Sub-models

Binomial data (logit link) \(logit(p_i) = log(p_i/(1-p_i)) = BX_i\)

Positive model (log link generally) \(log(u_i) = CZ_i\)

  • Poisson, Gamma, NegBinomial, etc
  • \(Z_i\) and \(X_i\) may be identical
  • Where are the error terms? (GLMMs)

Time series

Adding these error terms allow us to turn conventional GLM into time series models with autoregressive behavior, e.g. \[Y_t \sim Bernoulli(p_t)\] \[logit(p_t) = BX_t \] \[X_{t} = BX_{t-1} + \epsilon_{t-1}\] \[\epsilon_t \sim Normal(0, \sigma)\]

  • Any DLM or SS model can be extended to include non-normal errors

Hurdle models

Delta-GLM: We’ll start fitting 2 models to the Conochilus data

lakeWAplanktonRaw = as.data.frame(lakeWAplanktonRaw)
lakeWAplanktonRaw$present = 
  ifelse(lakeWAplanktonRaw$Conochilus > 0, 1, 0)

model_01 = glm(present ~ Year + as.factor(Month),
  data=lakeWAplanktonRaw, family = binomial)
model_pos = glm(Conochilus ~ Year + as.factor(Month),
  data=lakeWAplanktonRaw[which(lakeWAplanktonRaw$Conochilus>0),],
  family = Gamma(link="log"))

Hurdle models

Now we can combine predictions from these models for estimates of Conochilus density

lakeWAplanktonRaw$glm_pred = predict(model_01, 
  newdata=lakeWAplanktonRaw, type="response") * 
  predict(model_pos, 
  newdata=lakeWAplanktonRaw, type="response")

Hurdle models

In the last example, we used glm() to put the model pieces together ourselves.

Alternatives:

pscl::hurdle()

Hurdle models via mgcv

Given the fits to the hurdle model with GLM() weren’t great, it may be worthwhile to try a GAM

Delta-GAM: We’ll start fitting 2 models to the Conochilus data

lakeWAplanktonRaw = as.data.frame(lakeWAplanktonRaw)
lakeWAplanktonRaw$present = 
  ifelse(lakeWAplanktonRaw$Conochilus > 0, 1, 0)

model_01 = gam(present ~ s(Year) + 
    s(Month,bs="cc",k=12), 
  data=lakeWAplanktonRaw, family = binomial)
model_pos = gam(Conochilus ~ s(Year) + 
    s(Month,bs="cc",k=12),
  data=lakeWAplanktonRaw[which(lakeWAplanktonRaw$Conochilus>0),], 
  family = Gamma(link="log"))

Hurdle models

Predictions via the GAM. These look maybe slightly better but don’t appear to capture really high observations

lakeWAplanktonRaw$gam_pred = predict(model_01, 
  newdata=lakeWAplanktonRaw, type="response") * 
  predict(model_pos, 
  newdata=lakeWAplanktonRaw, type="response")

Extensions with adding mixed effects

Several R packages out there. ‘glmmTMB’ makes ML estimation relatively straightforward if your hurdle model = Poisson or NegBin.

  • Instead we’ll add mixed effects with the Tweedie
mod <- glmmTMB(Conochilus~as.factor(Month) + (1|Year), 
  family=tweedie(), lakeWAplanktonRaw)

Extensions with adding mixed effects

Comparing these models so far,

Model Cor (pred,obs)
GLM 0.355
GAM 0.377
GLMM 0.339

brms

Bayesian hurdle models, optionally with smooth functions and random effects!

Start with month (factor) and Year (linear trend) in positive model, hurdle is intercept only

library(brms)
mod = brm(bf(Conochilus~ as.factor(Month) + Year, 
       hu ~ 1, 
      data = lakeWAplanktonRaw, family = hurdle_gamma(), 
        chains=1, iter=1000)

brms

Now we can add month covariates to the hurdle part, and a smooth function on Year

  • We use k to select the wiggliness of the smooth.
  • It can be selected automatically, but probably better to put some thought into this!
library(brms)
mod = brm(bf(Conochilus~ as.factor(Month) + s(Year, k=10), 
       hu ~ as.factor(Month), 
      data = lakeWAplanktonRaw, family = hurdle_gamma(), 
        chains=1, iter=1000)

brms

Or we could include the month term as a random effect (intercept)

library(brms)
mod = brm(bf(Conochilus~ as.factor(Month) + s(Year, k=10), 
       hu ~ as.factor(Month), 
      data = lakeWAplanktonRaw, family = hurdle_gamma(), 
        chains=1, iter=1000)

brms

Each of these models can be evaluated with LOOIC for model selection, e.g.

library(brms)
mod = brm(bf(Conochilus~ as.factor(Month) + s(Year, k=10), 
       hu ~ as.factor(Month), 
      data = lakeWAplanktonRaw, family = hurdle_gamma(), 
        chains=1, iter=1000)
loo::loo(mod)

Summary: Advantages and disadvantages of Tweedie and delta-GLMMs

Advantages of Tweedie

  • Single link function
    • sometimes awkward to have multiple link functions (logit, log) with covariate effects in each
    • one coefficient to interpret / covariate

Advantages of delta-GLMMs

  • More flexible
  • Estimation can be faster than Tweedie (particularly if power parameter p estimated)
  • Mechanistic model for 0s that you don’t get with Tweedie