Zero-inflated = data with excess of zero counts
The degree of zero-inflation may be extreme
12 Mar 2019
Zero-inflated = data with excess of zero counts
The degree of zero-inflation may be extreme
Zero-inflated = data with excess of zero counts
Or more subtle
Remove them from the dataset and ignore them
Transform your data
Work with complicated statistical models
Can work in a limited number of cases BUT
Becomes problematic for
models that depend on autoregressive structure
models that don’t allow for missing data / NAs
datasets where the zeros aren’t random (Daphnia example)
zeros are caused by response (density dependence, etc)
With zero-inflated data, it’s common to add a small number and transform the results – ln(), sqrt(), etc.
Lots of problems with this approach generally
choice of constant is subjective
leads to biases in parameters being estimated
Example: let’s fit auto.arima() to this time series using different constants
log(Conochilus + 0.001) = ARIMA(3,1,1) + no drift
log(Conochilus + 0.01) = ARIMA(3,1,2) + no drift
log(Conochilus + 0.1) = ARIMA(2,1,1) + no drift
log(Conochilus + 1) = ARIMA(5,1,0) + drift
Ekwaru et al. 2017 link
O’hara and Kotze 2010 link
Tweedie distribution
Delta or hurdle- models
Tweedie is very flexible
Lots of existing family support
\(μ_i^q = x_i^Tb\)
\(var(y_i) = phi * μ_i^p\)
library(statmod) g = glm(Conochilus ~ 1, family= tweedie(var.power=1.3,link.power=0), data=as.data.frame(lakeWAplanktonRaw))
Package ‘tweedie’ does ML estimation of the p parameter.
library(tweedie) tp = tweedie.profile(Conochilus ~ 1, p.vec = seq(1,2,by=0.1), data=as.data.frame(lakeWAplanktonRaw))
plot(tp, xlab="p", ylab="log lik")
Also supported in ‘mgcv’. p parameter may be fixed
gam(Conochilus~s(year), family=Tweedie(1.25,power(.1)), data=as.data.frame(lakeWAplanktonRaw))
or estimated
gam(Conochilus~s(year), family=tw(), data=as.data.frame(lakeWAplanktonRaw))
Tweedie models every data point as being generated from the same process
Alternative: use delta- or hurdle- models
These model the zeros with one sub-model, and the positive values with a second sub-model
Very widely used in fisheries (index standardization etc)
Binomial data (logit link) \(logit(p_i) = log(p_i/(1-p_i)) = BX_i\)
Positive model (log link generally) \(log(u_i) = CZ_i\)
Adding these error terms allow us to turn conventional GLM into time series models with autoregressive behavior, e.g. \[Y_t \sim Bernoulli(p_t)\] \[logit(p_t) = BX_t \] \[X_{t} = BX_{t-1} + \epsilon_{t-1}\] \[\epsilon_t \sim Normal(0, \sigma)\]
Delta-GLM: We’ll start fitting 2 models to the Conochilus data
lakeWAplanktonRaw = as.data.frame(lakeWAplanktonRaw) lakeWAplanktonRaw$present = ifelse(lakeWAplanktonRaw$Conochilus > 0, 1, 0) model_01 = glm(present ~ Year + as.factor(Month), data=lakeWAplanktonRaw, family = binomial) model_pos = glm(Conochilus ~ Year + as.factor(Month), data=lakeWAplanktonRaw[which(lakeWAplanktonRaw$Conochilus>0),], family = Gamma(link="log"))
Now we can combine predictions from these models for estimates of Conochilus density
lakeWAplanktonRaw$glm_pred = predict(model_01, newdata=lakeWAplanktonRaw, type="response") * predict(model_pos, newdata=lakeWAplanktonRaw, type="response")
In the last example, we used glm() to put the model pieces together ourselves.
Alternatives:
pscl::hurdle()
Given the fits to the hurdle model with GLM() weren’t great, it may be worthwhile to try a GAM
Delta-GAM: We’ll start fitting 2 models to the Conochilus data
lakeWAplanktonRaw = as.data.frame(lakeWAplanktonRaw) lakeWAplanktonRaw$present = ifelse(lakeWAplanktonRaw$Conochilus > 0, 1, 0) model_01 = gam(present ~ s(Year) + s(Month,bs="cc",k=12), data=lakeWAplanktonRaw, family = binomial) model_pos = gam(Conochilus ~ s(Year) + s(Month,bs="cc",k=12), data=lakeWAplanktonRaw[which(lakeWAplanktonRaw$Conochilus>0),], family = Gamma(link="log"))
Predictions via the GAM. These look maybe slightly better but don’t appear to capture really high observations
lakeWAplanktonRaw$gam_pred = predict(model_01, newdata=lakeWAplanktonRaw, type="response") * predict(model_pos, newdata=lakeWAplanktonRaw, type="response")
Several R packages out there. ‘glmmTMB’ makes ML estimation relatively straightforward if your hurdle model = Poisson or NegBin.
mod <- glmmTMB(Conochilus~as.factor(Month) + (1|Year), family=tweedie(), lakeWAplanktonRaw)
Comparing these models so far,
Model | Cor (pred,obs) |
---|---|
GLM | 0.355 |
GAM | 0.377 |
GLMM | 0.339 |
Bayesian hurdle models, optionally with smooth functions and random effects!
Start with month (factor) and Year (linear trend) in positive model, hurdle is intercept only
library(brms) mod = brm(bf(Conochilus~ as.factor(Month) + Year, hu ~ 1, data = lakeWAplanktonRaw, family = hurdle_gamma(), chains=1, iter=1000)
Now we can add month covariates to the hurdle part, and a smooth function on Year
library(brms) mod = brm(bf(Conochilus~ as.factor(Month) + s(Year, k=10), hu ~ as.factor(Month), data = lakeWAplanktonRaw, family = hurdle_gamma(), chains=1, iter=1000)
Or we could include the month term as a random effect (intercept)
library(brms) mod = brm(bf(Conochilus~ as.factor(Month) + s(Year, k=10), hu ~ as.factor(Month), data = lakeWAplanktonRaw, family = hurdle_gamma(), chains=1, iter=1000)
Each of these models can be evaluated with LOOIC for model selection, e.g.
library(brms) mod = brm(bf(Conochilus~ as.factor(Month) + s(Year, k=10), hu ~ as.factor(Month), data = lakeWAplanktonRaw, family = hurdle_gamma(), chains=1, iter=1000) loo::loo(mod)
Advantages of Tweedie
Advantages of delta-GLMMs