Data often have spatial attributes
Ideal world:
- Plug spatial covariates into a GLM / GLMM
- Residuals are uncorrelated
16 May 2023
Data often have spatial attributes
Ideal world:
Need ‘wiggly’/smooth surface for approximating all spatial variables missing from model (‘latent’ variables)
Several approaches exist
gam(y ~ s(lon, lat) + s(time), ...) gam(y ~ s(lon, lat) + s(time) + ti(lon, lat, time, d=c(2,1)), ...) gam(y ~ s(lon, lat, by = "time") + s(time), ...)
SPDE differs in that it explicitly estimates meaningful parameters for spatial covariance function
SPDE and GAMs can produce the identical results
Miller, D.L., Glennie, R. & Seaton, A.E. Understanding the Stochastic Partial Differential Equation Approach to Smoothing. JABES 25, 1–16 (2020)
Flexible, can be exponential or Gaussian shape
Estimate spatial field as random effects
High dimensional datasets computationally challenging
Gaussian process predictive process models:
–
–
Data covary spatially (data that are closer are more similar)
Relationship between distance and covariance can be described with a spatial covariance function
Covariance function in 2D may be
A 2 dimensional “Gaussian Process”
A realization from a multivariate normal distribution with some covariance function
A way of estimating a wiggly surface to account for spatial and/or spatiotemporal correlation in data.
Alternatively, a way of estimating a wiggly surface to account for “latent” or unobserved variables.
As a bonus, it provides useful covariance parameter estimates: spatial variance and the distance at data points are effectively uncorrelated (“range”)
RandomFields::RFsimulate()
simulates univariate / multivariate fieldsfields::sim.rf()
simulates random fields on a gridgeoR::grf()
simulates random fields with irregular observationsglmmfields::sim_glmmfields()
simulates random fields with/without extreme valuessdmTMB::sdmTMB_simulate()
simulates univariate fields with sdmTMB
??? Homework: try to work through some of these yourself. Make some plots, and see how changing the covariance affects the smoothness of these fields.
Large observation error looks like noise
\(\sigma_{obs}\) >> \(\sigma_{O}\), \(\sigma_{E}\)
Georeferenced data often involve 1000s or more points
Like in the 1-D setting, we need to approximate the spatial field
sdmTMB uses an approach from INLA
SPDE: stochastic partial differential equation
The solution to a specific SPDE is a Gaussian random field (GRF) with Matérn covariance
This, and sparse precision matrices, let us efficiently fit approximations to GRFs to large spatial datasets
INLA is software that performs data wrangling for SPDE estimation
Implementing the SPDE with INLA requires constructing a ‘mesh’
INLA::inla.mesh.2d()
: lets many arguments be customized
INLA::meshbuilder()
: Shiny app for constructing a mesh, provides R code
Meshes can include barriers / islands / coastlines with shapefiles
INLA books https://www.r-inla.org/learnmore/books
sdmTMB has a function make_mesh()
to quickly construct a basic mesh
Details in next set of slides
sdmTMB is a user-friendly R package for modeling spatial processes
Familiar syntax to widely used functions/packages; glm()
, mgcv, glmmTMB, etc.
Performs fast (marginal) maximum likelihood estimation via TMB
Widely applicable to species distribution modelling, stock assessment inputs, bycatch models, etc. that include spatially referenced data
From CRAN:
install.packages("sdmTMB", dependencies = TRUE)
From source on GitHub:
xcode-select --install
Warning: The INLA package is large, not on CRAN, and can be slow to install. If you run into problems, try installing it separately first:
Prepare data (convert to UTMs, scale covariates, …)
Construct a mesh
Fit the model
Inspect the model (and possibly refit the model)
Predict from the model
Calculate any derived quantities
Prepare data: add_utm_columns()
Construct a mesh: make_mesh()
Fit the model: sdmTMB()
Inspect the model: print()
, sanity()
, tidy()
, residuals()
Predict from the model: predict()
Get derived quantities: get_index()
Prepare the data in wide format (i.e. row = observation)
Make sure there is no NA in the used covariates
NA in the response value is OK (internally checked)
Need a projection that preserves constant distance
UTMs work well for regional analyses
Helper function: sdmTMB::add_utm_columns()
Internally, guesses at UTM zone and uses the sf package:sf::st_as_sf()
, sf::st_transform()
, and sf::st_coordinates()
Or use the sf or sp packages yourself
d <- data.frame( lat = c(52.1, 53.4), lon = c(-130.0, -131.4) ) d <- sdmTMB::add_utm_columns(d, c("lon", "lat"))
## Detected UTM zone 9N; CRS = 32609.
## Visit https://epsg.io/32609 to verify.
d
units = "km"
make_mesh()
has 2 shortcuts to mesh construction
K-means algorithm: used to cluster points (e.g., n_knots = 100
); approach taken in VAST; sensitive to random seed
argument!
Cutoff: minimum allowed distance between vertices (e.g., cutoff = 10
) - the preferred option
Alternatively, build any INLA mesh and supply it to the mesh
argument in make_mesh()
Size of mesh has the single largest impact on fitting speed
cutoff
is in units of x and y (minimum triangle size)
d <- data.frame(x = runif(500), y = runif(500)) mesh <- make_mesh(d, xy_cols = c("x", "y"), cutoff = 0.1) mesh$mesh$n plot(mesh)
Set up is similar to glmmTMB()
. Common arguments:
fit <- sdmTMB( formula, data, mesh, time = NULL, family = gaussian(link = "identity"), spatial = c("on", "off"), spatiotemporal = c("iid", "ar1", "rw", "off"), silent = TRUE, ... )
See ?sdmTMB
sdmTMB uses a similar formula interface to widely used R packages
A formula is used to specify fixed effects and (optionally) random intercepts
# linear effect of x1: formula = y ~ x1 # add smoother effect of x2: formula = y ~ x1 + s(x2) # add random intercept by group g: formula = y ~ x1 + s(x2) + (1 | g)
# smoother effect of x: formula = y ~ s(x) # basis dimension of 5: formula = y ~ s(x, k = 5) # bivariate smoother effect of x & y: formula = y ~ s(x, y) # smoother effect of x1 varying by x2: formula = y ~ s(x1, by = x2) # other kinds of mgcv smoothers: formula = ~ s(month, bs = "cc", k = 12)
Smoothers are penalized (‘p-splines’), i.e. data determine ‘wiggliness’
Polynomials and omitting the intercept:
# transformations using `I()` notation y ~ depth + I(depth^2) # polynomial functions using `poly` y ~ poly(depth, degree = 2) # omit intercept y ~ -1 + as.factor(year) y ~ 0 + as.factor(year)
cpue ~ breakpt(temperature)
cpue ~ logistic(temperature)
Many of the same families used in glm()
, glmmTMB()
, mgcv::gam()
can be used here
Includes: gaussian()
, Gamma()
, binomial()
, poisson()
, Beta()
, student()
, tweedie()
, nbinom1()
, nbinom2()
, truncated_nbinom1()
, truncated_nbinom2()
, delta_gamma()
, delta_lognormal()
, delta_beta()
, and more…
All have link
arguments
See ?sdmTMB::Families
Useful for positive continuous data with zeros (e.g., biomass density per unit effort)
Dispersion ( \(\phi\) ) and power ( \(p\) ) parameters allow for a wide variety of shapes including many zeros
Also known as compound Poisson-Gamma distribution
Useful for continuous data with extreme values.
Can also be used to deal with positive data when using log link
df
parameters controls for the amount of tail in the distribution i.e. degree of “extreme” values
P.S: beware when df << 3. All noise is assimilated in the “observation error”
Delta/hurdle model has 2 sub-models:
presence/absence (binomial, logit link)
positive model (link varies by family)
family
argument to sdmTMB can be a list()
for convenience, many delta- families implemented: delta_gamma
, delta_lognormal
, delta_truncated_nbinom2
etc
Positive components may be modeled as a mixture of 2 distributions
Finite mixture model (2 components)
Also referred to as “ECE” (extreme catch event) model, Thorson et al. (2012)
Mechanisms: shoaling, etc.
See gamma_mix()
and lognormal_mix()
But requires data
A spatial field can be thought of as a spatial intercept
Spatiotemporal variation represents separate fields estimated for each time slice (possibly correlated)
sdmTMB()
estimates a spatial fieldfit <- sdmTMB( y ~ x, family = gaussian(), data = dat, mesh = mesh, spatial = "on", #<< ... )
If shared process across time slices isn’t of interest
If magnitude of spatiotemporal variability >> spatial variation
If confounded with other parameters (more later)
sdmTMB()
estimates a spatiotemporal field if the time
argument is specifiedfit <- sdmTMB( y ~ x, family = gaussian(), data = dat, mesh = mesh, time = "year", # column in `data` #<< spatiotemporal = "iid", #<< ... )
None (spatiotemporal = "off"
)
Independent (spatiotemporal = "iid"
)
Random walk (spatiotemporal = "rw"
)
Autoregressive (spatiotemporal = "ar1"
)
Why include spatiotemporal fields?
Inspecting, summarizing, predicting, etc.
Covered in examples in next slides.
predict()
: ?predict.sdmTMB
residuals()
: ?residuals.sdmTMB
tidy()
: ?tidy.sdmTMB
print()
sanity()
get_index()
...
data("pcod") mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 20) pcod$fyear <- as.factor(pcod$year) fit <- sdmTMB( density ~ fyear, time_varying = NULL, data = pcod, mesh = mesh, time = "year", spatiotemporal = "iid", family = tweedie(link = "log") )
sanity(fit)
nd <- replicate_df(qcs_grid, "year", unique(pcod$year)) nd$fyear <- as.factor(nd$year) predictions <- predict(fit, newdata = nd, return_tmb_object = TRUE)
get_index
to derive annual estimatesindx <- get_index(predictions) head(indx)