Fitting spatial and spatiotemporal models

16 May 2023

Motivating questions

Data often have spatial attributes
Ideal world:
- Plug spatial covariates into a GLM / GLMM
- Residuals are uncorrelated

Reality

Residual spatial autocorrelation

Modeling spatial autocorrelation

Need ‘wiggly’/smooth surface for approximating all spatial variables missing from model (‘latent’ variables)
Several approaches exist
- 2D smooths in mgcv
- Random fields and the Stochastic Partial Differential Equation (SPDE)

Spatial smoothing with GAMs (mgcv)

gam(y ~ s(lon, lat) + s(time), ...)

gam(y ~ s(lon, lat) + s(time) + 
      ti(lon, lat, time, d=c(2,1)), ...)

gam(y ~ s(lon, lat, by = "time") + 
      s(time), ...)

Spatial smoothing with SPDE (INLA, inlabru)

SPDE differs in that it explicitly estimates meaningful parameters for spatial covariance function
SPDE and GAMs can produce the identical results
Miller, D.L., Glennie, R. & Seaton, A.E. Understanding the Stochastic Partial Differential Equation Approach to Smoothing. JABES 25, 1–16 (2020)

Matérn covariance

Flexible, can be exponential or Gaussian shape

Predictive process models

Estimate spatial field as random effects
High dimensional datasets computationally challenging
Gaussian process predictive process models:
- Estimate values at a subset of locations in the time series
- ‘knots’, ‘vertices’, or ‘control points’
- Use covariance function to interpolate from knots to locations of observations

Predictive process models

More knots (vertical dashed lines) = more wiggliness & parameters to estimate

Spatial data types

Lattice: gridded data, e.g. interpolated SST from satellite observations

–

Areal: data collected in neighboring spatial areas, e.g. commercial catch records by state / county

–

Georeferenced: data where observations are associated with latitude and longitude
- Locations may be unique or repeated (stations)

Why is space important?

Data covary spatially (data that are closer are more similar)
Relationship between distance and covariance can be described with a spatial covariance function
Covariance function in 2D may be
- isotropic (same covariance in each direction)
- anisotropic (different in each direction)

What is a random field?

Random field

A 2 dimensional “Gaussian Process”
A realization from a multivariate normal distribution with some covariance function

Random field

A way of estimating a wiggly surface to account for spatial and/or spatiotemporal correlation in data.
Alternatively, a way of estimating a wiggly surface to account for “latent” or unobserved variables.
As a bonus, it provides useful covariance parameter estimates: spatial variance and the distance at data points are effectively uncorrelated (“range”)

Many ways to simulate random fields

RandomFields::RFsimulate() simulates univariate / multivariate fields
fields::sim.rf() simulates random fields on a grid
geoR::grf() simulates random fields with irregular observations
glmmfields::sim_glmmfields() simulates random fields with/without extreme values
sdmTMB::sdmTMB_simulate() simulates univariate fields with sdmTMB

??? Homework: try to work through some of these yourself. Make some plots, and see how changing the covariance affects the smoothness of these fields.

Effects of changing variance and range

Effects of adding noise

Large observation error looks like noise
\(\sigma_{obs}\) >> \(\sigma_{O}\), \(\sigma_{E}\)

Moderate observation errors

\(\sigma_{obs}\) = \(\sigma_{O}\) = \(\sigma_{E}\)

Small observation errors

\(\sigma_{obs}\) << \(\sigma_{O}\), \(\sigma_{E}\)

Estimating random fields

Georeferenced data often involve 1000s or more points
Like in the 1-D setting, we need to approximate the spatial field
- Options include nearest neighbor methods, covariance tapering, etc.
sdmTMB uses an approach from INLA
- for VAST users, this is the same
- INLA books:
  https://www.r-inla.org/learnmore/books

INLA and the SPDE approach

SPDE: stochastic partial differential equation
The solution to a specific SPDE is a Gaussian random field (GRF) with Matérn covariance
This, and sparse precision matrices, let us efficiently fit approximations to GRFs to large spatial datasets
INLA is software that performs data wrangling for SPDE estimation
- INLA also performs approximate Bayesian estimation
- sdmTMB uses INLA to wrangle matrices, but uses TMB for maximum likelihood estimation

Introducing meshes

Implementing the SPDE with INLA requires constructing a ‘mesh’

Mesh construction

A unique mesh is generally made for each dataset
Rules of thumb:
- More triangles = more computation time
- More triangles = more fine-scale spatial predictions
- Borders with coarser resolution reduce number of triangles
- Use minimum edge size to avoid meshes becoming too fine
- Want fewer vertices than data points
- Triangle edge size needs to be smaller than spatial range
“How to make a bad mesh?” Haakon Bakka’s book

Building your own mesh

INLA::inla.mesh.2d(): lets many arguments be customized
INLA::meshbuilder(): Shiny app for constructing a mesh, provides R code
Meshes can include barriers / islands / coastlines with shapefiles
INLA books https://www.r-inla.org/learnmore/books

Simplifying mesh construction in sdmTMB

sdmTMB has a function make_mesh() to quickly construct a basic mesh

Details in next set of slides

Example: cutoff = 50km

Example: cutoff = 25km

Example: cutoff = 10km

sdmTMB highlights

sdmTMB is a user-friendly R package for modeling spatial processes

Familiar syntax to widely used functions/packages; glm(), mgcv, glmmTMB, etc.
Performs fast (marginal) maximum likelihood estimation via TMB
Widely applicable to species distribution modelling, stock assessment inputs, bycatch models, etc. that include spatially referenced data

Installing sdmTMB

From CRAN:

install.packages("sdmTMB", dependencies = TRUE)

From source on GitHub:

Install a C++ compiler
- e.g., Rtools on Windows
- e.g., Xcode command line tools on a Mac:
  xcode-select --install

Installing sdmTMB

Warning: The INLA package is large, not on CRAN, and can be slow to install. If you run into problems, try installing it separately first:

https://www.r-inla.org/download-install

General sdmTMB workflow

Prepare data (convert to UTMs, scale covariates, …)
Construct a mesh
Fit the model
Inspect the model (and possibly refit the model)
Predict from the model
Calculate any derived quantities

General sdmTMB workflow

Prepare data: add_utm_columns()
Construct a mesh: make_mesh()
Fit the model: sdmTMB()
Inspect the model: print(), sanity(), tidy(), residuals()
Predict from the model: predict()
Get derived quantities: get_index()

Preparing the data

Prepare the data in wide format (i.e. row = observation)

Make sure there is no NA in the used covariates

NA in the response value is OK (internally checked)

Preparing the “space”: the use of UTMs

Need a projection that preserves constant distance

UTMs work well for regional analyses

Helper function: sdmTMB::add_utm_columns()

Internally, guesses at UTM zone and uses the sf package:
sf::st_as_sf(), sf::st_transform(), and sf::st_coordinates()

Or use the sf or sp packages yourself

Example of adding UTM columns

d <- data.frame(
  lat = c(52.1, 53.4), 
  lon = c(-130.0, -131.4)
)
d <- sdmTMB::add_utm_columns(d, c("lon", "lat"))

## Detected UTM zone 9N; CRS = 32609.

## Visit https://epsg.io/32609 to verify.

Note default units = "km"
Why? Range parameter estimated in units of X and Y
Should be not too big or small for estimation

Constructing a mesh

make_mesh() has 2 shortcuts to mesh construction

K-means algorithm: used to cluster points (e.g., n_knots = 100); approach taken in VAST; sensitive to random seed argument!
Cutoff: minimum allowed distance between vertices (e.g., cutoff = 10) - the preferred option

Alternatively, build any INLA mesh and supply it to the mesh argument in make_mesh()

Constructing a mesh

Size of mesh has the single largest impact on fitting speed

cutoff is in units of x and y (minimum triangle size)

d <- data.frame(x = runif(500), y = runif(500))
mesh <- make_mesh(d, xy_cols = c("x", "y"), cutoff = 0.1)
mesh$mesh$n
plot(mesh)

Fitting the model: sdmTMB()

Set up is similar to glmmTMB(). Common arguments:

fit <- sdmTMB(
  formula,
  data,
  mesh,
  time = NULL,
  family = gaussian(link = "identity"),
  spatial = c("on", "off"),
  spatiotemporal = c("iid", "ar1", "rw", "off"),
  silent = TRUE,
  ...
)

See ?sdmTMB

Formula interface

sdmTMB uses a similar formula interface to widely used R packages

A formula is used to specify fixed effects and (optionally) random intercepts

# linear effect of x1:
formula = y ~ x1

# add smoother effect of x2:
formula = y ~ x1 + s(x2)

# add random intercept by group g:
formula = y ~ x1 + s(x2) + (1 | g)

Smoothers (as in mgcv)

# smoother effect of x:
formula = y ~ s(x)

# basis dimension of 5:
formula = y ~ s(x, k = 5)

# bivariate smoother effect of x & y:
formula = y ~ s(x, y)

# smoother effect of x1 varying by x2:
formula = y ~ s(x1, by = x2)

# other kinds of mgcv smoothers:
formula = ~ s(month, bs = "cc", k = 12)

Smoothers are penalized (‘p-splines’), i.e. data determine ‘wiggliness’

Other common R formula options

Polynomials and omitting the intercept:

# transformations using `I()` notation
y ~ depth + I(depth^2)

# polynomial functions using `poly`
y ~ poly(depth, degree = 2)

# omit intercept
y ~ -1 + as.factor(year)
y ~ 0 + as.factor(year)

Breakpoint functions for threshold analyses

cpue ~ breakpt(temperature)

Logistic functions for threshold analyses

cpue ~ logistic(temperature)

Syntax: families and links

Many of the same families used in glm(), glmmTMB(), mgcv::gam() can be used here

Includes: gaussian(), Gamma(), binomial(), poisson(), Beta(), student(), tweedie(), nbinom1(), nbinom2(), truncated_nbinom1(), truncated_nbinom2(), delta_gamma(), delta_lognormal(), delta_beta(), and more…

All have link arguments

See ?sdmTMB::Families

An aside on the Tweedie

Useful for positive continuous data with zeros (e.g., biomass density per unit effort)

Dispersion ( \(\phi\) ) and power ( \(p\) ) parameters allow for a wide variety of shapes including many zeros

Also known as compound Poisson-Gamma distribution

An aside on the student-t

Useful for continuous data with extreme values.

Can also be used to deal with positive data when using log link

df parameters controls for the amount of tail in the distribution i.e. degree of “extreme” values

P.S: beware when df << 3. All noise is assimilated in the “observation error”

An aside on delta models

Delta/hurdle model has 2 sub-models:
presence/absence (binomial, logit link)
positive model (link varies by family)
family argument to sdmTMB can be a list()
for convenience, many delta- families implemented: delta_gamma, delta_lognormal, delta_truncated_nbinom2 etc

An aside on mixture models

Positive components may be modeled as a mixture of 2 distributions

Finite mixture model (2 components)
Also referred to as “ECE” (extreme catch event) model, Thorson et al. (2012)
Mechanisms: shoaling, etc.
See gamma_mix() and lognormal_mix()
But requires data

Spatial vs. spatiotemporal fields - notion

A spatial field can be thought of as a spatial intercept
- a wiggly spatial process that is constant in time. e.g. areas that has on average a higher/lower animal density, constant “hot/cold-spot”
Spatiotemporal variation represents separate fields estimated for each time slice (possibly correlated)
- wiggly spatial processes that change through time. e.g. inter-annual variability in “hot/cold-spot” locations

Spatial fields can be turned on/off

By default sdmTMB() estimates a spatial field

fit <- sdmTMB(
  y ~ x,
  family = gaussian(),
  data = dat,
  mesh = mesh,
  spatial = "on", #<<
  ...
)

Why not estimate a spatial field?

If shared process across time slices isn’t of interest
If magnitude of spatiotemporal variability >> spatial variation
If confounded with other parameters (more later)

Spatiotemporal fields can be turned on/off

By default sdmTMB() estimates a spatiotemporal field if the time argument is specified

fit <- sdmTMB(
  y ~ x,
  family = gaussian(),
  data = dat,
  mesh = mesh,
  time = "year", # column in `data` #<<
  spatiotemporal = "iid", #<<
  ...
)

Types of spatiotemporal fields

None (spatiotemporal = "off")
Independent (spatiotemporal = "iid")
Random walk (spatiotemporal = "rw")
Autoregressive (spatiotemporal = "ar1")

Independent (IID) spatiotemporal fields

Useful if pattern changes much between years

AR1 spatiotemporal fields

Useful if pattern are related between years.
P.S: Random walk = AR1 with 1.0 correlation

Spatiotemporal fields

Why include spatiotemporal fields?
- If the data are collected in both space and time and there are ‘latent’ spatial processes that vary through time
- E.g., effect of water temperature on abundance if temperature wasn’t in the model
- Represents all the missing variables that vary through time
- Why would a field be IID vs RW/AR1?
- Do we expect hotspots to be independent with each time slice or adapt slowly over time?

After model fitting

Inspecting, summarizing, predicting, etc.

Covered in examples in next slides.

predict(): ?predict.sdmTMB
residuals(): ?residuals.sdmTMB
tidy(): ?tidy.sdmTMB
print()
sanity()
get_index()
...

Examples of workflow / other functions

data("pcod")
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 20)
pcod$fyear <- as.factor(pcod$year)

fit <- sdmTMB(
  density ~ fyear,
  time_varying = NULL,
  data = pcod, 
  mesh = mesh,
  time = "year",
  spatiotemporal = "iid",
  family = tweedie(link = "log")
)

Did the model converge?

sanity(fit)

Generating an index of abundance

First predict to some regular grid

nd <- replicate_df(qcs_grid, "year", unique(pcod$year))
nd$fyear <- as.factor(nd$year)
predictions <- predict(fit, 
               newdata = nd, 
               return_tmb_object = TRUE)

Call `get_index` to derive annual estimates

indx <- get_index(predictions)

head(indx)

Motivating questions

Reality

Modeling spatial autocorrelation

Spatial smoothing with GAMs (mgcv)

Spatial smoothing with SPDE (INLA, inlabru)

Matérn covariance

Predictive process models

Predictive process models

Spatial data types

Why is space important?

What is a random field?

Random field

Random field

Many ways to simulate random fields

Effects of changing variance and range

Effects of adding noise

Moderate observation errors

Small observation errors

Estimating random fields

INLA and the SPDE approach

Introducing meshes

Mesh construction

Building your own mesh

Simplifying mesh construction in sdmTMB

Example: cutoff = 50km

Example: cutoff = 25km

Example: cutoff = 10km

sdmTMB highlights

Installing sdmTMB

Installing sdmTMB

General sdmTMB workflow

General sdmTMB workflow

Preparing the data

Preparing the “space”: the use of UTMs

Example of adding UTM columns

Constructing a mesh

Constructing a mesh

Fitting the model: sdmTMB()

Formula interface

Smoothers (as in mgcv)

Other common R formula options

Breakpoint functions for threshold analyses

Logistic functions for threshold analyses

Syntax: families and links

An aside on the Tweedie

An aside on the student-t

An aside on delta models

An aside on mixture models

Spatial vs. spatiotemporal fields - notion

Spatial fields can be turned on/off

Why not estimate a spatial field?

Spatiotemporal fields can be turned on/off

Types of spatiotemporal fields

Independent (IID) spatiotemporal fields

AR1 spatiotemporal fields

Spatiotemporal fields

After model fitting

Examples of workflow / other functions

Did the model converge?

Generating an index of abundance

Call get_index to derive annual estimates

Call `get_index` to derive annual estimates