What we’ve learned so far

  • Time series can be useful for identifying structure, improving precision, and accuracy of forecasts

  • Modeling multivariate time series

    • e.g. MARSS() function, with each observed time series mapped to a single discrete state
  • Using DFA

    • Structure determined by factor loadings

Response generally the same variable (not separate species)

  • Making inference about population as a whole involves jointly modeling both time series

Lots of time series -> complicated covariance matrices

  • For unconstrained covariance matrices, the number of covariance matrix parameters = \(m*(m+1)/2\)

    • problematic for m > 5 or so
  • MARSS solutions: diagonal matrices or ‘equalvarcov’

  • DFA runs into same issues with estimating unconstrained \(R\) matrix

Potential problems with both the MARSS and DFA approach

  • Sites separated by large distances may be grouped together

  • Sites close to one another may be found to have very different dynamics

Switching gears: spatial models

  • Data often have spatial attributes

  • Ideal world:

    • Plug spatial covariates into a GLM / GLMM
    • Residuals are uncorrelated

Reality

  • Residual spatial autocorrelation

Modeling spatial autocorrelation

  • Need ‘wiggly’/smooth surface for approximating all spatial variables missing from model (‘latent’ variables)

  • Several approaches exist

    • 2D smooths in mgcv
    • Random fields and the Stochastic Partial Differential Equation (SPDE)
    • Nearest neighbor methods

Spatial smoothing with GAMs (mgcv)

gam(y ~ s(lon, lat) + s(time), ...)

gam(y ~ s(lon, lat) + s(time) + 
      ti(lon, lat, time, d=c(2,1)), ...)

gam(y ~ s(lon, lat, by = "time") + 
      s(time), ...)

Spatial smoothing with SPDE (INLA, inlabru)

  • SPDE differs in that it explicitly estimates meaningful parameters for spatial covariance function

  • SPDE and GAMs can produce the identical results

  • Miller, D.L., Glennie, R. & Seaton, A.E. Understanding the Stochastic Partial Differential Equation Approach to Smoothing. JABES 25, 1–16 (2020)

Matérn covariance

Flexible, can be exponential or Gaussian shape

Predictive process models

  • Estimate spatial field as random effects

  • High dimensional datasets computationally challenging

  • Gaussian process predictive process models:

    • Estimate values at a subset of locations in the time series
    • ‘knots’, ‘vertices’, or ‘control points’
    • Use covariance function to interpolate from knots to locations of observations

Predictive process models

  • More knots (vertical dashed lines) = more wiggliness & parameters to estimate

Spatial data types

  • Lattice: gridded data, e.g. interpolated SST from satellite observations

  • Areal: data collected in neighboring spatial areas, e.g. commercial catch records by state / county

  • Georeferenced: data where observations are associated with latitude and longitude
    • Locations may be unique or repeated (stations)

Why is space important?

  • Data covary spatially (data that are closer are more similar)

  • Relationship between distance and covariance can be described with a spatial covariance function

  • Covariance function in 2D may be

    • isotropic (same covariance in each direction)
    • anisotropic (different in each direction)

What is a random field?

Random field

  • A 2 dimensional “Gaussian Process”

  • A realization from a multivariate normal distribution with some covariance function

Random field

  • A way of estimating a wiggly surface to account for spatial and/or spatiotemporal correlation in data.

  • Alternatively, a way of estimating a wiggly surface to account for “latent” or unobserved variables.

  • As a bonus, it provides useful covariance parameter estimates: spatial variance and the distance at data points are effectively uncorrelated (“range”)

Many ways to simulate random fields

  • RandomFields::RFsimulate() simulates univariate / multivariate fields
  • fields::sim.rf() simulates random fields on a grid
  • geoR::grf() simulates random fields with irregular observations
  • glmmfields::sim_glmmfields() simulates random fields with/without extreme values
  • sdmTMB::sdmTMB_simulate() simulates univariate fields with sdmTMB

??? Homework: try to work through some of these yourself. Make some plots, and see how changing the covariance affects the smoothness of these fields.

Effects of changing variance and range

Effects of adding noise

  • Large observation error looks like noise

  • \(\sigma_{obs}\) >> \(\sigma_{O}\), \(\sigma_{E}\)

Moderate observation errors

  • \(\sigma_{obs}\) = \(\sigma_{O}\) = \(\sigma_{E}\)

Small observation errors

  • \(\sigma_{obs}\) << \(\sigma_{O}\), \(\sigma_{E}\)

Estimating random fields

  • Georeferenced data often involve 1000s or more points

  • Like in the 1-D setting, we need to approximate the spatial field

    • Options include nearest neighbor methods, covariance tapering, etc.
  • sdmTMB uses an approach from INLA

Introducing meshes

Implementing the SPDE with INLA requires constructing a ‘mesh’

Mesh construction

  • A unique mesh is generally made for each dataset
  • Rules of thumb:
    • More triangles = more computation time
    • More triangles = more fine-scale spatial predictions
    • Borders with coarser resolution reduce number of triangles
    • Use minimum edge size to avoid meshes becoming too fine
    • Want fewer vertices than data points
    • Triangle edge size needs to be smaller than spatial range
  • “How to make a bad mesh?” Haakon Bakka’s book

Building your own mesh

  • INLA::inla.mesh.2d(): lets many arguments be customized

  • INLA::meshbuilder(): Shiny app for constructing a mesh, provides R code

  • Meshes can include barriers / islands / coastlines with shapefiles

  • INLA books https://www.r-inla.org/learnmore/books

Simplifying mesh construction in sdmTMB

sdmTMB has a function make_mesh() to quickly construct a basic mesh

Details in next set of slides

Example: cutoff = 50km

Example: cutoff = 25km

Example: cutoff = 10km

sdmTMB highlights

sdmTMB is a user-friendly R package for modeling spatial processes

  • Familiar syntax to widely used functions/packages; glm(), mgcv, glmmTMB, etc.

  • Performs fast (marginal) maximum likelihood estimation via TMB

  • Widely applicable to species distribution modelling, stock assessment inputs, bycatch models, etc. that include spatially referenced data

General sdmTMB workflow

  1. Prepare data (convert to UTMs, scale covariates, …)

  2. Construct a mesh

  3. Fit the model

  4. Inspect the model (and possibly refit the model)

  5. Predict from the model

  6. Calculate any derived quantities

General sdmTMB workflow

  1. Prepare data: add_utm_columns()

  2. Construct a mesh: make_mesh()

  3. Fit the model: sdmTMB()

  4. Inspect the model: print(), sanity(), tidy(), residuals()

  5. Predict from the model: predict()

  6. Get derived quantities: get_index(), get_cog(), get_eao()

Preparing the data

Prepare the data in wide format (i.e. row = observation)

Make sure there is no NA in the used covariates

NA in the response value is OK (internally checked)

Preparing the “space”: the use of UTMs

Need a projection that preserves constant distance

UTMs work well for regional analyses

Helper function: sdmTMB::add_utm_columns()

Internally, guesses at UTM zone and uses the sf package:
sf::st_as_sf(), sf::st_transform(), and sf::st_coordinates()

Or use the sf or sp packages yourself

Example of adding UTM columns

d <- data.frame(
  lat = c(52.1, 53.4), 
  lon = c(-130.0, -131.4)
)
d <- sdmTMB::add_utm_columns(d, c("lon", "lat"))
## Detected UTM zone 9N; CRS = 32609.
## Visit https://epsg.io/32609 to verify.
d
  • Note default units = "km"
  • Why? Range parameter estimated in units of X and Y
  • Should be not too big or small for estimation

Constructing a mesh

make_mesh() has 2 shortcuts to mesh construction

  1. K-means algorithm: used to cluster points (e.g., n_knots = 100); approach taken in VAST; sensitive to random seed argument!

  2. Cutoff: minimum allowed distance between vertices (e.g., cutoff = 10) - the preferred option

Alternatively, build any INLA mesh and supply it to the mesh argument in make_mesh()

Constructing a mesh

Size of mesh has the single largest impact on fitting speed

cutoff is in units of x and y (minimum triangle size)

d <- data.frame(x = runif(500), y = runif(500))
mesh <- make_mesh(d, xy_cols = c("x", "y"), cutoff = 0.1)
mesh$mesh$n
plot(mesh)

Fitting the model: sdmTMB()

Set up is similar to glmmTMB(). Common arguments:

fit <- sdmTMB(
  formula,
  data,
  mesh,
  time = NULL,
  family = gaussian(link = "identity"),
  spatial = c("on", "off"),
  spatiotemporal = c("iid", "ar1", "rw", "off"),
  silent = TRUE,
  ...
)

See ?sdmTMB

Formula interface

sdmTMB uses a similar formula interface to widely used R packages

A formula is used to specify fixed effects and (optionally) random intercepts

# linear effect of x1:
formula = y ~ x1

# add smoother effect of x2:
formula = y ~ x1 + s(x2)

# add random intercept by group g:
formula = y ~ x1 + s(x2) + (1 | g)

Smoothers (as in mgcv)

# smoother effect of x:
formula = y ~ s(x)

# basis dimension of 5:
formula = y ~ s(x, k = 5)

# bivariate smoother effect of x & y:
formula = y ~ s(x, y)

# smoother effect of x1 varying by x2:
formula = y ~ s(x1, by = x2)

# other kinds of mgcv smoothers:
formula = ~ s(month, bs = "cc", k = 12)

Smoothers are penalized (‘p-splines’), i.e. data determine ‘wiggliness’

Other common R formula options

Polynomials and omitting the intercept:

# transformations using `I()` notation
y ~ depth + I(depth^2)

# polynomial functions using `poly`
y ~ poly(depth, degree = 2)

# omit intercept
y ~ -1 + as.factor(year)
y ~ 0 + as.factor(year)

Syntax: families and links

An aside on the Tweedie

Useful for positive continuous data with zeros (e.g., biomass density per unit effort)

Dispersion ( \(\phi\) ) and power ( \(p\) ) parameters allow for a wide variety of shapes including many zeros

Also known as compound Poisson-Gamma distribution

An aside on the student-t

Useful for continuous data with extreme values.

Can also be used to deal with positive data when using log link

df parameters controls for the amount of tail in the distribution i.e. degree of “extreme” values

P.S: beware when df << 3. All noise is assimilated in the “observation error”

An aside on delta models

  • Delta/hurdle model has 2 sub-models:

  • presence/absence (binomial, logit link)

  • positive model (link varies by family)

  • family argument to sdmTMB can be a list()

  • for convenience, many delta- families implemented: delta_gamma, delta_lognormal, delta_truncated_nbinom2 etc

Spatial vs. spatiotemporal fields - notion

  • A spatial field can be thought of as a spatial intercept

    • a wiggly spatial process that is constant in time. e.g. areas that has on average a higher/lower animal density, constant “hot/cold-spot”
  • Spatiotemporal variation represents separate fields estimated for each time slice (possibly correlated)

    • wiggly spatial processes that change through time. e.g. inter-annual variability in “hot/cold-spot” locations

Spatial fields can be turned on/off

  • By default sdmTMB() estimates a spatial field
fit <- sdmTMB(
  y ~ x,
  family = gaussian(),
  data = dat,
  mesh = mesh,
  spatial = "on", #<<
  ...
)

Extracting spatial variance and range

mesh <- make_mesh(pcod_2011, c("X", "Y"), cutoff = 20)
fit <- sdmTMB(
  density ~ s(depth),
  data = pcod_2011, mesh = mesh,
  family = tweedie(link = "log")
)

Extracting spatial variance and range

print(tidy(fit, "ran_pars"))
## # A tibble: 4 × 5
##   term      estimate std.error conf.low conf.high
##   <chr>        <dbl>     <dbl>    <dbl>     <dbl>
## 1 range        16.8    13.7       3.40      83.3 
## 2 phi          13.7     0.663    12.4       15.0 
## 3 sigma_O       2.20    1.23      0.735      6.59
## 4 tweedie_p     1.58    0.0153    1.55       1.61

Why not estimate a spatial field?

  • If shared process across time slices isn’t of interest

  • If magnitude of spatiotemporal variability >> spatial variation

  • If confounded with other parameters (more later)

Spatiotemporal fields can be turned on/off

  • By default sdmTMB() estimates a spatiotemporal field if the time argument is specified
fit <- sdmTMB(
  y ~ x,
  family = gaussian(),
  data = dat,
  mesh = mesh,
  time = "year", # column in `data` #<<
  spatiotemporal = "iid", #<<
  ...
)

Types of spatiotemporal fields

  • None (spatiotemporal = "off")

  • Independent (spatiotemporal = "iid")

  • Random walk (spatiotemporal = "rw")

  • Autoregressive (spatiotemporal = "ar1")

How spatiotemporal fields relate to MARSS models

  • Random effects estimated at knot locations

  • IID = independent by year (no time series)

  • RW = \[x_{t} = x{t-1} + \epsilon_{t}\]

  • AR = \[x_{t} = b \cdot x{t-1} + \epsilon_{t}\]

Independent (IID) spatiotemporal fields

  • Useful if pattern changes much between years

AR1 spatiotemporal fields

  • Useful if pattern are related between years.
    P.S: Random walk = AR1 with 1.0 correlation

Spatiotemporal fields

  • Why include spatiotemporal fields?

    • If the data are collected in both space and time and there are ‘latent’ spatial processes that vary through time
    • E.g., effect of water temperature on abundance if temperature wasn’t in the model
    • Represents all the missing variables that vary through time
    • Why would a field be IID vs RW/AR1?
    • Do we expect hotspots to be independent with each time slice or adapt slowly over time?

Time - varying coefficients

  • Specify these with the time_varying formula

  • time_varying_type controls structure (“rw”, “ar1”, etc)

  • Covariates shouldn’t be in both time_varying and main formulas

  • Intercept included implicitly, but can be turned off with -1 or 0

Time - varying coefficients

Example: time varying intercept

fit <- sdmTMB(
  y ~ 0, # global intercept off
  time_varying = ~ 1,
  time_varying_type = "rw",
  family = gaussian(),
  time = "year", # column in `data` #<<
  spatiotemporal = "iid", #<<
  ...
)

Time - varying coefficients

Example: time varying intercept and slope

fit <- sdmTMB(
  y ~ 0, # global intercept off
  time_varying = ~ x,
  time_varying_type = "rw",
  family = gaussian(),
  time = "year", # column in `data` #<<
  spatiotemporal = "iid", #<<
  ...
)

Time - varying coefficients

Example: time varying slope

fit <- sdmTMB(
  y ~ 1, # global intercept on
  time_varying = ~ 0 + x,
  time_varying_type = "rw",
  family = gaussian(),
  time = "year", # column in `data` #<<
  spatiotemporal = "iid", #<<
  ...
)

Spatially - varying coefficients

  • Environmental processes may have different impacts across a large region

  • spatial_varying allows separate fields to be estimated for each coefficient

  • Unlike time-varying we do want the effect in both the main and spatial_varying formulas

Spatially - varying coefficients

fit <- sdmTMB(
  y ~ x,
  spatial_varying = ~ 0 + x,
  family = gaussian(),
  time = "year", # column in `data` #<<
  spatiotemporal = "iid", #<<
  ...
)

Spatially - varying coefficients

  • Example via Philina English (DFO) for sdmTMB paper

After model fitting

Inspecting, summarizing, predicting, etc.

Covered in examples in next slides.

  • predict(): ?predict.sdmTMB
  • residuals(): ?residuals.sdmTMB
  • tidy(): ?tidy.sdmTMB
  • print()
  • sanity()
  • get_index()
  • ...

Examples of workflow / other functions

data("pcod")
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 20)
pcod$fyear <- as.factor(pcod$year)

fit <- sdmTMB(
  density ~ fyear,
  time_varying = NULL,
  data = pcod, 
  mesh = mesh,
  time = "year",
  spatiotemporal = "iid",
  family = tweedie(link = "log")
)

Did the model converge?

print(sanity(fit))
## $hessian_ok
## [1] TRUE
## 
## $eigen_values_ok
## [1] TRUE
## 
## $nlminb_ok
## [1] TRUE
## 
## $range_ok
## [1] TRUE
## 
## $gradients_ok
## [1] TRUE
## 
## $se_magnitude_ok
## [1] TRUE
## 
## $se_na_ok
## [1] TRUE
## 
## $sigmas_ok
## [1] TRUE
## 
## $all_ok
## [1] TRUE

Diagnostics / residuals

Generating an index of abundance

  • First predict to some regular grid
nd <- replicate_df(qcs_grid, "year", unique(pcod$year))
nd$fyear <- as.factor(nd$year)
predictions <- predict(fit, 
               newdata = nd, 
               return_tmb_object = TRUE)

Call get_index to derive annual estimates

indx <- get_index(predictions)
print(indx)
##   year       est       lwr      upr  log_est        se  type
## 1 2003 209277.01 151513.00 289063.4 12.25141 0.1647925 index
## 2 2004 316826.48 244648.36 410299.2 12.66611 0.1319067 index
## 3 2005 355482.15 274277.01 460729.7 12.78123 0.1323170 index
## 4 2007  91540.99  66923.25 125214.4 11.42454 0.1598195 index
## 5 2009 150187.87 109089.00 206770.6 11.91964 0.1631269 index
## 6 2011 274230.96 212808.46 353381.7 12.52173 0.1293790 index
## 7 2013 274541.65 204421.77 368713.8 12.52286 0.1504709 index
## 8 2015 326960.45 241466.23 442725.0 12.69759 0.1546506 index
## 9 2017 151502.44 110227.70 208232.5 11.92836 0.1622752 index