Gaussian process models
Neural network models
Empirical dynamic modeling
23 Feb 2021
Gaussian process models
Neural network models
Empirical dynamic modeling
Last week, we discussed exponential smoothing
GAMs estimate the trend using a smooth function,
\[E\left[ Y \right] ={ B }_{ 0 }+f(x)\] where like regression, we assume \(Y\sim Normal\left( E\left[ Y \right] ,\sigma \right)\)
The smooth function \(f(x)\) is generated by some underlying series of functions
Basis splines (“B-splines”) might be a common choice
bs()
function lets us create the basis matrix \(\textbf{B}\)
In addition to degree, we specify where to evaluate these functions at (1:100 here)
\(\textbf{B}\) matrix generated as first step
\(E[Y] = b_{0} \cdot \ \textbf{x} + \textbf{B}\cdot \textbf{b}\)
Spacing knots: evenly distributed?
For data that are regularly spaced in time, this probably isn’t a big deal
Cubic spline (default) on the ‘airmiles’ dataset (n = 23), a function estimated at 10 equally spaced locations (grey vertical lines).
Recapping, GAMs are estimating the underlying trend using a smooth function,
\[E\left[ Y \right] ={ B }_{ 0 }+f(x)\]
We’re going to leave GAMs alone for now, but there’s lots of great references out there. Examples:
Similarities between GAMs and GP models:
Differences:
We have some function we want to approximate
We could use GP to estimate the latent values at all observed locations * What are the downsides to this?
Instead, consider estimating them at a subset of points and extrapolating (aka Kriging)
Lots of applications in Fisheries and Ecology
Especially with applications to spatial models
Several options for estimating f(x) at knot locations
Gaussian Process models use the covariance function, \(\Sigma\)
We could estimate elements of \(\Sigma\) as unconstrained matrix (e.g. ‘unconstrained’ in MARSS)
We could try to zero out some elements of \(\Sigma\)
Instead, we’ll use a covariance function (aka kernel). Common choices are
For example with the exponential function,
\[{ \Sigma}_{i,j}={\sigma}^{2}exp\left( -{ d }_{ i,j }/\tau \right)\]
\({\sigma}^{2}\) is the variance parameter (estimated)
\({ d }_{ i,j }\) is the distance between points, e.g. \(|{x}_{i}-{x}_{j}|\)
distance could be distance in time, space, etc
\(\tau\) is a scaling parameter (estimated)
Question:
For our exponential function, how do \(\sigma\) and \(\tau\) control ‘wiggliness’?
For our exponential function, how do \(\sigma\) and \(\tau\) control ‘wiggliness’?
Larger values of \(\sigma\) introduce more variability between \(f(x)\) at knot locations
Larger values of \(\tau\) will make the ‘exp(…)’ term closer to 1
Revisiting univariate state space models, what are some reasons the AR process is used?
\[x_t = x_{t-1}+w_t, \,\,\, w_t \sim N(0,q)\]
\[y_t = x_t + v_t, \,\,\, v_t \sim N(0,r)\] * Mechanism may be AR or RW BUT also * AR process is just one flavor of constraining estimation * Convenience / estimation of \(q\) and \(r\)
Any of the univariate SS or multivariate models (DFA, MARSS) can be modified by swapping out an AR latent process for a GP one!
Example: Gaussian process DFA
Using a GP-DFA estimation model, we can see our ability to recover the process improve from 4 to 10 to 25 knots. 4 knots:
Using a GP-DFA estimation model, we can see our ability to recover process improve from 4 to 10 to 25 knots. 10 knots:
Using a GP-DFA estimation model, we can see our ability to recover the process improve from 4 to 10 to 25 knots. 25 knots:
Neural networks widely used in lots of fields. Becoming more widely used in fisheries / ecology:
Ward et al. 2014 link
Coro et al. 2016 link
Joseph et al. 2020 link
Some NNet jargon:
Inputs are predictors (including lagged data)
Hidden layer are the latent variables / process
Neurons control dimensionality of hidden layer (a collection of hidden neurons = hidden layer)
Output is the predictions vailidated against observable data
Neural networks offer an advantage over many approaches we’ve seen in that they’re non-linear
Example:
\({ X }_{ 1 }\), \({ X }_{ 2 }\), \({ X }_{ 3 }\)
Just like regression, the nueron estimates coefficients (aka weights) for each of the predictors.
\(E[Y] = f({ b }_{ 0 } + { b }_{ 1 }*{ X }_{ 1 } + { b }_{ 2 }*{ X }_{ 2 } + { b }_{ 3 }*{ X }_{ 3 })\)
Note: \({ b }_{ 0 }\) is sometimes called the bias – but is similar to intercept in regression
Estimate coefficients between input and hidden layer \(a_{1} = f(\theta_{1,1}x_{1} + \theta_{1,2}x_{2} + \theta_{1,3}x_{3})\)
\(a_{2} = f(\theta_{2,1}x_{1} + \theta_{2,2}x_{2} + \theta_{3,3}x_{3})\)
\(a_{3} = f(\theta_{3,1}x_{1} + \theta_{3,2}x_{2} + \theta_{3,3}x_{3})\)
f() is logistic/sigmoid function
And again between hidden layer and output \(g(x) = f(b_{2}a_{1} + b_{2}a_{2} + b_{3}a_{3})\)
f() is logistic/sigmoid function
Implementation in R
*We’ll talk about examples in 2 packages
First the forecast package – function nnetar
This package implements NNet models with autoregression, where this is defined as lagged values of the response time series Y
Rob Hyndman has some great tutorials / vignettes for more in-depth info. more on nnetar here
We’ll apply this to daily flow data from the Cedar River
Using the ‘nnetar’ function, there’s several important arguments to consider
mod = nnetar(y=dat$val, p=..., size=...)
p represents the embedding dimension or number of lags to include
size represents the dimension of the hidden layer (# neurons)
Each of these has defaults, but we’ll do a couple sensitivities
First, let’s look at varying the number of lagged predictors
mod_1 = nnetar(y=dat$val, p=1, size=1) mod_5 = nnetar(y=dat$val, p=5, size=1) mod_15 = nnetar(y=dat$val, p=15, size=1)
Even with 1 node in hidden layer, predictions are pretty good
Only very slight differences here (\(\rho=0.999\)) – slight ones in Feb/March for example
Ok, now a sensitivity to the size of the hidden layer
mod_1 = nnetar(y=dat$val, p=1, size=1) mod_5 = nnetar(y=dat$val, p=1, size=5) mod_15 = nnetar(y=dat$val, p=1, size=15)
Again, the fit with 1 neuron looks pretty good
And there only appear to be slight differences as we add more neurons
Selecting the size of the network and number of lags (embedding dimension) can be tricky. Many estimation routines will do this for you.
For our flow data for example, we can choose to not specify p or size
mod = nnetar(y=dat$val)
Output here is as NNAR(p,k) with p equal to the embedding dimension, and k the hidden nodes
mod
## Series: dat$val ## Model: NNAR(4,2) ## Call: nnetar(y = dat$val) ## ## Average of 20 networks, each of which is ## a 4-2-1 network with 13 weights ## options were - linear output units ## ## sigma^2 estimated as 8136
Models are trained on 1-step ahead forecasts
Weights are randomized from lots of starting values and forecasts averaged
Point forecasts can be used from the fitted object as before,
f = forecast(mod, h = 10)
Alternative estimation routines also exist in ‘tsDyn’ package
nnetTs(x, m, d = 1, steps = d, size)
Just like ‘nnetar’,
Simplex link
S-Map link
Convergent cross-mapping link
Hao Ye’s Vignette link
Yair Daon’s Vignette link
Owen Petchey’s Vignette link
Chang et al. 2017 link
Kuriyama et al. 2020 link
These tools generally represent nearest neighbor forecasting (projecting) routines
Like NNets, there is a lag (embedding dimension) that needs to be chosen
Also need to specify the number of nearest neighbors (default Simplex = E+1)
Or we could use a value of \(E\) = 3
There’s some optimal embedding dimension we can select
predictions are likely affected strongly by recent dynamics
it is less likely that conditions in the distant past are also useful at making projections
as a result, predictability may increase slightly with greater values of \(E\) and then eventually decline
Internally, forecasts will be made based on the library of predictors
http://deepeco.ucsd.edu/simplex/
Examples: let’s start with the classic ‘lynx’ dataset
Examples: let’s start with the classic ‘lynx’ dataset
mod = rEDM::simplex(as.numeric(lynx), E=1:10)
Predictability increases a lot when E=2, but pretty flat after
Similar patterns with RMSE
We can also pull out predictions (off by default) with the ‘stats_only’ argument,
mod = rEDM::simplex(as.numeric(lynx), E=1:10, stats_only=FALSE)
We can also play with out of sample forecasting by specifiying the data to be used in the library (‘lib’) and data to be used for prediction (‘pred’). For example, to forecast the last 14 data points of the lynx series, we could use
mod = rEDM::simplex(as.numeric(lynx), E=1:10, stats_only=FALSE, lib=c(1,100), pred=c(101,114))
As a second example, let’s fit this to the water data from the Cedar River.
mod = rEDM::simplex(dat$val, E=1:10)
For this application, it’s also interesting to maybe compare the Simplex fits against the neural network time series. Here, the ‘forecast skill’ (rho) is 0.9817 for the best model (E=3).
Fitting the nnet model yields a slightly higher correlation (0.988)
mod_nn = nnetar(y=dat$val)
Beyond Simplex: in the interest of time, we haven’t talked about SMAP or Cross Mapping
Smap (rEDM::s_map) is similar to Simplex, but also estimates a non linear parameter \(\theta\)
Cross mapping (rEDM::ccm) models causality in multiple time series, using information in lags
Autocorrelated red-noise may appear predictable
Distinguish red-noise from deterministic model with S-maps
Local linear ‘maps’, and adds non-linear parameter \(\theta\)
rEDM Tutorial
mod = rEDM::s_map(as.numeric(lynx), E=2, stats_only=FALSE)