Gaussian process models
Neural network models
Empirical dynamic modeling
26 Feb 2019
Gaussian process models
Neural network models
Empirical dynamic modeling
Last week, we discussed exponential smoothing and in lab touched on GAMs
GAMs estimate the trend using a smooth function,
\[E\left[ Y \right] ={ B }_{ 0 }+f(x)\]
where like regression, we assume \(Y\sim Normal\left( E\left[ Y \right] ,\sigma \right)\)
For data that are regularly spaced in time, this probably isn't a big deal
Let's try again, this time knocking a few holes in the data. Removing years 1950:1953 and 1955:1959, the knot locations are no longer equally spaced, and weighted more toward the locations of data points.
Recapping, GAMs are estimating the underlying trend using a smooth function,
\[E\left[ Y \right] ={ B }_{ 0 }+f(x)\]
We're going to leave GAMs alone for now, but there's lots of great references out there. Examples:
Similarities between GAMs and GP models:
Differences:
We have some function we want to approximate
We could estimate the latent values at all observed locations * What are the downsides to this?
Instead, consider estimating them at a subset of points and extrapolating (aka Kriging)
Lots of applications in Fisheries and Ecology
Especially with applications to spatial models
Several options for estimating f(x) at knot locations
Gaussian Process models use the covariance function, \(\Sigma\)
We could estimate elements of \(\Sigma\) as unconstrained matrix (e.g. 'unconstrained' in MARSS)
We could try to zero out some elements of \(\Sigma\)
Instead, we'll use a covariance function (aka kernel). Common choices are
For example with the exponential function,
\[{ \Sigma}_{i,j}={\sigma}^{2}exp\left( -{ d }_{ i,j }/\tau \right)\]
\({\sigma}^{2}\) is the variance parameter (estimated)
\({ d }_{ i,j }\) is the distance between points, e.g. \(|{x}_{i}-{x}_{j}|\)
distance could be distance in time, space, etc
\(\tau\) is a scaling parameter (estimated)
Question:
For our exponential function, how do \(\sigma\) and \(\tau\) control 'wiggliness'?
For our exponential function, how do \(\sigma\) and \(\tau\) control 'wiggliness'?
Larger values of \(\sigma\) introduce more variability between \(f(x)\) at knot locations
Larger values of \(\tau\) will make the 'exp(…)' term closer to 1
Revisiting univariate state space models, what are some reasons the AR process is used?
\[x_t = x_{t-1}+w_t, \,\,\, w_t \sim N(0,q)\]
\[y_t = x_t + v_t, \,\,\, v_t \sim N(0,r)\] * Mechanism may be AR or RW BUT also * AR process is just one flavor of constraining estimation * Convenience / estimation of \(q\) and \(r\)
Any of the univariate SS or multivariate models (DFA, MARSS) can be modified by swapping out an AR latent process for a GP one!
Example: Gaussian process DFA
Using a GP-DFA estimation model, we can see our ability to recover the process improve from 4 to 10 to 25 knots. 4 knots:
Using a GP-DFA estimation model, we can see our ability to recover process improve from 4 to 10 to 25 knots. 10 knots:
Using a GP-DFA estimation model, we can see our ability to recover the process improve from 4 to 10 to 25 knots. 25 knots:
Neural networks widely used in lots of fields. Not widely used in fisheries with a few examples:
Ward et al. 2014 link Coro et al. 2016 link
Some NNet jargon:
Inputs are predictors (including lagged data)
Hidden layer are the latent variables / process
Neurons control dimensionality of hidden layer (a collection of hidden neurons = hidden layer)
Output is the predictions vailidated against observable data
Neural networks offer an advantage over many approaches we've seen in that they're non-linear
Example:
\({ X }_{ 1 }\), \({ X }_{ 2 }\), \({ X }_{ 3 }\)
Just like regression, the nueron estimates coefficients (aka weights) for each of the predictors.
\(E[Y] = f({ b }_{ 0 } + { b }_{ 1 }*{ X }_{ 1 } + { b }_{ 2 }*{ X }_{ 2 } + { b }_{ 3 }*{ X }_{ 3 })\)
Note: \({ b }_{ 0 }\) is sometimes called the bias – but is similar to intercept in regression
Implementation in R
*We'll talk about examples in 2 packages
First the forecast package – function nnetar
This package implements NNet models with autoregression, where this is defined as lagged values of the response time series Y
Rob Hyndman has some great tutorials / vignettes for more in-depth info. more on nnetar here
We'll apply this to daily flow data from the Cedar River
library(waterData) dat = importDVs(staid = "12119000") library(lubridate) dat = dat[which(year(date(dat$dates)) == 2018),] ggplot(dat, aes(dates,val)) + geom_point() + ylab("Flow")
Using the 'nnetar' function, there's several important arguments to consider
mod = nnetar(y=dat$val, p=..., size=...)
p represents the embedding dimension or number of lags to include
size represents the dimension of the hidden layer (# neurons)
Each of these has defaults, but we'll do a couple sensitivities
First, let's look at varying the number of lagged predictors
mod_1 = nnetar(y=dat$val, p=1, size=1) mod_5 = nnetar(y=dat$val, p=5, size=1) mod_15 = nnetar(y=dat$val, p=15, size=1)
Even with embedding dimension = 1, predictions are pretty good
Only very slight differences here – slight ones in Feb/March for example
Ok, now a sensitivity to the size of the hidden layer
mod_1 = nnetar(y=dat$val, p=1, size=1) mod_5 = nnetar(y=dat$val, p=1, size=5) mod_15 = nnetar(y=dat$val, p=1, size=15)
Again, the fit with 1 neuron looks pretty good
And there only appear to be slight differences as we add more neurons
Selecting the size of the network and number of lags (embedding dimension) can be tricky. Many estimation routines will do this for you.
For our flow data for example, we can not specify p or size
mod = nnetar(y=dat$val)
Output here is as NNAR(p,k) with p equal to the embedding dimension, and k the hidden nodes
mod
## Series: dat$val ## Model: NNAR(4,2) ## Call: nnetar(y = dat$val) ## ## Average of 20 networks, each of which is ## a 4-2-1 network with 13 weights ## options were - linear output units ## ## sigma^2 estimated as 8136
Models are trained on 1-step ahead forecasts
Weights are randomized from lots of starting values and forecasts averaged
Point forecasts can be used from the fitted object as before,
f = forecast(mod, h = 10)
Alternative estimation routines also exist in 'tsDyn' package
nnetTs(x, m, d = 1, steps = d, size)
Just like 'nnetar',
Simplex link
S-Map link
Convergent cross-mapping link
Hao Ye's Vignette link Yair Daon's Vignette link Owen Petchey's Vignette link
These tools generally represent nearest neighbor forecasting (projecting) routines
Like NNets, there is a lag (embedding dimension) that needs to be chosen
Also need to specify the number of nearest neighbors (default Simplex = E+1)
First, the embedding dimension. We'll start with a lag / embedding dimension of \(E\) = 5
Or we could use a value of \(E\) = 3
There's some optimal embedding dimension we can select
predictions are likely affected strongly by recent dynamics
it is less likely that conditions in the distant past are also useful at making projections
as a result, predictability may increase slightly with greater values of \(E\) and then eventually decline
Internally, forecasts will be made based on the library of predictors
Examples: let's start with the classic 'lynx' dataset
Examples: let's start with the classic 'lynx' dataset
mod = rEDM::simplex(as.numeric(lynx), E=1:10)
Predictability increases a lot when E=2, but pretty flat after
ggplot(mod, aes(E,rho)) + geom_line()
Similar patterns with RMSE
ggplot(mod, aes(E,rmse)) + geom_line()
We can also pull out predictions (off by default) with the 'stats_only' argument,
mod = rEDM::simplex(as.numeric(lynx), E=1:10, stats_only=FALSE) ggplot(mod[[2]]$model_output, aes(obs, pred)) + geom_point()
We can also play with out of sample forecasting by specifiying the data to be used in the library ('lib') and data to be used for prediction ('pred'). For example, to forecast the last 14 data points of the lynx series, we could use
mod = rEDM::simplex(as.numeric(lynx), E=1:10, stats_only=FALSE, lib=c(1,100), pred=c(101,114))
As a second example, let's fit this to the water data from the Cedar River.
mod = rEDM::simplex(dat$val, E=1:10) ggplot(mod, aes(E,rho)) + geom_line()
For this application, it's also interesting to maybe compare the Simplex fits against the neural network time series. Here, the 'forecast skill' (rho) is 0.9817 for the best model (E=3).
Fitting the nnet model yields a slightly higher correlation (0.988)
mod_nn = nnetar(y=dat$val)