The expected value of the response variable is modeled as:
\[ E[Y] = \beta_0 + f_1(x_1) + f_2(x_2) + \dots + f_k(x_k) \]
Where:
Linear Models: Assume a linear relationship between predictors and the response \[ E[Y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k \]
GAMs: Allow for flexible, non-linear relationships by fitting smooth functions: \[ E[Y] = \beta_0 + f_1(x_1) + f_2(x_2) + \dots + f_k(x_k) \]
Key Advantage: GAMs are more flexible in capturing complex trends in data that cannot be adequately modeled with a simple linear relationship (and have flexible families)
mgcv
Very powerful R package for fitting GAMs
Univariate smooth terms are expressed with s()
We’ll cover more complicated smooths later
To better understand how B-splines work, let’s visualize basis functions and see how they combine to form a smooth function
Basis Functions: Functions that form the building blocks of splines (used to construct the overall smooth function)
B-splines are defined using a set of basis functions, each associated with a knot. The smooth curve is formed by a weighted sum of these basis functions.
B-spline Basis: The basis functions for B-splines have the following properties:
Weighted Sum of Basis Functions: The final spline is created by summing these basis functions, each weighted by a coefficient. This allows the spline to adapt to the data and capture non-linear relationships.
\[ f(x) = \sum_{i=1}^{n} w_i \cdot B_i(x) \]
To understand how the weighted sum of basis functions creates a smooth function, let’s visualize how different basis functions combine to form the final curve.
weights <- c(0.4, -0.5, 0.3, 0.1, 0.6, -0.2)
weights <- c(0.2, -0.1, 0.3)
weights <- c(0.6, -0.1, 0.01)
mgcv
Setting k
controls the number of basis functions, and
number of knots is generally closely related (e.g. k-2 for
cr
)
It doesn’t take many basis functions to create a flexible spline
Flexibility is controlled by:
Important Considerations:
Knots and Basis Functions:
mgcv
default is to distribute them based on datamgcv
default is to distribute them based on datatp
): Flexible,
non-linear smooths for complex, irregular datacr
):
Piecewise cubic polynomials, flexible with knots for general trendscc
): For periodic
data (e.g., angles, time-of-day), handles cyclical patternsgp
): Smooths with
uncertainty estimates, used for spatial/time series modelsbs
): Piecewise polynomials,
flexible but computationally efficientps
): Combination of
B-splines and a penalty for smoothness, good for large datasetscor(obs, pred)
~ 0.86economics
dataset
psavert
: Personal savings rate
unemploy
: Unemployment rate
date
: Date of observation
DLMs are flexible and allow a random walk in the intercept, covariates or both. Presumably we can do the same with a GAM.
How about a smooth / random walk on the covariate. Is this correct and or why not?
This is fitting a flexible non-linear model
However, non-linear smooth of unemployment totally ignores time aspect
If you thought we needed to be including time, you are correct
The prior model was fitting a non-linear smooth of the covariate
What about a smooth of the covariate and time?
Here we add a 2D smooth,
s(ln_unemploy, time_num)
What is the 2D smooth of ln_unemploy and time doing?
There’s maybe some slight non-linearity here
# Simulate predictors
set.seed(123)
n <- 400
x1 <- runif(n, 0, 10)
x2 <- runif(n, 0, 10)
# Create a response with interaction
# The response surface is non-additive: x1's effect depends on x2
y <- sin(x1) * cos(x2) + rnorm(n, sd = 0.3)
dat <- data.frame(x1 = x1, x2 = x2, y = y)
model <- gam(y ~ s(x1, x2), data = dat)
vis.gam(model, view = c("x1", "x2"), plot.type = "contour", color = "topo")
s(ln_unemploy, time_num)
is fitting a 2D smooth that
includes the interaction between the covariate and time
This is a flexible model, but not the same as a time varying slope
Another way to model the interaction is with the by
covariate
This lets the smooth vary / creates separate smooths for
different values of the by
covariate
Why?
s(ln_unemploy, by = time_num)
is letting the shape /
smooth effect of ln_unemploy
vary by time step
But ordering of years isn’t preserved
Neighboring years should be similar
time_num
and
psavert
How to decide between these 2 approaches?
Diagnostics: R2, RMSE, AIC, etc
compare_gams
Available in slide Rmd
k <- compare_gams(fit_nonlinear, fit_dlm, response="psavert", data = economics)
print(knitr::kable(k, digits = 3, caption = "Comparison of GAMs"))
Model | AIC | GCV | R2 | RMSE | Shapiro_p | |
---|---|---|---|---|---|---|
GCV.Cp | Non-linear GAM | 1421.12 | 0.759 | 0.918 | 1.146 | 0 |
GCV.Cp1 | Time-varying GAM | 1495.46 | 0.868 | 0.906 | 1.194 | 0 |
Time varying intercepts more often supported vs time varying slopes. Why?
They capture the overall trend (level) in the data
Covariate may not be variable enough
Intercepts may not be identifiable without priors, etc
This will vary case by case, and in general you want to be careful in comparing / constructing models