This was a more open-ended homework. Grading was 3/5 if you made a solid attempt at questions 1-8 but didn’t do anything extra (extra plots, more analysis, tables, discussion of what is going on). Anything extra you did (including for question 1-8) added points. Most tried some of the optional questions and got 4/5 points. A couple students did did 4+ optional or additional analyses plus delved deeper into questions 1-8 (5/5 points).
The following are examples of ‘extra’ analyses:
RMarkdown equation tips
$$ $$
is a shortcut for \begin{equation} \end{equation}
. Use one or the other but not both.
bmatrix
in the amsmath package has a limit of 10 columns in matrices. You can change this by including a line in your YAML at the top of your document to change the default.
---
title: "My Title"
author: My Name
header-includes:
- \setcounter{MaxMatrixCols}{20}
output:
pdf_document: default
html_document: default
---
load("ECNNO2.RData")
Load the packages.
library(MARSS)
library(ggplot2)
library(ggmap)
library(tidyr)
library(dplyr)
These are the site locations and names. Note that the site numbers do not indicate how close the sites are to each other.
There are 23 years of monthly data for each site and at each site there are 1-3 samples (tubes) taken.
The winter NO\(_2\) is higher and more variable than summer (coal plants?). You’ll focus on the winter average data for the first part of the homework and the monthly data for the second part of the homework.
dat = ECNNO2 %>%
subset(Month %in% c("Nov", "Dec", "Jan")) %>%
subset(TubeID=="E1") %>%
group_by(Year, SiteCode, TubeID) %>%
summarize(log.mean = log(mean(Value, na.rm=TRUE)))
# replace NaN with NA
dat[is.na(dat)] <- as.numeric(NA)
Make this into a matrix.
datwide <- pivot_wider(
dat,
names_from = Year,
values_from = log.mean
)
# make into a matrix
datmat <- as.matrix(datwide[,3:23])
rownames(datmat) <- datwide$SiteCode
mod.list1 <- list(Z = matrix(1, 11, 1),
R = "diagonal and unequal")
fit1 <- MARSS(datmat, model=mod.list1)
This model has an AICc value of 27.5528735.
autoplot(fit1, plot.type = "xtT")
R="diagonal and equal"
The observation errors independent with equal variance.We can fit each and see which is most supported.
Z <- matrix(1, 11, 1)
fit1b <- MARSS(datmat, model=list(Z=Z, R="diagonal and equal"))
fit1c <- MARSS(datmat, model=list(Z=Z, R="unconstrained"))
fit1d <- MARSS(datmat, model=list(Z=Z, R="equalvarcov"))
And then look at the table.
df <- data.frame(
R=c("diagonal and unequal","diagonal and equal",
"unconstrained", "equalvarcov"),
AICc=c(fit1$AICc, fit1b$AICc, fit1c$AICc, fit1d$AICc)
)
knitr::kable(df)
R | AICc |
---|---|
diagonal and unequal | 27.55287 |
diagonal and equal | 62.23435 |
unconstrained | 131.39249 |
equalvarcov | 64.53942 |
Not too bad but T03 especially seems to have a trend issue (residuals are not temporally stationary). T11 has some outliers.
resids <- residuals(fit1)
You can use plot(fit1)
and get some basic residuals plots but let’s make them ourselves.
ggplot(resids, aes(x=t, y=.resids)) + geom_point() +
stat_smooth() + geom_hline(yintercept=0) +
facet_wrap(~.rownames)
ggplot(resids, aes(x=.resids)) +
geom_histogram() +
facet_wrap(~.rownames) +
ylab("") + xlab("Residuals") +
geom_vline(xintercept = 0)
par(mfrow = c(3,4), mar=c(2,2,2,2))
sites <- rownames(datmat)
for(i in sites){
tmp <- subset(resids, .rownames==i)$.resids
qqnorm(tmp, main = i, pch =16, xpd = NA, xlab="", ylab="")
qqline(tmp)
}
par(mfrow = c(3,4), mar=c(2,2,3,2))
sites <- rownames(datmat)
for(i in sites){
tmp <- subset(resids, .rownames==i)$.resids
acf(tmp, na.action=na.pass, main=i)
}
ggplot(resids, aes(x = .fitted, y = .resids)) + geom_point() +
facet_wrap(~ .rownames, scales = "free") + xlab("Fitted values") + ylab("Residuals") +
geom_hline(yintercept=0)
The key here is setting up the \(\mathbf{Z}\) matrix (0s and 1s) and then making that in R.
The \(\mathbf{Z}\) will look like
\[ \mathbf{Z} = \begin{matrix} T01 \\ T03 \\ T04 \\ T05 \\ T06 \\ T07 \\ T08 \\ T09 \\ T10 \\ T11 \\ T12 \end{matrix} \begin{bmatrix} 1&0&0 \\ 0&0&1 \\ 0&1&0 \\ 1&0&0 \\ 1&0&0 \\ 0&1&0 \\ 1&0&0 \\ 1&0&0 \\ 1&0&0 \\ 0&0&1 \\ 0&1&0 \end{bmatrix} \]
Z <- matrix(0, 11, 3)
Z[c(1,4,5,7,8,9),1] <- 1
Z[c(3, 6, 11),2] <- 1
Z[c(2, 10),3] <- 1
We have to decide what to assume about \(\mathbf{Q}\). Let’s start with independent with different variances.
mod.list2 <- list(Z = Z,
R = "diagonal and unequal")
fit2 <- MARSS(datmat, model=mod.list2)
The AICc for this model is quite a bit larger: 48.671774.
Let’s try an unconstrained \(\mathbf{Q}\).
mod.list3 <- list(Z = Z,
R = "diagonal and unequal",
Q = "unconstrained")
fit3 <- MARSS(datmat, model=mod.list3)
The AICc for this model is smaller 25.3222632.
If we look at the \(\mathbf{Q}\) variance-covariance matrix we can see why the independent assumption was bad.
require(corrplot)
Qmat <- coef(fit3, type="matrix")$Q
M <- cov2cor(Qmat)
corrplot(M)
The states for this model are similar but have different trends.
plot(fit3, plot.type="xtT")
## plot type = "xtT" Estimated states
MARSS()
model will look like so.MARSS(datmat, model=list(R=..., Z=..., Q=...))
See comments above.
Some ideas for OPTIONAL extra analyses
OPTIONAL. Go back to question 1 and the model with 1 state. Instead of assuming that all \(y\) observe the same \(x\), assume they can observed a ‘stretched’ \(x\) so \(y_{i,t} = z_i x_t + v_t\). Fit that model and look at the estimated \(\mathbf{Z}\) matrix. What does it say? Is it more supported than a model without ‘stretching’?
Note you will need to remove the mean from the data and set A="zero"
to fit a model where you estimate \(\mathbf{Z}\).
newdat <- zscore(datmat, mean.only=TRUE)
MARSS(newdat, model=list(R=.., Z=..., A="zero"))
The idea here is that we have the \(x\) scale 1 to 1 to one \(y\) (let’s say the first) and then use \(z_i\) to scale the \(x\) for the other \(y\). This gets us \(y_{i,t} = z_i x_t + v_{i,t}\).
The \(\mathbf{Z}\) matrix for this model is an 11-row, 1-column matrix as before, but now each entry is \(z\), not 1, to model the “stretching”. Note you could have also fit this by estimating \(z_1\). It would work but it’s just that one of the \(z\) is underdetermined and should be fixed.
\[ \mathbf{Z} = \begin{bmatrix} 1 \\ z_3 \\ z_4 \\ z_5 \\ z_6 \\ z_7 \\ z_8 \\ z_9 \\ z_{10} \\ z_{11} \\ z_{12} \\ \end{bmatrix} \]
Z2 <- matrix(list(1),11,1)
Z2[2:11,] <- paste0("z",3:12)
mod.list4 <- list(Z = Z2,
R = "diagonal and unequal",
A = "zero")
datmat2 <- zscore(datmat, mean.only=TRUE)
fit4 <- MARSS(datmat2, model=mod.list4)
Note, we cannot compare AICcs because datmat2 and datmat are different. datmat2 has the mean subtracted off. But we can fit model 1 to datmat2 and then compare AICcs.
mod.list5 <- list(Z = matrix(1,11,1),
R = "diagonal and unequal",
A = "zero")
fit5 <- MARSS(datmat2, model=mod.list4)
Instead of this, I just took the mean of the data and smoothed the data. It looks fairly similar to the state for fit1. The offset is because the MARSS model makes \(x\) match \(y_1\).
df <- data.frame(y = apply(datmat, 2, mean, na.rm=TRUE),
t = 1:21)
plot(smooth.spline(df$y), type="l", ylim=c(1.5,3.2))
lines(fit1$states[1,])
points(df$y)
M <- cor(t(datmat), use="complete.obs")
corrplot::corrplot(M, order = "hclust", addrect = 5)
fit <- MARSS(datmat, model=list(R="diagonal and unequal", Q="unconstrained"))
Qmat <- coef(fit, type="matrix")$Q
rownames(Qmat) <- colnames(Qmat) <- rownames(datmat)
M <- cov2cor(Qmat)
corrplot(M, order = "hclust", addrect = 3)
Data for part 2 is datmat2
below. What the code is doing: take the ECNNO2 monthly data, get just tube E1, make a column with Year-Mon, remove the unneeded columns, pivot wider, make into a matrix.
datwide <- ECNNO2 %>%
subset(TubeID=="E1") %>%
mutate(Year.Mon=paste(Year, Month, sep="-")) %>%
select(-Year, -Month, -TubeID, -Date) %>%
pivot_wider(
names_from = Year.Mon,
values_from = Value
)
# make into a matrix
datmat2 <- as.matrix(datwide[,-1])
# deal with NaN
datmat2[is.na(datmat2)] <- NA
# log the data
datmat2[datmat2==0] <- NA
datmat2 <- log(datmat2)
rownames(datmat2) <- datwide$SiteCode
Create covariate
TT <- dim(datmat2)[2] # time steps
period <- 12
cos.t <- cos(2 * pi * seq(TT)/period)
sin.t <- sin(2 * pi * seq(TT)/period)
covariate <- rbind(cos.t, sin.t)
Fit model
fit.seas1 <- MARSS(datmat2, model=list(Z = matrix(1, 11, 1),
R = "diagonal and unequal",
C = "unconstrained",
c = covariate))
Let’s look at the state. It is cyclic as we expect.
plot(fit.seas1, plot.type="xtT")
## plot type = "xtT" Estimated states
Let’s see if we have
There are problems. This model is not capturing all the seasonality.
par(mfrow = c(3,4), mar=c(2,2,3,2))
resids <- residuals(fit.seas1)
sites <- rownames(datmat)
for(i in sites){
tmp <- subset(resids, .rownames==i)$.resids
acf(tmp, na.action=na.pass, main=i)
}
fit.seas2 <- MARSS(datmat2, model=list(Z = matrix(1, 11, 1),
R = "diagonal and unequal",
D = "unconstrained",
d = covariate))
Much better but we still see the 11-12 month signal in the residuals.
par(mfrow = c(3,4), mar=c(2,2,3,2))
resids <- residuals(fit.seas2)
sites <- rownames(datmat)
for(i in sites){
tmp <- subset(resids, .rownames==i)$.resids
acf(tmp, na.action=na.pass, main=i)
lines(c(12,12),c(.5,1), col=2)
}
Let’s also try a model where the seasonality is in the observation but is the same across sites.
The D will be
\[ \begin{bmatrix} D_s&D_c\\ D_s&D_c\\ D_s&D_c\\ D_s&D_c\\ D_s&D_c\\ D_s&D_c\\ D_s&D_c\\ D_s&D_c\\ D_s&D_c\\ D_s&D_c\\ D_s&D_c \end{bmatrix} \]
fit.seas3 <- MARSS(datmat2, model=list(Z = matrix(1, 11, 1),
R = "diagonal and unequal",
D = matrix(c("Ds", "Dc"), 11, 2, byrow=TRUE),
d = covariate))
par(mfrow = c(3,4), mar=c(2,2,3,2))
resids <- residuals(fit.seas3)
sites <- rownames(datmat)
for(i in sites){
tmp <- subset(resids, .rownames==i)$.resids
acf(tmp, na.action=na.pass, main=i)
lines(c(12,12),c(.5,1), col=2)
}
Let’s also try a model with the 3 regions and seasonality.
fit.seas4 <- MARSS(datmat2, model=list(Z = Z,
R = "diagonal and unequal",
Q = "unconstrained",
D = "unconstrained",
d = covariate))
The first model has the seasonality in the underlying state and the seasonality is the same for each site. The second one has the seaonality in the observation and the seasonality is different for each site.
And then look at the table.
df <- data.frame(
model=c("season in one state", "season in observation. all different", "season in observation. one season", "season in observation. all different. 3 states"),
AICc=c(fit.seas1$AICc, fit.seas2$AICc, fit.seas3$AICc, fit.seas4$AICc)
)
knitr::kable(df)
model | AICc |
---|---|
season in one state | 2293.079 |
season in observation. all different | 2213.435 |
season in observation. one season | 2293.080 |
season in observation. all different. 3 states | 2175.940 |
Ideas for OPTIONAL extra analyses
Make seasonal effects matrix
D.est1 <- coef(fit.seas2, type="matrix")$D
# The time series of net seasonal effects
seasonality.1 <- D.est1 %*% covariate[,1:12]
rownames(seasonality.1) <- rownames(datmat)
colnames(seasonality.1) <- month.abb
matplot(t(seasonality.1),type="l",bty="n",xaxt="n",ylab="Fourier", col=c(2:3,1,4:11), lwd=3, lty=c(2:3,1,4:11))
axis(1,labels=month.abb, at=1:12,las=2,cex.axis=0.75)
legend("top", legend=rownames(datmat), col=c(2:3,1,4:11), bty="n", lwd=3, cex=0.6, seg.len=4, lty=c(2:3,1,4:11))
The sites with the strongest winter versus summer difference are near London plus T04. T04 is quite odd since it is not near cities or coal plants. It is up north in the Lake District.
par(mfrow=c(1,2))
jan.jul.diff <- sort(seasonality.1[,1]-seasonality.1[,7])
p <- barplot(jan.jul.diff, main="Jan - Jul Difference", ylim=c(0,1))
locs <- which(names(jan.jul.diff) %in% c("T08","T10","T06","T09"))
text(p[locs], rep(.3,4), rep("London",4), srt=90)
locs <- which(names(jan.jul.diff) %in% c("T04"))
text(p[locs], rep(.3,1), rep("Lake District",1), srt=90)
mean.vals <- sort(apply(datmat, 1, mean, na.rm=TRUE))
p <- barplot(mean.vals, main="Mean values")
locs <- which(names(mean.vals) %in% c("T08","T10","T06","T09"))
text(p[locs], rep(1.3,4), rep("London",4), srt=90)
locs <- which(names(mean.vals) %in% c("T04"))
text(p[locs], rep(.75,1), rep("Lake District",1), srt=90)
Make seasonal effects matrix
D.est1 <- coef(fit.seas2, type="matrix")$D
# The time series of net seasonal effects
seasonality.1 <- D.est1 %*% covariate[,1:12]
rownames(seasonality.1) <- rownames(datmat)
colnames(seasonality.1) <- month.abb
matplot(t(seasonality.1),type="l",bty="n",xaxt="n",ylab="Fourier", col=c(2:3,1,4:11), lwd=3, lty=c(2:3,1,4:11))
axis(1,labels=month.abb, at=1:12,las=2,cex.axis=0.75)
legend("top", legend=rownames(datmat), col=c(2:3,1,4:11), bty="n", lwd=3, cex=0.6, seg.len=4, lty=c(2:3,1,4:11))
OPTIONAL. Use the model from question 8. Structure your \(\mathbf{D}\) matrix into the 3 regions so that each region has the same seasonal effect (same \(\mathbf{D}\) values).
Region 1 = T01, T05, T06, T08, T09, T10
Region 2 = T04, T07, T12
Region 3 = T03, T11
In this case the D is \[ \begin{bmatrix} D_{s,1}&D_{c,1}\\ D_{s,3}&D_{c,3}\\ D_{s,2}&D_{c,2}\\ D_{s,1}&D_{c,1}\\ D_{s,1}&D_{c,1}\\ D_{s,2}&D_{c,2}\\ D_{s,1}&D_{c,1}\\ D_{s,1}&D_{c,1}\\ D_{s,1}&D_{c,1}\\ D_{s,3}&D_{c,3}\\ D_{s,2}&D_{c,2} \end{bmatrix} \] Set up your D like so
D <- matrix("", 11, 2)
D[c(1,4,5,7,8,9),] <- c("Ds1", "Dc1")
D[c(3, 6, 11),] <- c("Ds2", "Dc2")
D[c(2, 10),] <- c("Ds3", "Dc3")
Fit the model
fit.seas6 <- MARSS(datmat2, model=list(Z = matrix(1, 11, 1),
R = "diagonal and unequal",
D = D,
d = covariate))
## Success! abstol and log-log tests passed at 75 iterations.
## Alert: conv.test.slope.tol is 0.5.
## Test with smaller values (<0.1) to ensure convergence.
##
## MARSS fit is
## Estimation method: kem
## Convergence test: conv.test.slope.tol = 0.5, abstol = 0.001
## Estimation converged in 75 iterations.
## Log-likelihood: -1116.538
## AIC: 2293.076 AICc: 2293.775
##
## Estimate
## A.T03 -0.58704
## A.T04 -0.91045
## A.T05 -0.64273
## A.T06 0.88807
## A.T07 -2.10286
## A.T08 0.16799
## A.T09 -0.21467
## A.T10 -0.16770
## A.T11 -1.12102
## A.T12 -1.74289
## R.(T01,T01) 0.07482
## R.(T03,T03) 0.16085
## R.(T04,T04) 0.15124
## R.(T05,T05) 0.10653
## R.(T06,T06) 0.07894
## R.(T07,T07) 0.19398
## R.(T08,T08) 0.07369
## R.(T09,T09) 0.06904
## R.(T10,T10) 0.08261
## R.(T11,T11) 0.13404
## R.(T12,T12) 0.25288
## U.U -0.00165
## Q.Q 0.03959
## x0.x0 2.41853
## D.Ds1 0.22809
## D.Ds3 0.13298
## D.Ds2 0.22868
## D.Dc1 0.26647
## D.Dc2 0.19598
## D.Dc3 0.15692
## Initial states (x0) defined at t=0
##
## Standard errors have not been calculated.
## Use MARSSparamCIs to compute CIs and bias estimates.
Make seasonal effects matrix
D.est <- matrix(coef(fit.seas6)$D, 3, 2)
# The time series of net seasonal effects
seasonality.2 <- D.est %*% covariate[,1:12]
rownames(seasonality.2) <- c("Region 1", "Region 2", "Region 3")
colnames(seasonality.2) <- month.abb
matplot(t(seasonality.2),type="l",bty="n",xaxt="n",ylab="Fourier", lty=1:3, lwd=3, col=1:3)
axis(1,labels=month.abb, at=1:12,las=2,cex.axis=0.75)
legend("top", lty=1:3, legend=c("Region 1", "Region 2", "Region 3"), cex=1, col=1:3, bty="n", lwd=3)
We need to create a covariate matrix with a row for each month. If t=i in datmat is month j, then the j-th row, i-th column will be a 1. This is a little tricky since you would have had to guess that you need to set A equal to zero, just like you have to do in a linear regression with factors.
c.fac <- matrix(diag(1,12),12, ncol(datmat2))
rownames(c.fac) <- month.abb
mod.list <- list(Z = matrix(1, 11, 1),
R = "diagonal and unequal",
D = "unconstrained",
d = c.fac,
A = "zero")
fit.seas7 <- MARSS(datmat2, model=mod.list)
Make seasonal effects matrix. I will remove the mean so we can compare the seasonality without the mean of the data.
D.est <- coef(fit.seas7, type="matrix")$D
seasonality.7 <- D.est %*% c.fac[,1:12]
seasonality.7 <- zscore(seasonality.7, mean.only=TRUE)
rownames(seasonality.7) <- rownames(datmat2)
colnames(seasonality.7) <- month.abb
matplot(t(seasonality.7),type="l",bty="n",xaxt="n",ylab="Effect", col=c(2:3,1,4:11), lwd=3, lty=c(2:3,1,4:11))
axis(1,labels=month.abb, at=1:12,las=2,cex.axis=0.75)
title("Each month as a factor")
legend("top", legend=rownames(datmat), col=c(2:3,1,4:11), bty="n", lwd=3, cex=0.6, seg.len=4, lty=c(2:3,1,4:11))
It doesn’t substantially change our conclusions about seasonality.
Let’s try a 3rd order polynomial. We need to create a covariate matrix with a row for month, month\(^2\) and month\(^3\). However these would be correlated so I will use the poly()
function to create 3rd polynomial covariates that are orthogonal.
mon <- rep(1:12, 23)
c.poly <- t(poly(mon, 3))
rownames(c.poly) <- c("m1", "m2", "m3")
mod.list <- list(Z = matrix(1, 11, 1),
R = "diagonal and unequal",
D = "unconstrained",
d = c.poly)
fit.seas8 <- MARSS(datmat2, model=mod.list)
Make seasonal effects matrix.
D.est <- coef(fit.seas8, type="matrix")$D
seasonality.8 <- D.est %*% c.poly[,1:12]
rownames(seasonality.8) <- rownames(datmat2)
colnames(seasonality.8) <- month.abb
matplot(t(seasonality.8),type="l",bty="n",xaxt="n",ylab="Effect", col=c(2:3,1,4:11), lwd=3, lty=c(2:3,1,4:11))
axis(1,labels=month.abb, at=1:12,las=2,cex.axis=0.75)
title("3rd order polynomial")
legend("top", legend=rownames(datmat), col=c(2:3,1,4:11), bty="n", lwd=3, cex=0.6, seg.len=4, lty=c(2:3,1,4:11))
There are differences in our conclusions regarding the degree of seasonality (winter-summer percent difference). Percent difference because we are working on logged data. It doesn’t substantially change our conclusions about seasonality.