STA 610L: Module 2.5

class: center, middle, inverse, title-slide

# STA 610L: Module 2.5
## Random effects ANOVA (Bayesian estimation II)
### Dr. Olanrewaju Michael Akande

---

## Bayesian random effects ANOVA model

- Recall our hierarchical model can be written as
.block[
.small[
$$
`\begin{split}
y_{ij} | \mu_j, \sigma^2 & \sim \mathcal{N} \left( \mu_j, \sigma^2\right); \ \ \ i = 1,\ldots, n_j\\
\mu_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J,
\end{split}`
$$
]
]

with priors
.block[
.small[
$$
`\begin{split}
\pi(\mu) & = \mathcal{N}\left(\mu_0, \gamma^2_0\right)\\
\pi(\tau^2) & = \mathcal{IG} \left(\dfrac{\eta_0}{2}, \dfrac{\eta_0\tau_0^2}{2}\right)\\
\pi(\sigma^2) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right).\\
\end{split}`
$$
]
]

- We can write our own Gibbs sampler for this model (see end of this module and also, next homework).

- However, since we will rely primarily on the .hlight[brms] package for fitting many of our hierarchical models anyway, let's see if we can fit a version of this model to the radon data, using the .hlight[brms] package.

---
## Radon study again

```r
#library(rstan)
#library(brms)
#library(tidybayes)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

#note: there are many ways of specifying priors in brms
#we will touch on some other options soon
#see the help page for "set_priors"
#for now, vague priors under our model specification will do
prior <- c(set_prior("normal(0,5)", class = "Intercept"),
           #set_prior("normal(0,10)", class = "b"),
           set_prior("inv_gamma(0.5,5)", class = "sigma"),
           set_prior("inv_gamma(0.5,5)", class = "sd"))

m1 <- brm(log_radon ~ (1 | countyname),
          data = Radon, family = gaussian(),
          prior = prior,iter = 3000, warmup = 2000,seed = 13)

summary(m1)
```

---
## Radon study again

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: log_radon ~ (1 | countyname) 
##    Data: Radon (Number of observations: 919) 
##   Draws: 4 chains, each with iter = 5000; warmup = 2000; thin = 1;
##          total post-warmup draws = 12000
## 
## Group-Level Effects: 
## ~countyname (Number of levels: 85) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.40      0.05     0.31     0.51 1.00     5818     7459
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     1.32      0.06     1.21     1.43 1.00     7103     8310
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.80      0.02     0.76     0.84 1.00    22035     9252
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

---
## Radon study again

We can compare the results to the frequentist estimates.

```r
Model1 <- lmer(log_radon ~ (1 | countyname), data = Radon) 
summary(Model1)
```

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: log_radon ~ (1 | countyname)
##    Data: Radon
## 
## REML criterion at convergence: 2259.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.4661 -0.5734  0.0441  0.6432  3.3516 
## 
## Random effects:
##  Groups     Name        Variance Std.Dev.
##  countyname (Intercept) 0.09581  0.3095  
##  Residual               0.63662  0.7979  
## Number of obs: 919, groups:  countyname, 85
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1.31258    0.04891   26.84
```

---
## Radon study again

Quick diagnostics for the Bayesian model.

```r
plot(m1)
```

---
## Radon study again

We can plot the group means.

```r
m1 %>%
  spread_draws(b_Intercept, r_countyname[countyname,]) %>%
  median_qi(`Group Means` = b_Intercept + r_countyname) %>%
  ggplot(aes(y = countyname, x = `Group Means`, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(orientation = "horizontal")
```

---
## Radon study again

---
## Radon study again

...or just the treatment effects.

```r
m1 %>%
  spread_draws(r_countyname[countyname,]) %>%
  median_qi(`Group Effects` = r_countyname) %>%
  ggplot(aes(y = countyname, x = `Group Effects`, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(orientation = "horizontal")
```

---
## Radon study again

---
## Challenge to validity: heterogeneous means and variances

Recall our model again:
.block[
.small[
$$
`\begin{split}
y_{ij} | \mu_j, \sigma^2 & \sim \mathcal{N} \left( \mu_j, \sigma^2 \right); \ \ \ i = 1,\ldots, n_j\\
\mu_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J,\\
\mu & \sim \mathcal{N}\left(\mu_0, \gamma^2_0\right),\\
\tau^2 & \sim \mathcal{IG} \left(\dfrac{\eta_0}{2}, \dfrac{\eta_0\tau_0^2}{2}\right),\\
\sigma^2 & \sim \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right).\\
\end{split}`
$$
]
]

While we are indeed sharing information across groups, we only do so via the group-specific means.
  
While many people feel that shrinkage can "do no harm", it can be quite detrimental when the shrinkage target is not correctly specified.

---
## Mortality by volume

---
## Estimated random intercepts by volume

---
## Group-specific variances

How might we specify a model to avoid such problems? We could introduce predictors to model group means and or group variances. For example,
`$$\alpha_j \sim N(\mu_j(z),\tau^2_j(z))$$`

Another potential challenge is that the variance of the response may not be the same for each group anyway. This could be due to a variety of factors.

One potential remedy for this issue is to allow the error variance to differ across groups. A natural extension is
.block[
.small[
$$
`\begin{split}
y_{ij} | \mu_j, \sigma^2 & \sim \mathcal{N} \left( \mu_j, \sigma^2_j\right); \ \ \ i = 1,\ldots, n_j\\
\mu_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J,\\
\sigma^2_1, \ldots, \sigma^2_J | \nu_0, \sigma_0^2 & \sim \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right).
\end{split}`
$$
]
]

---
## Posterior inference

- The full posterior is now:
.block[
.small[
$$
`\begin{split}
\pi(\mu_1, \ldots, \mu_J, \sigma^2_1, \ldots, \sigma^2_J, \mu,\tau^2,\nu_0, \sigma_0^2 | Y) & \boldsymbol{\propto} p(y | \mu_1, \ldots, \mu_J, \sigma^2_1, \ldots, \sigma^2_J, \mu, \tau^2, \nu_0, \sigma_0^2)\\
& \ \ \ \ \times p(\mu_1, \ldots, \mu_J | \sigma^2_1, \ldots, \sigma^2_J, \mu, \tau^2, \nu_0, \sigma_0^2)\\
& \ \ \ \ \times p(\sigma^2_1, \ldots, \sigma^2_J | \mu, \tau^2,\nu_0, \sigma_0^2)\\
& \ \ \ \ \times \pi(\mu, \tau^2,\nu_0, \sigma_0^2)\\
\\
& \boldsymbol{=} p(y | \mu_1, \ldots, \mu_J, \sigma^2_1, \ldots, \sigma^2_J )\\
& \ \ \ \ \times p(\mu_1, \ldots, \mu_J | \mu,\tau^2)\\
& \ \ \ \ \times p(\sigma^2_1, \ldots, \sigma^2_J | \nu_0, \sigma_0^2)\\
& \ \ \ \ \times \pi(\mu) \cdot \pi(\tau^2) \cdot \pi(\nu_0) \cdot \pi(\sigma_0^2)\\
\\
& \boldsymbol{=} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \mu_j, \sigma^2_j ) \right\}\\
& \ \ \ \ \times \left\{ \prod_{j=1}^{J} p(\mu_j | \mu,\tau^2) \right\}\\
& \ \ \ \ \times \left\{ \prod_{j=1}^{J} p(\sigma^2_j | \nu_0, \sigma_0^2) \right\}\\
& \ \ \ \ \times \pi(\mu)  \cdot \pi(\tau^2) \cdot \pi(\nu_0) \cdot \pi(\sigma_0^2)\\
\end{split}`
$$
]
]

---
## Full conditionals

- Notice that this new factorization won't affect the full conditionals for `$\mu$` and `$\tau^2$` from before, since those have nothing to do with all the new `$\sigma^2_j$`'s.

- That is,
.block[
.small[
$$
`\begin{split}
& \pi(\mu | \cdots \cdots ) = \mathcal{N}\left(\mu_n, \gamma^2_n \right) \ \ \ \ \textrm{where}\\
\\
& \gamma^2_n = \dfrac{1}{\dfrac{J}{\tau^2} + \dfrac{1}{\gamma_0^2}  } ; \ \ \ \ \ \ \ \  \mu_n = \gamma^2_n \left[ \dfrac{J}{\tau^2} \bar{\theta} + \dfrac{1}{\gamma_0^2} \mu_0 \right],
\end{split}`
$$
]
]

and
.block[
.small[
$$
`\begin{split}
& \pi(\tau^2 | \cdots \cdots ) = \mathcal{IG} \left(\dfrac{\eta_n}{2}, \dfrac{\eta_n\tau_n^2}{2}\right) \ \ \ \ \textrm{where}\\
\\
& \eta_n = \eta_0 + J ; \ \ \ \ \ \ \ \tau_n^2 = \dfrac{1}{\eta_n} \left[\eta_0\tau_0^2 + \sum\limits_{j=1}^{J} (\mu_j - \mu)^2 \right].\\
\end{split}`
$$
]
]

---
## Full conditionals

- The full conditional for each `$\mu_j$`, we have
.block[
.small[
$$
`\begin{split}
\pi(\mu_j | \mu_{-j}, \mu, \sigma^2_1, \ldots, \sigma^2_J,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \mu_j, \sigma^2_j ) \right\} \cdot p(\mu_j | \mu,\tau^2) \\
\end{split}`
$$
]
]

with the only change from before being `$\sigma^2_j$`.

- That is, those terms still include a normal density for `$\mu_j$` multiplied by a product of normals in which `$\mu_j$` is the mean, again mirroring the previous case, so you can show that
.block[
.small[
$$
`\begin{split}
\pi(\mu_j | \mu_{-j}, \mu, \sigma^2_1, \ldots, \sigma^2_J,\tau^2, Y) & = \mathcal{N}\left(\mu_j^\star, \tau_j^\star \right) \ \ \ \ \textrm{where}\\
\\
\tau_j^\star & = \dfrac{1}{ \dfrac{n_j}{\sigma^2_j} + \dfrac{1}{\tau^2}  } ; \ \ \ \ \ \ \ \mu_j^\star = \tau_j^\star \left[ \dfrac{n_j}{\sigma^2_j} \bar{y}_j + \dfrac{1}{\tau^2} \mu \right]
\end{split}`
$$
]
]

---
## How about within-group variances?

- Before we get to the choice of the priors for `$\nu_0$` and `$\sigma_0^2$`, we have enough to derive the full conditional for each `$\sigma^2_j$`. This actually takes a similar form to what we had before we indexed by `$j$`, that is,
.block[
.small[
$$
`\begin{split}
\pi(\sigma^2_j | \sigma^2_{-j}, \mu_1, \ldots, \mu_J, \mu, \tau^2, \nu_0, \sigma_0^2,Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \mu_j, \sigma^2_j ) \right\} \cdot \pi(\sigma^2_j | \nu_0, \sigma_0^2)\\
\end{split}`
$$
]
]

- This still looks like what we had before, that is, products of normals and one inverse-gamma, so that
.block[
.small[
$$
`\begin{split}
\pi(\sigma^2_j | \sigma^2_{-j}, \mu_1, \ldots, \mu_J, \mu, \tau^2, \nu_0, \sigma_0^2, Y) & = \mathcal{IG} \left(\dfrac{\nu_{j}^\star}{2}, \dfrac{\nu_{j}^\star\sigma_{j}^{2(\star)}}{2}\right) \ \ \ \ \textrm{where}\\
\\
\nu_{j}^\star = \nu_0 + n_j ; \ \ \ \ \ \ \ \sigma_{j}^{2(\star)} & = \dfrac{1}{\nu_{j}^\star} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_j} (y_{ij} - \mu_j)^2 \right].\\
\end{split}`
$$
]
]

---
## Remaining hyper-priors

- Now we can get back to priors for `$\nu_0$` and `$\sigma_0^2$`. We know that a semi-conjugate prior for `$\sigma_0^2$` is a gamma distribution. That is, if we set
.block[
.small[
$$
`\begin{split}
\pi(\sigma_0^2) & = \mathcal{Ga} \left(a,b\right),\\
\end{split}`
$$
]
]

then,
.block[
.small[
$$
`\begin{split}
\pi(\sigma_0^2 | \mu_1, \ldots, \mu_J, \sigma^2_1, \ldots, \sigma^2_J, \mu, \tau^2, \nu_0, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\sigma^2_j | \nu_0, \sigma_0^2) \right\} \cdot \pi(\sigma_0^2)\\
& \propto \ \mathcal{IG} \left(\sigma^2_j; \dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right) \  \cdot \ \mathcal{Ga} \left(\sigma_0^2 ; a,b\right) \\
\end{split}`
$$
]
]

- Recall that
  + `$\mathcal{Ga}(y; a, b) \equiv \dfrac{b^a}{\Gamma(a)} y^{a - 1} e^{-b y}$`, and
  + `$\mathcal{IG}(y; a, b) \equiv \dfrac{b^a}{\Gamma(a)} y^{-(a+1)} e^{-\dfrac{b}{y}}$`.

---
## Remaining hyper-priors

- So `$\pi(\sigma_0^2 | \mu_1, \ldots, \mu_J, \sigma^2_1, \ldots, \sigma^2_J, \mu, \tau^2, \nu_0, Y)$`
.block[
.small[
$$
`\begin{split}
& \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\sigma^2_j | \nu_0, \sigma_0^2) \right\} \cdot \pi(\sigma_0^2)\\
& \boldsymbol{\propto} \ \prod_{j=1}^{J} \ \mathcal{IG} \left(\sigma^2_j; \dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right) \  \cdot \ \mathcal{Ga} \left(\sigma_0^2 ; a,b\right) \\
& \boldsymbol{=} \left[ \prod_{j=1}^{J} \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} (\sigma^2_j)^{-\left(\dfrac{\nu_0}{2}+1 \right)} e^{-\dfrac{\nu_0\sigma_0^2}{2 (\sigma^2_j)}} \right]  \cdot \left[ \dfrac{b^a}{\Gamma(a)} (\sigma_0^2)^{a - 1} e^{-b \sigma_0^2} \right] \\
& \boldsymbol{\propto} \left[ \prod_{j=1}^{J} \left( \sigma_0^2 \right)^\left(\dfrac{\nu_0}{2} \right)  e^{-\dfrac{\nu_0\sigma_0^2}{2 (\sigma^2_j)}} \right]  \cdot \left[  (\sigma_0^2)^{a - 1} e^{-b \sigma_0^2} \right] \\
& \boldsymbol{\propto} \left[ \left( \sigma_0^2 \right)^\left(\dfrac{J \nu_0}{2} \right)  e^{- \sigma_0^2 \left[ \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right]  \cdot \left[  (\sigma_0^2)^{a - 1} e^{-b \sigma_0^2} \right] \\
\end{split}`
$$
]
]

---
## Remaining hyper-priors

- That is, the full conditional is
.block[
.small[
$$
`\begin{split}
\pi(\sigma_0^2 | \cdots \cdots ) & \boldsymbol{\propto} \left[ \left( \sigma_0^2 \right)^\left(\dfrac{J \nu_0}{2} \right)  e^{- \sigma_0^2 \left[ \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right]  \cdot \left[  (\sigma_0^2)^{a - 1} e^{-b \sigma_0^2} \right] \\
& \boldsymbol{\propto} \left[ \left( \sigma_0^2 \right)^\left( a + \dfrac{J \nu_0}{2} - 1 \right)  e^{- \sigma_0^2 \left[ b +  \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right] \\
& \equiv \mathcal{Ga} \left(\sigma_0^2 ; a_n ,b_n \right),
\end{split}`
$$
]
]

where
.block[
.small[
$$
`\begin{split}
a_n = a + \dfrac{J \nu_0}{2}; \ \ \ \ b_n = b +  \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j}.
\end{split}`
$$
]
]

---
## Remaining hyper-priors

- OK that leaves us with one parameter to go, i.e., `$\nu_0$`. Turns out there is no simple conjugate/semi-conjugate prior for `$\nu_0$`.

- Given that we know how to do Metropolis/Metropolis-Hastings, we actually have many options here, but to keep this simple, let's follow the same path as what you (hopefully) did for this model in STA 360/601/602.

- That is, restrict  `$\nu_0$` to be an integer (which makes sense when we think of it as being degrees of freedom, which also means it cannot be zero). With the restriction, we need a discrete distribution as the prior with support on `$\nu_0 = 1, 2, 3, \ldots$`.

- A popular choice is the geometric distribution with pmf `$p(\nu_0) = (1-p)^{\nu_0-1} p$`.

- However, we can rewrite the kernel as `$\pi(\nu_0) \propto e^{-\alpha \nu_0}$`. How did we get here from the geometric pmf and what is `$\alpha$`?

---
## Final full conditional

- With this prior, `$\pi(\nu_0 | \mu_1, \ldots, \mu_J, \sigma^2_1, \ldots, \sigma^2_J, \mu, \tau^2, \sigma_0^2, Y)$`
.block[
.small[
$$
`\begin{split}
& \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\sigma^2_j | \nu_0, \sigma_0^2) \right\} \cdot \pi(\nu_0)\\
& \propto \ \prod_{j=1}^{J} \ \mathcal{IG} \left(\sigma^2_j; \dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right) \  \cdot \ e^{-\alpha \nu_0} \\
& \boldsymbol{=} \left[ \prod_{j=1}^{J} \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} \left(\sigma^2_j\right)^{-\left(\dfrac{\nu_0}{2}+1 \right)} e^{-\dfrac{\nu_0\sigma_0^2}{2 (\sigma^2_j)}} \right]  \cdot e^{-\alpha \nu_0} \\
& \boldsymbol{\propto} \left[ \left( \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} \right)^J \cdot \left(\prod_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right)^{\left(\dfrac{\nu_0}{2}+1 \right)} \cdot  e^{- \nu_0 \left[ \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right]  \cdot e^{-\alpha \nu_0} \\
\end{split}`
$$
]
]

---
## Final full conditional

- That is, the full conditional is
.block[
.small[
$$
`\begin{split}
\pi(\nu_0 | \cdots \cdots ) & \boldsymbol{\propto} \left[ \left( \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} \right)^J \cdot \left(\prod_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right)^{\left(\dfrac{\nu_0}{2}+1 \right)} \cdot  e^{- \nu_0 \left[ \alpha +  \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right], \\
\end{split}`
$$
]
]

which is not a known kernel and is thus unnormalized (i.e., does not integrate to 1 in its current form).

- While this looks like a lot, it is relatively easy to compute in R, for a grid of `$\nu_0$` values.

- Technically, the support is `$\nu_0 = 1, 2, 3, \ldots$`,but we can compute the unnormalized distribution across say `$\nu_0 = 1, 2, 3, \ldots, K$` for some large `$K$`, re-normalize, and then sample.

---
## Final full conditional

- One more thing, computing these probabilities on the raw scale can be problematic particularly because of the product inside. Good idea to transform to the log scale instead.

- That is,
.block[
.small[
$$
`\begin{split}
\pi(\nu_0 | \cdots \cdots ) & \boldsymbol{\propto} \left[ \left( \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} \right)^J \cdot \left(\prod_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right)^{\left(\dfrac{\nu_0}{2}-1 \right)} \cdot  e^{- \nu_0 \left[ \alpha +  \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right]\\
\\
\Rightarrow \ \text{ln} \pi(\nu_0 | \cdots \cdots ) &  \boldsymbol{\propto} \left(\dfrac{J\nu_0}{2} \right) \text{ln} \left( \dfrac{\nu_0\sigma_0^2}{2} \right) - J\text{ln}\left[ \Gamma\left(\dfrac{\nu_0}{2} \right) \right] \\
& \ \ \ \ + \left(\dfrac{\nu_0}{2}+1 \right) \left(\sum_{j=1}^{J} \text{ln} \left[\dfrac{1}{\sigma^2_j} \right] \right) \\
& \ \ \ \ - \nu_0 \left[ \alpha +  \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right] \\
\end{split}`
$$
]
]

---
## Full model

As a recap, the final model is:
.block[
.small[
$$
`\begin{split}
y_{ij} | \mu_j, \sigma^2_j & \sim \mathcal{N} \left( \mu_j, \sigma^2_j \right); \ \ \ i = 1,\ldots, n_j; \ \ \ j = 1, \ldots, J\\
\\
\\
\mu_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J\\
\\
\sigma^2_1, \ldots, \sigma^2_J | \nu_0, \sigma_0^2 & \sim \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right); \ \ \ j = 1, \ldots, J\\
\\
\\
\mu & \sim \mathcal{N}\left(\mu_0, \gamma^2_0\right)\\
\\
\tau^2 & \sim \mathcal{IG} \left(\dfrac{\eta_0}{2}, \dfrac{\eta_0\tau_0^2}{2}\right).\\
\\
\\
\pi(\nu_0) & \propto e^{-\alpha \nu_0} \\
\\
\sigma_0^2 & \sim \mathcal{Ga} \left(a,b\right).\\
\end{split}`
$$
]
]

---
## Gibbs sampler

```r
#Data summaries
J <- #number of groups
ybar <- #vector of the group sample means
s_j_sq <- #vector of the group sample variances
n <- #vector of the number of observations in each group

#Hyperparameters for the priors
mu_0 <- ...
gamma_0_sq <- ...
eta_0 <- ...
tau_0_sq <- ...
alpha <- ...
a <- ...
b <- ...

#Grid values for sampling nu_0_grid
nu_0_grid <- 1:5000

#Initial values for Gibbs sampler
theta <- ybar #theta vector for all the mu_j's
sigma_sq <- s_j_sq
mu <- mean(theta)
tau_sq <- var(theta)
nu_0 <- 1
sigma_0_sq <- 100
```

---
## Gibbs sampler

```r
#first set number of iterations and burn-in, then set seed
n_iter <- 10000; burn_in <- 0.3*n_iter
set.seed(1234)

#Set null matrices to save samples
SIGMA_SQ <- THETA <- matrix(nrow=n_iter, ncol=J)
OTHER_PAR <- matrix(nrow=n_iter, ncol=4)

#Now, to the Gibbs sampler
for(s in 1:(n_iter+burn_in)){
  
  #update the theta vector (all the mu_j's)
  tau_j_star <- 1/(n/sigma_sq + 1/tau_sq)
  mu_j_star <- tau_j_star*(ybar*n/sigma_sq + mu/tau_sq)
  theta <- rnorm(J,mu_j_star,sqrt(tau_j_star))
  
  #update the sigma_sq vector (all the sigma_sq_j's)
  nu_j_star <- nu_0 + n
  theta_long <- rep(theta,n)
  nu_j_star_sigma_j_sq_star <- 
    nu_0*sigma_0_sq + c(by((Y[,"mathscore"] - theta_long)^2,Y[,"school"],sum))
  sigma_sq <- 1/rgamma(J,(nu_j_star/2),(nu_j_star_sigma_j_sq_star/2))
  
  #update mu
  gamma_n_sq <- 1/(J/tau_sq + 1/gamma_0_sq)
  mu_n <- gamma_n_sq*(J*mean(theta)/tau_sq + mu_0/gamma_0_sq)
  mu <- rnorm(1,mu_n,sqrt(gamma_n_sq))
```

---
## Gibbs sampler

```r
  #update tau_sq
  eta_n <- eta_0 + J
  eta_n_tau_n_sq <- eta_0*tau_0_sq + sum((theta-mu)^2)
  tau_sq <- 1/rgamma(1,eta_n/2,eta_n_tau_n_sq/2)
  
  #update sigma_0_sq
  sigma_0_sq <- rgamma(1,(a + J*nu_0/2),(b + nu_0*sum(1/sigma_sq)/2))
  
  #update nu_0
  log_prob_nu_0 <- (J*nu_0_grid/2)*log(nu_0_grid*sigma_0_sq/2) -
    J*lgamma(nu_0_grid/2) +
    (nu_0_grid/2+1)*sum(log(1/sigma_sq)) -
    nu_0_grid*(alpha + sigma_0_sq*sum(1/sigma_sq)/2)
  nu_0 <- sample(nu_0_grid,1, prob = exp(log_prob_nu_0 - max(log_prob_nu_0)) )
  #this last step substracts the maximum logarithm from all logs
  #it is a neat trick that throws away all results that are so negative
  #they will screw up the exponential
  #note that the sample function will renormalize the probabilities internally
  
  
  #save results only past burn-in
  if(s > burn_in){
    THETA[(s-burn_in),] <- theta
    SIGMA_SQ[(s-burn_in),] <- sigma_sq
    OTHER_PAR[(s-burn_in),] <- c(mu,tau_sq,sigma_0_sq,nu_0)
  }
}
colnames(OTHER_PAR) <- c("mu","tau_sq","sigma_0_sq","nu_0")
```

---

class: center, middle

# What's next?

### Move on to the readings for the next module!