class: center, middle, inverse, title-slide # STA 610L: Module 2.2 ## Random effects ANOVA (estimation) ### Dr. Olanrewaju Michael Akande --- ## Maximum likelihood estimation Recall our random effects ANOVA model for the bike data. That is, `$$y_{ij}=\mu+\alpha_j+\varepsilon_{ij},$$` where `\(\varepsilon_{ij} \overset{iid}{\sim} N(0,\sigma^2)\)` `\(\perp\)` `\(\alpha_j \overset{iid}{\sim} N(0,\tau^2)\)`. `\(y_{ij}\)` indicates the passing distance between the car and the bike, and `\(\alpha_j\)` represent effects of different distances between the bike and the curb. Also, recall the general linear mixed effects models representation `$$Y=X\beta+Zb+\varepsilon,$$` with `\(\Sigma = \text{Var}(Y) = \tau^2ZZ'+\sigma^2I\)`. --- ## Maximum likelihood estimation Our `\(N=nJ\)` outcomes follow the multivariate Gaussian distribution, our likelihood is given by `$$(2\pi)^{-\frac{N}{2}}|\Sigma|^{-\frac{1}{2}}\exp\left[-\frac{1}{2}(y-X\beta)'\Sigma^{-1}(y-X\beta)\right],$$` which we then need to maximize. Since we often work with log-likelihoods, write `\begin{eqnarray*} \ell(y, \beta,\Sigma)&=&-\frac{1}{2}\left[N\log(2\pi) + \log |\Sigma| + (y-X\beta)'\Sigma^{-1}(y-X\beta) \right] \\ &\propto& \log |\Sigma| + (y-X\beta)'\Sigma^{-1}(y-X\beta), \end{eqnarray*}` which we then minimize (as I took the negative) in order to find the MLE. Peter Hoff's notes covers this is a bit more detail but we can just do it directly in R, so let's do that. --- ## MLE for bike data Actually we can let the `lmer` function do the work for us. ```r load("data/PsychBikeData.RData") library(lme4) fit.ml=lmer(`passing distance` ~ (1 | kerb), REML=FALSE, data = PsychBikeData) summary(fit.ml) ``` --- ## MLE for bike data .large[ ``` ## Loading required package: Matrix ``` ``` ## ## Attaching package: 'Matrix' ``` ``` ## The following objects are masked from 'package:tidyr': ## ## expand, pack, unpack ``` ``` ## Linear mixed model fit by maximum likelihood ['lmerMod'] ## Formula: `passing distance` ~ (1 | kerb) ## Data: PsychBikeData ## ## AIC BIC logLik deviance df.resid ## 2028.7 2046.0 -1011.4 2022.7 2352 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -3.5113 -0.6674 -0.0948 0.5511 6.3949 ## ## Random effects: ## Groups Name Variance Std.Dev. ## kerb (Intercept) 0.009206 0.09595 ## Residual 0.137203 0.37041 ## Number of obs: 2355, groups: kerb, 5 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 1.54023 0.04363 35.3 ``` ] Our ML estimates of `\((\mu,\tau^2,\sigma^2)\)` for the bike data are `\((\widehat{\mu},\widehat{\tau}^2,\widehat{\sigma}^2)=(1.540, 0.009, 0.137)\)`. --- ## Restricted maximum likelihood estimation REML (restricted or residual maximum likelihood) estimation is quite popular for variance component estimation. Features of REML estimation include the following - it is based on a likelihood function that only uses information that does not depend on fixed effects (we define new outcomes orthogonal to the mean) - it is generally less biased than ML estimates (and unbiased in certain special cases) --- ## MLE for one-sample setting First, recall a one-sample setting in which we wish to estimate the sample mean `\(\mu\)` and variance `\(\sigma^2\)` using the model `$$y_{i}=\mu+\varepsilon_{i}, \ \ i=1,\ldots,n$$` with `\(\varepsilon_i \sim N\left(0,\sigma^2\right)\)`. Then our log-likelihood is proportional to `\(n\log\sigma^2 + \frac{\sum(y_i-\mu)^2}{\sigma^2}\)`. To find the MLE's of `\(\mu\)` and `\(\sigma^2\)`, take derivatives and solve for zero to obtain `\(\widehat{\mu}=\overline{y}\)` and `\(\widehat{\sigma}^2=\frac{\sum (y_i-\overline{y})^2}{n}\)`. Of course typically we don't use the MLE to estimate `\(\sigma^2\)` because of its well-known small-sample bias, instead using the unbiased estimate `\(s^2=\frac{\sum (y_i-\overline{y})^2}{n-1}=\frac{n}{n-1}\widehat{\sigma}^2\)`. --- ## REML for simplest case REML estimates are often based on a full-rank set of error contrasts -- the basic idea is to retain the information in the data about the variance while eliminating the fixed effects. The full residuals `\(\varepsilon_i=y_i-\mu\)` contain all the information in the likelihood about the variance parameter `\(\sigma^2\)`. Because the residuals are independent of the fixed effect `\(\mu\)`, we can re-express our log likelihood to isolate the residual likelihood: `$$\ell(y,\mu,\sigma^2)=\ell(\varepsilon,\mu,\sigma^2)+\ell(\overline{y},\mu,\sigma^2)$$` We know `\(\widehat{\mu}=\overline{y}\sim N\left(\mu,\frac{\sigma^2}{n}\right)\)` and so `\(\ell(\overline{y},\mu,\sigma^2)\propto \log \frac{\sigma^2}{n} + \frac{(\overline{y}-\mu)^2}{\frac{\sigma^2}{n}}\)` which reduces to `\(\log \sigma^2 - \log n\)` once we plug in the MLE `\(\overline{y}\)` for `\(\mu\)`. A slightly different approach to this actually attempts to integrate out `\(\mu\)` from the original likelihood. --- ## REML for simplest case Then `$$\ell(\varepsilon,\mu,\sigma^2) \propto n\log \sigma^2 + \frac{\sum (y_i-\mu)^2}{\sigma^2}-\log \sigma^2+\log n$$` which is proportional to `$$(n-1)\log \sigma^2 + \frac{\sum (y_i-\mu)^2}{\sigma^2},$$` which looks just like our ML likelihood with the exception of the multiplier `\(n-1\)` instead of `\(n\)`, and it's straightforward to show the maximum is `\(\widehat{\sigma}^2_{REML}=\frac{\sum (y_i-\mu)^2}{n-1}\)`, where `\(\mu\)` is replaced with its estimate. We can follow a similar approach for the random effects ANOVA model. Because they are generally less biased than ML estimates, REML estimates are typically the default frequentist estimates provided by many software packages. --- ## REML estimates for the bike data ```r fit.reml=lmer(`passing distance` ~ (1 | kerb), REML=TRUE, data = PsychBikeData) summary(fit.reml) ``` ``` ## Linear mixed model fit by REML ['lmerMod'] ## Formula: `passing distance` ~ (1 | kerb) ## Data: PsychBikeData ## ## REML criterion at convergence: 2027 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -3.5132 -0.6647 -0.0940 0.5498 6.3978 ## ## Random effects: ## Groups Name Variance Std.Dev. ## kerb (Intercept) 0.01157 0.1076 ## Residual 0.13720 0.3704 ## Number of obs: 2355, groups: kerb, 5 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 1.54008 0.04876 31.59 ``` Our REML estimates for the bike data are `\((\widehat{\mu},\widehat{\tau}^2,\widehat{\sigma}^2)=(1.540, 0.012, 0.137)\)`. --- ## Empirical Bayes When we have random effects in a model, the standard frequentist effects of these random quantities are called *empirical Bayes* estimates, regardless of whether we obtain other estimates using ML or REML. --- ## Empirical Bayes Recall our group means formulation: `\begin{eqnarray*} y_{ij}&=&\mu_j+\varepsilon_{ij}\\ \mu_1,\cdots,\mu_J &\overset{iid}{\sim}& N(\mu, \tau^2) \\ \varepsilon_{ij} &\overset{iid}{\sim} & N(0,\sigma^2). \end{eqnarray*}` Suppose `\((\mu, \tau^2, \sigma^2)\)` are known exactly and consider estimating `\(\mu_j\)` with an estimator that is a linear function of the group sample mean `\(\widehat{\mu}_j=a\overline{y}_j+b\)`. Then one can show that the MSE `\(E[(\mu_j-\widehat{\mu}_j)^2]\)` is minimized if `\(a=\frac{\frac{n_j}{\sigma^2}}{\frac{n_j}{\sigma^2}+\frac{1}{\tau^2}}\)` and `\(b=(1-a)\mu\)`, so that `\(\widehat{\mu}_j=w_j \overline{y}_j+(1-w_j)\mu\)`, where `\(w_j=\frac{\frac{n_j}{\sigma^2}}{\frac{n_j}{\sigma^2}+\frac{1}{\tau^2}}\)` --- ## Empirical Bayes If we knew `\((\mu, \tau^2,\sigma^2)\)` this estimate would be the *Bayes estimate*; however, we do not know these parameters and are instead estimating them from the data, so that `\(\widehat{\mu}_j=\widehat{w}_j \overline{y}_j+(1-\widehat{w}_j)\widehat{\mu}\)`, where `\(\widehat{w}_j=\frac{\frac{n_j}{\widehat{\sigma}^2}}{\frac{n_j}{\widehat{\sigma}^2}+\frac{1}{\widehat{\tau}^2}}\)` is called an *empirical Bayes estimate* because our unknown parameters have been replaced by "empirical" estimates from the data. While this estimate is widely-used, it has several unsatisfactory qualities, including a standard variance estimate known to be an underestimate. This is great motivation for consideration of Bayesian approaches when formal comparisons among groups modeled with random effects are desired. --- ## EB Estimates of group means for bike data ```r table(PsychBikeData$kerb); mean(PsychBikeData$`passing distance`) ``` ``` ## ## 0.25 0.5 0.75 1 1.25 ## 670 545 339 469 332 ``` ``` ## [1] 1.563912 ``` ```r tapply(PsychBikeData$`passing distance`,PsychBikeData$kerb,mean) ``` ``` ## 0.25 0.5 0.75 1 1.25 ## 1.698054 1.590473 1.505519 1.490584 1.412813 ``` ```r coef(fit.ml) ``` ``` ## $kerb ## (Intercept) ## 0.25 1.694619 ## 0.5 1.589136 ## 0.75 1.506981 ## 1 1.492113 ## 1.25 1.418287 ## ## attr(,"class") ## [1] "coef.mer" ``` --- ## EB Estimates of group means for bike data ```r tapply(PsychBikeData$`passing distance`,PsychBikeData$kerb,mean) ``` ``` ## 0.25 0.5 0.75 1 1.25 ## 1.698054 1.590473 1.505519 1.490584 1.412813 ``` ```r coef(fit.reml) ``` ``` ## $kerb ## (Intercept) ## 0.25 1.695307 ## 0.5 1.589401 ## 0.75 1.506687 ## 1 1.491805 ## 1.25 1.417201 ## ## attr(,"class") ## [1] "coef.mer" ``` Here we see only a slight shrinkage back towards the overall mean, due in large part to the large sample sizes within curb distances. --- class: center, middle # What's next? ### Move on to the readings for the next module!