# 第 73 章 隨機截距模型 random intercept model

## 73.1 隨機截距模型的定義

\begin{aligned} Y_{ij} & = \mu + u_j + \varepsilon_{ij} \\ \text{where } u_j & \stackrel{N.I.D}{\sim} N(0, \sigma_u^2) \\ \varepsilon_{ij} & \stackrel{N.I.D}{\sim} N(0, \sigma_\varepsilon^2) \\ u_j & \text{ are independent from } \varepsilon_{ij} \\ \end{aligned} \tag{73.1}

• $$\mu$$ 是總體均值;
• $$u_j$$ 是一個服從均值 0, 方差 (the population between cluster variance) 爲 $$\sigma_u^2$$ 的正態分布的隨機變量;
• $$\varepsilon_{ij}$$ 是隨機誤差，它也被認爲服從均值爲 0, 方差爲 $$\sigma_\varepsilon^2$$ 的正太分布，且這兩個隨機效應部分之間也是相互獨立的
• 從該模型估算的結果變量 $$Y_{ij}$$ 的方差是 $$\sigma_u^2 + \sigma_\varepsilon^2$$
• 隨機截距模型又被叫做是 方差成分模型 (variance-component model)，或者是單向隨機效應方差模型 (one-way random effects ANOVA model)

$Y_{ij} = \mu + \gamma_j + \varepsilon_{ij}$

• $$\mu$$ 也是總體均值;
• $$\sum_{j=1}^J \gamma_j = 0$$將各組不同截距之和強制爲零的過程;

\begin{aligned} \because Y_{1j} & = \mu + u_j + \varepsilon_{1j} \\ Y_{2j} & = \mu + u_j + \varepsilon_{2j} \\ \therefore \text{Cov}(Y_{1j}, Y_{2j}) & = \text{Cov}(u_j, u_j) + \text{Cov}(u_j, \varepsilon_{2j}) + \text{Cov}(\varepsilon_{1j}, u_j) + \text{Cov}(\varepsilon_{1j}, \varepsilon_{2j}) \\ & = \text{Cov}(u_j, u_j) = V(u_j, u_j)\\ & = \sigma_u^2 \end{aligned}

$\lambda = \frac{\text{Cov}(Y_{1j}, Y_{2j})}{\text{SD}(Y_{1j})\text{SD}(Y_{2j})} = \frac{\sigma_u^2}{\sigma_\varepsilon^2 + \sigma_u^2}$

This is the within-cluster or intra-class correlation, that we will denote $$\lambda$$. Note that it is also the proportion of total variance that is accounted for by the cluster.

## 73.2 隨機截距模型的參數估計

\begin{aligned} \hat\mu & = \bar{Y} \\ \hat\sigma_\varepsilon^2 & = \text{Mean square error, MSE} \\ \hat\sigma_u^2 & = \frac{\text{Model Sum of Squares, MSS}}{Jn} - \frac{\hat\sigma^2_\varepsilon}{n} \end{aligned} \tag{73.2}

\begin{aligned} \widetilde{\sigma}_u^2 & = \frac{\text{MSS}}{(J-1)n}- \frac{\hat\sigma_\varepsilon^2}{n} \\ & = \frac{\text{MSS} - \text{MSE}(J-1)}{(J-1)n} \\ & = \frac{\text{MMS}(J-1) - \text{MSE}(J-1)}{(J-1)n} \\ & = \frac{\text{MMS} - \text{MSE}}{n} \end{aligned}

## 73.3 如何在 R 中進行隨機截距模型的擬合

pefr <- read_dta("../backupfiles/pefr.dta")
# the data are in wide format
head(pefr)
## # A tibble: 6 × 5
##      id   wp1   wp2   wm1   wm2
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1   494   490   512   525
## 2     2   395   397   430   415
## 3     3   516   512   520   508
## 4     4   434   401   428   444
## 5     5   476   470   500   500
## 6     6   557   611   600   625
# transform data into long format
pefr_long <- pefr %>%
gather(key, value, -id) %>%
separate(key, into = c("measurement", "occasion"), sep = 2) %>%
arrange(id, occasion) %>%
pefr_long
## # A tibble: 34 × 4
##       id occasion    wm    wp
##    <dbl> <chr>    <dbl> <dbl>
##  1     1 1          512   494
##  2     1 2          525   490
##  3     2 1          430   395
##  4     2 2          415   397
##  5     3 1          520   516
##  6     3 2          508   512
##  7     4 1          428   434
##  8     4 2          444   401
##  9     5 1          500   476
## 10     5 2          500   470
## # … with 24 more rows

M0 <- lme(fixed = wm ~ 1, random  = ~ 1 | id, data = pefr_long, method = "REML")
summary(M0)
## Linear mixed-effects model fit by REML
##   Data: pefr_long
##         AIC       BIC     logLik
##   366.75843 371.24795 -180.37921
##
## Random effects:
##  Formula: ~1 | id
##         (Intercept)  Residual
## StdDev:   110.39701 19.910835
##
## Fixed effects:  wm ~ 1
##                 Value Std.Error DF   t-value p-value
## (Intercept) 453.91176 26.992068 17 16.816487       0
##
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max
## -2.444435579 -0.335076940  0.037044891  0.350983659  2.377059741
##
## Number of Observations: 34
## Number of Groups: 17
M1 <- lmer(wm ~ (1|id), data = pefr_long, REML = TRUE)
summary(M1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
## Formula: wm ~ (1 | id)
##    Data: pefr_long
##
## REML criterion at convergence: 360.8
##
## Scaled residuals:
##       Min        1Q    Median        3Q       Max
## -2.444436 -0.335077  0.037045  0.350984  2.377060
##
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  id       (Intercept) 12187.51 110.397
##  Residual               396.44  19.911
## Number of obs: 34, groups:  id, 17
##
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)
## (Intercept)  453.912     26.992  16.000  16.817 1.36e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

$\hat\lambda = \frac{\hat\sigma_u^2}{(\hat\sigma_u^2 + \hat\sigma_\varepsilon^2)} = \frac{110.40^2}{110.40^2 + 19.91^2} = 0.97$

## 73.4 隨機截距模型中的統計推斷

### 73.4.1 固定效應部分的推斷

$\hat{\text{SE}}(\hat\mu) = \sqrt{\frac{n\hat\sigma_u^2 + \hat\sigma_\varepsilon^2}{Jn}}$

$\hat{\text{SE}}(\hat\mu^F) = \sqrt{\frac{\hat\sigma_\varepsilon^2}{Jn}}$

$\hat{\text{SE}}(\hat\mu^F) < \hat{\text{SE}}(\hat\mu)$

\begin{aligned} \hat\mu & = \frac{\sum_jw_j\bar{Y}_{\cdot j}}{\sum_j w_j} \\ \text{Where } w_j & = \frac{1}{\sigma_u^2 + \sigma_\varepsilon^2/n_j} \end{aligned}

$z = \frac{\hat\mu}{\hat{\text{SE}}(\hat\mu)}$ 總體均值的 95% 信賴區間的計算式就是:

$\hat\mu \pm z_{0.975}\hat{\text{SE}}(\hat\mu)$

### 73.4.2 隨機效應部分的推斷

M0 <- lme(fixed = wm ~ 1, random  = ~ 1 | id, data = pefr_long, method = "REML")
M0_fixed<- lm(wm ~ 1, data = pefr_long)
anova(M0, M0_fixed)
##          Model df       AIC       BIC     logLik   Test   L.Ratio p-value
## M0           1  3 366.75843 371.24795 -180.37921
## M0_fixed     2  2 411.71916 414.71217 -203.85958 1 vs 2 46.960731  <.0001

• 在坑爹的 STATA 裏面混合效應模型居然還會輸出隨機效應方差的 “標準誤”，該數字請你無視之。
• 當樣本擁有足夠多的樣本量 (其實是第二階層的層數)，極大似然法 (ML) 和限制性極大似然法 (REML) 給出的結果會相當接近。
• 當你比較兩個不是互爲嵌套 (nested) 的模型時，可以使用 AIC/BIC 指標。

## 73.5 Practical Hierarchical 02

### 73.5.1 數據

1. GHQ 數據
該數據包含 12 名學生前後兩次回答 General Health Questionnaire (GHQ) 問卷獲得的數據。該問卷用於測量學生的心理壓力，其變量名和含義如下：
id        Student identifier
GHQ1      General Health Questionnaire score- 1st occasion
GHQ2      General Health Questionnaire score- 2nd occasion
1. Siblings 數據
該數據是來自一項對 3978 名媽媽關於她們 8604 名孩子的出生體重及健康狀況的問卷調查。該數據的變量名和含義如下：
momid     Mother identifier
idx       Baby identifier
mage      Maternal age (years)
meduc     Maternal education
gestat    gestational age (weeks)
birwt     Birth weight (g)
smoke     Maternal smoking (0 = no, 1 = yes)
male      Baby boy (0 = no, 1 = yes)
year      Year of birth
married   Maternal marital status (0 = no, 1 = yes)
hsgrad    Maternal high school education (0 = no, 1 = yes)
black     Maternal race (1 = black, 0 = other)

### 73.5.2 讀入 GHQ 數據，探索其內容，該數據是否是平衡數據 (balanced)？計算每名學生的兩次問卷成績平均分。

ghq <- read_dta("../backupfiles/ghq.dta")
ghq
## # A tibble: 12 × 3
##       id  GHQ1  GHQ2
##    <dbl> <dbl> <dbl>
##  1     1    12    12
##  2     2     8     7
##  3     3    22    24
##  4     4    10    14
##  5     5    10     8
##  6     6     6     4
##  7     7     8     5
##  8     8     4     6
##  9     9    14    14
## 10    10     6     5
## 11    11     2     5
## 12    12    22    16
ghq <- ghq %>%
mutate(mean = (GHQ1 + GHQ2)/2)

# each student has 2 observations (i.e. n_j = n = 2)
# and therefore the data are balanced.
# the overall mean is 10.167 and its SD is 6.073
ghq %>%
summarise(OverallMean = mean(mean), SD = sd(mean))
## # A tibble: 1 × 2
##   OverallMean    SD
##         <dbl> <dbl>
## 1        10.2  6.07

### 73.5.3 把數據從寬 (wide) 改變成長 (long) 的形式

# transform data into long format
ghq_long <- ghq %>%
gather(key, value, -id, -mean) %>%
separate(key, into = c("measurement", "occasion"), sep = 3) %>%
arrange(id, occasion) %>%
ghq_long
## # A tibble: 24 × 4
##       id  mean occasion   GHQ
##    <dbl> <dbl> <chr>    <dbl>
##  1     1  12   1           12
##  2     1  12   2           12
##  3     2   7.5 1            8
##  4     2   7.5 2            7
##  5     3  23   1           22
##  6     3  23   2           24
##  7     4  12   1           10
##  8     4  12   2           14
##  9     5   9   1           10
## 10     5   9   2            8
## # … with 14 more rows
# after reshaping there are 24 records. the summary statistics are
# overall mean sd and min max

ghq_long %>%
summarise(OverallMean = mean(GHQ), SD = sd(GHQ), Min = min(GHQ), Max = max(GHQ))
## # A tibble: 1 × 4
##   OverallMean    SD   Min   Max
##         <dbl> <dbl> <dbl> <dbl>
## 1        10.2  6.10     2    24
# between groups mean sd and min

ghq_long %>%
distinct(id, .keep_all= TRUE) %>%
summarise(Bet_mean = mean(mean), Bet_sd = sd(mean), Bet_min = min(mean), Bet_max = max(mean))
## # A tibble: 1 × 4
##   Bet_mean Bet_sd Bet_min Bet_max
##      <dbl>  <dbl>   <dbl>   <dbl>
## 1     10.2   6.07     3.5      23
# within groups mean sd and min (came from the difference between
# the overall mean and the within difference) observations for
# each group = 2
ghq_long <- ghq_long %>%
mutate(dif_GHQ = mean(GHQ) - (GHQ - mean))

ghq_long %>%
summarise(N = n(),
Wit_mean = mean(dif_GHQ), Wit_sd = sd(dif_GHQ),
Wit_min = min(dif_GHQ), Wit_max = max(dif_GHQ))
## # A tibble: 1 × 5
##       N Wit_mean Wit_sd Wit_min Wit_max
##   <int>    <dbl>  <dbl>   <dbl>   <dbl>
## 1    24     10.2   1.38    7.17    13.2

GHQ 的分佈並不左右對稱。

### 73.5.4 對數據按照 id 分層進行 ANOVA

with(ghq_long, anova(lm(GHQ~factor(id))))
## Analysis of Variance Table
##
## Response: GHQ
##            Df  Sum Sq Mean Sq F value     Pr(>F)
## factor(id) 11 811.333 73.7576 20.1157 4.7782e-06 ***
## Residuals  12  44.000  3.6667
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#library(lme4)
( fit <- lmer(GHQ ~ (1|id), data=ghq_long) )
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: GHQ ~ (1 | id)
##    Data: ghq_long
## REML criterion at convergence: 131.3492
## Random effects:
##  Groups   Name        Std.Dev.
##  id       (Intercept) 5.9199
##  Residual             1.9149
## Number of obs: 24, groups:  id, 12
## Fixed Effects:
## (Intercept)
##      10.167

$$\sigma_u, \sigma_e$$ 的估計值分別是 5.92 (between)， 1.91 (within)。可以計算層間相關係數 (intra-class correlation) $$\hat\lambda = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2_e} = 0.905$$。且 $$\hat\sigma_u = \sqrt{\frac{73.8 - 3.7}{2}} = 5.92$$，和前一次練習一樣地，這個隨機效應的方差，可以通過方差分析表格來直接手動計算 (當且僅當分層數據是平衡狀態的)。和前面計算的樣本數據比較，樣本層間標準差是高估了的 (sample between variance = 6.073 > 5.92)，相反樣本層內標準差 (within sd) 則是低估了的 (sample within sd = 1.383 < 1.91)。兩個層內標準差的關係是：

$\sqrt{1.383^2\times\frac{23}{12}} = 1.91$

### 73.5.5 用 R 裏的 nlme 包，使用限制性極大似然法 (restricted maximum likelihood, REML) 擬合截距混合效應模型，比較其結果和前文中隨機效應 ANOVA 的結果

summary(nlme::lme(fixed = GHQ ~ 1, random = ~ 1 | id, data = ghq_long, method = "REML"))
## Linear mixed-effects model fit by REML
##   Data: ghq_long
##         AIC       BIC     logLik
##   137.34924 140.75573 -65.674622
##
## Random effects:
##  Formula: ~1 | id
##         (Intercept)  Residual
## StdDev:   5.9199181 1.9148548
##
## Fixed effects:  GHQ ~ 1
##                 Value Std.Error DF   t-value p-value
## (Intercept) 10.166667 1.7530632 12 5.7993727  0.0001
##
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max
## -1.337372043 -0.578482697  0.073557531  0.414059981  1.796024881
##
## Number of Observations: 24
## Number of Groups: 12

### 73.5.6 用極大似然法 (maximum likelihood, ML) method = "ML" 重新擬合前面的混合效應模型，比較結果有什麼不同。

#( fit <- lmer(GHQ ~ (1|id), data=ghq_long, REML = FALSE) ) # same but from lme4 package

summary(lme(fixed = GHQ ~ 1, random = ~ 1 | id, data = ghq_long, method = "ML"))
## Linear mixed-effects model fit by maximum likelihood
##   Data: ghq_long
##         AIC       BIC     logLik
##   140.26571 143.79987 -67.132857
##
## Random effects:
##  Formula: ~1 | id
##         (Intercept)  Residual
## StdDev:   5.6543976 1.9148545
##
## Fixed effects:  GHQ ~ 1
##                 Value Std.Error DF  t-value p-value
## (Intercept) 10.166667 1.7145299 12 5.929711  0.0001
##
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max
## -1.31652454 -0.58359637  0.08024454  0.40422622  1.81687284
##
## Number of Observations: 24
## Number of Groups: 12

### 73.5.7 用簡單線性迴歸擬合一個固定效應模型

Fixed_reg <- lm(GHQ-mean(GHQ) ~ 0 + factor(id), data = ghq_long)
summary(Fixed_reg)
##
## Call:
## lm(formula = GHQ - mean(GHQ) ~ 0 + factor(id), data = ghq_long)
##
## Residuals:
##    Min     1Q Median     3Q    Max
##     -3     -1      0      1      3
##
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)
## factor(id)1    1.8333     1.3540  1.3540 0.2006847
## factor(id)2   -2.6667     1.3540 -1.9695 0.0724256 .
## factor(id)3   12.8333     1.3540  9.4780 6.371e-07 ***
## factor(id)4    1.8333     1.3540  1.3540 0.2006847
## factor(id)5   -1.1667     1.3540 -0.8616 0.4057744
## factor(id)6   -5.1667     1.3540 -3.8158 0.0024580 **
## factor(id)7   -3.6667     1.3540 -2.7080 0.0190252 *
## factor(id)8   -5.1667     1.3540 -3.8158 0.0024580 **
## factor(id)9    3.8333     1.3540  2.8311 0.0151447 *
## factor(id)10  -4.6667     1.3540 -3.4466 0.0048356 **
## factor(id)11  -6.6667     1.3540 -4.9237 0.0003516 ***
## factor(id)12   8.8333     1.3540  6.5238 2.836e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.9149 on 12 degrees of freedom
## Multiple R-squared:  0.94856,    Adjusted R-squared:  0.89712
## F-statistic: 18.439 on 12 and 12 DF,  p-value: 6.8362e-06

### 73.5.8 計算這些隨機截距的均值和標準差

mean(Fixed_reg$coefficients) ## [1] 5.9211895e-16 sd(Fixed_reg$coefficients)
## [1] 6.0727908

### 73.5.9 忽略掉所有的分層和解釋變量擬合 GHQ 的簡單線性迴歸

Fixed_simple <- lm(GHQ ~ 1, data = ghq_long)
summary(Fixed_simple)
##
## Call:
## lm(formula = GHQ ~ 1, data = ghq_long)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -8.1667 -4.4167 -2.1667  3.8333 13.8333
##
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  10.1667     1.2448  8.1673 3.001e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.0982 on 23 degrees of freedom

### 73.5.10 用分層的穩健法 (三明治標準誤法) 計算簡單線性迴歸時，截距的標準誤差，和簡單線性迴歸時的結果作比較

# sandwich robust method with cluster id

robustReg <- clubSandwich::coef_test(Fixed_simple, vcov = "CR1", cluster = ghq_long$id) rob.std.err <- robustReg$SE
naive.std.err<-summary(Fixed_simple)$coefficients[,2] better.table <- cbind("Estimate" = coef(Fixed_simple), "Naive SE" = naive.std.err, "Pr(>|z|)" = 2 * pt(abs(coef(Fixed_simple)/naive.std.err), df=nrow(ghq_long)-2, lower.tail = FALSE), "LL" = coef(Fixed_simple) - 1.96 * naive.std.err, "UL" = coef(Fixed_simple) + 1.96 * naive.std.err, "Robust SE" = rob.std.err, "Pr(>|z|)" = 2 * pt(abs(coef(Fixed_simple)/rob.std.err), df=nrow(ghq_long)-2, lower.tail = FALSE), "LL" = coef(Fixed_simple) - qt(df=robustReg$df, 0.975) * rob.std.err,
"UL" = coef(Fixed_simple) + qt(df=robustReg\$df, 0.975) * rob.std.err)
rownames(better.table)<-c("Constant")
better.table
##           Estimate  Naive SE      Pr(>|z|)        LL        UL Robust SE      Pr(>|z|)        LL
## Constant 10.166667 1.2447959 4.1792464e-08 7.7268666 12.606467 1.7530637 7.7968698e-06 6.3081995
##                 UL
## Constant 14.025134

### 73.5.11 讀入 siblings 數據。先總結嬰兒的出生體重，思考這個數據中嬰兒出生體重之間是否可能存在關聯性？它的來源是哪裏。用這個數據擬合兩個混合效應模型 (ML, REML)，不加入任何解釋變量。

siblings <- read_dta("../backupfiles/siblings.dta")
Fixed_ml <- lme(fixed = birwt ~ 1, random = ~ 1 | momid, data = siblings, method = "ML")
summary(Fixed_ml)
## Linear mixed-effects model fit by maximum likelihood
##   Data: siblings
##         AIC       BIC     logLik
##   130956.97 130978.15 -65475.486
##
## Random effects:
##  Formula: ~1 | momid
##         (Intercept)  Residual
## StdDev:   368.28657 377.65778
##
## Fixed effects:  birwt ~ 1
##                Value Std.Error   DF   t-value p-value
## (Intercept) 3467.969 7.1380684 4626 485.84138       0
##
## Standardized Within-Group Residuals:
##           Min            Q1           Med            Q3           Max
## -6.2745852674 -0.4860398569  0.0036050019  0.5054348620  4.0506129265
##
## Number of Observations: 8604
## Number of Groups: 3978
Fixed_reml <- lme(fixed = birwt ~ 1, random = ~ 1 | momid, data = siblings, method = "REML")
summary(Fixed_reml)
## Linear mixed-effects model fit by REML
##   Data: siblings
##        AIC       BIC     logLik
##   130951.2 130972.38 -65472.601
##
## Random effects:
##  Formula: ~1 | momid
##         (Intercept)  Residual
## StdDev:   368.35597 377.65768
##
## Fixed effects:  birwt ~ 1
##                 Value Std.Error   DF   t-value p-value
## (Intercept) 3467.9688 7.1385551 4626 485.80822       0
##
## Standardized Within-Group Residuals:
##           Min            Q1           Med            Q3           Max
## -6.2743063896 -0.4860193974  0.0035299789  0.5053550371  4.0503923656
##
## Number of Observations: 8604
## Number of Groups: 3978