# 第 82 章 缺失數據 Missing data 1

## 82.7 Practical 10-Hier

### 82.7.1 數據缺失產生的影響，缺失機制，和多重插補法

1. 數據缺失對結果估計可能造成的影響，包括可能帶來的偏倚 Bias，和對參數估計精確度的影響。
2. 如何分析一個變量的缺失模式。
3. 對單一不完整的觀察變量進行簡單的多重插補。

### 82.7.2 二進制變量的缺失 “class size study”

classp3dta <- read_dta("../backupfiles/classp3.dta")
head(classp3dta)
## # A tibble: 6 x 5
##   uniqueid nmatpre nlitpost   sen sen_m
##      <dbl>   <dbl>    <dbl> <dbl> <dbl>
## 1      706  -1.21    -1.31      1     1
## 2      731  -1.70    -2.41      1     1
## 3     7345  -1.70    -1.43      0     0
## 4     1154  -0.980   -1.99      1     1
## 5     9071  -0.258   -0.255     0    NA
## 6     4875  -0.980   -0.255     0     0
names(classp3dta)
## [1] "uniqueid" "nmatpre"  "nlitpost" "sen"      "sen_m"

Name Details
uniqueid Unique pupil id
nmatpre Pre-reception maths score
nlitpost Post-reception literacy score
sen Special educational needs (1 = yes, 0 = no)
sen_m Special educational needs, with a number of missing values

### 82.7.3 完整數據分析結果

classmissing <- lm(nlitpost ~ nmatpre + sen, data = classp3dta)
summary(classmissing)
##
## Call:
## lm(formula = nlitpost ~ nmatpre + sen, data = classp3dta)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -3.00430 -0.52708 -0.01955  0.52983  3.35295
##
## Coefficients:
##              Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)  0.045178   0.012079   3.7403 0.0001859 ***
## nmatpre      0.584829   0.012402  47.1552 < 2.2e-16 ***
## sen         -0.432446   0.042855 -10.0910 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.80298 on 4870 degrees of freedom
## Multiple R-squared:  0.36147,    Adjusted R-squared:  0.36121
## F-statistic: 1378.4 on 2 and 4870 DF,  p-value: < 2.22e-16
jtools::summ(classmissing, digits = 6, confint = TRUE, exp = FALSE)
 Observations 4873 Dependent variable nlitpost Type OLS linear regression
 F(2,4870) 1378.45 R² 0.36147 Adj. R² 0.361208
Est. 2.5% 97.5% t val. p
(Intercept) 0.045178 0.021498 0.068857 3.740335 0.000186
nmatpre 0.584829 0.560515 0.609143 47.155211 0.000000
sen -0.432446 -0.516461 -0.348432 -10.091018 0.000000
Standard errors: OLS

##
## . use "../backupfiles/classp3.dta", cl. describe
##
## Contains data from ../backupfiles/classp3.dta
##   obs:         4,873
##  vars:             5                          31 Jan 2009 21:58
## -------------------------------------------------------------------------------
##               storage   display    value
## variable name   type    format     label      variable label
## -------------------------------------------------------------------------------
## uniqueid        float   %9.0g
## nmatpre         float   %9.0g
## nlitpost        float   %9.0g
## sen             float   %9.0g
## sen_m           float   %9.0g
## -------------------------------------------------------------------------------
## Sorted by:
##
## . regress nlitpost nmatpre sen
##
##       Source |       SS           df       MS      Number of obs   =     4,873
## -------------+----------------------------------   F(2, 4870)      =   1378.45
##        Model |  1777.58785         2  888.793924   Prob > F        =    0.0000
##     Residual |   3140.0696     4,870  .644778153   R-squared       =    0.3615
## -------------+----------------------------------   Adj R-squared   =    0.3612
##        Total |  4917.65745     4,872   1.0093714   Root MSE        =    .80298
##
## ------------------------------------------------------------------------------
##     nlitpost |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##      nmatpre |    .584829   .0124022    47.16   0.000     .5605151     .609143
##          sen |  -.4324465   .0428546   -10.09   0.000    -.5164608   -.3484321
##        _cons |   .0451778   .0120786     3.74   0.000     .0214984    .0688573
## ------------------------------------------------------------------------------
##
## .

Variable Original Data analysis Coefficient. (SE) Complete Case analysis Coefficient. (SE) Multiple Imputation Coefficient. (SE)
Pre-reception numeracy 0.58 (0.012)
Special educational needs -0.432 (0.043)

### 82.7.4 去除了缺失值的分析結果 complete case analysis

epiDisplay::tab1(classp3dta$sen, graph = FALSE) ## classp3dta$sen :
##         Frequency Percent Cum. percent
## 0            4462    91.6         91.6
## 1             411     8.4        100.0
##   Total      4873   100.0        100.0
epiDisplay::tab1(classp3dta$sen_m, graph = FALSE) ## classp3dta$sen_m :
##         Frequency   %(NA+)   %(NA-)
## 0            2092     42.9       87
## 1             313      6.4       13
## <NA>         2468     50.6        0
##   Total      4873    100.0      100
classcompelet <- lm(nlitpost ~ nmatpre + sen_m, data = classp3dta)
summary(classcompelet)
##
## Call:
## lm(formula = nlitpost ~ nmatpre + sen_m, data = classp3dta)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -2.658912 -0.491275 -0.028356  0.496535  2.862229
##
## Coefficients:
##              Estimate Std. Error  t value  Pr(>|t|)
## (Intercept) -0.259546   0.017990 -14.4269 < 2.2e-16 ***
## nmatpre      0.434043   0.018684  23.2306 < 2.2e-16 ***
## sen_m       -0.362332   0.046773  -7.7466 1.381e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.74601 on 2402 degrees of freedom
##   (2468 observations deleted due to missingness)
## Multiple R-squared:  0.23565,    Adjusted R-squared:  0.23501
## F-statistic: 370.26 on 2 and 2402 DF,  p-value: < 2.22e-16
jtools::summ(classcompelet, digits = 6, confint = TRUE, exp = FALSE)
 Observations 2405 (2468 missing obs. deleted) Dependent variable nlitpost Type OLS linear regression
 F(2,2402) 370.26 R² 0.235645 Adj. R² 0.235009
Est. 2.5% 97.5% t val. p
(Intercept) -0.259546 -0.294824 -0.224267 -14.426855 0.000000
nmatpre 0.434043 0.397405 0.470682 23.230615 0.000000
sen_m -0.362332 -0.454052 -0.270612 -7.746606 0.000000
Standard errors: OLS

##
## . use "../backupfiles/classp3.dta", cl. regress nlitpost nmatpre sen_m
##
##       Source |       SS           df       MS      Number of obs   =     2,405
## -------------+----------------------------------   F(2, 2402)      =    370.26
##        Model |  412.119502         2  206.059751   Prob > F        =    0.0000
##     Residual |  1336.77695     2,402  .556526625   R-squared       =    0.2356
## -------------+----------------------------------   Adj R-squared   =    0.2350
##        Total |  1748.89645     2,404  .727494366   Root MSE        =    .74601
##
## ------------------------------------------------------------------------------
##     nlitpost |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##      nmatpre |   .4340432   .0186841    23.23   0.000     .3974046    .4706818
##        sen_m |   -.362332    .046773    -7.75   0.000    -.4540516   -.2706124
##        _cons |  -.2595457   .0179905   -14.43   0.000    -.2948241   -.2242672
## ------------------------------------------------------------------------------
##
## .

Variable Original Data analysis Coefficient. (SE) Complete Case analysis Coefficient. (SE) Multiple Imputation Coefficient. (SE)
Pre-reception numeracy 0.58 (0.012) 0.43 (0.019)
Special educational needs -0.432 (0.043) -0.362 (0.047)

### 82.7.5 分析 sen_m 的缺失值機制

classp3dta <- classp3dta %>%
mutate(r = !is.na(sen_m))

epiDisplay::tab1(classp3dta$sen_m, graph = FALSE) ## classp3dta$sen_m :
##         Frequency   %(NA+)   %(NA-)
## 0            2092     42.9       87
## 1             313      6.4       13
## <NA>         2468     50.6        0
##   Total      4873    100.0      100
epiDisplay::tab1(classp3dta$r, graph = FALSE) ## classp3dta$r :
##         Frequency Percent Cum. percent
## FALSE        2468    50.6         50.6
## TRUE         2405    49.4        100.0
##   Total      4873   100.0        100.0

classglm <- glm(r ~ nlitpost + nmatpre, data = classp3dta,
summary(classglm)
##
## Call:
## glm(formula = r ~ nlitpost + nmatpre, family = binomial(link = logit),
##     data = classp3dta)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -2.88467  -0.76311  -0.19894   0.76629   2.69335
##
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)
## (Intercept)  0.019637   0.036027   0.5451   0.5857
## nlitpost    -1.039361   0.050322 -20.6540   <2e-16 ***
## nmatpre     -1.005978   0.051082 -19.6934   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 6754.60  on 4872  degrees of freedom
## Residual deviance: 4692.94  on 4870  degrees of freedom
## AIC: 4698.94
##
## Number of Fisher Scoring iterations: 5
jtools::summ(classglm, digits = 6, confint = TRUE, exp = TRUE)
 Observations 4873 Dependent variable r Type Generalized linear model Family binomial Link logit
 𝛘²(2) 2061.65 Pseudo-R² (Cragg-Uhler) 0.459988 Pseudo-R² (McFadden) 0.305222 AIC 4698.94 BIC 4718.42
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.019831 0.950303 1.094446 0.545072 0.585704
nlitpost 0.353681 0.320462 0.390342 -20.654049 0.000000
nmatpre 0.365687 0.330848 0.404195 -19.693359 0.000000
Standard errors: MLE

##
## . use "../backupfiles/classp3.dta", cl. gen r=(sen_m!=.)
##
## . logistic r nlitpost nmatpre
##
## Logistic regression                             Number of obs     =      4,873
##                                                 LR chi2(2)        =    2061.65
##                                                 Prob > chi2       =     0.0000
## Log likelihood = -2346.4721                     Pseudo R2         =     0.3052
##
## ------------------------------------------------------------------------------
##            r | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##     nlitpost |   .3536805   .0177981   -20.65   0.000     .3204621    .3903423
##      nmatpre |   .3656868   .0186801   -19.69   0.000     .3308477    .4041945
##        _cons |   1.019831   .0367414     0.55   0.586     .9503032    1.094446
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline odds.
##
## .

classglm2 <- glm(r ~ nlitpost + nmatpre + sen_m, data = classp3dta,
family = binomial(link = logit))
## Warning: glm.fit: algorithm did not converge
summary(classglm2)
##
## Call:
## glm(formula = r ~ nlitpost + nmatpre + sen_m, family = binomial(link = logit),
##     data = classp3dta)
##
## Deviance Residuals:
##        Min          1Q      Median          3Q         Max
## 2.4086e-06  2.4086e-06  2.4086e-06  2.4086e-06  2.4086e-06
##
## Coefficients:
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  2.6566e+01  8.9525e+03   0.003   0.9976
## nlitpost     5.6292e-10  9.7403e+03   0.000   1.0000
## nmatpre     -2.4086e-10  9.8705e+03   0.000   1.0000
## sen_m       -2.0172e-07  2.2605e+04   0.000   1.0000
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 0.00000e+00  on 2404  degrees of freedom
## Residual deviance: 1.39528e-08  on 2401  degrees of freedom
##   (2468 observations deleted due to missingness)
## AIC: 8
##
## Number of Fisher Scoring iterations: 25
jtools::summ(classglm2, digits = 6, confint = TRUE, exp = TRUE)
## Warning: glm.fit: algorithm did not converge
 Observations 2405 (2468 missing obs. deleted) Dependent variable r Type Generalized linear model Family binomial Link logit
 𝛘²(3) -0 Pseudo-R² (Cragg-Uhler) 0 Pseudo-R² (McFadden) 0 AIC 8 BIC 31.1412
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 344742777888.648010 0.000000 Inf 0.002967 0.997632
nlitpost 1.000000 0.000000 Inf 0.000000 1.000000
nmatpre 1.000000 0.000000 Inf -0.000000 1.000000
sen_m 1.000000 0.000000 Inf -0.000000 1.000000
Standard errors: MLE

classglm3 <- glm(r ~ nlitpost + nmatpre + sen, data = classp3dta,
summary(classglm3)
##
## Call:
## glm(formula = r ~ nlitpost + nmatpre + sen, family = binomial(link = logit),
##     data = classp3dta)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -2.86373  -0.76374  -0.19817   0.76322   2.69585
##
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)
## (Intercept)  0.025268   0.037471   0.6744   0.5001
## nlitpost    -1.043082   0.050798 -20.5339   <2e-16 ***
## nmatpre     -1.009251   0.051446 -19.6177   <2e-16 ***
## sen         -0.080734   0.146027  -0.5529   0.5804
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 6754.60  on 4872  degrees of freedom
## Residual deviance: 4692.64  on 4869  degrees of freedom
## AIC: 4700.64
##
## Number of Fisher Scoring iterations: 5
jtools::summ(classglm3, digits = 6, confint = TRUE, exp = TRUE)
 Observations 4873 Dependent variable r Type Generalized linear model Family binomial Link logit
 𝛘²(3) 2061.96 Pseudo-R² (Cragg-Uhler) 0.460042 Pseudo-R² (McFadden) 0.305267 AIC 4700.64 BIC 4726.61
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.025590 0.952969 1.103746 0.674355 0.500086
nlitpost 0.352367 0.318975 0.389256 -20.533887 0.000000
nmatpre 0.364492 0.329532 0.403161 -19.617732 0.000000
sen 0.922439 0.692850 1.228106 -0.552873 0.580351
Standard errors: MLE

### 82.7.6 多重插補 multiple imputation

#### 82.7.6.1 在 Stata 裏，重新讀入數據

• 首先我們需要告訴 Stata 我們要開始存儲一些多重插補過後的數據 mi set wide
• 然後，再告訴 Stata，我們要對 sen_m 這個變量進行多重插補 mi register imputed sen_m
• 產生我們需要的多重插補之後的數據，這個過程（雖然都被計算機一步完成）可以描述爲：
• 對於二進制變量，我們需要使用 logit 命令來進行多重插補。
• 先使用去除了缺失值的數據來跑我們需要的模型，然後獲取對應的回歸模型的回歸係數(i.e. log odds ratios)及其標準誤差。
• 假設要進行 $$m = 1, \cdots, M$$ 次多重插補過程，那麼根據回歸係數 (logOR) 本身服從常態分佈的假設，每次多重插補我們需要從回歸係數服從的分佈中採樣一組回歸係數。
• 獲取了 $$m$$ 組回歸係數之後，在每一個多重插補過程中，根據模型中的變量和採集的回歸係數，計算存在缺失值的 sen_m 取值是 1 的概率。
• 根據上步中計算獲得的每個有缺失值的 sen_m 取值可能是 1 的概率。這個概率被當作是一個 Bernoulli (binary) 分佈的概率參數，然後隨機採集 0/1 作爲該缺失值的插補值。
• 另外，實施多重插補過程的命令中 add(10) 是在告訴 Stata 我要進行十次插補，採集十次樣本，即 $$m = 10$$
• 最後一步是使用插補後的數據進行我們希望的回歸模型分析。它的過程簡單來說就是對插補後的十個樣本進行相同的模型估計，然後利用 “Rubin” 法則把十次結果總結成爲一個估計。它和之前的結果相比如何？
##
## . use "../backupfiles/classp3.dta", cl. mi set wide
##
## . mi register imputed sen_m
##
## . mi impute logit sen_m nlitpost nmatpre, add(10) rseed(4921)
##
## Univariate imputation                       Imputations =       10
## Logistic regression                               added =       10
## Imputed: m=1 through m=10                       updated =        0
##
## ------------------------------------------------------------------
##                    |               Observations per m
##                    |----------------------------------------------
##           Variable |   Complete   Incomplete   Imputed |     Total
## -------------------+-----------------------------------+----------
##              sen_m |       2405         2468      2468 |      4873
## ------------------------------------------------------------------
## (complete + incomplete = total; imputed is the minimum across m
##  of the number of filled-in observations.)
##
## . mi estimate: regress nlitpost nmatpre sen_m
##
## Multiple-imputation estimates                   Imputations       =         10
## Linear regression                               Number of obs     =      4,873
##                                                 Average RVI       =     0.1901
##                                                 Largest FMI       =     0.3735
##                                                 Complete DF       =       4870
## DF adjustment:   Small sample                   DF:     min       =      69.45
##                                                         avg       =   2,083.26
##                                                         max       =   3,626.87
## Model F test:       Equal FMI                   F(   2,  238.4)   =    1071.23
## Within VCE type:          OLS                   Prob > F          =     0.0000
##
## ------------------------------------------------------------------------------
##     nlitpost |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##      nmatpre |   .5864505   .0126874    46.22   0.000     .5615717    .6113292
##        sen_m |  -.4044246    .053444    -7.57   0.000    -.5110302   -.2978191
##        _cons |   .0428783   .0122477     3.50   0.000     .0188653    .0668913
## ------------------------------------------------------------------------------
##
## .
Variable Original Data analysis Coefficient. (SE) Complete Case analysis Coefficient. (SE) Multiple Imputation Coefficient. (SE)
Pre-reception numeracy 0.58 (0.012) 0.43 (0.019) 0.59 (0.012)
Special educational needs -0.432 (0.043) -0.362 (0.047) -0.404 (0.053)