第 82 章 缺失數據 Missing data 1

82.1 處理原則(建議)recommended principles

82.2 缺失數據機制 missingness mechanisms

82.3 臨時解決方案 ad-hoc approaches

82.4 單一變量的參數多重插補法 parametric multiple imputation of one variable

82.5 變量選擇,插補次數,模型檢查

82.6 總結

82.7 Practical 10-Hier

82.7.1 數據缺失產生的影響,缺失機制,和多重插補法

目的:

練習完成後你應該明白

  1. 數據缺失對結果估計可能造成的影響,包括可能帶來的偏倚 Bias,和對參數估計精確度的影響。
  2. 如何分析一個變量的缺失模式。
  3. 對單一不完整的觀察變量進行簡單的多重插補。

82.7.2 二進制變量的缺失 “class size study”

第一個練習,我們使用的是 “class size” 數據。

classp3dta <- read_dta("../backupfiles/classp3.dta")
head(classp3dta)
## # A tibble: 6 × 5
##   uniqueid nmatpre nlitpost   sen sen_m
##      <dbl>   <dbl>    <dbl> <dbl> <dbl>
## 1      706  -1.21    -1.31      1     1
## 2      731  -1.70    -2.41      1     1
## 3     7345  -1.70    -1.43      0     0
## 4     1154  -0.980   -1.99      1     1
## 5     9071  -0.258   -0.255     0    NA
## 6     4875  -0.980   -0.255     0     0
names(classp3dta)
## [1] "uniqueid" "nmatpre"  "nlitpost" "sen"      "sen_m"

每個變量的名字和含義羅列如下:

Name Details
uniqueid Unique pupil id
nmatpre Pre-reception maths score
nlitpost Post-reception literacy score
sen Special educational needs (1 = yes, 0 = no)
sen_m Special educational needs, with a number of missing values

其中,分數的變量 nmatpre, 和 nlitpost 都被標準化了 (normalised)。因爲不少學生成績分數相同,所以會看到一些標準化分數完全相同的數值。先來觀察一下 nmatpre, 和 nlitpost 這兩個變量之間調整 sen 之後的關係。我們先使用 sen 作爲調整變量,再使用 sen_m 作爲調整變量。之後我們會對 sen_m 中的缺失值做標記,最後再用多重插補法分析這個模型。

82.7.3 完整數據分析結果

classmissing <- lm(nlitpost ~ nmatpre + sen, data = classp3dta)
summary(classmissing)
## 
## Call:
## lm(formula = nlitpost ~ nmatpre + sen, data = classp3dta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.00430 -0.52708 -0.01955  0.52983  3.35295 
## 
## Coefficients:
##              Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)  0.045178   0.012079   3.7403 0.0001859 ***
## nmatpre      0.584829   0.012402  47.1552 < 2.2e-16 ***
## sen         -0.432446   0.042855 -10.0910 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.80298 on 4870 degrees of freedom
## Multiple R-squared:  0.36147,    Adjusted R-squared:  0.36121 
## F-statistic: 1378.4 on 2 and 4870 DF,  p-value: < 2.22e-16
jtools::summ(classmissing, digits = 6, confint = TRUE, exp = FALSE)
Observations 4873
Dependent variable nlitpost
Type OLS linear regression
F(2,4870) 1378.449192
0.361470
Adj. R² 0.361208
Est. 2.5% 97.5% t val. p
(Intercept) 0.045178 0.021498 0.068857 3.740335 0.000186
nmatpre 0.584829 0.560515 0.609143 47.155211 0.000000
sen -0.432446 -0.516461 -0.348432 -10.091018 0.000000
Standard errors: OLS

完整數據用 Stata 分析的結果

## 
## . use "../backupfiles/classp3.dta", clear
## 
## . describe
## 
## Contains data from ../backupfiles/classp3.dta
##   obs:         4,873                          
##  vars:             5                          31 Jan 2009 21:58
## -------------------------------------------------------------------------------
##               storage   display    value
## variable name   type    format     label      variable label
## -------------------------------------------------------------------------------
## uniqueid        float   %9.0g                 
## nmatpre         float   %9.0g                 
## nlitpost        float   %9.0g                 
## sen             float   %9.0g                 
## sen_m           float   %9.0g                 
## -------------------------------------------------------------------------------
## Sorted by: 
## 
## . regress nlitpost nmatpre sen
## 
##       Source |       SS           df       MS      Number of obs   =     4,873
## -------------+----------------------------------   F(2, 4870)      =   1378.45
##        Model |  1777.58785         2  888.793924   Prob > F        =    0.0000
##     Residual |   3140.0696     4,870  .644778153   R-squared       =    0.3615
## -------------+----------------------------------   Adj R-squared   =    0.3612
##        Total |  4917.65745     4,872   1.0093714   Root MSE        =    .80298
## 
## ------------------------------------------------------------------------------
##     nlitpost |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##      nmatpre |    .584829   .0124022    47.16   0.000     .5605151     .609143
##          sen |  -.4324465   .0428546   -10.09   0.000    -.5164608   -.3484321
##        _cons |   .0451778   .0120786     3.74   0.000     .0214984    .0688573
## ------------------------------------------------------------------------------
## 
## .

此時的結果我們總結在下表中

Variable Original Data analysis Coefficient. (SE) Complete Case analysis Coefficient. (SE) Multiple Imputation Coefficient. (SE)
Pre-reception numeracy 0.58 (0.012)
Special educational needs -0.432 (0.043)

82.7.4 去除了缺失值的分析結果 complete case analysis

接下來使用 sen_m 作爲調整變量,它是有大約一半缺失值數據的 sen

epiDisplay::tab1(classp3dta$sen, graph = FALSE)
## classp3dta$sen : 
##         Frequency Percent Cum. percent
## 0            4462    91.6         91.6
## 1             411     8.4        100.0
##   Total      4873   100.0        100.0
epiDisplay::tab1(classp3dta$sen_m, graph = FALSE)
## classp3dta$sen_m : 
##         Frequency   %(NA+)   %(NA-)
## 0            2092     42.9       87
## 1             313      6.4       13
## <NA>         2468     50.6        0
##   Total      4873    100.0      100
classcompelet <- lm(nlitpost ~ nmatpre + sen_m, data = classp3dta)
summary(classcompelet)
## 
## Call:
## lm(formula = nlitpost ~ nmatpre + sen_m, data = classp3dta)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -2.658912 -0.491275 -0.028356  0.496535  2.862229 
## 
## Coefficients:
##              Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) -0.259546   0.017990 -14.4269 < 2.2e-16 ***
## nmatpre      0.434043   0.018684  23.2306 < 2.2e-16 ***
## sen_m       -0.362332   0.046773  -7.7466 1.381e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.74601 on 2402 degrees of freedom
##   (2468 observations deleted due to missingness)
## Multiple R-squared:  0.23565,    Adjusted R-squared:  0.23501 
## F-statistic: 370.26 on 2 and 2402 DF,  p-value: < 2.22e-16
jtools::summ(classcompelet, digits = 6, confint = TRUE, exp = FALSE)
Observations 2405 (2468 missing obs. deleted)
Dependent variable nlitpost
Type OLS linear regression
F(2,2402) 370.260364
0.235645
Adj. R² 0.235009
Est. 2.5% 97.5% t val. p
(Intercept) -0.259546 -0.294824 -0.224267 -14.426855 0.000000
nmatpre 0.434043 0.397405 0.470682 23.230615 0.000000
sen_m -0.362332 -0.454052 -0.270612 -7.746606 0.000000
Standard errors: OLS

去除了缺失值的數據用 Stata 分析的結果

## 
## . use "../backupfiles/classp3.dta", clear
## 
## . regress nlitpost nmatpre sen_m
## 
##       Source |       SS           df       MS      Number of obs   =     2,405
## -------------+----------------------------------   F(2, 2402)      =    370.26
##        Model |  412.119502         2  206.059751   Prob > F        =    0.0000
##     Residual |  1336.77695     2,402  .556526625   R-squared       =    0.2356
## -------------+----------------------------------   Adj R-squared   =    0.2350
##        Total |  1748.89645     2,404  .727494366   Root MSE        =    .74601
## 
## ------------------------------------------------------------------------------
##     nlitpost |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##      nmatpre |   .4340432   .0186841    23.23   0.000     .3974046    .4706818
##        sen_m |   -.362332    .046773    -7.75   0.000    -.4540516   -.2706124
##        _cons |  -.2595457   .0179905   -14.43   0.000    -.2948241   -.2242672
## ------------------------------------------------------------------------------
## 
## .

我們發現此時分析僅基於擁有完整數據的 2405 名對象,另外2468名對象因爲 sen_m 缺失而無法進入模型。可以看到模型中的兩個預測變量的回歸係數本身和對應的標準誤差也都發生了較大的偏差,精確度明顯下降。像這樣使用去除掉缺失值的數據分析獲得的結果,如果說 sen_m 的缺失值機制是條件(conditional on nmatpresen)獨立於結果變量 nlitpost 的話,可以被認爲是無偏估計,但是現實情況下如果你的數據中出現這樣大量的缺失值的話,我們無法判斷它是否獨立於我們希望分析的結果變量和預測變量。

此時的結果我們總結在下表中

Variable Original Data analysis Coefficient. (SE) Complete Case analysis Coefficient. (SE) Multiple Imputation Coefficient. (SE)
Pre-reception numeracy 0.58 (0.012) 0.43 (0.019)
Special educational needs -0.432 (0.043) -0.362 (0.047)

82.7.5 分析 sen_m 的缺失值機制

先生成一個關於 sen_m 是否是缺失值的指示變量 (indicator variable) r

classp3dta <- classp3dta %>% 
  mutate(r = !is.na(sen_m))

epiDisplay::tab1(classp3dta$sen_m, graph = FALSE)
## classp3dta$sen_m : 
##         Frequency   %(NA+)   %(NA-)
## 0            2092     42.9       87
## 1             313      6.4       13
## <NA>         2468     50.6        0
##   Total      4873    100.0      100
epiDisplay::tab1(classp3dta$r, graph = FALSE)
## classp3dta$r : 
##         Frequency Percent Cum. percent
## FALSE        2468    50.6         50.6
## TRUE         2405    49.4        100.0
##   Total      4873   100.0        100.0

然後我們一個邏輯回歸模型,把 r 作爲該模型的結果變量,把 nlitpost nmatpre 作爲預測變量:

classglm <- glm(r ~ nlitpost + nmatpre, data = classp3dta, 
                family = binomial(link = logit))
summary(classglm)
## 
## Call:
## glm(formula = r ~ nlitpost + nmatpre, family = binomial(link = logit), 
##     data = classp3dta)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.88467  -0.76311  -0.19894   0.76629   2.69335  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  0.019637   0.036027   0.5451   0.5857    
## nlitpost    -1.039361   0.050322 -20.6540   <2e-16 ***
## nmatpre     -1.005978   0.051082 -19.6934   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6754.60  on 4872  degrees of freedom
## Residual deviance: 4692.94  on 4870  degrees of freedom
## AIC: 4698.94
## 
## Number of Fisher Scoring iterations: 5
jtools::summ(classglm, digits = 6, confint = TRUE, exp = TRUE)
Observations 4873
Dependent variable r
Type Generalized linear model
Family binomial
Link logit
𝛘²(2) 2061.653687
Pseudo-R² (Cragg-Uhler) 0.459988
Pseudo-R² (McFadden) 0.305222
AIC 4698.944224
BIC 4718.418619
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.019831 0.950303 1.094446 0.545072 0.585704
nlitpost 0.353681 0.320462 0.390342 -20.654049 0.000000
nmatpre 0.365687 0.330848 0.404195 -19.693359 0.000000
Standard errors: MLE

同樣的過程在 Stata 可以這樣完成:

## 
## . use "../backupfiles/classp3.dta", clear
## 
## . gen r=(sen_m!=.)
## 
## . logistic r nlitpost nmatpre
## 
## Logistic regression                             Number of obs     =      4,873
##                                                 LR chi2(2)        =    2061.65
##                                                 Prob > chi2       =     0.0000
## Log likelihood = -2346.4721                     Pseudo R2         =     0.3052
## 
## ------------------------------------------------------------------------------
##            r | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##     nlitpost |   .3536805   .0177981   -20.65   0.000     .3204621    .3903423
##      nmatpre |   .3656868   .0186801   -19.69   0.000     .3308477    .4041945
##        _cons |   1.019831   .0367414     0.55   0.586     .9503032    1.094446
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline odds.
## 
## .

你能看出 sen_m 的缺失值機制大致是怎樣的嗎?

從上面邏輯回歸模型的結果可以看出, sen_m 本身是否出現缺失值,和 nlitpost nmatpre 兩個變量之間有着很強的關係。從獲得的OR值來看,這兩個分數越高的人,觀測到它 sen 數值的可能性越低(即較高可能性出現缺失值),也就是說,我們可以認爲該變量出現缺失值的機制不屬於完全隨機 (missing completely at random, MCAR)。

下面我們把該邏輯回歸模型中的預測變量增加一個 sen_m

classglm2 <- glm(r ~ nlitpost + nmatpre + sen_m, data = classp3dta, 
                family = binomial(link = logit))
## Warning: glm.fit: algorithm did not converge
summary(classglm2)
## 
## Call:
## glm(formula = r ~ nlitpost + nmatpre + sen_m, family = binomial(link = logit), 
##     data = classp3dta)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## 2.4086e-06  2.4086e-06  2.4086e-06  2.4086e-06  2.4086e-06  
## 
## Coefficients:
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  2.6566e+01  8.9525e+03   0.003   0.9976
## nlitpost    -7.6455e-10  9.7403e+03   0.000   1.0000
## nmatpre      7.2348e-10  9.8705e+03   0.000   1.0000
## sen_m        4.8110e-06  2.2605e+04   0.000   1.0000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.00000e+00  on 2404  degrees of freedom
## Residual deviance: 1.39528e-08  on 2401  degrees of freedom
##   (2468 observations deleted due to missingness)
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25
jtools::summ(classglm2, digits = 6, confint = TRUE, exp = TRUE)
## Warning: glm.fit: algorithm did not converge
Observations 2405 (2468 missing obs. deleted)
Dependent variable r
Type Generalized linear model
Family binomial
Link logit
𝛘²(3) -0.000000
Pseudo-R² (Cragg-Uhler) 0.000000
Pseudo-R² (McFadden) 0.000000
AIC 8.000000
BIC 31.141221
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 344742703498.612000 0.000000 Inf 0.002967 0.997632
nlitpost 1.000000 0.000000 Inf -0.000000 1.000000
nmatpre 1.000000 0.000000 Inf 0.000000 1.000000
sen_m 1.000005 0.000000 Inf 0.000000 1.000000
Standard errors: MLE

你知道爲什麼這裏的模型中的參數無法被估計嗎?

我們把 sen_m 替換成 sen

classglm3 <- glm(r ~ nlitpost + nmatpre + sen, data = classp3dta, 
                family = binomial(link = logit))
summary(classglm3)
## 
## Call:
## glm(formula = r ~ nlitpost + nmatpre + sen, family = binomial(link = logit), 
##     data = classp3dta)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.86373  -0.76374  -0.19817   0.76322   2.69585  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  0.025268   0.037471   0.6744   0.5001    
## nlitpost    -1.043082   0.050798 -20.5339   <2e-16 ***
## nmatpre     -1.009251   0.051446 -19.6177   <2e-16 ***
## sen         -0.080734   0.146027  -0.5529   0.5804    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6754.60  on 4872  degrees of freedom
## Residual deviance: 4692.64  on 4869  degrees of freedom
## AIC: 4700.64
## 
## Number of Fisher Scoring iterations: 5
jtools::summ(classglm3, digits = 6, confint = TRUE, exp = TRUE)
Observations 4873
Dependent variable r
Type Generalized linear model
Family binomial
Link logit
𝛘²(3) 2061.957742
Pseudo-R² (Cragg-Uhler) 0.460042
Pseudo-R² (McFadden) 0.305267
AIC 4700.640169
BIC 4726.606030
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.025590 0.952969 1.103746 0.674355 0.500086
nlitpost 0.352367 0.318975 0.389256 -20.533887 0.000000
nmatpre 0.364492 0.329532 0.403161 -19.617732 0.000000
sen 0.922439 0.692850 1.228106 -0.552873 0.580351
Standard errors: MLE

此時,我們看見 sensen_m 是否出現缺失值是無關的 (no evidence that missingness in sen_m is associated with sen),即使調整了 nlitpost nmatpre。這滿足了我們對隨機缺失機制 (Missing at random, MAR) 的定義。所以我們可以認爲 sen_m 的缺失值機制在知道了 nlitpost nmatpre 的條件下是隨機缺失 (MAR)。但是值得注意的是,因爲這裏我們是故意把一個變量隨機刪除掉一半以後人爲地造成了一個不完整數據的變量,才能夠根據上述定義來證明這個變量發生缺失值的機制是 MAR,現實情況下,這樣的證明是沒有人可以做到的,因爲有缺失值的變量你是無法獲得它的完整數據的。而且我們也從上述分析中發現了使用去除缺失值之後的數據進行分析時模型中參數估計出現較大偏倚和標準誤差變大的原因 – 因爲 sen_m 是否出現缺失值同時和我們關心的結果變量 nlitpost,和預測變量 nmatpre 有關。

82.7.6 多重插補 multiple imputation

82.7.6.1 在 Stata 裏,重新讀入數據

  • 首先我們需要告訴 Stata 我們要開始存儲一些多重插補過後的數據 mi set wide
  • 然後,再告訴 Stata,我們要對 sen_m 這個變量進行多重插補 mi register imputed sen_m
  • 產生我們需要的多重插補之後的數據,這個過程(雖然都被計算機一步完成)可以描述爲:
    • 對於二進制變量,我們需要使用 logit 命令來進行多重插補。
    • 先使用去除了缺失值的數據來跑我們需要的模型,然後獲取對應的回歸模型的回歸係數(i.e. log odds ratios)及其標準誤差。
    • 假設要進行 \(m = 1, \cdots, M\) 次多重插補過程,那麼根據回歸係數 (logOR) 本身服從常態分佈的假設,每次多重插補我們需要從回歸係數服從的分佈中採樣一組回歸係數。
    • 獲取了 \(m\) 組回歸係數之後,在每一個多重插補過程中,根據模型中的變量和採集的回歸係數,計算存在缺失值的 sen_m 取值是 1 的概率。
    • 根據上步中計算獲得的每個有缺失值的 sen_m 取值可能是 1 的概率。這個概率被當作是一個 Bernoulli (binary) 分佈的概率參數,然後隨機採集 0/1 作爲該缺失值的插補值。
  • 另外,實施多重插補過程的命令中 add(10) 是在告訴 Stata 我要進行十次插補,採集十次樣本,即 \(m = 10\)
  • 最後一步是使用插補後的數據進行我們希望的回歸模型分析。它的過程簡單來說就是對插補後的十個樣本進行相同的模型估計,然後利用 “Rubin” 法則把十次結果總結成爲一個估計。它和之前的結果相比如何?
## 
## . use "../backupfiles/classp3.dta", clear
## 
## . mi set wide
## 
## . mi register imputed sen_m
## 
## . mi impute logit sen_m nlitpost nmatpre, add(10) rseed(4921)
## 
## Univariate imputation                       Imputations =       10
## Logistic regression                               added =       10
## Imputed: m=1 through m=10                       updated =        0
## 
## ------------------------------------------------------------------
##                    |               Observations per m             
##                    |----------------------------------------------
##           Variable |   Complete   Incomplete   Imputed |     Total
## -------------------+-----------------------------------+----------
##              sen_m |       2405         2468      2468 |      4873
## ------------------------------------------------------------------
## (complete + incomplete = total; imputed is the minimum across m
##  of the number of filled-in observations.)
## 
## . mi estimate: regress nlitpost nmatpre sen_m
## 
## Multiple-imputation estimates                   Imputations       =         10
## Linear regression                               Number of obs     =      4,873
##                                                 Average RVI       =     0.1901
##                                                 Largest FMI       =     0.3735
##                                                 Complete DF       =       4870
## DF adjustment:   Small sample                   DF:     min       =      69.45
##                                                         avg       =   2,083.26
##                                                         max       =   3,626.87
## Model F test:       Equal FMI                   F(   2,  238.4)   =    1071.23
## Within VCE type:          OLS                   Prob > F          =     0.0000
## 
## ------------------------------------------------------------------------------
##     nlitpost |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##      nmatpre |   .5864505   .0126874    46.22   0.000     .5615717    .6113292
##        sen_m |  -.4044246    .053444    -7.57   0.000    -.5110302   -.2978191
##        _cons |   .0428783   .0122477     3.50   0.000     .0188653    .0668913
## ------------------------------------------------------------------------------
## 
## .
Variable Original Data analysis Coefficient. (SE) Complete Case analysis Coefficient. (SE) Multiple Imputation Coefficient. (SE)
Pre-reception numeracy 0.58 (0.012) 0.43 (0.019) 0.59 (0.012)
Special educational needs -0.432 (0.043) -0.362 (0.047) -0.404 (0.053)

可以看到,使用多重插補之後,獲得的回歸係數的參數估計及其標準誤差都和原始數據模型結果十分接近,令人滿意。值得注意的是,因爲這裏的數據缺失是我們認爲通過 MAR 機制產生的,所以我們會看到這樣幾乎無偏的估計。然而在現實中,我們其實無法對缺失值的機制是否是 MAR 進行驗證。