第 120 章 邏輯回歸模型1 - 單一暴露變量 Logistic regression 1 - effect of a single exposure

  • 這裏主要使用 Stata 的命令,taboddsmhodds 來進行橫斷面研究的比值 odds 和比值比 odds ratio, OR 的計算。
  • 另外需要能夠理解並熟悉 logistic 命令和 logit 命令計算並比較不同曝露組之間的 OR 和對應的信賴區間。
  • 學會使用指示性變量 indicator variable
  • 學會使用 lrtest 進行似然比檢驗

120.1 Q1-Q2 讀入 mortality.dta 數據

數據讀入 Stata 系統之後,使用 describesummarize 命令來熟悉這個數據。

## 
## . cd "~/Downloads/LSHTMlearningnote/backupfile/Users/chaoimacmini/Downloads/LSHTMlearningnote/backupfiles
## 
## . use mortality 
## 
## . 
## . describe
## 
## Contains data from mortality.dta
##   obs:         4,298                          
##  vars:            28                          23 Oct 2013 11:45
## -------------------------------------------------------------------------------
##               storage   display    value
## variable name   type    format     label      variable label
## -------------------------------------------------------------------------------
## id              int     %8.0g                 Individual ID
## area            float   %9.0g      area       Area
## district        str7    %7s                   District
## vcode           byte    %8.0g                 Village ID
## compound_size   float   %9.0g                 Compound size
## age             byte    %8.0g                 Age in years
## sex             byte    %8.0g      sex        Sex
## ethnic          byte    %8.0g      ethnic     Ethnic
## religion        byte    %11.0g     religion   Religion
## occupation      byte    %27.0g     occupation
##                                               Occupation
## education       byte    %22.0g     education
##                                               Education
## mfpermg         float   %9.0g                 Microfilarial load per mg
## vimp            float   %17.0g     vimp       Visually impaired
## died            float   %9.0g                 Died
## systolic        int     %8.0g                 Systolic BP (mmHg)
## diastolic       int     %8.0g                 Diastolic BP (mmHg)
## pulse           int     %8.0g                 Pulse rate/minute
## weight          int     %8.0g                 Weight (kgs)
## height          int     %8.0g                 Height (cms)
## map             float   %9.0g                 Mean Arterial Pressure
## bmi             float   %9.0g                 BMI
## agegrp          float   %9.0g      agegrp     Age group (years)
## mfgrp           float   %10.0g     mfgrp      Microfilarial load/mg (grouped)
## mfpos           float   %9.0g                 Microfilarial infection
## agebin          float   %9.0g      agebin     Age (2 groups)
## bmigrp          float   %12.0g     bmigrp     BMI (3 groups)
## enter           long    %dD_m_Y               Date of entry to study
## exit            float   %dD_m_Y               Date of exit from study
## -------------------------------------------------------------------------------
## Sorted by: id
## 
## . summarize
## 
##     Variable |        Obs        Mean    Std. Dev.       Min        Max
## -------------+---------------------------------------------------------
##           id |      4,298     5598.28    2643.843          1       9820
##         area |      4,298    .1226152    .3280331          0          1
##     district |          0
##        vcode |      4,298    19.12634    10.61364          1         35
## compound_s~e |      4,298    14.73872    10.52019          1         55
## -------------+---------------------------------------------------------
##          age |      4,298    34.48744    14.64162         15         92
##          sex |      4,298      .52792     .499278          0          1
##       ethnic |      4,298    6.352722    3.045664          1          9
##     religion |      4,297    1.985571    .6300043          1          3
##   occupation |      4,289    3.121707    3.927291          1         20
## -------------+---------------------------------------------------------
##    education |      4,287    2.052718    1.457285          1          6
##      mfpermg |      4,205    10.72746    20.15699          0     198.02
##         vimp |      4,298    .0760819      .26516          0          1
##         died |      4,298    .0318753    .1756885          0          1
##     systolic |      4,263    117.8072    19.26367         68        260
## -------------+---------------------------------------------------------
##    diastolic |      4,250    72.35247     13.4937         20        156
##        pulse |      4,243    78.34056    12.42703         50        146
##       weight |      4,235    52.88524    8.217265         29        102
##       height |      4,235    159.4012    8.244971        135        188
##          map |      4,250    87.49694    13.90433         40   186.6667
## -------------+---------------------------------------------------------
##          bmi |      4,235    20.75805    2.442004   14.00511    40.3465
##       agegrp |      4,298    .6140065    .8067433          0          3
##        mfgrp |      4,205    1.030916     .863685          0          3
##        mfpos |      4,205    .6901308    .4624945          0          1
##       agebin |      4,298    .1165658    .3209396          0          1
## -------------+---------------------------------------------------------
##       bmigrp |      4,235    .8746163    .4304267          0          2
##        enter |      4,298     10709.6    95.48321      10571      10880
##         exit |      4,298    11683.26    159.8555    10750.5      11766
## 
## .

120.2 Q3 結果變量 died

結果變量是死亡 died,編碼爲1 表示受試對象在研究過程中死亡,0則表示對象在研究結束以後仍然存活。用 tab, taboddsmhodds 來簡單分析死亡 diedvimp (視力損傷) 之間的關係。

. tab died

       Died |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      4,161       96.81       96.81
          1 |        137        3.19      100.00
------------+-----------------------------------
      Total |      4,298      100.00

. tab vimp

Visually impaired |      Freq.     Percent        Cum.
------------------+-----------------------------------
           Normal |      3,971       92.39       92.39
Visually impaired |        327        7.61      100.00
------------------+-----------------------------------
            Total |      4,298      100.00


. tabulate died vimp, col

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

           |   Visually impaired
      Died |    Normal  Visually  |     Total
-----------+----------------------+----------
         0 |     3,874        287 |     4,161 
           |     97.56      87.77 |     96.81 
-----------+----------------------+----------
         1 |        97         40 |       137 
           |      2.44      12.23 |      3.19 
-----------+----------------------+----------
     Total |     3,971        327 |     4,298 
           |    100.00     100.00 |    100.00 

. tabodds died vimp

--------------------------------------------------------------------------
      vimp  |      Cases     Controls       Odds      [95% Conf. Interval]
------------+-------------------------------------------------------------
     Normal |         97         3874    0.02504        0.02047   0.03063
  Visuall~d |         40          287    0.13937        0.10012   0.19402
--------------------------------------------------------------------------
Test of homogeneity (equal odds): chi2(1)  =    93.81
                                  Pr>chi2  =   0.0000

Score test for trend of odds:     chi2(1)  =    93.81
                                  Pr>chi2  =   0.0000

. mhodds died vimp

Maximum likelihood estimate of the odds ratio
Comparing vimp==1 vs. vimp==0

    ----------------------------------------------------------------
     Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
    ----------------------------------------------------------------
       5.566292      93.81        0.0000         3.762437   8.234984
    ----------------------------------------------------------------

120.3 Q4-5 第一個簡單邏輯回歸模型

用邏輯回歸模型來分析一下 vimp 視覺損傷和 died 死亡之間的關係。

. logit died vimp

Iteration 0:   log likelihood = -606.88457  
Iteration 1:   log likelihood = -578.54009  
Iteration 2:   log likelihood = -577.36856  
Iteration 3:   log likelihood = -577.36596  
Iteration 4:   log likelihood = -577.36596  

Logistic regression                             Number of obs     =      4,298
                                                LR chi2(1)        =      59.04
                                                Prob > chi2       =     0.0000
Log likelihood = -577.36596                     Pseudo R2         =     0.0486

------------------------------------------------------------------------------
        died |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        vimp |   1.716729   .1976151     8.69   0.000     1.329411    2.104048
       _cons |  -3.687332   .1027979   -35.87   0.000    -3.888812   -3.485852
------------------------------------------------------------------------------

這裏我們看見 Stata 通過 iteration 迭代法找到極大對數似然 -577.36。右邊的模型信息提示有 4298 名實驗對象,另外模型還進行了一個卡方檢驗 \(\chi^2 = 59.04\) ,這個檢驗的零假設是“模型中沒有任何一個變量和結果變量(死亡)有關聯性 none of the variables in the model are associated with the outcome variable.”。因爲我們只在模型中加入了一個預測變量 vimp 視覺損傷,所以這個零假設也就等同於“視覺損傷與否和死亡之間無相關性。”檢驗的結果很顯然, p<0.0001。Pseudo R2 = 0.0486 其實是一個評價模型擬合度的統計量 goodness of fit statistic,在這裏我們先忽略它。

120.4 Q6 給出 OR 值

. logistic died vimp

Logistic regression                             Number of obs     =      4,298
                                                LR chi2(1)        =      59.04
                                                Prob > chi2       =     0.0000
Log likelihood = -577.36596                     Pseudo R2         =     0.0486

------------------------------------------------------------------------------
        died | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        vimp |   5.566292   1.099983     8.69   0.000     3.778816     8.19929
       _cons |   .0250387   .0025739   -35.87   0.000     .0204696    .0306277
------------------------------------------------------------------------------

值得注意的是,

  1. 這時候直接給出的比值比 OR 它下面的 _cons 部分其實並不是比值比 OR,而是基線組的比值 (the odds of outcome in the baseline group)。所以,基線比值在本例中的含義其實就是視力沒有受損的人中死亡的比值 (odds of death amongst those visually unimpaired)。

  2. 此時給出的標準誤差 (standard error) 只是近似值,可以無視。

  3. z 統計量和使用 logit 命令時計算得出的值完全相同。

  4. 信賴區間使用的是對數尺度下計算獲得的信賴區間之後取 exp 獲得的。例如 exp(1.329411) = 3.778816

你還可以用下面代碼獲取相同的結果:

. logit died vimp, or

Iteration 0:   log likelihood = -606.88457  
Iteration 1:   log likelihood = -578.54009  
Iteration 2:   log likelihood = -577.36856  
Iteration 3:   log likelihood = -577.36596  
Iteration 4:   log likelihood = -577.36596  

Logistic regression                             Number of obs     =      4,298
                                                LR chi2(1)        =      59.04
                                                Prob > chi2       =     0.0000
Log likelihood = -577.36596                     Pseudo R2         =     0.0486

------------------------------------------------------------------------------
        died | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        vimp |   5.566292   1.099983     8.69   0.000     3.778816     8.19929
       _cons |   .0250387   .0025739   -35.87   0.000     .0204696    .0306277
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

120.5 Q7 彙總成報告用的表格

把目前爲止分析的結果彙總成研究報告使用的表格:

   
   
   
n   

Died (row %)
   
OR (95% CI)   
   
P-value   
   
Visually unimpaired
   
Visually impaired   
   
3971
   
327   

97 (2.4%)

40 (12.2%)
   
1
   
5.57 (3.78, 8.20)   
   
<0.0001   

描述這一分析結果的段落可以寫作:

分析結果發現,視力損傷和死亡的危險度強烈相關 (p < 0.0001)。視力損傷的研究對象相比視力正常的人死亡的比值要高出快要6倍之多 (OR 5.57; 95%CI: 3.78, 8.20)。

120.6 Q8 瞭解微小絲蟲傳染病 (microfilarial infection) 和死亡之間的關係

當曝露變量在表格中是行 (row) 時,我們推薦用行百分比視角來觀察它和結果變量 (列) 之間的關係:

. tab mfgrp died, row

+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

Microfilar |
       ial |
   load/mg |         Died
 (grouped) |         0          1 |     Total
-----------+----------------------+----------
Uninfected |     1,274         29 |     1,303 
           |     97.77       2.23 |    100.00 
-----------+----------------------+----------
       <10 |     1,609         62 |     1,671 
           |     96.29       3.71 |    100.00 
-----------+----------------------+----------
     10-49 |       996         33 |     1,029 
           |     96.79       3.21 |    100.00 
-----------+----------------------+----------
       50+ |       193          9 |       202 
           |     95.54       4.46 |    100.00 
-----------+----------------------+----------
     Total |     4,072        133 |     4,205 
           |     96.84       3.16 |    100.00 

於是我們發現,未被此寄生蟲感染的人羣,整個實驗進行的過程中,死亡率是 2.23%,如果感染絲蟲的感染量較輕 < 10 load/mg,那麼死亡率會升高至 3.71%。

120.7 Q9 使用指示變量 indicator variable

在變量 mfgrp 之前加上 i. 等同於告訴計算機程序說,這個 mfgrp 變量應噶被編碼成爲指示變量或者叫啞變量 (dummy)。

. logit died i.mfgrp

Iteration 0:   log likelihood = -590.21364  
Iteration 1:   log likelihood = -586.93331  
Iteration 2:   log likelihood = -586.86511  
Iteration 3:   log likelihood = -586.86507  

Logistic regression                             Number of obs     =      4,205
                                                LR chi2(3)        =       6.70
                                                Prob > chi2       =     0.0822
Log likelihood = -586.86507                     Pseudo R2         =     0.0057

------------------------------------------------------------------------------
        died |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       mfgrp |
        <10  |   .5263863    .228075     2.31   0.021     .0793676     .973405
      10-49  |   .3753803   .2580207     1.45   0.146     -.130331    .8810917
        50+  |   .7171561    .389307     1.84   0.065    -.0458716    1.480184
             |
       _cons |   -3.78262   .1877968   -20.14   0.000    -4.150695   -3.414545
------------------------------------------------------------------------------

你可以查看這些計算機自動產生的啞變量:

. list mfgrp i.mfgrp in 1/25

     +--------------------------------------------+
     |                  0.      1.      2.      3.|
     |      mfgrp   mfgrp   mfgrp   mfgrp   mfgrp |
     |--------------------------------------------|
  1. | Uninfected       1       0       0       0 |
  2. |        <10       0       1       0       0 |
  3. |        <10       0       1       0       0 |
  4. |      10-49       0       0       1       0 |
  5. |        <10       0       1       0       0 |
     |--------------------------------------------|
  6. |          .       .       .       .       . |
  7. | Uninfected       1       0       0       0 |
  8. |          .       .       .       .       . |
  9. |        <10       0       1       0       0 |
 10. |        <10       0       1       0       0 |
     |--------------------------------------------|
 11. |          .       .       .       .       . |
 12. |        <10       0       1       0       0 |
 13. |        <10       0       1       0       0 |
 14. | Uninfected       1       0       0       0 |
 15. |        <10       0       1       0       0 |
     |--------------------------------------------|
 16. |          .       .       .       .       . |
 17. |      10-49       0       0       1       0 |
 18. | Uninfected       1       0       0       0 |
 19. |      10-49       0       0       1       0 |
 20. | Uninfected       1       0       0       0 |
     |--------------------------------------------|
 21. |      10-49       0       0       1       0 |
 22. |        50+       0       0       0       1 |
 23. |          .       .       .       .       . |
 24. |        <10       0       1       0       0 |
 25. | Uninfected       1       0       0       0 |
     +--------------------------------------------+

自動產生的啞變量的同時,計算機還自動加上的標籤。

120.8 Q10 計算比值比

. logistic died i.mfgrp, base

Logistic regression                             Number of obs     =      4,205
                                                LR chi2(3)        =       6.70
                                                Prob > chi2       =     0.0822
Log likelihood = -586.86507                     Pseudo R2         =     0.0057

------------------------------------------------------------------------------
        died | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       mfgrp |
 Uninfected  |          1  (base)
        <10  |   1.692804   .3860862     2.31   0.021     1.082602    2.646942
      10-49  |   1.455545   .3755608     1.45   0.146     .8778048    2.413533
        50+  |   2.048599   .7975339     1.84   0.065     .9551646    4.393753
             |
       _cons |    .022763   .0042748   -20.14   0.000     .0157535    .0328914
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

在命令行最後添加 , base 選項的話,會在計算結果中提示我們哪個組是基線組 (baseline group),便於查看結果。

  1. OR = 1.69 表示感染程度爲 <10 load/mg 的人死亡的比值和基線組(未感染)相比的比值。
  2. OR = 1.46 表示感染程度爲 10-49 load/mg 的人的死亡的比值和基線組(未感染)相比的比值。
  3. OR = 2.05 表示感染程度爲 50+ load/mg 的人的死亡的比值和基線組(未感染)相比的比值。
  4. 一共有三個 Wald 檢驗,和對應的 P 值。他們檢驗的零假設是每一個感染組和未感染組相比死亡的比值比等於1。
  5. 右上角的似然比檢驗 likelihood ratio test 檢驗的是卡方值等於 6.7,自由度是 3 的卡方檢驗,p = 0.0822。這個似然比檢驗的是,同時評價四個關於絲蟲寄生蟲感染程度的啞變量 mfgrp,和結果變量死亡 died 之間的關係。This tests the association between the variable mfgrp and the outcome, death, by simultaneously testing all three parameters in the model.

120.9 Q11 簡單陳述上述分析的結果

數據分析的結果表明,絲蟲寄生蟲感染程度和死亡之間的相關性比較不明顯,證據強度爲弱 (p = 0.08)。在未被感染的人羣中,死亡率是 2.2%,確實比其他組死亡率略低。和未被感染的人羣相比,感染程度是 <10, 10-49, 50+ load/mg 的組的死亡比值比分別是 1.69 (95% CI: 1.08, 2.64), 1.46 (0.88, 2.41), 2.05 (0.96, 4.39)。

120.10 Q12 似然比檢驗用於模型比較

爲了調整正確的比較模型姿勢,我們需要使用相同的人數在兩個模型中,這裏我們打算比較的兩個模型一個是不含有絲蟲感染變量 mfgrp 的模型 (L0),另一個是含有絲蟲感染變量的模型 (L1)。那麼,我們需要先修改 L0 的人數,再進行二者的比較。

. logistic died if mfgrp!=.
. estimates store A

. logistic died i.mfgrp
. estimates store B

. lrtest B A

Likelihood-ratio test                                 LR chi2(3)  =      6.70
(Assumption: A nested in B)                           Prob > chi2 =    0.0822

假如你忘記了一開始調整分析的人數的步驟,直接使用似然比檢驗比較兩個模型的話,電腦會因爲人數不同而報錯:

. lrtest B A
observations differ: 4205 vs. 4298

120.11 Q13 分析年齡組和死亡之間的關係

  1. 先在模型中只放年齡組一個變量
. logistic died i.agegrp, base

Logistic regression                             Number of obs     =      4,298
                                                LR chi2(3)        =     110.78
                                                Prob > chi2       =     0.0000
Log likelihood = -551.49583                     Pseudo R2         =     0.0913

------------------------------------------------------------------------------
        died | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      agegrp |
      15-34  |          1  (base)
      35-54  |   2.578425   .6004948     4.07   0.000      1.63349    4.069983
      55-64  |   6.933532   1.895136     7.08   0.000     4.057854    11.84712
        65+  |   14.80207   3.919374    10.18   0.000     8.809203    24.87186
             |
       _cons |   .0133448   .0024127   -23.88   0.000      .009363      .01902
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

只有年齡做唯一的解釋變量時,我們發現年齡和死亡呈強烈的正相關。

  1. 再往模型裏增加一個變量 (視覺損傷)

. logistic died i.vimp i.agegrp, base

Logistic regression                             Number of obs     =      4,298
                                                LR chi2(4)        =     122.05
                                                Prob > chi2       =     0.0000
Log likelihood =  -545.8609                     Pseudo R2         =     0.1006

------------------------------------------------------------------------------------
              died | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
              vimp |
           Normal  |          1  (base)
Visually impaired  |   2.202266   .5019659     3.46   0.001     1.408816    3.442592
                   |
            agegrp |
            15-34  |          1  (base)
            35-54  |   2.354987   .5548487     3.64   0.000     1.484023    3.737116
            55-64  |   5.415484   1.556832     5.88   0.000     3.082732    9.513469
              65+  |   9.901309   2.932372     7.74   0.000      5.54116    17.69231
                   |
             _cons |   .0131857   .0023851   -23.93   0.000     .0092498    .0187962
------------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

我們發現,之前獲得的視覺歲上和死亡之間強烈正相關的比值比 5.57 在調整了年齡之後縮水到了 2.20。

120.12 分析要點

  1. 在實施邏輯回歸模型的計算之前,我們需要先對解釋變量和結果變量進行列表計算等初步的瞭解。
  2. 邏輯回歸模型本身在計算的時候是使用對數比值 log-odds,然後通過數學轉換獲得我們需要的比值比 OR。
  3. 每次模型運行時會對每一個變量進行 Wald 檢驗,它檢驗的都是 logOR/SE 然後和標準正(常)態分佈做比較他們是否等於零。
  4. 似然比檢驗法比較模型時,實際上比較的是兩個模型之間的對數似然 log-likelihood。它的零假設是“目標變量和結果變量之間無關 no association between the variable of interest and the outcome”。