第 120 章 邏輯回歸模型1 - 單一暴露變量 Logistic regression 1 - effect of a single exposure
- 這裏主要使用 Stata 的命令,
tabodds
和mhodds
來進行橫斷面研究的比值 odds 和比值比 odds ratio, OR 的計算。 - 另外需要能夠理解並熟悉
logistic
命令和logit
命令計算並比較不同曝露組之間的 OR 和對應的信賴區間。 - 學會使用指示性變量 indicator variable
- 學會使用
lrtest
進行似然比檢驗
120.1 Q1-Q2 讀入 mortality.dta
數據
數據讀入 Stata 系統之後,使用 describe
和 summarize
命令來熟悉這個數據。
##
## . cd "~/Downloads/LSHTMlearningnote/backupfile/Users/chaoimacmini/Downloads/LSHTMlearningnote/backupfiles
##
## . use mortality
##
## .
## . describe
##
## Contains data from mortality.dta
## obs: 4,298
## vars: 28 23 Oct 2013 11:45
## -------------------------------------------------------------------------------
## storage display value
## variable name type format label variable label
## -------------------------------------------------------------------------------
## id int %8.0g Individual ID
## area float %9.0g area Area
## district str7 %7s District
## vcode byte %8.0g Village ID
## compound_size float %9.0g Compound size
## age byte %8.0g Age in years
## sex byte %8.0g sex Sex
## ethnic byte %8.0g ethnic Ethnic
## religion byte %11.0g religion Religion
## occupation byte %27.0g occupation
## Occupation
## education byte %22.0g education
## Education
## mfpermg float %9.0g Microfilarial load per mg
## vimp float %17.0g vimp Visually impaired
## died float %9.0g Died
## systolic int %8.0g Systolic BP (mmHg)
## diastolic int %8.0g Diastolic BP (mmHg)
## pulse int %8.0g Pulse rate/minute
## weight int %8.0g Weight (kgs)
## height int %8.0g Height (cms)
## map float %9.0g Mean Arterial Pressure
## bmi float %9.0g BMI
## agegrp float %9.0g agegrp Age group (years)
## mfgrp float %10.0g mfgrp Microfilarial load/mg (grouped)
## mfpos float %9.0g Microfilarial infection
## agebin float %9.0g agebin Age (2 groups)
## bmigrp float %12.0g bmigrp BMI (3 groups)
## enter long %dD_m_Y Date of entry to study
## exit float %dD_m_Y Date of exit from study
## -------------------------------------------------------------------------------
## Sorted by: id
##
## . summarize
##
## Variable | Obs Mean Std. Dev. Min Max
## -------------+---------------------------------------------------------
## id | 4,298 5598.28 2643.843 1 9820
## area | 4,298 .1226152 .3280331 0 1
## district | 0
## vcode | 4,298 19.12634 10.61364 1 35
## compound_s~e | 4,298 14.73872 10.52019 1 55
## -------------+---------------------------------------------------------
## age | 4,298 34.48744 14.64162 15 92
## sex | 4,298 .52792 .499278 0 1
## ethnic | 4,298 6.352722 3.045664 1 9
## religion | 4,297 1.985571 .6300043 1 3
## occupation | 4,289 3.121707 3.927291 1 20
## -------------+---------------------------------------------------------
## education | 4,287 2.052718 1.457285 1 6
## mfpermg | 4,205 10.72746 20.15699 0 198.02
## vimp | 4,298 .0760819 .26516 0 1
## died | 4,298 .0318753 .1756885 0 1
## systolic | 4,263 117.8072 19.26367 68 260
## -------------+---------------------------------------------------------
## diastolic | 4,250 72.35247 13.4937 20 156
## pulse | 4,243 78.34056 12.42703 50 146
## weight | 4,235 52.88524 8.217265 29 102
## height | 4,235 159.4012 8.244971 135 188
## map | 4,250 87.49694 13.90433 40 186.6667
## -------------+---------------------------------------------------------
## bmi | 4,235 20.75805 2.442004 14.00511 40.3465
## agegrp | 4,298 .6140065 .8067433 0 3
## mfgrp | 4,205 1.030916 .863685 0 3
## mfpos | 4,205 .6901308 .4624945 0 1
## agebin | 4,298 .1165658 .3209396 0 1
## -------------+---------------------------------------------------------
## bmigrp | 4,235 .8746163 .4304267 0 2
## enter | 4,298 10709.6 95.48321 10571 10880
## exit | 4,298 11683.26 159.8555 10750.5 11766
##
## .
120.2 Q3 結果變量 died
結果變量是死亡 died
,編碼爲1 表示受試對象在研究過程中死亡,0則表示對象在研究結束以後仍然存活。用 tab
, tabodds
和 mhodds
來簡單分析死亡 died
和 vimp
(視力損傷) 之間的關係。
. tab died
Died | Freq. Percent Cum.
------------+-----------------------------------
0 | 4,161 96.81 96.81
1 | 137 3.19 100.00
------------+-----------------------------------
Total | 4,298 100.00
. tab vimp
Visually impaired | Freq. Percent Cum.
------------------+-----------------------------------
Normal | 3,971 92.39 92.39
Visually impaired | 327 7.61 100.00
------------------+-----------------------------------
Total | 4,298 100.00
. tabulate died vimp, col
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
| Visually impaired
Died | Normal Visually | Total
-----------+----------------------+----------
0 | 3,874 287 | 4,161
| 97.56 87.77 | 96.81
-----------+----------------------+----------
1 | 97 40 | 137
| 2.44 12.23 | 3.19
-----------+----------------------+----------
Total | 3,971 327 | 4,298
| 100.00 100.00 | 100.00
. tabodds died vimp
--------------------------------------------------------------------------
vimp | Cases Controls Odds [95% Conf. Interval]
------------+-------------------------------------------------------------
Normal | 97 3874 0.02504 0.02047 0.03063
Visuall~d | 40 287 0.13937 0.10012 0.19402
--------------------------------------------------------------------------
Test of homogeneity (equal odds): chi2(1) = 93.81
Pr>chi2 = 0.0000
Score test for trend of odds: chi2(1) = 93.81
Pr>chi2 = 0.0000
. mhodds died vimp
Maximum likelihood estimate of the odds ratio
Comparing vimp==1 vs. vimp==0
----------------------------------------------------------------
Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
----------------------------------------------------------------
5.566292 93.81 0.0000 3.762437 8.234984
----------------------------------------------------------------
120.3 Q4-5 第一個簡單邏輯回歸模型
用邏輯回歸模型來分析一下 vimp
視覺損傷和 died
死亡之間的關係。
. logit died vimp
Iteration 0: log likelihood = -606.88457
Iteration 1: log likelihood = -578.54009
Iteration 2: log likelihood = -577.36856
Iteration 3: log likelihood = -577.36596
Iteration 4: log likelihood = -577.36596
Logistic regression Number of obs = 4,298
LR chi2(1) = 59.04
Prob > chi2 = 0.0000
Log likelihood = -577.36596 Pseudo R2 = 0.0486
------------------------------------------------------------------------------
died | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
vimp | 1.716729 .1976151 8.69 0.000 1.329411 2.104048
_cons | -3.687332 .1027979 -35.87 0.000 -3.888812 -3.485852
------------------------------------------------------------------------------
這裏我們看見 Stata 通過 iteration
迭代法找到極大對數似然 -577.36。右邊的模型信息提示有 4298 名實驗對象,另外模型還進行了一個卡方檢驗 \(\chi^2 = 59.04\) ,這個檢驗的零假設是“模型中沒有任何一個變量和結果變量(死亡)有關聯性 none of the variables in the model are associated with the outcome variable.”。因爲我們只在模型中加入了一個預測變量 vimp
視覺損傷,所以這個零假設也就等同於“視覺損傷與否和死亡之間無相關性。”檢驗的結果很顯然, p<0.0001。Pseudo R2 = 0.0486
其實是一個評價模型擬合度的統計量 goodness of fit statistic
,在這裏我們先忽略它。
120.4 Q6 給出 OR 值
. logistic died vimp
Logistic regression Number of obs = 4,298
LR chi2(1) = 59.04
Prob > chi2 = 0.0000
Log likelihood = -577.36596 Pseudo R2 = 0.0486
------------------------------------------------------------------------------
died | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
vimp | 5.566292 1.099983 8.69 0.000 3.778816 8.19929
_cons | .0250387 .0025739 -35.87 0.000 .0204696 .0306277
------------------------------------------------------------------------------
值得注意的是,
這時候直接給出的比值比 OR 它下面的
_cons
部分其實並不是比值比 OR,而是基線組的比值 (the odds of outcome in the baseline group)。所以,基線比值在本例中的含義其實就是視力沒有受損的人中死亡的比值 (odds of death amongst those visually unimpaired)。此時給出的標準誤差 (standard error) 只是近似值,可以無視。
z
統計量和使用logit
命令時計算得出的值完全相同。信賴區間使用的是對數尺度下計算獲得的信賴區間之後取
exp
獲得的。例如exp(1.329411) = 3.778816
你還可以用下面代碼獲取相同的結果:
. logit died vimp, or
Iteration 0: log likelihood = -606.88457
Iteration 1: log likelihood = -578.54009
Iteration 2: log likelihood = -577.36856
Iteration 3: log likelihood = -577.36596
Iteration 4: log likelihood = -577.36596
Logistic regression Number of obs = 4,298
LR chi2(1) = 59.04
Prob > chi2 = 0.0000
Log likelihood = -577.36596 Pseudo R2 = 0.0486
------------------------------------------------------------------------------
died | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
vimp | 5.566292 1.099983 8.69 0.000 3.778816 8.19929
_cons | .0250387 .0025739 -35.87 0.000 .0204696 .0306277
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
120.5 Q7 彙總成報告用的表格
把目前爲止分析的結果彙總成研究報告使用的表格:
|
n |
Died (row %) |
OR (95% CI) |
P-value |
---|---|---|---|---|
Visually unimpaired Visually impaired |
3971 327 |
97 (2.4%) 40 (12.2%) |
1 5.57 (3.78, 8.20) |
<0.0001 |
描述這一分析結果的段落可以寫作:
分析結果發現,視力損傷和死亡的危險度強烈相關 (p < 0.0001)。視力損傷的研究對象相比視力正常的人死亡的比值要高出快要6倍之多 (OR 5.57; 95%CI: 3.78, 8.20)。
120.6 Q8 瞭解微小絲蟲傳染病 (microfilarial infection) 和死亡之間的關係
當曝露變量在表格中是行 (row) 時,我們推薦用行百分比視角來觀察它和結果變量 (列) 之間的關係:
. tab mfgrp died, row
+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+
Microfilar |
ial |
load/mg | Died
(grouped) | 0 1 | Total
-----------+----------------------+----------
Uninfected | 1,274 29 | 1,303
| 97.77 2.23 | 100.00
-----------+----------------------+----------
<10 | 1,609 62 | 1,671
| 96.29 3.71 | 100.00
-----------+----------------------+----------
10-49 | 996 33 | 1,029
| 96.79 3.21 | 100.00
-----------+----------------------+----------
50+ | 193 9 | 202
| 95.54 4.46 | 100.00
-----------+----------------------+----------
Total | 4,072 133 | 4,205
| 96.84 3.16 | 100.00
於是我們發現,未被此寄生蟲感染的人羣,整個實驗進行的過程中,死亡率是 2.23%,如果感染絲蟲的感染量較輕 < 10 load/mg,那麼死亡率會升高至 3.71%。
120.7 Q9 使用指示變量 indicator variable
在變量 mfgrp
之前加上 i.
等同於告訴計算機程序說,這個 mfgrp
變量應噶被編碼成爲指示變量或者叫啞變量 (dummy)。
. logit died i.mfgrp
Iteration 0: log likelihood = -590.21364
Iteration 1: log likelihood = -586.93331
Iteration 2: log likelihood = -586.86511
Iteration 3: log likelihood = -586.86507
Logistic regression Number of obs = 4,205
LR chi2(3) = 6.70
Prob > chi2 = 0.0822
Log likelihood = -586.86507 Pseudo R2 = 0.0057
------------------------------------------------------------------------------
died | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mfgrp |
<10 | .5263863 .228075 2.31 0.021 .0793676 .973405
10-49 | .3753803 .2580207 1.45 0.146 -.130331 .8810917
50+ | .7171561 .389307 1.84 0.065 -.0458716 1.480184
|
_cons | -3.78262 .1877968 -20.14 0.000 -4.150695 -3.414545
------------------------------------------------------------------------------
你可以查看這些計算機自動產生的啞變量:
. list mfgrp i.mfgrp in 1/25
+--------------------------------------------+
| 0. 1. 2. 3.|
| mfgrp mfgrp mfgrp mfgrp mfgrp |
|--------------------------------------------|
1. | Uninfected 1 0 0 0 |
2. | <10 0 1 0 0 |
3. | <10 0 1 0 0 |
4. | 10-49 0 0 1 0 |
5. | <10 0 1 0 0 |
|--------------------------------------------|
6. | . . . . . |
7. | Uninfected 1 0 0 0 |
8. | . . . . . |
9. | <10 0 1 0 0 |
10. | <10 0 1 0 0 |
|--------------------------------------------|
11. | . . . . . |
12. | <10 0 1 0 0 |
13. | <10 0 1 0 0 |
14. | Uninfected 1 0 0 0 |
15. | <10 0 1 0 0 |
|--------------------------------------------|
16. | . . . . . |
17. | 10-49 0 0 1 0 |
18. | Uninfected 1 0 0 0 |
19. | 10-49 0 0 1 0 |
20. | Uninfected 1 0 0 0 |
|--------------------------------------------|
21. | 10-49 0 0 1 0 |
22. | 50+ 0 0 0 1 |
23. | . . . . . |
24. | <10 0 1 0 0 |
25. | Uninfected 1 0 0 0 |
+--------------------------------------------+
自動產生的啞變量的同時,計算機還自動加上的標籤。
120.8 Q10 計算比值比
. logistic died i.mfgrp, base
Logistic regression Number of obs = 4,205
LR chi2(3) = 6.70
Prob > chi2 = 0.0822
Log likelihood = -586.86507 Pseudo R2 = 0.0057
------------------------------------------------------------------------------
died | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mfgrp |
Uninfected | 1 (base)
<10 | 1.692804 .3860862 2.31 0.021 1.082602 2.646942
10-49 | 1.455545 .3755608 1.45 0.146 .8778048 2.413533
50+ | 2.048599 .7975339 1.84 0.065 .9551646 4.393753
|
_cons | .022763 .0042748 -20.14 0.000 .0157535 .0328914
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
在命令行最後添加 , base
選項的話,會在計算結果中提示我們哪個組是基線組 (baseline group),便於查看結果。
- OR = 1.69 表示感染程度爲 <10 load/mg 的人死亡的比值和基線組(未感染)相比的比值。
- OR = 1.46 表示感染程度爲 10-49 load/mg 的人的死亡的比值和基線組(未感染)相比的比值。
- OR = 2.05 表示感染程度爲 50+ load/mg 的人的死亡的比值和基線組(未感染)相比的比值。
- 一共有三個 Wald 檢驗,和對應的 P 值。他們檢驗的零假設是每一個感染組和未感染組相比死亡的比值比等於1。
- 右上角的似然比檢驗 likelihood ratio test 檢驗的是卡方值等於 6.7,自由度是 3 的卡方檢驗,p = 0.0822。這個似然比檢驗的是,同時評價四個關於絲蟲寄生蟲感染程度的啞變量
mfgrp
,和結果變量死亡died
之間的關係。This tests the association between the variablemfgrp
and the outcome, death, by simultaneously testing all three parameters in the model.
120.9 Q11 簡單陳述上述分析的結果
數據分析的結果表明,絲蟲寄生蟲感染程度和死亡之間的相關性比較不明顯,證據強度爲弱 (p = 0.08)。在未被感染的人羣中,死亡率是 2.2%,確實比其他組死亡率略低。和未被感染的人羣相比,感染程度是 <10, 10-49, 50+ load/mg 的組的死亡比值比分別是 1.69 (95% CI: 1.08, 2.64), 1.46 (0.88, 2.41), 2.05 (0.96, 4.39)。
120.10 Q12 似然比檢驗用於模型比較
爲了調整正確的比較模型姿勢,我們需要使用相同的人數在兩個模型中,這裏我們打算比較的兩個模型一個是不含有絲蟲感染變量 mfgrp
的模型 (L0),另一個是含有絲蟲感染變量的模型 (L1)。那麼,我們需要先修改 L0 的人數,再進行二者的比較。
. logistic died if mfgrp!=.
. estimates store A
. logistic died i.mfgrp
. estimates store B
. lrtest B A
Likelihood-ratio test LR chi2(3) = 6.70
(Assumption: A nested in B) Prob > chi2 = 0.0822
假如你忘記了一開始調整分析的人數的步驟,直接使用似然比檢驗比較兩個模型的話,電腦會因爲人數不同而報錯:
. lrtest B A
observations differ: 4205 vs. 4298
120.11 Q13 分析年齡組和死亡之間的關係
- 先在模型中只放年齡組一個變量
. logistic died i.agegrp, base
Logistic regression Number of obs = 4,298
LR chi2(3) = 110.78
Prob > chi2 = 0.0000
Log likelihood = -551.49583 Pseudo R2 = 0.0913
------------------------------------------------------------------------------
died | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
agegrp |
15-34 | 1 (base)
35-54 | 2.578425 .6004948 4.07 0.000 1.63349 4.069983
55-64 | 6.933532 1.895136 7.08 0.000 4.057854 11.84712
65+ | 14.80207 3.919374 10.18 0.000 8.809203 24.87186
|
_cons | .0133448 .0024127 -23.88 0.000 .009363 .01902
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
只有年齡做唯一的解釋變量時,我們發現年齡和死亡呈強烈的正相關。
- 再往模型裏增加一個變量 (視覺損傷)
. logistic died i.vimp i.agegrp, base
Logistic regression Number of obs = 4,298
LR chi2(4) = 122.05
Prob > chi2 = 0.0000
Log likelihood = -545.8609 Pseudo R2 = 0.1006
------------------------------------------------------------------------------------
died | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
vimp |
Normal | 1 (base)
Visually impaired | 2.202266 .5019659 3.46 0.001 1.408816 3.442592
|
agegrp |
15-34 | 1 (base)
35-54 | 2.354987 .5548487 3.64 0.000 1.484023 3.737116
55-64 | 5.415484 1.556832 5.88 0.000 3.082732 9.513469
65+ | 9.901309 2.932372 7.74 0.000 5.54116 17.69231
|
_cons | .0131857 .0023851 -23.93 0.000 .0092498 .0187962
------------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
我們發現,之前獲得的視覺歲上和死亡之間強烈正相關的比值比 5.57 在調整了年齡之後縮水到了 2.20。
120.12 分析要點
- 在實施邏輯回歸模型的計算之前,我們需要先對解釋變量和結果變量進行列表計算等初步的瞭解。
- 邏輯回歸模型本身在計算的時候是使用對數比值 log-odds,然後通過數學轉換獲得我們需要的比值比 OR。
- 每次模型運行時會對每一個變量進行 Wald 檢驗,它檢驗的都是 logOR/SE 然後和標準正(常)態分佈做比較他們是否等於零。
- 似然比檢驗法比較模型時,實際上比較的是兩個模型之間的對數似然 log-likelihood。它的零假設是“目標變量和結果變量之間無關 no association between the variable of interest and the outcome”。