第 118 章 (無匹配的)病例對照研究的分析方法 analysis of unmatched case-control studies
本次練習使用的數據是在坦桑尼亞實施的病例對照研究,數據名是 mwanza.dta
。你可以使用 help mwanza
來進一步瞭解這個研究和數據的各個變量。
118.1 Q1 數據讀入
按照要求生成兩個變量:
ed2
: 1 = 表示未接受過教育; 2 = 接受過1年以上的正規教育。age2
: 1 = 15-24; 2 = 25-34; 3 = 35+ 歲
##
## . cd "~/Downloads/LSHTMlearningnote/backupfile/Users/chaoimacmini/Downloads/LSHTMlearningnote/backupfiles
##
## . use mwanza
##
## .
## . * create a new variable for education
##
## . generate ed2 = ed
##
## . recode ed2 3/4 = 2
## (ed2: 376 changes made)
##
## . label define ed2label 1 "none/adult only" 2 ">=1 years"
##
## . label val ed2 ed2label
##
## . label var ed2 "education"
##
## .
## . * check the recoding worked as wanted
##
## . tabulate ed2 ed
##
## | Education
## education | 1 2 3 4 | Total
## ----------------+--------------------------------------------+----------
## none/adult only | 312 0 0 0 | 312
## >=1 years | 0 75 365 11 | 451
## ----------------+--------------------------------------------+----------
## Total | 312 75 365 11 | 763
##
## .
## .
## . * similarly for age
##
## .
## . generate age2 = age1
##
## . recode age2 2 = 1 3/4 = 2 5/6 = 3
## (age2: 654 changes made)
##
## . label define age2label 1 "15-24" 2 "25-34" 3 "35+"
##
## . label val age2 age2label
##
## . label var age2 "Age"
##
## . tabulate age2 age1
##
## | Age group
## Age | 1 2 3 4 5 | Total
## -----------+-------------------------------------------------------+----------
## 15-24 | 109 165 0 0 0 | 274
## 25-34 | 0 0 123 118 0 | 241
## 35+ | 0 0 0 0 137 | 248
## -----------+-------------------------------------------------------+----------
## Total | 109 165 123 118 137 | 763
##
##
## | Age group
## Age | 6 | Total
## -----------+-----------+----------
## 15-24 | 0 | 274
## 25-34 | 0 | 241
## 35+ | 111 | 248
## -----------+-----------+----------
## Total | 111 | 763
##
## .
help mwanza
Case control study of risk factors for HIV in women, Mwanza Tanzania
As part of a prospective study of the impact of STD control on the incidence
of HIV infection in Mwanza, Tanzania, a baseline survey of HIV prevalence was
carried out in 12 communities. All seropositive women (15 years and above)
were revisited and, where possible) interviewed about potential risk factors
for HIV infection using a standard questionnaire. In addition to interviewing
HIV +ve women, a random sample of HIV -ve women were selected from the
population lists prepared during the baseline survey and these women were also
revisited and, where possible, interviewed. No matching of controls with cases
was performed.
idno identity number
comp community 1-12
case 1=case; 0=control
age1 age group: 1=15-19 2=20-24 3=25-29
4=30-34 5=35-44 6=45-54
ed education: 1=none/adult only 2=1-3 years
3=4-6 years 4=7+ years
eth ethnic group: 1=Sukuma 2=Mkara 3=other 9=missing
rel religion: 1=Moslem 2=Catholic 3=Protestant 4=other 9=missing
msta marital status: 1=currently married 2=divorced/widowed
3=never married 9=missing
bld blood transfusion in last 5 years: 1=no 2=yes 9=missing
inj injections in past 1 year: 1=none 2=1 3=2-4 4=5-9 5=10+
9=missing
skin skin incisions or tattoos: 1=no 2=yes 9=missing
fsex age at first sex: 1=<15 2=15-19 3=20+ 4=never 9=missing
npa number of sexual partners ever: 1=0-1 2=2-4 3=5-9 4=10-19
5=20-49 6=50+ 9=missing
pa1 sex partners in last year: 1=none 2=1 3=2 4=3-4
9=missing
usedc ever used a condom: 1=no 2=yes 9=missing
ud genital ulcer or discharge in past year: 1=no 2=yes
9=missing
ark perceived risk of HIV/AIDS: 1=none/slight 2=quite likely
3=very likely/already infected 4=don't know
srk perceived risk of STDs: 1=none/slight 2=quite likely
3=very likely/already infected 4=don't know
118.2 Q2 計算粗比值比
以受教育程度爲預測變量,HIV患病與否作爲結果變量,計算粗比值比 crude odds ratio (OR)。 先獲取這倆個變量之間簡單的 \(2 \times 2\) 表格,對他們可能有的關係有個大概的印象:
## . tabulate case ed2, row
##
## +----------------+
## | Key |
## |----------------|
## | frequency |
## | row percentage |
## +----------------+
##
## Case/contr | education
## ol | none/adul >=1 years | Total
## -----------+----------------------+----------
## 0 | 263 311 | 574
## | 45.82 54.18 | 100.00
## -----------+----------------------+----------
## 1 | 49 140 | 189
## | 25.93 74.07 | 100.00
## -----------+----------------------+----------
## Total | 312 451 | 763
## | 40.89 59.11 | 100.00
##
進一步計算 Crude OR,分別計算拿不同的教育水平作爲參照 (baseline) 時獲得的粗比值比:
## . mhodds case ed2, c(1, 2)
##
## Maximum likelihood estimate of the odds ratio
## Comparing ed2==1 vs. ed2==2
##
## ----------------------------------------------------------------
## Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
## ----------------------------------------------------------------
## 0.413878 23.25 0.0000 0.285782 0.599391
## ----------------------------------------------------------------
## . mhodds case ed2, c(2, 1)
##
## Maximum likelihood estimate of the odds ratio
## Comparing ed2==2 vs. ed2==1
##
## ----------------------------------------------------------------
## Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
## ----------------------------------------------------------------
## 2.416169 23.25 0.0000 1.668360 3.499168
## ----------------------------------------------------------------
計算該表格的卡方值 \(\chi^2\),並檢驗這兩個二進制型變量之間是否存在相關性:
## . tab case ed2, chi exact
## Case/contr | education
## ol | none/adul >=1 years | Total
## -----------+----------------------+----------
## 0 | 263 311 | 574
## 1 | 49 140 | 189
## -----------+----------------------+----------
## Total | 312 451 | 763
## Pearson chi2(1) = 23.2789 Pr = 0.000
## Fisher's exact = 0.000
## 1-sided Fisher's exact = 0.000
其實這裏我們並不需要用到 Fisher’s Exact 檢驗方法,因爲四個空格裏最小的期望值是 \(\frac{312 \times 189}{763} = 77.3 > 5\)。所以無論你用哪個檢驗方法都會得出相同的結論,也就是數據給出的證據反對該命題的零假設,也就是兩個二進制變量之間無關。
118.3 Q3 年齡的混雜或者交互 confounding or effect-mnodifier
我們希望分析數據來理解受教育水平和是否患有HIV之間的關係受到年齡怎樣的影響。我們先通過不同年齡階層內,教育和HIV之間的關係來看:
## . bysort age2: tab case ed2, row
##
## ---------------------------------------------------------------------------------------------------
## -> age2 = 15-24
##
## +----------------+
## | Key |
## |----------------|
## | frequency |
## | row percentage |
## +----------------+
##
## Case/contr | education
## ol | none/adul >=1 years | Total
## -----------+----------------------+----------
## 0 | 37 167 | 204
## | 18.14 81.86 | 100.00
## -----------+----------------------+----------
## 1 | 13 57 | 70
## | 18.57 81.43 | 100.00
## -----------+----------------------+----------
## Total | 50 224 | 274
## | 18.25 81.75 | 100.00
##
## ---------------------------------------------------------------------------------------------------
## -> age2 = 25-34
##
## +----------------+
## | Key |
## |----------------|
## | frequency |
## | row percentage |
## +----------------+
##
## Case/contr | education
## ol | none/adul >=1 years | Total
## -----------+----------------------+----------
## 0 | 79 90 | 169
## | 46.75 53.25 | 100.00
## -----------+----------------------+----------
## 1 | 11 61 | 72
## | 15.28 84.72 | 100.00
## -----------+----------------------+----------
## Total | 90 151 | 241
## | 37.34 62.66 | 100.00
##
## ---------------------------------------------------------------------------------------------------
## -> age2 = 35+
##
## +----------------+
## | Key |
## |----------------|
## | frequency |
## | row percentage |
## +----------------+
##
## Case/contr | education
## ol | none/adul >=1 years | Total
## -----------+----------------------+----------
## 0 | 147 54 | 201
## | 73.13 26.87 | 100.00
## -----------+----------------------+----------
## 1 | 25 22 | 47
## | 53.19 46.81 | 100.00
## -----------+----------------------+----------
## Total | 172 76 | 248
## | 69.35 30.65 | 100.00
我們可以使用 by(age2)
選項來計算年齡調整之後的比值比,評價受教育水平和是否患有HIV之間的關係:
## . mhodds case ed2, by(age2)
##
## Maximum likelihood estimate of the odds ratio
## Comparing ed2==2 vs. ed2==1
## by age2
##
## -------------------------------------------------------------------------------
## age2 | Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
## ----------+--------------------------------------------------------------------
## 15-24 | 0.971442 0.01 0.9354 0.48188 1.95837
## 25-34 | 4.867677 21.28 0.0000 2.31121 10.25188
## 35+ | 2.395556 7.10 0.0077 1.23412 4.65001
## -------------------------------------------------------------------------------
##
## Mantel-Haenszel estimate controlling for age2
## ----------------------------------------------------------------
## Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
## ----------------------------------------------------------------
## 2.330972 19.46 0.0000 1.582460 3.433536
## ----------------------------------------------------------------
##
## Test of homogeneity of ORs (approx): chi2(2) = 10.31
## Pr>chi2 = 0.0058
可以清楚的看見,當調整了年齡之後,比較受過1年以上正規教育的人,未受過教育的人患有HIV的比值比 OR 從 2.42 變成了 2.33 (95%CI: 1.58, 3.43)。也就是說數據支持受教育水平和是否患有HIV之間有很強的相關性。但是我們關注到最後一行進行交互作用檢驗部分給出的結果:
## Test of homogeneity of ORs (approx): chi2(2) = 10.31
## Pr>chi2 = 0.0058
也就是數據同樣發現的一點是,評價教育水平, HIV之間關係的 OR 在不同的年齡分層之間有顯著的不同 (the association between education and HIV infection varies with age group)。而且分層的 OR 值中我們看見教育和HIV患病之間並無關係。如果你也認爲,年齡對教育水平和HIV之間的關係造成的是交互作用的話,那麼我們就必須摒棄年齡調整之後的OR值,轉而應該報告每個年齡層的OR值。
118.4 Q4 宗教信仰 religion rel
和HIV之間的關係
前面三個問題具體展示了我們應該如何分析並且理解“年齡”對我們關心的“教育水平和HIV患病與否之間的關係”這一命題的影響。接下來我們嘗試用類似的方法來分析“宗教信仰”這一變量。
值得注意的是,宗教信仰 rel
這個變量裏存在編碼成 9
的缺失值 (missing value)。
## . recode rel 9=.
## (rel: 1 changes made)
這裏再對結果變量和宗教信仰兩個變量之間製作卡方表格:
## . tabulate case rel, chi row
##
## +----------------+
## | Key |
## |----------------|
## | frequency |
## | row percentage |
## +----------------+
##
## Case/contr | Religion
## ol | 1 2 3 4 | Total
## -----------+--------------------------------------------+----------
## 0 | 28 228 150 167 | 573
## | 4.89 39.79 26.18 29.14 | 100.00
## -----------+--------------------------------------------+----------
## 1 | 20 93 55 21 | 189
## | 10.58 49.21 29.10 11.11 | 100.00
## -----------+--------------------------------------------+----------
## Total | 48 321 205 188 | 762
## | 6.30 42.13 26.90 24.67 | 100.00
##
## Pearson chi2(3) = 29.4949 Pr = 0.000
初步表格總結發現宗教信仰和是否患有HIV應該是存在關聯性的。在病例中,宗教信仰 rel = 4
也就是其他信仰的人明顯比例較低。
計算不同宗教信仰層級內的比值比和調整了宗教信仰之後的比值比的過程如下:
. mhodds case ed2, by(rel) c(2,1)
Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1
by rel
-------------------------------------------------------------------------------
rel | Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
----------+--------------------------------------------------------------------
1 | 2.022222 1.29 0.2562 0.58471 6.99382
2 | 2.252252 7.69 0.0056 1.24857 4.06278
3 | 1.393519 0.79 0.3745 0.66775 2.90811
4 | 2.019724 2.15 0.1425 0.77414 5.26941
-------------------------------------------------------------------------------
Mantel-Haenszel estimate controlling for rel
----------------------------------------------------------------
Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
----------------------------------------------------------------
1.914248 10.89 0.0010 1.292931 2.834138
----------------------------------------------------------------
Test of homogeneity of ORs (approx): chi2(3) = 1.03
Pr>chi2 = 0.7931
可以看見各個宗教信仰層級內評估受教育水平和是否患有HIV的OR值沒有劇烈的變化,基本都在2左右 (2.02, 2.25, 1.39, 2.02)。而且評價交互作用的檢驗同質性結果的 p 值是 0.7931,也就是並無證據反對無交互作用的零假設,也就是說,這一數據無法提供證據證明受教育水平和是否患有HIV之間的關係會由於宗教信仰而有顯著差別。調整了宗教信仰變量之後的比值比變成 1.91,小於未調整宗教信仰時的比值比 2.42。值得注意的是,在比較比值比計算結果的時候,我們應該確保不同計算過程中使用的人數和病例數是相同的,所以這裏計算粗比值比應該把宗教信仰爲未知的那名對象從數據中排除之後重新計算:
. mhodds case ed2 if rel!=., c(2,1)
Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1
----------------------------------------------------------------
Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
----------------------------------------------------------------
2.423963 23.42 0.0000 1.673565 3.510826
----------------------------------------------------------------
這裏我們可以爲這兩個結果做一個總結性的表格:
Variable | Cases | Controls |
Crude OR (95% CI) |
P |
Religion adjusted OR (95% CI) |
P |
---|---|---|---|---|---|---|
Education None/adult only ≥1 year |
49 140 |
263 311 |
1 2.42 (1.67, 3.50) |
<0.001 |
1 1.91 (1.29, 2.83) |
0.001 |
然後對這個表格的描述可以簡單表達爲:
在未進行任何變量調整的情況下,該數據的計算結果提供了很強的關於受教育水平和是否患有HIV這二者之間關係的證據 (P < 0.001)。這一相關性有可能可以部分由宗教信仰對這一關係的混雜效應解釋。但是即使是調整了宗教信仰之後,受教育水平依然和是否患有HIV有顯著的相關性。具體地說,接受過一年以上正規教育的人比未曾接受過任何教育或者只有成人教育的人患有HIV的比值 (Odds) 要高將近兩倍 (OR = 1.91)。
118.5 Q5 性伴侶人數
接下來通過分析來理解 “性伴侶人數 npa
” 這個變量是否是受教育水平和HIV患病之間的關係的混雜因子 (confounder)。
受教育水平和HIV患病之間的未調整前OR,和調整 npa
之後的 OR 可以通過下面的代碼計算:
. recode npa 9=.
(npa: 28 changes made)
. mhodds case ed2
Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1
----------------------------------------------------------------
Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
----------------------------------------------------------------
2.416169 23.25 0.0000 1.668360 3.499168
----------------------------------------------------------------
.
. mhodds case ed2, by(npa)
Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1
by npa
-------------------------------------------------------------------------------
npa | Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
----------+--------------------------------------------------------------------
1 | 2.378641 3.28 0.0701 0.90428 6.25683
2 | 2.204661 9.72 0.0018 1.32367 3.67200
3 | 3.111429 6.04 0.0139 1.19822 8.07945
4 | 2.698413 2.39 0.1224 0.72666 10.02046
-------------------------------------------------------------------------------
Mantel-Haenszel estimate controlling for npa
----------------------------------------------------------------
Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
----------------------------------------------------------------
2.416886 21.08 0.0000 1.637939 3.566272
----------------------------------------------------------------
Test of homogeneity of ORs (approx): chi2(3) = 0.42
Pr>chi2 = 0.9353
值得注意的是,計算未調整 npa
時的 OR 的過程中,Stata 並未排除掉 npa
裏存在缺失值的對象,所以,我們需要人爲重新把他們排除,再次計算粗比值比。
. mhodds case ed2 if npa!=.
Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1
----------------------------------------------------------------
Odds Ratio chi2(1) P>chi2 [95% Conf. Interval]
----------------------------------------------------------------
2.311262 20.32 0.0000 1.588409 3.363072
----------------------------------------------------------------
所以,我們發現,當忘記排除掉含有 npa
缺失值對象時計算的粗比值比 OR = 2.42。如果正確地排除掉含有 npa
缺失值的對象之後,粗比值比 OR = 2.31,調整 npa
之後的比值比 OR = 2.42。所以在比較正確的粗 OR (2.31) 和調整後 OR (2.42),的時候,我們會做出“npa
對教育水平和HIV患病之間的關係有微弱的混雜作用 slight confounding effect by npa” 的結論和判斷。但是如果錯誤地去和未排除缺失值時計算的粗 OR (2.42) 做比較的話,我們可能就會得出 “npa
對教育水平和HIV患病之間的關係一點混雜作用都沒有 there was no confounding effect at all”。所以,進行粗比值比和調整後比值比數值上比較從而理解是否有混雜效應時,需要注意的一點是計算時使用的對象(人數)必須保持一致。
118.6 Q6 分析劑量-反應關係 dose-response relationship
這題我們來嘗試分析 npa
(性伴侶人數) 和是否患有 HIV 之間的劑量-反應關係。npa
本身有四個分層等級: 1 (none/1); 2 (2-4); 3 (5-9); 4 (10-19)。我們需要計算生成一個新變量,用上述不同分層等級各自的“中位數”來當作 npa
的連續變量:
. recode npa 1=0 2=3 3=7 4=15, gen(npa2)
(735 differences between npa and npa2)
.
. tabodds case npa2, or
---------------------------------------------------------------------------
npa2 | Odds Ratio chi2 P>chi2 [95% Conf. Interval]
-------------+-------------------------------------------------------------
0 | 1.000000 . . . .
3 | 2.128092 10.23 0.0014 1.324948 3.418077
7 | 3.087907 16.71 0.0000 1.746757 5.458785
15 | 8.093567 38.05 0.0000 3.665130 17.872716
---------------------------------------------------------------------------
Test of homogeneity (equal odds): chi2(3) = 39.64
Pr>chi2 = 0.0000
Score test for trend of odds: chi2(1) = 38.65
Pr>chi2 = 0.0000
這裏計算獲得的 Score test for trend of odds
的 p值 < 0.001,也就是此次數據分析的結果提供證據使我們認爲使用線性關係 (linear trend) 來解釋 npa2
和 HIV 患病與否的對數比值 log-odds (ie. the odds of HIV increasing by a constant factor for each unit increase in npa2) 之間的關係更優於無線性關係 (零假設)。
我們還可以計算 npa2
和 HIV患病與否之間的 \(\chi^2\):
. tab case npa2, chi
Case/contr | RECODE of npa (Number of sex partners ever)
ol | 0 3 7 15 | Total
-----------+--------------------------------------------+----------
0 | 173 277 83 19 | 552
1 | 27 92 40 24 | 183
-----------+--------------------------------------------+----------
Total | 200 369 123 43 | 735
Pearson chi2(3) = 39.6969 Pr = 0.000
然後利用這兩個 \(\chi^2\) 和各自的自由度,我們可以檢驗另一個零假設: “npa2
和HIV患病與否的對數比值 log-odds 之間的關係是線性的。”
方差之差: 39.70 - 38.65 = 1.05,自由度差:3-1 = 2,所以 p 值是:
. display chiprob(2, 1.05)
.59155536
獲得了一個等於 0.59 的 p 值。所以我們可以認爲無證據拒絕這次的零假設 - 線性關係成立。(There is no evidence of departure from linear trend between the score of npa2 and log-odds of HIV.)