第 118 章 (無匹配的)病例對照研究的分析方法 analysis of unmatched case-control studies

本次練習使用的數據是在坦桑尼亞實施的病例對照研究,數據名是 mwanza.dta。你可以使用 help mwanza 來進一步瞭解這個研究和數據的各個變量。

118.1 Q1 數據讀入

按照要求生成兩個變量:

  1. ed2: 1 = 表示未接受過教育; 2 = 接受過1年以上的正規教育。
  2. age2: 1 = 15-24; 2 = 25-34; 3 = 35+ 歲
## 
## . cd "~/Downloads/LSHTMlearningnote/backupfile/Users/chaoimacmini/Downloads/LSHTMlearningnote/backupfiles
## 
## . use mwanza
## 
## . 
## . * create a new variable for education
## 
## . generate ed2 = ed
## 
## . recode ed2 3/4 = 2
## (ed2: 376 changes made)
## 
## . label define ed2label 1 "none/adult only" 2 ">=1 years"
## 
## . label val ed2 ed2label
## 
## . label var ed2 "education"
## 
## . 
## . * check the recoding worked as wanted
## 
## . tabulate ed2 ed
## 
##                 |                  Education
##       education |         1          2          3          4 |     Total
## ----------------+--------------------------------------------+----------
## none/adult only |       312          0          0          0 |       312 
##       >=1 years |         0         75        365         11 |       451 
## ----------------+--------------------------------------------+----------
##           Total |       312         75        365         11 |       763 
## 
## . 
## . 
## . * similarly for age 
## 
## . 
## . generate age2 = age1
## 
## . recode age2 2 = 1 3/4 = 2 5/6 = 3
## (age2: 654 changes made)
## 
## . label define age2label 1 "15-24" 2 "25-34" 3 "35+"
## 
## . label val age2 age2label
## 
## . label var age2 "Age"
## 
## . tabulate age2 age1
## 
##            |                       Age group
##        Age |         1          2          3          4          5 |     Total
## -----------+-------------------------------------------------------+----------
##      15-24 |       109        165          0          0          0 |       274 
##      25-34 |         0          0        123        118          0 |       241 
##        35+ |         0          0          0          0        137 |       248 
## -----------+-------------------------------------------------------+----------
##      Total |       109        165        123        118        137 |       763 
## 
## 
##            | Age group
##        Age |         6 |     Total
## -----------+-----------+----------
##      15-24 |         0 |       274 
##      25-34 |         0 |       241 
##        35+ |       111 |       248 
## -----------+-----------+----------
##      Total |       111 |       763 
## 
## .
help mwanza
Case control study of risk factors for HIV in women, Mwanza Tanzania  
     
As part of a prospective study of the impact of STD control on the incidence
of HIV infection in Mwanza, Tanzania, a baseline survey of HIV prevalence was
carried out in 12 communities. All seropositive women (15 years and above)
were revisited and, where possible) interviewed about potential risk factors
for HIV infection using a standard questionnaire. In addition to interviewing
HIV +ve women, a random sample of HIV -ve women were selected from the
population lists prepared during the baseline survey and these women were also
revisited and, where possible, interviewed. No matching of controls with cases
was performed.

     
   idno         identity number  
   comp         community 1-12  
   case         1=case; 0=control  
   age1         age group: 1=15-19 2=20-24 3=25-29               
                             4=30-34 5=35-44 6=45-54  
   ed           education: 1=none/adult only  2=1-3 years
                             3=4-6 years  4=7+ years
   eth          ethnic group: 1=Sukuma 2=Mkara 3=other 9=missing        
   rel          religion: 1=Moslem 2=Catholic 3=Protestant 4=other 9=missing
   msta         marital status: 1=currently married 2=divorced/widowed
                              3=never married 9=missing
   bld          blood transfusion in last 5 years: 1=no 2=yes 9=missing
   inj          injections in past 1 year: 1=none 2=1 3=2-4 4=5-9 5=10+   
                              9=missing
   skin         skin incisions or tattoos: 1=no 2=yes 9=missing
   fsex         age at first sex: 1=<15 2=15-19 3=20+ 4=never 9=missing
   npa          number of sexual partners ever: 1=0-1 2=2-4 3=5-9 4=10-19 
                              5=20-49 6=50+ 9=missing
   pa1          sex partners in last year: 1=none 2=1 3=2 4=3-4      
                              9=missing
   usedc        ever used a condom: 1=no 2=yes 9=missing
   ud           genital ulcer or discharge in past year: 1=no 2=yes       
                              9=missing
   ark          perceived risk of HIV/AIDS: 1=none/slight 2=quite likely
                              3=very likely/already infected 4=don't know
   srk          perceived risk of STDs:  1=none/slight 2=quite likely
                              3=very likely/already infected 4=don't know

118.2 Q2 計算粗比值比

以受教育程度爲預測變量,HIV患病與否作爲結果變量,計算粗比值比 crude odds ratio (OR)。 先獲取這倆個變量之間簡單的 \(2 \times 2\) 表格,對他們可能有的關係有個大概的印象:

## .  tabulate case ed2, row
## 
## +----------------+
## | Key            |
## |----------------|
## |   frequency    |
## | row percentage |
## +----------------+
## 
## Case/contr |       education
##         ol | none/adul  >=1 years |     Total
## -----------+----------------------+----------
##          0 |       263        311 |       574 
##            |     45.82      54.18 |    100.00 
## -----------+----------------------+----------
##          1 |        49        140 |       189 
##            |     25.93      74.07 |    100.00 
## -----------+----------------------+----------
##      Total |       312        451 |       763 
##            |     40.89      59.11 |    100.00 
## 

進一步計算 Crude OR,分別計算拿不同的教育水平作爲參照 (baseline) 時獲得的粗比值比:

## .  mhodds case ed2, c(1, 2)
## 
## Maximum likelihood estimate of the odds ratio
## Comparing ed2==1 vs. ed2==2
## 
##     ----------------------------------------------------------------
##      Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
##     ----------------------------------------------------------------
##        0.413878      23.25        0.0000         0.285782   0.599391
##     ----------------------------------------------------------------

## .  mhodds case ed2, c(2, 1)
## 
## Maximum likelihood estimate of the odds ratio
## Comparing ed2==2 vs. ed2==1
## 
##     ----------------------------------------------------------------
##      Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
##     ----------------------------------------------------------------
##        2.416169      23.25        0.0000         1.668360   3.499168
##     ----------------------------------------------------------------

計算該表格的卡方值 \(\chi^2\),並檢驗這兩個二進制型變量之間是否存在相關性:


## . tab case ed2, chi exact

## Case/contr |       education
##         ol | none/adul  >=1 years |     Total
## -----------+----------------------+----------
##          0 |       263        311 |       574 
##          1 |        49        140 |       189 
## -----------+----------------------+----------
##      Total |       312        451 |       763 

##           Pearson chi2(1) =  23.2789   Pr = 0.000
##            Fisher's exact =                 0.000
##    1-sided Fisher's exact =                 0.000

其實這裏我們並不需要用到 Fisher’s Exact 檢驗方法,因爲四個空格裏最小的期望值是 \(\frac{312 \times 189}{763} = 77.3 > 5\)。所以無論你用哪個檢驗方法都會得出相同的結論,也就是數據給出的證據反對該命題的零假設,也就是兩個二進制變量之間無關。

118.3 Q3 年齡的混雜或者交互 confounding or effect-mnodifier

我們希望分析數據來理解受教育水平和是否患有HIV之間的關係受到年齡怎樣的影響。我們先通過不同年齡階層內,教育和HIV之間的關係來看:

## . bysort age2: tab case ed2, row
## 
## ---------------------------------------------------------------------------------------------------
## -> age2 = 15-24
## 
## +----------------+
## | Key            |
## |----------------|
## |   frequency    |
## | row percentage |
## +----------------+
## 
## Case/contr |       education
##         ol | none/adul  >=1 years |     Total
## -----------+----------------------+----------
##          0 |        37        167 |       204 
##            |     18.14      81.86 |    100.00 
## -----------+----------------------+----------
##          1 |        13         57 |        70 
##            |     18.57      81.43 |    100.00 
## -----------+----------------------+----------
##      Total |        50        224 |       274 
##            |     18.25      81.75 |    100.00 
## 
## ---------------------------------------------------------------------------------------------------
## -> age2 = 25-34
## 
## +----------------+
## | Key            |
## |----------------|
## |   frequency    |
## | row percentage |
## +----------------+
## 
## Case/contr |       education
##         ol | none/adul  >=1 years |     Total
## -----------+----------------------+----------
##          0 |        79         90 |       169 
##            |     46.75      53.25 |    100.00 
## -----------+----------------------+----------
##          1 |        11         61 |        72 
##            |     15.28      84.72 |    100.00 
## -----------+----------------------+----------
##      Total |        90        151 |       241 
##            |     37.34      62.66 |    100.00 
## 
## ---------------------------------------------------------------------------------------------------
## -> age2 = 35+
## 
## +----------------+
## | Key            |
## |----------------|
## |   frequency    |
## | row percentage |
## +----------------+
## 
## Case/contr |       education
##         ol | none/adul  >=1 years |     Total
## -----------+----------------------+----------
##          0 |       147         54 |       201 
##            |     73.13      26.87 |    100.00 
## -----------+----------------------+----------
##          1 |        25         22 |        47 
##            |     53.19      46.81 |    100.00 
## -----------+----------------------+----------
##      Total |       172         76 |       248 
##            |     69.35      30.65 |    100.00 

我們可以使用 by(age2) 選項來計算年齡調整之後的比值比,評價受教育水平和是否患有HIV之間的關係:


## . mhodds case ed2, by(age2)
## 
## Maximum likelihood estimate of the odds ratio
## Comparing ed2==2 vs. ed2==1
## by age2
## 
## -------------------------------------------------------------------------------
##      age2 | Odds Ratio        chi2(1)         P>chi2       [95% Conf. Interval]
## ----------+--------------------------------------------------------------------
##     15-24 |   0.971442           0.01         0.9354         0.48188    1.95837
##     25-34 |   4.867677          21.28         0.0000         2.31121   10.25188
##       35+ |   2.395556           7.10         0.0077         1.23412    4.65001
## -------------------------------------------------------------------------------
## 
##     Mantel-Haenszel estimate controlling for age2
##     ----------------------------------------------------------------
##      Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
##     ----------------------------------------------------------------
##        2.330972      19.46        0.0000         1.582460   3.433536
##     ----------------------------------------------------------------
## 
## Test of homogeneity of ORs (approx): chi2(2)   =   10.31
##                                      Pr>chi2   =  0.0058

可以清楚的看見,當調整了年齡之後,比較受過1年以上正規教育的人,未受過教育的人患有HIV的比值比 OR 從 2.42 變成了 2.33 (95%CI: 1.58, 3.43)。也就是說數據支持受教育水平和是否患有HIV之間有很強的相關性。但是我們關注到最後一行進行交互作用檢驗部分給出的結果:

## Test of homogeneity of ORs (approx): chi2(2)   =   10.31
##                                      Pr>chi2   =  0.0058

也就是數據同樣發現的一點是,評價教育水平, HIV之間關係的 OR 在不同的年齡分層之間有顯著的不同 (the association between education and HIV infection varies with age group)。而且分層的 OR 值中我們看見教育和HIV患病之間並無關係。如果你也認爲,年齡對教育水平和HIV之間的關係造成的是交互作用的話,那麼我們就必須摒棄年齡調整之後的OR值,轉而應該報告每個年齡層的OR值。

118.4 Q4 宗教信仰 religion rel 和HIV之間的關係

前面三個問題具體展示了我們應該如何分析並且理解“年齡”對我們關心的“教育水平和HIV患病與否之間的關係”這一命題的影響。接下來我們嘗試用類似的方法來分析“宗教信仰”這一變量。

值得注意的是,宗教信仰 rel 這個變量裏存在編碼成 9 的缺失值 (missing value)。

## . recode rel 9=.
## (rel: 1 changes made)

這裏再對結果變量和宗教信仰兩個變量之間製作卡方表格:

## . tabulate case rel, chi row
## 
## +----------------+
## | Key            |
## |----------------|
## |   frequency    |
## | row percentage |
## +----------------+
## 
## Case/contr |                  Religion
##         ol |         1          2          3          4 |     Total
## -----------+--------------------------------------------+----------
##          0 |        28        228        150        167 |       573 
##            |      4.89      39.79      26.18      29.14 |    100.00 
## -----------+--------------------------------------------+----------
##          1 |        20         93         55         21 |       189 
##            |     10.58      49.21      29.10      11.11 |    100.00 
## -----------+--------------------------------------------+----------
##      Total |        48        321        205        188 |       762 
##            |      6.30      42.13      26.90      24.67 |    100.00 
## 
##           Pearson chi2(3) =  29.4949   Pr = 0.000

初步表格總結發現宗教信仰和是否患有HIV應該是存在關聯性的。在病例中,宗教信仰 rel = 4 也就是其他信仰的人明顯比例較低。

計算不同宗教信仰層級內的比值比和調整了宗教信仰之後的比值比的過程如下:

. mhodds case ed2, by(rel) c(2,1)

Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1
by rel

-------------------------------------------------------------------------------
      rel | Odds Ratio        chi2(1)         P>chi2       [95% Conf. Interval]
----------+--------------------------------------------------------------------
        1 |   2.022222           1.29         0.2562         0.58471    6.99382
        2 |   2.252252           7.69         0.0056         1.24857    4.06278
        3 |   1.393519           0.79         0.3745         0.66775    2.90811
        4 |   2.019724           2.15         0.1425         0.77414    5.26941
-------------------------------------------------------------------------------

    Mantel-Haenszel estimate controlling for rel
    ----------------------------------------------------------------
     Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
    ----------------------------------------------------------------
       1.914248      10.89        0.0010         1.292931   2.834138
    ----------------------------------------------------------------

Test of homogeneity of ORs (approx): chi2(3)   =    1.03
                                     Pr>chi2   =  0.7931

可以看見各個宗教信仰層級內評估受教育水平和是否患有HIV的OR值沒有劇烈的變化,基本都在2左右 (2.02, 2.25, 1.39, 2.02)。而且評價交互作用的檢驗同質性結果的 p 值是 0.7931,也就是並無證據反對無交互作用的零假設,也就是說,這一數據無法提供證據證明受教育水平和是否患有HIV之間的關係會由於宗教信仰而有顯著差別。調整了宗教信仰變量之後的比值比變成 1.91,小於未調整宗教信仰時的比值比 2.42。值得注意的是,在比較比值比計算結果的時候,我們應該確保不同計算過程中使用的人數和病例數是相同的,所以這裏計算粗比值比應該把宗教信仰爲未知的那名對象從數據中排除之後重新計算:

. mhodds case ed2 if rel!=., c(2,1)

Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1

    ----------------------------------------------------------------
     Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
    ----------------------------------------------------------------
       2.423963      23.42        0.0000         1.673565   3.510826
    ----------------------------------------------------------------

這裏我們可以爲這兩個結果做一個總結性的表格:

Variable Cases Controls Crude OR
(95% CI)
P Religion adjusted OR
(95% CI)
P
Education
None/adult only
≥1 year

49
140

263
311

1
2.42 (1.67, 3.50)


<0.001

1
1.91 (1.29, 2.83)


0.001

然後對這個表格的描述可以簡單表達爲:

在未進行任何變量調整的情況下,該數據的計算結果提供了很強的關於受教育水平和是否患有HIV這二者之間關係的證據 (P < 0.001)。這一相關性有可能可以部分由宗教信仰對這一關係的混雜效應解釋。但是即使是調整了宗教信仰之後,受教育水平依然和是否患有HIV有顯著的相關性。具體地說,接受過一年以上正規教育的人比未曾接受過任何教育或者只有成人教育的人患有HIV的比值 (Odds) 要高將近兩倍 (OR = 1.91)。

118.5 Q5 性伴侶人數

接下來通過分析來理解 “性伴侶人數 npa” 這個變量是否是受教育水平和HIV患病之間的關係的混雜因子 (confounder)。

受教育水平和HIV患病之間的未調整前OR,和調整 npa 之後的 OR 可以通過下面的代碼計算:

. recode npa 9=.
(npa: 28 changes made)

 
. mhodds case ed2

Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1

    ----------------------------------------------------------------
     Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
    ----------------------------------------------------------------
       2.416169      23.25        0.0000         1.668360   3.499168
    ----------------------------------------------------------------

. 
. mhodds case ed2, by(npa)

Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1
by npa

-------------------------------------------------------------------------------
      npa | Odds Ratio        chi2(1)         P>chi2       [95% Conf. Interval]
----------+--------------------------------------------------------------------
        1 |   2.378641           3.28         0.0701         0.90428    6.25683
        2 |   2.204661           9.72         0.0018         1.32367    3.67200
        3 |   3.111429           6.04         0.0139         1.19822    8.07945
        4 |   2.698413           2.39         0.1224         0.72666   10.02046
-------------------------------------------------------------------------------

    Mantel-Haenszel estimate controlling for npa
    ----------------------------------------------------------------
     Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
    ----------------------------------------------------------------
       2.416886      21.08        0.0000         1.637939   3.566272
    ----------------------------------------------------------------

Test of homogeneity of ORs (approx): chi2(3)   =    0.42
                                     Pr>chi2   =  0.9353

值得注意的是,計算未調整 npa 時的 OR 的過程中,Stata 並未排除掉 npa 裏存在缺失值的對象,所以,我們需要人爲重新把他們排除,再次計算粗比值比。

. mhodds case ed2 if npa!=.

Maximum likelihood estimate of the odds ratio
Comparing ed2==2 vs. ed2==1

    ----------------------------------------------------------------
     Odds Ratio    chi2(1)        P>chi2        [95% Conf. Interval]
    ----------------------------------------------------------------
       2.311262      20.32        0.0000         1.588409   3.363072
    ----------------------------------------------------------------

所以,我們發現,當忘記排除掉含有 npa 缺失值對象時計算的粗比值比 OR = 2.42。如果正確地排除掉含有 npa 缺失值的對象之後,粗比值比 OR = 2.31,調整 npa 之後的比值比 OR = 2.42。所以在比較正確的粗 OR (2.31) 和調整後 OR (2.42),的時候,我們會做出“npa對教育水平和HIV患病之間的關係有微弱的混雜作用 slight confounding effect by npa” 的結論和判斷。但是如果錯誤地去和未排除缺失值時計算的粗 OR (2.42) 做比較的話,我們可能就會得出 “npa 對教育水平和HIV患病之間的關係一點混雜作用都沒有 there was no confounding effect at all”。所以,進行粗比值比和調整後比值比數值上比較從而理解是否有混雜效應時,需要注意的一點是計算時使用的對象(人數)必須保持一致。

118.6 Q6 分析劑量-反應關係 dose-response relationship

這題我們來嘗試分析 npa (性伴侶人數) 和是否患有 HIV 之間的劑量-反應關係。npa 本身有四個分層等級: 1 (none/1); 2 (2-4); 3 (5-9); 4 (10-19)。我們需要計算生成一個新變量,用上述不同分層等級各自的“中位數”來當作 npa 的連續變量:

. recode npa 1=0 2=3 3=7 4=15, gen(npa2)
(735 differences between npa and npa2)

. 
. tabodds case npa2, or

---------------------------------------------------------------------------
        npa2 |  Odds Ratio       chi2       P>chi2     [95% Conf. Interval]
-------------+-------------------------------------------------------------
           0 |    1.000000          .           .              .          .
           3 |    2.128092      10.23       0.0014      1.324948   3.418077
           7 |    3.087907      16.71       0.0000      1.746757   5.458785
          15 |    8.093567      38.05       0.0000      3.665130  17.872716
---------------------------------------------------------------------------
Test of homogeneity (equal odds): chi2(3)  =    39.64
                                  Pr>chi2  =   0.0000

Score test for trend of odds:     chi2(1)  =    38.65
                                  Pr>chi2  =   0.0000

這裏計算獲得的 Score test for trend of odds 的 p值 < 0.001,也就是此次數據分析的結果提供證據使我們認爲使用線性關係 (linear trend) 來解釋 npa2 和 HIV 患病與否的對數比值 log-odds (ie. the odds of HIV increasing by a constant factor for each unit increase in npa2) 之間的關係更優於無線性關係 (零假設)。

我們還可以計算 npa2 和 HIV患病與否之間的 \(\chi^2\):

. tab case npa2, chi

Case/contr | RECODE of npa (Number of sex partners ever)
        ol |         0          3          7         15 |     Total
-----------+--------------------------------------------+----------
         0 |       173        277         83         19 |       552 
         1 |        27         92         40         24 |       183 
-----------+--------------------------------------------+----------
     Total |       200        369        123         43 |       735 

          Pearson chi2(3) =  39.6969   Pr = 0.000

然後利用這兩個 \(\chi^2\) 和各自的自由度,我們可以檢驗另一個零假設: “npa2 和HIV患病與否的對數比值 log-odds 之間的關係是線性的。”

方差之差: 39.70 - 38.65 = 1.05,自由度差:3-1 = 2,所以 p 值是:

. display chiprob(2, 1.05)
.59155536

獲得了一個等於 0.59 的 p 值。所以我們可以認爲無證據拒絕這次的零假設 - 線性關係成立。(There is no evidence of departure from linear trend between the score of npa2 and log-odds of HIV.)