第 65 章 分析策略

65.1 明確分析目的

作爲統計學家,着手分析數據之前,千萬記得,必須要制定一個儘可能詳盡的分析計劃。即使你的分析,可能並不一定受到第三方的監管或者調控,因爲同行評審的專家們,喜歡看到你分析的目的明確,假設檢驗的過程是經過仔細推敲的。同時,也可以避免陷入 “玩弄數據 (data dredging)” 指控的危險。

數據分析的目的,可以分成三大類:

  1. 估計一個或者幾個暴露變量,對結果變量的影響。以此目的的數據分析過程,需要我們有醫學偵探一樣的眼光和見解,從數據中判斷那些需要被調整和控制的混雜因子,從而提高你的分析效率。最常見的例子是分析隨機對照臨牀實驗 (RCT) 中,療效的差異;或者流行病學研究中,分析某種生活習慣,和疾病的發生或者死亡之間的關係。
  2. 在現有的數據庫中,尋找並且建立 “最佳” 模型。以此目的的數據分析,需要我們對模型中的結果變量有極爲深入的瞭解,把與之相關的所有要因,儘可能多的納入你的分析模型。常見的例子如,在某個特定人羣的數據庫中尋找並確定能夠決定自殺這一結果變量的決定性因素,之所以有這樣的目的,背後可能有決策者希望尋找這些決定性因素後採取一些對策從而達到改善現狀的最終目的。所以找到和結果變量相關的因素,是此類研究的重中之重。
  3. 建立預測模型。例如,某項研究的目的是爲了能夠建立一個能夠預測孕期胎兒患有唐氏綜合症的預測模型,用能夠測量的一些指標(如血液指標,或者母親的一些健康指標),通過模型的算法,去計算胎兒患病的概率是多大。這樣的模型,對與診斷醫學有重大意義。所以,此類研究的目的,不是爲了尋找確定和胎兒患病相關的全部要因,而是怎樣才能提高模型預測的準確度,提高診斷的效率,減少錯誤診斷,拯救生命。

當然,上述目的中的 2 和 3 有時候易讓人混淆,因爲我們可能建立最佳模型,除了想要找到和 “自殺” 這一結果相關的所有要因,還可能希望通過該模型做出預測,尋找可能自殺的高危人羣,進行干預。這並不矛盾。

65.2 分析目的 1.1 – 估計 RCT 中治療效果 (treatment effect)

先揀最軟的柿子捏,RCT 的療效比較作爲數據分析的目的時,情況要比其他的目的相對簡單些。RCT 的隨機過程,確保了臨牀試驗不會受到混雜因素的影響。但是我們還會出於爲了提高統計分析效能改善估計的精確度的目的,對參與臨牀試驗的受試者最初測量的一些特徵進行調整。當然,不是所有的數據專家,也不是所有的 RCT 實施者都同意進行這一調整的。如果確定要調整,放入模型中的變量,可能常常是一開始隨機分配時用到的那些用於將受試者分層歸類或者最小化 (minimisation) 的那些變量。

基線值調整 (baseline adjustment),在結果變量爲連續型,同時模型是線性迴歸模型時,能夠顯著提高統計效能 (statistical efficiency),降低估計值的標準誤。理論上,一個基線測量時的連續型變量,如果它和實驗後測量的連續型結果變量之間的 皮爾森相關係數 Pearson correlation coefficient\(r\),那麼如果你用 ANCOVA 模型調整了這個基線值的話,療效差異估計值的標準誤會是沒有調整時的 \(\sqrt{1-r^2}\) 倍 (也就是永遠比不調整時要小,大大提高精確度,縮小療效差異估計值的 95% 信賴區間)。

但是,但是,但是!如果一個 RCT 測量的結果變量是一個二進制變量 (死亡/存活),線性迴歸模型不適用,只能使用邏輯迴歸時,模型中加入和結果變量相關 (和暴露變量無關) 的基線值的做法对分析效能的提高顯得十分有限,相反還会受到邏輯迴歸的不可壓縮性較大的影響 (Section 63.3.2)。

再把之前講邏輯迴歸不可壓縮性時用过的例子拿过来这里解释这个现象:

表 51.1: Non-collapsibility of logit link in GLM (stratified data)
Strata 1 Strata 2
Outcome Drug Placebo Drug Placebo
Success 90 50 50 10
Failure 10 50 50 90
Total 100 100 100 100
Odds Ratios 9 9

上面的數據表示,分層變量 (Strata 1-2) 本身和使用藥物和安慰劑無交互作用,也和藥物使用與臨牀試驗結果之間的關係無關。但是,即使這個分類變量無關,壓縮後的數據計算獲得的比值比和分層時的比值比差異巨大:

表 51.2: Non-collapsibility of logit link in GLM (collapsed data)
Outcome Drug Placebo
Success 140 60
Failure 60 140
Total 200 200
Odds ratio 5.4

實際在 R 裏擬合邏輯迴歸模型的結果如下:

## 
## Call:
## glm(formula = Result ~ Treatment, family = binomial(link = logit), 
##     data = RCT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5518  -0.8446   0.0000   0.8446   1.5518  
## 
## Coefficients:
##                  Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)       0.84730    0.15430  5.4911 3.994e-08 ***
## TreatmentPlacebo -1.69460    0.21822 -7.7656 8.125e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.518  on 399  degrees of freedom
## Residual deviance: 488.691  on 398  degrees of freedom
## AIC: 492.691
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = Result ~ Treatment + Strata, family = binomial(link = logit), 
##     data = RCT)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1460  -1.1774   0.0000   1.1774   2.1460  
## 
## Coefficients:
##                  Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)       2.19722    0.26505  8.2898 < 2.2e-16 ***
## TreatmentPlacebo -2.19722    0.27486 -7.9941 1.306e-15 ***
## Strata2          -2.19722    0.27486 -7.9941 1.306e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.518  on 399  degrees of freedom
## Residual deviance: 407.292  on 397  degrees of freedom
## AIC: 413.292
## 
## Number of Fisher Scoring iterations: 4

從結果的迴歸係數估計和計算的標準誤來看,調整了其他的變量會引起:

  1. 使對數比值比的估計量升高 (這是由於模型的不可壓縮性) \(1.69 \rightarrow 2.19\)
  2. 對數比值比的標準誤估計升高 (非但不能增加估計精確度,反而起到了反作用) \(0.22\rightarrow0.27\)
  3. 對數比值比的統計檢驗量升高 (由於對數比值比的升高比標準誤升高的更多一些) \(7.77\rightarrow7.99\)

事實上,上面的現象在使用邏輯迴歸的時候基本上都會呈現。在經典論文 (L. D. Robinson and Jewell 1991) 中給出了詳細的論證。所以其實使用邏輯迴歸擬合數據的 RCT 臨牀試驗,我們可以推論,當模型中加入第三個僅和結果變量有關的基線共變量 (baseline covariates),如果模型估計的對數比值比在調整前後變化不大 (即,不可壓縮性造成的影響很小),那這樣的調整對於改善分析的統計效能上幾乎也沒有貢獻。(跟使用線性迴歸的 RCT 完全不同!)

由於邏輯迴歸受使用 \(\text{logit}\) 鏈接方程時不可壓縮性的侷限,同時還由於使用 \(\text{log}\) 鏈接方程時獲得的危險度比 (risk ratios) 比比值比 (odds ratios) 更加容易讓人理解,結果變量爲二進制的 RCT 臨牀試驗常常會選用 \(\text{log}\) 鏈接方程的廣義線性迴歸模型 (見 Section 59.3 第 5 條討論)。選用 \(\text{log}\) 鏈接方程的 GLM 最大的問題在於,當模型中加入過多的預測變量時,會導致模型無法收斂 (converge)–無法找到極大似然估計

至於使用泊松迴歸模型的時候,預測變量如果放入不合理,那麼很容易違反泊松分佈的前提 (方差和均值相同)。對於違反了泊松分佈前提,模型變得過度離散 (over-dispersed) 的 GLM,加入適當的基線共變量 (baseline covariates) 則有助於減少模型的過度離散,減小參數估計的標準誤 (使之變得更精確些)。和線性迴歸相同的是,泊松迴歸模型不受不可壓縮性 (non-collapsibility) 的影響。

65.2.1 RCT 數據分析的一些不成熟的小建議

  1. RCT 臨牀試驗通常都有嚴格的數據管理和監控,且統計分析計劃 (statistical analysis plan, SAP) 在任何一個 RCT 都已經是必須條件。除此之外,還要在試驗進行前就制訂所有詳細的計劃,並寫成實驗實施計劃文件,以供參與的所有人及倫理審查委員會等各種第三方機構的監督。所以,RCT 的統計分析計劃必須儘量考慮到所有的可能情況,因爲一旦開始了試驗,分析計劃是很難改動的。
  2. SAP 必須詳細記錄哪些共變量需要被調整,常見的是實驗設計階段用於實施隨機化過程的那些特徵變量。對於連續型結果變量,(還有過度離散的計數型變量),基線共變量的調整許多時候會有助於改善參數估計的精確度,提高統計效能。對於使用邏輯迴歸模型的試驗,調整基線共變量則沒有太多的好處,且調整後的比值比的含義會發生較大的改變,需慎重。
  3. 有些統計學家支持調整基線共變量,認爲這樣做有助於減少萬一隨機化不徹底造成的治療組和對照組之間隨機產生的殘差偏倚 (residual bias),但是你無法提前欲知那些變量可能會產生隨機的殘差偏倚,這樣便無法在事先需要準備的SAP計劃文件中明確到底哪些基線變量需要被調整。
  4. 另有許多研究者喜歡在 RCT 中尋找交互作用的存在,但是他們常常忽略掉的一點是,一個 RCT 本身的檢驗效能是 80%-90%,其用於檢驗交互作用的效能會更低。建議在 RCT 中儘量少 (甚至不建議) 進行任何交互作用的統計檢驗。

65.3 分析目的 1.2 – 估計流行病學研究中暴露變量和結果變量的關係 (exposure effect)

前文討論的關於調整僅僅和結果變量相關 (與暴露變量無關) 的基線共變量的內容,同樣適用與一般的流行病學研究。流行病學研究中另一個 (應該是更加) 重要的點是,混雜因子的排查和調整。

實例:

  • \(Y\) 標記結果變量,如嬰兒的出生體重;
  • \(X_1\) 標記最主要的 (想要分析其與結果變量之間的關係的) 預測變量,如母親孕期高血壓 (是/否);
  • \(X_2, X_3, \cdots, X_Q\) 標記其他非主要預測變量,但是可能是 \(X_1, Y\) 之間關係中重要的潛在混雜因子,如嬰兒的性別/母親孕前體重/嬰兒胎齡等等。

在這個簡單流行病學研究實例中,我們關心的問題包括:

  1. 主要暴露變量–孕期高血壓,和結果變量–嬰兒出生體重二者的未调整前 (粗) 關係 (crude/before adjustment association) 是什麼樣的?
  2. 主要暴露變量和結果變量之間的關係是否被其他因素影響 (例如胎齡)?如果有,那麼調整後的關係會發生怎樣的變化?
  3. 有沒有其他的變量會改變 (modify) 主要暴露變量和結果變量之間的關係?也就是,有沒有那個變量和主要暴露變量有交互作用?
  4. 有沒有其他的變量和主要暴露變量無關,卻可能和結果變量有關係呢?如果存在這樣的變量,模型中調整它在一些情況下可能會改善擬合的結果提高模型的統計效能 (statistical power)。
  5. 收集的變量中,有沒有哪個變量可能是在主要暴露變量和結果變量之間因果關係 (如果存在因果關係的話) 的通路上 (on the causal pathway) 的呢?如果有,這樣的變量應該被認爲是媒介因子 (mediator)。

65.3.1 不成熟的小策略

這是很常見的簡單流行病學數據分析。可以按照 (但不一定非要按照) 下面建議的步驟實施統計分析:

  1. 第一步,分析主要暴露變量和結果變量之間的未調整前 (粗) 關係:
    \[g\{ E(Y|X_1) \} = \alpha + \beta_1 X_1\]
  2. 第二步,逐個分析其餘的變量和主要暴露變量之間的關係,以及這些潛在的混雜因子和結果變量之間的關係。注意,這一步可能耗時較長,但是它並不是決定模型中是否要加入某個或某些非主要暴露變量的步驟,通過這一步過程有助於我們分析和理解,進一步分析中調整前後的參數估計變化
  3. 第三步,建立主要暴露變量和這些潛在混雜因子同時放入模型中的 GLM,逐步放入,一次放入一個 (one at a time) 潛在混雜因子,和上一步分析的三者之間的關係相結合,分析調整該潛在混雜因子前後,主要暴露變量的迴歸係數的參數估計變化的原因。
    \[g\{ E(Y|X_1, X_k) \} = \alpha^* + \beta_1^*X_1 + \beta_kX_k,\; k= 1,\cdots,Q\]

我們來分析這個可以從 Stata 網站上下載的數據

  • 第一步,先看看暴露變量和結果變量之間的關係
lbw <- read_dta("http://www.stata-press.com/data/r12/lbw.dta")
lbw$race <- factor(lbw$race)
lbw$smoke <- factor(lbw$smoke)
lbw$ht <- factor(lbw$ht)
a <- Epi::stat.table(list("Birthweight <2500g" = low, "History of hypertension"=ht), list(count(),percent(low)), data = lbw, margins = TRUE)
# We first tabulate the data
print(a, digits = c(percent = 2))
##  -------------------------------------- 
##               -History of hypertension- 
##  Birthweight         0       1   Total  
##  <2500g                                 
##  -------------------------------------- 
##  0                 125       5     130  
##                  70.62   41.67   68.78  
##                                         
##  1                  52       7      59  
##                  29.38   58.33   31.22  
##                                         
##                                         
##  Total             177      12     189  
##                 100.00  100.00  100.00  
##  --------------------------------------
  • 第二步,分析母親高血壓病史和嬰兒低出生體重之間的調整前 (粗) 關係。
Model0 <- glm(low~ht, data = lbw, family = binomial(link = "logit"))
summary(Model0); epiDisplay::logistic.display(Model0)
## 
## Call:
## glm(formula = low ~ ht, family = binomial(link = "logit"), data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.32323  -0.83407  -0.83407   1.56519   1.56519  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -0.87707    0.16502 -5.3150 1.066e-07 ***
## ht1          1.21354    0.60835  1.9948   0.04606 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 230.650  on 187  degrees of freedom
## AIC: 234.65
## 
## Number of Fisher Scoring iterations: 4
## 
## Logistic regression predicting low 
##  
##            OR(95%CI)          P(Wald's test) P(LR-test)
## ht: 1 vs 0 3.37 (1.02,11.09)  0.046          0.045     
##                                                        
## Log-likelihood = -115.3249
## No. of observations = 189
## AIC value = 234.6499

所以,數據提供了一些證據證明母親的高血壓病史和嬰兒低出生體重之間可能存在正關係,這個調整前的關係是,粗比值比 (crude odds ratio) 爲 3.37 (1.02, 11.09)。

  • 接下來,分析潛在的混雜因子是否和主要暴露變量相關:
# lwt is the last weight of mothers before pregnancy
Model1 <- lm(lwt ~ ht, data = lbw)
summary(Model1); epiDisplay::regress.display(Model1)
## 
## Call:
## lm(formula = lwt ~ ht, data = lbw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -62.5000 -17.9435  -7.9435  10.0565 122.0565 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 127.9435     2.2390 57.1426  < 2e-16 ***
## ht1          29.5565     8.8858  3.3262  0.00106 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.788 on 187 degrees of freedom
## Multiple R-squared:  0.05586,    Adjusted R-squared:  0.050811 
## F-statistic: 11.064 on 1 and 187 DF,  p-value: 0.0010596
## Linear regression predicting lwt
##  
##            Coeff.(95%CI)        P(t-test) P(F-test)
## ht: 1 vs 0 29.56 (12.03,47.09)  0.001     0.001    
##                                                    
## No. of observations = 189

可見,有高血壓病史的母親,孕前體重較高。再看其與結果變量是否有關係:

Model2 <- glm(low ~ lwt, data = lbw, family = binomial(link = "logit"))
summary(Model2); epiDisplay::logistic.display(Model2)
## 
## Call:
## glm(formula = low ~ lwt, family = binomial(link = "logit"), data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.09482  -0.90217  -0.80197   1.36105   1.98141  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.9957634  0.7852434  1.2681  0.20476  
## lwt         -0.0140371  0.0061685 -2.2756  0.02287 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 228.708  on 187  degrees of freedom
## AIC: 232.708
## 
## Number of Fisher Scoring iterations: 4
## 
## Logistic regression predicting low 
##  
##                  OR(95%CI)      P(Wald's test) P(LR-test)
## lwt (cont. var.) 0.99 (0.97,1)  0.023          0.015     
##                                                          
## Log-likelihood = -114.354
## No. of observations = 189
## AIC value = 232.7081

由此知,母親孕前體重較高的人,有較低的可能生下低出生體重的嬰兒。這兩個單獨的關係,各自看都具有 5% 的統計學意義,但是這 (或者其他變量分析的結果沒有統計學意義時) 並不是決定模型中是否加入母親孕前體重這一潛在的混雜因子的理由。接下來,我們通過模型中加入母親孕前體重這一變量前後模型的參數估計變化來分析:

Model3 <- glm(low ~ ht + lwt, data = lbw, family = binomial(link = "logit"))
summary(Model3);epiDisplay::logistic.display(Model3)
## 
## Call:
## glm(formula = low ~ ht + lwt, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.85912  -0.87274  -0.73845   1.29224   2.17964  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.4478621  0.8208975  1.7638 0.077773 . 
## ht1          1.8544773  0.7008245  2.6461 0.008142 **
## lwt         -0.0186285  0.0065928 -2.8256 0.004720 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 221.165  on 186  degrees of freedom
## AIC: 227.165
## 
## Number of Fisher Scoring iterations: 4
## 
## Logistic regression predicting low 
##  
##                  crude OR(95%CI)    adj. OR(95%CI)     P(Wald's test) P(LR-test)
## ht: 1 vs 0       3.37 (1.02,11.09)  6.39 (1.62,25.23)  0.008          0.006     
##                                                                                 
## lwt (cont. var.) 0.99 (0.97,1)      0.98 (0.97,0.99)   0.005          0.002     
##                                                                                 
## Log-likelihood = -110.5827
## No. of observations = 189
## AIC value = 227.1654

加入了孕前體重的模型給出的母親是否有高血壓病史對嬰兒的低出生體重關係的比值比估計爲 \(6.39\),這很明顯比調整孕前體重前的粗比值比 \((3.37)\) 大了很多。這個比值比估計的變化有兩個原因:

  1. (常被忽略的) 邏輯迴歸模型的不可壓縮性導致的;
  2. 母親孕前體重對高血壓病史和嬰兒的低出生體重之間的關係造成了混雜效應。

上面的分析結果告訴我們,該數據提供了足夠的證據證明母親孕前體重和是否有高血壓病史,在調整了彼此之後,仍然獨立地和嬰兒低出生體重的發生有相關性。這裏,我們可以下結論認爲,模型中加入母親孕前體重作爲混雜因子,是合情合理的。

完成了目前爲止的初步分析和混雜因子的判斷以後,下一階段的分析側重於尋找有沒有任何第三方的預測變量,會對主要暴露變量 \(X_1\) (孕期高血壓) 與結果變量 \(Y\) (嬰兒出生體重過低) 之間的關係產生交互作用。如果數據中的預測變量有多個,那可能導致需要分析潛在的交互作用有許多對,通常建議在遇到多個預測變量之間的複雜關係需要討論的時候,建議不要一股腦全部作交互作用的分析,而是限定一個或者幾個最有可能有交互作用的變量就可以了。否則模型過於複雜,反而不利於理解。一般生物醫學的統計分析中考慮的重要交互作用分析,需要有重要的生物學意義,常見的例子是年齡,性別等。

本節使用的例子中,令人感興趣的是,母親的孕前體重,會不會對妊娠高血壓的有無與嬰兒出生體重過低之間的關係造成交互作用:

Model4 <- glm(low ~ ht*lwt, family = binomial(link = "logit"), data = lbw)
summary(Model4); epiDisplay::logistic.display(Model4)
## 
## Call:
## glm(formula = low ~ ht * lwt, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.77338  -0.87349  -0.74658   1.28246   2.20403  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.5393115  0.9174755  1.6778 0.093392 . 
## ht1          1.2869895  2.5493528  0.5048 0.613678   
## lwt         -0.0193796  0.0074129 -2.6143 0.008941 **
## ht1:lwt      0.0037324  0.0161735  0.2308 0.817490   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 221.113  on 185  degrees of freedom
## AIC: 229.113
## 
## Number of Fisher Scoring iterations: 4
## 
## Logistic regression predicting low 
##  
##                  crude OR(95%CI)    adj. OR(95%CI)          P(Wald's test) P(LR-test)
## ht: 1 vs 0       3.37 (1.02,11.09)  3.62 (0.02,535.73)      0.614          0.605     
##                                                                                      
## lwt (cont. var.) 0.99 (0.97,1)      0.98 (0.97,1)           0.009          < 0.001   
##                                                                                      
## ht1:lwt          -                  1.0037 (0.9724,1.0361)  0.817          0.819     
##                                                                                      
## Log-likelihood = -110.5567
## No. of observations = 189
## AIC value = 229.1133

由於交互作用項結果爲 ht1:lwt 0.003732 0.016173 0.231 0.81749,無足夠的證據證明孕前體重會對妊娠高血壓和嬰兒出生體重過低之間的關係造成交互作用。

如果確認沒有交互作用,建立本例最終模型前的幾個建議:

  1. 最終分析 \(X_1, Y\) 之間關係的模型,需要加入我們逐一甄別之後確認過的混淆因子,此時稱爲模型 1
  2. 對於確認不是 \(X_1, Y\) 之間關係的混淆因子的那些剩餘變量,逐一加入模型 1,比較前後是否模型中各個混淆因子的參數估計是否發生了變化 (有沒有混淆因子的混淆因子?);
  3. 最終模型中的變量,需要包含前兩步確認過的全部混淆因子;
  4. 在報告中把調整前後的參數估計整理成表格。

如果在分析過程中發現了有重要意義的交互作用,那麼除了包含全部的混淆因子之外,你的最終模型中還需加入重要的交互作用項。此時需要報告的參數估計是有交互作用項部分的分層比值比/其他指標。

65.3.2 補充

除了使用二項分佈的邏輯迴歸之外,當結果變量是連續型或者計數型,也就是分析模型使用線性迴歸 (ANCOVA),或者 (可能過度離散的) 泊松迴歸時,爲了提高模型的統計效能,減小參數估計的標準誤,模型可以選擇進一步調整一個或幾個只和結果變量有關的基線變量。此時,在你寫論文或者報告時,必須把這些變量和確認是混雜因子的變量加以區分,因爲加它們進入模型的目的不同。

65.4 分析目的 2 和 3 – 建立預測模型 (predictive models)

建立預測模型的過程,其實就是選擇哪個或者那些變量進入模型的過程。方法有很多,可惜的是,沒有哪種是公認完美的。這裏只介紹兩種最常見,也最常被批評的方法 – 前/後 逐步選擇法 (forward stepwise selection/backward elimination)。強調一下,逐步法本身並不是神奇法術,不同的算法選擇的變量自然會有不同,如果你用了逐步選擇法,選出來的模型變量僅僅只能作爲參考,而不能作爲最終結論。

vitc <- haven::read_dta("../backupfiles/vitC.dta")
vitc$ctakers <- factor(vitc$ctakers)
vitc$sex <- factor(vitc$sex)

stats::step(lm(seruvitc~1,data=vitc[complete.cases(vitc),]),direction="forward",scope=~age+height+weight+sex+cigs+ctakers)
## Start:  AIC=575.43
## seruvitc ~ 1
## 
##           Df Sum of Sq     RSS     AIC
## + ctakers  1   6967.43 42659.6 563.663
## + sex      1   3688.53 45938.5 570.402
## + cigs     1   2470.90 47156.1 572.783
## + height   1   1243.73 48383.3 575.121
## <none>                 49627.0 575.430
## + age      1    273.59 49353.4 576.927
## + weight   1      1.53 49625.5 577.427
## 
## Step:  AIC=563.66
## seruvitc ~ ctakers
## 
##          Df Sum of Sq     RSS     AIC
## + sex     1  2713.526 39946.0 559.683
## + cigs    1  2150.581 40509.0 560.956
## + height  1  1049.000 41610.6 563.398
## <none>                42659.6 563.663
## + age     1   247.944 42411.6 565.133
## + weight  1    18.227 42641.3 565.625
## 
## Step:  AIC=559.68
## seruvitc ~ ctakers + sex
## 
##          Df Sum of Sq     RSS     AIC
## + cigs    1  1527.704 38418.3 558.134
## <none>                39946.0 559.683
## + weight  1   445.296 39500.7 560.663
## + age     1   183.277 39762.8 561.264
## + height  1   131.722 39814.3 561.382
## 
## Step:  AIC=558.13
## seruvitc ~ ctakers + sex + cigs
## 
##          Df Sum of Sq     RSS     AIC
## <none>                38418.3 558.134
## + weight  1  257.1523 38161.2 559.523
## + age     1  205.2889 38213.0 559.647
## + height  1   58.2745 38360.1 559.996
## 
## Call:
## lm(formula = seruvitc ~ ctakers + sex + cigs, data = vitc[complete.cases(vitc), 
##     ])
## 
## Coefficients:
## (Intercept)     ctakers1         sex1         cigs  
##     46.0283      19.8317       9.7642     -12.2550
stats::step(lm(seruvitc~.,data=vitc[complete.cases(vitc),]),direction="backward")
## Start:  AIC=563.38
## seruvitc ~ serial + age + height + cigs + weight + sex + ctakers
## 
##           Df Sum of Sq     RSS     AIC
## - height   1      1.48 37272.5 561.379
## - age      1     64.18 37335.2 561.532
## - weight   1    174.65 37445.7 561.801
## - serial   1    808.36 38079.4 563.328
## <none>                 37271.1 563.375
## - cigs     1    979.26 38250.3 563.735
## - sex      1   1123.50 38394.6 564.078
## - ctakers  1   6407.24 43678.3 575.811
## 
## Step:  AIC=561.38
## seruvitc ~ serial + age + cigs + weight + sex + ctakers
## 
##           Df Sum of Sq     RSS     AIC
## - age      1     65.27 37337.8 559.538
## - weight   1    217.51 37490.0 559.908
## - serial   1    807.51 38080.1 561.329
## <none>                 37272.5 561.379
## - cigs     1    977.97 38250.5 561.736
## - sex      1   2584.37 39856.9 565.479
## - ctakers  1   6442.37 43714.9 573.887
## 
## Step:  AIC=559.54
## seruvitc ~ serial + cigs + weight + sex + ctakers
## 
##           Df Sum of Sq     RSS     AIC
## - weight   1    366.60 37704.4 558.427
## - serial   1    823.37 38161.2 559.523
## <none>                 37337.8 559.538
## - cigs     1    944.21 38282.0 559.811
## - sex      1   2816.28 40154.1 564.155
## - ctakers  1   6462.15 43800.0 572.064
## 
## Step:  AIC=558.43
## seruvitc ~ serial + cigs + sex + ctakers
## 
##           Df Sum of Sq     RSS     AIC
## - serial   1    713.92 38418.3 558.134
## <none>                 37704.4 558.427
## - cigs     1   1156.10 38860.5 559.176
## - sex      1   2451.95 40156.4 562.161
## - ctakers  1   6385.14 44089.5 570.664
## 
## Step:  AIC=558.13
## seruvitc ~ cigs + sex + ctakers
## 
##           Df Sum of Sq     RSS     AIC
## <none>                 38418.3 558.134
## - cigs     1   1527.70 39946.0 559.683
## - sex      1   2090.65 40509.0 560.956
## - ctakers  1   5841.01 44259.3 569.014
## 
## Call:
## lm(formula = seruvitc ~ cigs + sex + ctakers, data = vitc[complete.cases(vitc), 
##     ])
## 
## Coefficients:
## (Intercept)         cigs         sex1     ctakers1  
##     46.0283     -12.2550       9.7642      19.8317

65.5 GLM practical 09

在本次練習中,我們將嘗試使用多個不同的分析策略分析前文中使用過的低體重數據。

65.5.1 Part 1

第一部分,假設我們的主要研究目的是想要知道,在儘可能多的考慮所有潛在的混雜因子之後,患有孕期高血壓的孕婦是否和低出生體重嬰兒的誕生有關。

65.5.1.1 在你熟悉的統計分析軟件中載入該數據,重複前文中做過的分析,即調查母親的體重是否對主要分析的高血壓和低出生體重嬰兒之間關係造成混雜。在模型中加入他們的交互作用項試以分析。

lbw <- read_dta("../backupfiles/lbw.dta")

a <- Epi::stat.table(list("Birthweight <2500g" = low, 
                          "History of hypertension"=ht), 
                     list(count(),percent(low)), data = lbw, margins = TRUE)
# We first tabulate the data
print(a, digits = c(percent = 2))
##  -------------------------------------- 
##               -History of hypertension- 
##  Birthweight         0       1   Total  
##  <2500g                                 
##  -------------------------------------- 
##  0                 125       5     130  
##                  70.62   41.67   68.78  
##                                         
##  1                  52       7      59  
##                  29.38   58.33   31.22  
##                                         
##                                         
##  Total             177      12     189  
##                 100.00  100.00  100.00  
##  --------------------------------------
# get the un-adjusted odds ratios for the association
# between maternal hypertension and low birth weight

M0 <- glm(low ~ ht, data = lbw, family = binomial(link = "logit"))
summary(M0); jtools::summ(M0, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht, family = binomial(link = "logit"), data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.32323  -0.83407  -0.83407   1.56519   1.56519  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -0.87707    0.16502 -5.3150 1.066e-07 ***
## ht           1.21354    0.60835  1.9948   0.04606 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 230.650  on 187  degrees of freedom
## AIC: 234.65
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(1) 4.022134
Pseudo-R² (Cragg-Uhler) 0.029611
Pseudo-R² (McFadden) 0.017139
AIC 234.649863
BIC 241.133357
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.416000 0.301044 0.574852 -5.315016 0.000000
ht 3.365385 1.021427 11.088221 1.994814 0.046063
Standard errors: MLE
# get the association between maternal weight and hypertension

M.linear <- lm(lwt ~ ht, data = lbw)
summary(M.linear); jtools::summ(M.linear, confint = TRUE, digits = 6)
## 
## Call:
## lm(formula = lwt ~ ht, data = lbw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -62.5000 -17.9435  -7.9435  10.0565 122.0565 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 127.9435     2.2390 57.1426  < 2e-16 ***
## ht           29.5565     8.8858  3.3262  0.00106 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.788 on 187 degrees of freedom
## Multiple R-squared:  0.05586,    Adjusted R-squared:  0.050811 
## F-statistic: 11.064 on 1 and 187 DF,  p-value: 0.0010596
Observations 189
Dependent variable lwt
Type OLS linear regression
F(1,187) 11.063918
0.055860
Adj. R² 0.050811
Est. 2.5% 97.5% t val. p
(Intercept) 127.943503 123.526516 132.360489 57.142604 0.000000
ht 29.556497 12.027125 47.085869 3.326247 0.001060
Standard errors: OLS
# get the association between maternal weight and low birth weight

M1 <- glm(low ~ lwt, data = lbw, family = binomial(link = "logit"))
summary(M1); jtools::summ(M1, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ lwt, family = binomial(link = "logit"), data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.09482  -0.90217  -0.80197   1.36105   1.98141  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.9957634  0.7852434  1.2681  0.20476  
## lwt         -0.0140371  0.0061685 -2.2756  0.02287 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 228.708  on 187  degrees of freedom
## AIC: 232.708
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(1) 5.963938
Pseudo-R² (Cragg-Uhler) 0.043683
Pseudo-R² (McFadden) 0.025414
AIC 232.708058
BIC 239.191552
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.706790 0.580837 12.614062 1.268095 0.204764
lwt 0.986061 0.974211 0.998055 -2.275604 0.022870
Standard errors: MLE
# get the association between maternal hypertension and 
# low birth weight adjusted for maternal weight

M2 <- glm(low ~ lwt + ht, data = lbw, family = binomial(link = "logit"))
summary(M2); jtools::summ(M2, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ lwt + ht, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.85912  -0.87274  -0.73845   1.29224   2.17964  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.4478621  0.8208975  1.7638 0.077773 . 
## lwt         -0.0186285  0.0065928 -2.8256 0.004720 **
## ht           1.8544773  0.7008245  2.6461 0.008142 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 221.165  on 186  degrees of freedom
## AIC: 227.165
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(2) 13.506547
Pseudo-R² (Cragg-Uhler) 0.096991
Pseudo-R² (McFadden) 0.057555
AIC 227.165449
BIC 236.890690
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 4.254010 0.851235 21.259240 1.763755 0.077773
lwt 0.981544 0.968942 0.994309 -2.825577 0.004720
ht 6.388358 1.617508 25.230866 2.646137 0.008142
Standard errors: MLE
lrtest(M2, M0)
## Likelihood ratio test
## 
## Model 1: low ~ lwt + ht
## Model 2: low ~ ht
##   #Df   LogLik Df   Chisq Pr(>Chisq)   
## 1   3 -110.583                         
## 2   2 -115.325 -1 9.48441  0.0020722 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## we see that odds ratio for the association between
    ## maternal hypertension and low birth weight after 
    ## adjustment for mother's weight. Clear evidence of 
    ## confounding. 

# fit the model with interaction term

M3 <- glm(low ~ lwt*ht, data = lbw, family = binomial(link = "logit"))
summary(M3); jtools::summ(M3, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ lwt * ht, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.77338  -0.87349  -0.74658   1.28246   2.20403  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.5393115  0.9174755  1.6778 0.093392 . 
## lwt         -0.0193796  0.0074129 -2.6143 0.008941 **
## ht           1.2869895  2.5493528  0.5048 0.613678   
## lwt:ht       0.0037324  0.0161735  0.2308 0.817490   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 221.113  on 185  degrees of freedom
## AIC: 229.113
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(3) 13.558683
Pseudo-R² (Cragg-Uhler) 0.097352
Pseudo-R² (McFadden) 0.057777
AIC 229.113313
BIC 242.080301
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 4.661380 0.771894 28.149527 1.677768 0.093392
lwt 0.980807 0.966660 0.995161 -2.614315 0.008941
ht 3.621866 0.024486 535.729382 0.504830 0.613678
lwt:ht 1.003739 0.972420 1.036067 0.230774 0.817490
Standard errors: MLE
    ## No evidence of an interaction. 

65.5.1.2 把前一步中擬合過的模型的母親體重變量替換爲母親的年齡。你有怎樣的結論?

# get the association between age and hypertension

M.linear <- lm(age ~ ht, data = lbw)
summary(M.linear); jtools::summ(M.linear, confint = TRUE, digits = 6)
## 
## Call:
## lm(formula = age ~ ht, data = lbw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9.25989 -4.25989 -0.25989  2.74011 21.74011 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.25989    0.39929 58.2536   <2e-16 ***
## ht          -0.34322    1.58462 -0.2166   0.8288    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.3122 on 187 degrees of freedom
## Multiple R-squared:  0.00025081, Adjusted R-squared:  -0.0050954 
## F-statistic: 0.046913 on 1 and 187 DF,  p-value: 0.82876
Observations 189
Dependent variable age
Type OLS linear regression
F(1,187) 0.046913
0.000251
Adj. R² -0.005095
Est. 2.5% 97.5% t val. p
(Intercept) 23.259887 22.472202 24.047572 58.253639 0.000000
ht -0.343220 -3.469247 2.782806 -0.216595 0.828760
Standard errors: OLS
M1.1 <- glm(low ~ age, data = lbw, family = binomial(link = "logit"))
summary(M1.1); jtools::summ(M1.1, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ age, family = binomial(link = "logit"), data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.04024  -0.90182  -0.77544   1.41191   1.77999  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.384582   0.732125  0.5253   0.5994
## age         -0.051153   0.031514 -1.6232   0.1045
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 231.912  on 187  degrees of freedom
## AIC: 235.912
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(1) 2.760038
Pseudo-R² (Cragg-Uhler) 0.020387
Pseudo-R² (McFadden) 0.011761
AIC 235.911958
BIC 242.395452
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.469000 0.349813 6.168898 0.525296 0.599378
age 0.950133 0.893223 1.010669 -1.623194 0.104548
Standard errors: MLE
M1.2 <- glm(low ~ age + ht, data = lbw, family = binomial(link = "logit"))
summary(M1.2); jtools::summ(M1.2, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ age + ht, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.41416  -0.88874  -0.74591   1.40428   1.77486  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.304977   0.741987  0.4110   0.6811  
## age         -0.051504   0.031938 -1.6126   0.1068  
## ht           1.214791   0.612127  1.9845   0.0472 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 227.927  on 186  degrees of freedom
## AIC: 233.927
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(2) 6.745090
Pseudo-R² (Cragg-Uhler) 0.049303
Pseudo-R² (McFadden) 0.028743
AIC 233.926906
BIC 243.652147
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.356594 0.316862 5.808050 0.411028 0.681052
age 0.949800 0.892168 1.011156 -1.612604 0.106831
ht 3.369591 1.015158 11.184607 1.984541 0.047196
Standard errors: MLE
lrtest(M0, M1.2)
## Likelihood ratio test
## 
## Model 1: low ~ ht
## Model 2: low ~ age + ht
##   #Df   LogLik Df   Chisq Pr(>Chisq)  
## 1   2 -115.325                        
## 2   3 -113.963  1 2.72296   0.098915 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
M1.3 <- glm(low ~ age*ht, data = lbw, family = binomial(link = "logit"))
summary(M1.3); jtools::summ(M1.3, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ age * ht, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.52114  -0.89947  -0.72614   1.39913   1.82235  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.560224   0.770466  0.7271  0.46715  
## age         -0.062810   0.033417 -1.8796  0.06017 .
## ht          -4.269768   4.195117 -1.0178  0.30878  
## age:ht       0.242368   0.187151  1.2950  0.19531  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 225.661  on 185  degrees of freedom
## AIC: 233.661
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(3) 9.010600
Pseudo-R² (Cragg-Uhler) 0.065472
Pseudo-R² (McFadden) 0.038397
AIC 233.661396
BIC 246.628384
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.751064 0.386795 7.927273 0.727123 0.467150
age 0.939122 0.879583 1.002690 -1.879560 0.060168
ht 0.013985 0.000004 52.065896 -1.017795 0.308776
age:ht 1.274263 0.882989 1.838919 1.295037 0.195308
Standard errors: MLE

並無證據提示年齡是孕期高血壓和低出生體重之間的關係的混雜因子 (confounder),也沒有證據表明年齡對該關係有交互作用。

65.5.1.3 分析吸菸史是否對孕期高血壓和低出生體重之間的關係有混雜作用 (confounding effect)。注意這裏吸菸史是一個二進制變量。

M2.1 <- glm(ht ~ smoke, data = lbw, family = binomial(link = "logit"))
summary(M2.1); jtools::summ(M2.1, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = ht ~ smoke, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.37406  -0.37406  -0.35440  -0.35440   2.36602  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -2.73622    0.39002 -7.0156 2.289e-12 ***
## smoke        0.11155    0.60548  0.1842    0.8538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 89.3856  on 188  degrees of freedom
## Residual deviance: 89.3519  on 187  degrees of freedom
## AIC: 93.3519
## 
## Number of Fisher Scoring iterations: 5
Observations 189
Dependent variable ht
Type Generalized linear model
Family binomial
Link logit
𝛘²(1) 0.033747
Pseudo-R² (Cragg-Uhler) 0.000474
Pseudo-R² (McFadden) 0.000378
AIC 93.351859
BIC 99.835353
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.064815 0.030178 0.139206 -7.015634 0.000000
smoke 1.118012 0.341241 3.662956 0.184238 0.853827
Standard errors: MLE
    ## here both the exposure of interest and the potential confounder
    ## are confounder are binary, so we need to use logistic regression
    ## (not linear regression) to explore the relationship between them.
    ## In the logistic regression model either of the two variables 
    ## could be the dependent variable.

M2.2 <- glm(low ~ smoke, data = lbw, family = binomial(link = "logit"))
summary(M2.2); jtools::summ(M2.2, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ smoke, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.01968  -0.76235  -0.76235   1.34378   1.65990  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -1.08705    0.21473 -5.0623 4.142e-07 ***
## smoke        0.70406    0.31964  2.2026   0.02762 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 229.805  on 187  degrees of freedom
## AIC: 233.805
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(1) 4.867397
Pseudo-R² (Cragg-Uhler) 0.035754
Pseudo-R² (McFadden) 0.020741
AIC 233.804600
BIC 240.288094
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.337209 0.221370 0.513667 -5.062322 0.000000
smoke 2.021944 1.080660 3.783111 2.202647 0.027620
Standard errors: MLE
M2.3 <- glm(low ~ smoke + ht, data = lbw, family = binomial(link = "logit"))
summary(M2.3); jtools::summ(M2.3, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ smoke + ht, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.51383  -0.73245  -0.73245   1.38100   1.70116  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -1.17873    0.22319 -5.2813 1.283e-07 ***
## smoke        0.71187    0.32365  2.1995   0.02784 *  
## ht           1.23005    0.61765  1.9915   0.04643 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 225.792  on 186  degrees of freedom
## AIC: 231.792
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(2) 8.880061
Pseudo-R² (Cragg-Uhler) 0.064545
Pseudo-R² (McFadden) 0.037840
AIC 231.791935
BIC 241.517177
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.307668 0.198656 0.476499 -5.281299 0.000000
smoke 2.037802 1.080606 3.842877 2.199486 0.027843
ht 3.421389 1.019665 11.480150 1.991493 0.046427
Standard errors: MLE
lrtest(M2.3, M0)
## Likelihood ratio test
## 
## Model 1: low ~ smoke + ht
## Model 2: low ~ ht
##   #Df   LogLik Df   Chisq Pr(>Chisq)  
## 1   3 -112.896                        
## 2   2 -115.325 -1 4.85793   0.027519 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
M2.4 <- glm(low ~ smoke*ht, data = lbw, family = binomial(link = "logit"))
summary(M2.4); jtools::summ(M2.4, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ smoke * ht, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.35373  -0.72566  -0.72566   1.36987   1.71070  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -1.19996    0.22814 -5.2598 1.442e-07 ***
## smoke        0.75813    0.33600  2.2564   0.02405 *  
## ht           1.48765    0.79711  1.8663   0.06200 .  
## smoke:ht    -0.64035    1.23675 -0.5178   0.60462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 225.527  on 185  degrees of freedom
## AIC: 233.527
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(3) 9.144894
Pseudo-R² (Cragg-Uhler) 0.066424
Pseudo-R² (McFadden) 0.038969
AIC 233.527102
BIC 246.494090
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.301205 0.192605 0.471038 -5.259761 0.000000
smoke 2.134286 1.104716 4.123392 2.256359 0.024048
ht 4.426667 0.928062 21.114292 1.866305 0.061999
smoke:ht 0.527108 0.046685 5.951510 -0.517766 0.604622
Standard errors: MLE

模型結果和似然比檢驗的結果提示孕婦的吸菸史與低出生體重有關。但是調整了吸菸史並不太改變孕期高血壓和低出生體重之間的比值比 OR。這主要是因爲吸菸史和孕期高血壓之間沒有太多關係。所以,孕婦的吸菸史似乎不太需要被認定爲混雜因子,也沒有明顯的交互作用。

65.5.1.4 分析人種 (race) 是否有混雜效應。注意這裏人種變量的分類多於兩個。

M3.1 <- glm(ht ~ as.factor(race), data = lbw, 
            family = binomial(link = "logit"))
summary(M3.1); jtools::summ(M3.1, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = ht ~ as.factor(race), family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.49518  -0.35088  -0.32707  -0.32707   2.43101  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.90142    0.45932 -6.3168 2.67e-10 ***
## as.factor(race)2  0.86454    0.76667  1.1277   0.2595    
## as.factor(race)3  0.14458    0.69054  0.2094   0.8342    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 89.3856  on 188  degrees of freedom
## Residual deviance: 88.1841  on 186  degrees of freedom
## AIC: 94.1841
## 
## Number of Fisher Scoring iterations: 5
Observations 189
Dependent variable ht
Type Generalized linear model
Family binomial
Link logit
𝛘²(2) 1.201470
Pseudo-R² (Cragg-Uhler) 0.016816
Pseudo-R² (McFadden) 0.013441
AIC 94.184136
BIC 103.909377
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.054945 0.022333 0.135177 -6.316796 0.000000
as.factor(race)2 2.373913 0.528291 10.667355 1.127653 0.259466
as.factor(race)3 1.155556 0.298542 4.472772 0.209375 0.834155
Standard errors: MLE
    ## Here the potential confounder is a 3-level categorical
    ## variable so we need to treat this as the predictor 
    ## variable when exploring its relationship with the 
    ## exposure of interest. The exposure of interest is 
    ## binary, so we need to use logistic regression. 

M3.2 <- glm(low ~ as.factor(race), data = lbw, 
            family = binomial(link = "logit"))
summary(M3.2); jtools::summ(M3.2, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ as.factor(race), family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.04885  -0.96646  -0.74012   1.40415   1.69048  
## 
## Coefficients:
##                  Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)      -1.15497    0.23912 -4.8301 1.364e-06 ***
## as.factor(race)2  0.84481    0.46341  1.8230    0.0683 .  
## as.factor(race)3  0.63617    0.34783  1.8290    0.0674 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 229.662  on 186  degrees of freedom
## AIC: 235.662
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(2) 5.010366
Pseudo-R² (Cragg-Uhler) 0.036791
Pseudo-R² (McFadden) 0.021351
AIC 235.661630
BIC 245.386871
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.315068 0.197183 0.503433 -4.830132 0.000001
as.factor(race)2 2.327536 0.938507 5.772384 1.823014 0.068301
as.factor(race)3 1.889234 0.955458 3.735596 1.828968 0.067404
Standard errors: MLE
M3.3 <- glm(low ~ as.factor(race) + ht, data = lbw, 
            family = binomial(link = "logit"))
summary(M3.3); jtools::summ(M3.3, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ as.factor(race) + ht, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.49520  -0.93846  -0.71605   1.37141   1.72429  
## 
## Coefficients:
##                  Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)      -1.23023    0.24546 -5.0119 5.389e-07 ***
## as.factor(race)2  0.78492    0.47086  1.6670   0.09552 .  
## as.factor(race)3  0.63830    0.35122  1.8174   0.06916 .  
## ht                1.16712    0.61857  1.8868   0.05919 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 226.061  on 185  degrees of freedom
## AIC: 234.061
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(3) 8.610811
Pseudo-R² (Cragg-Uhler) 0.062633
Pseudo-R² (McFadden) 0.036693
AIC 234.061185
BIC 247.028173
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.292226 0.180627 0.472775 -5.011923 0.000001
as.factor(race)2 2.192239 0.871142 5.516795 1.666990 0.095516
as.factor(race)3 1.893265 0.951153 3.768532 1.817376 0.069160
ht 3.212725 0.955750 10.799480 1.886797 0.059188
Standard errors: MLE
lrtest(M3.3, M0)
## Likelihood ratio test
## 
## Model 1: low ~ as.factor(race) + ht
## Model 2: low ~ ht
##   #Df   LogLik Df   Chisq Pr(>Chisq)
## 1   4 -113.031                      
## 2   2 -115.325 -2 4.58868    0.10083
M3.4 <- glm(low ~ as.factor(race)*ht, data = lbw, 
            family = binomial(link = "logit"))
summary(M3.4); jtools::summ(M3.4, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ as.factor(race) * ht, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.66511  -0.92689  -0.72438   1.36987   1.71250  
## 
## Coefficients:
##                     Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)         -1.20397    0.24881 -4.8390 1.305e-06 ***
## as.factor(race)2     0.76214    0.49441  1.5415    0.1232    
## as.factor(race)3     0.58144    0.36297  1.6019    0.1092    
## ht                   0.79851    0.94617  0.8439    0.3987    
## as.factor(race)2:ht  0.33647    1.60555  0.2096    0.8340    
## as.factor(race)3:ht  0.92263    1.51605  0.6086    0.5428    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 225.670  on 183  degrees of freedom
## AIC: 237.67
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(5) 9.001675
Pseudo-R² (Cragg-Uhler) 0.065408
Pseudo-R² (McFadden) 0.038359
AIC 237.670321
BIC 257.120803
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.300000 0.184220 0.488546 -4.838993 0.000001
as.factor(race)2 2.142857 0.813108 5.647269 1.541504 0.123194
as.factor(race)3 1.788618 0.878121 3.643181 1.601891 0.109180
ht 2.222222 0.347861 14.196098 0.843937 0.398705
as.factor(race)2:ht 1.400000 0.060185 32.566497 0.209569 0.834004
as.factor(race)3:ht 2.515909 0.128893 49.108965 0.608577 0.542805
Standard errors: MLE

我們發現一些比較弱的證據似乎認爲不同人種可能和低出生體重有關,儘管這個關係並無統計學意義。調整了人種之後,孕期高血壓的比值比 OR 變小了一點點 (從3.37 降到 3.21)。所以,調整人種恐怕並沒有太多的意義。也應該沒有交互作用。

65.5.1.5 用相似的方法試分析子宮痙攣 (uterine irritability) 是否是混雜因子,有沒有可疑的交互作用?

a <- stat.table(list("History of Hypertension" = ht, "Uterine Irritability" = ui), 
                     list(count(),percent(low)), data = lbw, margins = TRUE)
# We first tabulate the data
# There is a strong observed association here. 
# The Zero in the contingency table means 
# that fitting a logistic regression model 
# will be uninformative, with subjects in one 
# predictor variable category being dropped
print(a, digits = c(percent = 2))
##  --------------------------------------- 
##                --Uterine Irritability--- 
##  History of           0       1   Total  
##  Hypertension                            
##  --------------------------------------- 
##  0                  149      28     177  
##                  100.00  100.00  100.00  
##                                          
##  1                   12       0      12  
##                  100.00      NA  100.00  
##                                          
##                                          
##  Total              161      28     189  
##                  100.00  100.00  100.00  
##  ---------------------------------------
M4.1 <- glm(low ~ ui, data = lbw, 
            family = binomial(link = "logit"))
summary(M4.1); jtools::summ(M4.1, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ui, family = binomial(link = "logit"), data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.17741  -0.80971  -0.80971   1.17741   1.59671  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -0.94693    0.17562 -5.3919 6.973e-08 ***
## ui           0.94693    0.41677  2.2720   0.02308 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 229.596  on 187  degrees of freedom
## AIC: 233.596
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(1) 5.076097
Pseudo-R² (Cragg-Uhler) 0.037267
Pseudo-R² (McFadden) 0.021631
AIC 233.595899
BIC 240.079393
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.387931 0.274957 0.547323 -5.391870 0.000000
ui 2.577778 1.138905 5.834499 2.272045 0.023084
Standard errors: MLE
M4.2 <- glm(low ~ ht + ui, data = lbw, 
            family = binomial(link = "logit"))
summary(M4.2); jtools::summ(M4.2, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht + ui, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.32323  -0.76735  -0.76735   1.17741   1.65309  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -1.07194    0.18795 -5.7034 1.175e-08 ***
## ht           1.40842    0.61496  2.2902   0.02201 *  
## ui           1.07194    0.42212  2.5395   0.01110 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 224.321  on 186  degrees of freedom
## AIC: 230.321
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(2) 10.351369
Pseudo-R² (Cragg-Uhler) 0.074950
Pseudo-R² (McFadden) 0.044110
AIC 230.320627
BIC 240.045868
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.342342 0.236853 0.494815 -5.703384 0.000000
ht 4.089474 1.225204 13.649807 2.290238 0.022008
ui 2.921053 1.277126 6.681056 2.539454 0.011103
Standard errors: MLE
lrtest(M4.2, M0)
## Likelihood ratio test
## 
## Model 1: low ~ ht + ui
## Model 2: low ~ ht
##   #Df   LogLik Df   Chisq Pr(>Chisq)  
## 1   3 -112.160                        
## 2   2 -115.325 -1 6.32924   0.011876 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

有一些證據提示子宮痙攣應該對孕期高血壓和低出生體重之間有一定的混雜因子效應。表示子宮痙攣和孕期高血壓,及其與低出生體重之間關係的比值比 OR 發生了顯著的變化 (OR changed materially from 3.37 to 4.09)。

但是由於沒有一位母親同時有孕期高血壓且有子宮痙攣,我們並沒有辦法分析他們之間是否存在交互作用。

65.5.1.6 根據目前爲止建立過的模型和他們之間體現的可能存在的關係,你會怎樣建立關於有效評價孕期高血壓和低出生體重之間關係的邏輯回歸模型?

我們認爲目前爲止對數據中的變量之間關係的評價,我們應該考慮在最終模型中加入母親的體重,以及子宮痙攣作爲混雜因子 (共變量, covariates)。

M5 <- glm(low ~ ht + ui + lwt, data = lbw, 
            family = binomial(link = "logit"))
summary(M5); jtools::summ(M5, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht + ui + lwt, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.80936  -0.81607  -0.72064   1.15535   2.12396  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.0648676  0.8328755  1.2785 0.201058   
## ht           1.9604955  0.6955433  2.8187 0.004823 **
## ui           0.9300291  0.4336034  2.1449 0.031962 * 
## lwt         -0.0168934  0.0065631 -2.5740 0.010052 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 216.636  on 185  degrees of freedom
## AIC: 224.636
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(3) 18.036428
Pseudo-R² (Cragg-Uhler) 0.127998
Pseudo-R² (McFadden) 0.076858
AIC 224.635568
BIC 237.602556
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.900455 0.566919 14.839219 1.278544 0.201058
ht 7.102846 1.817125 27.763864 2.818654 0.004823
ui 2.534583 1.083484 5.929122 2.144884 0.031962
lwt 0.983248 0.970682 0.995978 -2.574023 0.010052
Standard errors: MLE

此時,模型估計的孕期高血壓和低出生體重之間關係的比值比爲 7.10。接下來,重新把另外幾個數據中的變量一一逐個加入上面的 M5 模型中去,觀察孕期高血壓和低出生體重之間關係的比值比是否有本質變化。

M5.1 <- glm(low ~ ht + ui + lwt + age, data = lbw, 
            family = binomial(link = "logit"))
summary(M5.1); jtools::summ(M5.1, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht + ui + lwt + age, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79612  -0.83888  -0.67688   1.08739   2.14183  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.7468696  1.0568007  1.6530 0.098335 . 
## ht           1.9266741  0.6955792  2.7699 0.005608 **
## ui           0.9165658  0.4334661  2.1145 0.034472 * 
## lwt         -0.0159071  0.0066016 -2.4096 0.015971 * 
## age         -0.0350727  0.0333641 -1.0512 0.293162   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 215.501  on 184  degrees of freedom
## AIC: 225.501
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(4) 19.170815
Pseudo-R² (Cragg-Uhler) 0.135648
Pseudo-R² (McFadden) 0.081692
AIC 225.501181
BIC 241.709917
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 5.736617 0.722945 45.520417 1.652979 0.098335
ht 6.866635 1.756572 26.842441 2.769885 0.005608
ui 2.500688 1.069282 5.848257 2.114504 0.034472
lwt 0.984219 0.971566 0.997036 -2.409568 0.015971
age 0.965535 0.904417 1.030784 -1.051211 0.293162
Standard errors: MLE
M5.2 <- glm(low ~ ht + ui + lwt + as.factor(smoke), data = lbw, 
            family = binomial(link = "logit"))
summary(M5.2); jtools::summ(M5.2, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht + ui + lwt + as.factor(smoke), family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.66266  -0.80587  -0.66468   1.08091   1.94286  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        0.7184771  0.8490130  0.8462 0.397413   
## ht                 1.9210194  0.6825223  2.8146 0.004884 **
## ui                 0.8963519  0.4429079  2.0238 0.042992 * 
## lwt               -0.0162771  0.0065445 -2.4871 0.012877 * 
## as.factor(smoke)1  0.6528692  0.3356572  1.9450 0.051769 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 212.850  on 184  degrees of freedom
## AIC: 222.85
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(4) 21.822445
Pseudo-R² (Cragg-Uhler) 0.153350
Pseudo-R² (McFadden) 0.092991
AIC 222.849551
BIC 239.058286
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.051307 0.388463 10.832079 0.846250 0.397413
ht 6.827915 1.791943 26.016698 2.814589 0.004884
ui 2.450646 1.028672 5.838275 2.023788 0.042992
lwt 0.983855 0.971315 0.996556 -2.487140 0.012877
as.factor(smoke)1 1.921045 0.995006 3.708937 1.945047 0.051769
Standard errors: MLE
M5.3 <- glm(low ~ ht + ui + lwt + as.factor(race), data = lbw, 
            family = binomial(link = "logit"))
summary(M5.3); jtools::summ(M5.3, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht + ui + lwt + as.factor(race), family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.86476  -0.84236  -0.63790   1.09558   2.30619  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       0.9487466  0.8939306  1.0613 0.288544   
## ht                1.9338144  0.7063595  2.7377 0.006187 **
## ui                0.9561224  0.4338223  2.2039 0.027528 * 
## lwt              -0.0186076  0.0068965 -2.6981 0.006973 **
## as.factor(race)2  1.0991313  0.4993768  2.2010 0.027736 * 
## as.factor(race)3  0.4305449  0.3703007  1.1627 0.244955   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 211.534  on 183  degrees of freedom
## AIC: 223.534
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(5) 23.138325
Pseudo-R² (Cragg-Uhler) 0.162043
Pseudo-R² (McFadden) 0.098599
AIC 223.533671
BIC 242.984153
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.582471 0.447837 14.891912 1.061320 0.288544
ht 6.915840 1.732171 27.612088 2.737720 0.006187
ui 2.601589 1.111651 6.088480 2.203950 0.027528
lwt 0.981564 0.968386 0.994922 -2.698133 0.006973
as.factor(race)2 3.001558 1.127915 7.987613 2.201006 0.027736
as.factor(race)3 1.538095 0.744359 3.178219 1.162690 0.244955
Standard errors: MLE

可見,除了母親體重,和子宮痙攣兩個變量之外,增加任何一個變量並不再實質改變孕期高血壓和低出生體重之間關係的比值比。

65.5.2 Part 2

65.5.2.1 想象另外一個研究者的分析目的是要使用相同的數據來建立一個預測式的模型。預測一名孕婦會產下一位低出生體重胎兒的危險度。

65.5.2.2 使用下面的Stata代碼,試用逐步選擇法尋找模型:

## 
## . use "../backupfiles/lbw.dta"
## (Hosmer & Lemeshow data)
## 
## . 
## . * the brackets (i.race) ensure that the race indicator variables are consider
## > ed
## 
## . * as a group, not individually
## 
## . 
## . xi: stepwise, forward pe(0.05): logistic low i.ht lwt age (i.race) i.smoke i.
## > ui
## i.ht              _Iht_0-1            (naturally coded; _Iht_0 omitted)
## i.race            _Irace_1-3          (naturally coded; _Irace_1 omitted)
## i.smoke           _Ismoke_0-1         (naturally coded; _Ismoke_0 omitted)
## i.ui              _Iui_0-1            (naturally coded; _Iui_0 omitted)
##                       begin with empty model
## p = 0.0229 <  0.0500  adding  lwt
## p = 0.0081 <  0.0500  adding  _Iht_1
## p = 0.0320 <  0.0500  adding  _Iui_1
## 
## Logistic regression                             Number of obs     =        189
##                                                 LR chi2(3)        =      18.04
##                                                 Prob > chi2       =     0.0004
## Log likelihood = -108.31778                     Pseudo R2         =     0.0769
## 
## ------------------------------------------------------------------------------
##          low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          lwt |   .9832484   .0064531    -2.57   0.010     .9706816     .995978
##       _Iht_1 |   7.102846   4.940338     2.82   0.005     1.817125    27.76387
##       _Iui_1 |   2.534583   1.099004     2.14   0.032     1.083484    5.929122
##        _cons |   2.900455   2.415719     1.28   0.201     .5669188    14.83923
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline odds.
## 
## . 
## .

逐步篩選的過程是先加入和母親的體重,其次是孕期高血壓,最後保留了子宮痙攣作爲了低出生體重的預測變量。這和我們在第一部分最終保留的,模型完全一樣。

## 
## . use "../backupfiles/lbw.dta"
## (Hosmer & Lemeshow data)
## 
## . 
## . xi: stepwise, forward pe(0.1): logistic low i.ht lwt age (i.race) i.smoke i.u
## > i
## i.ht              _Iht_0-1            (naturally coded; _Iht_0 omitted)
## i.race            _Irace_1-3          (naturally coded; _Irace_1 omitted)
## i.smoke           _Ismoke_0-1         (naturally coded; _Ismoke_0 omitted)
## i.ui              _Iui_0-1            (naturally coded; _Iui_0 omitted)
##                       begin with empty model
## p = 0.0229 <  0.1000  adding  lwt
## p = 0.0081 <  0.1000  adding  _Iht_1
## p = 0.0320 <  0.1000  adding  _Iui_1
## p = 0.0518 <  0.1000  adding  _Ismoke_1
## p = 0.0173 <  0.1000  adding  _Irace_2 _Irace_3
## 
## Logistic regression                             Number of obs     =        189
##                                                 LR chi2(6)        =      30.43
##                                                 Prob > chi2       =     0.0000
## Log likelihood = -102.11978                     Pseudo R2         =     0.1297
## 
## ------------------------------------------------------------------------------
##          low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          lwt |   .9834361   .0066887    -2.46   0.014     .9704134    .9966336
##       _Iht_1 |   6.490237   4.483259     2.71   0.007     1.676009    25.13302
##       _Iui_1 |   2.471801   1.106213     2.02   0.043     1.028189    5.942297
##    _Ismoke_1 |   2.817403   1.105908     2.64   0.008     1.305356    6.080917
##     _Irace_2 |   3.758631   1.959795     2.54   0.011     1.352705    10.44375
##     _Irace_3 |   2.526023   1.087054     2.15   0.031      1.08675    5.871446
##        _cons |   1.054066   .9884219     0.06   0.955     .1677556    6.623063
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline odds.
## 
## . 
## .

這裏設定了 pe(0.1) 之後,我們發現又增加了吸菸習慣,和種族兩個變量。

## 
## . use "../backupfiles/lbw.dta"
## (Hosmer & Lemeshow data)
## 
## . 
## . xi: stepwise, pe(0.1): logistic low i.ht lwt age (i.race) i.smoke i.ui
## i.ht              _Iht_0-1            (naturally coded; _Iht_0 omitted)
## i.race            _Irace_1-3          (naturally coded; _Irace_1 omitted)
## i.smoke           _Ismoke_0-1         (naturally coded; _Ismoke_0 omitted)
## i.ui              _Iui_0-1            (naturally coded; _Iui_0 omitted)
##                       begin with empty model
## p = 0.0229 <  0.1000  adding  lwt
## p = 0.0081 <  0.1000  adding  _Iht_1
## p = 0.0320 <  0.1000  adding  _Iui_1
## p = 0.0518 <  0.1000  adding  _Ismoke_1
## p = 0.0173 <  0.1000  adding  _Irace_2 _Irace_3
## 
## Logistic regression                             Number of obs     =        189
##                                                 LR chi2(6)        =      30.43
##                                                 Prob > chi2       =     0.0000
## Log likelihood = -102.11978                     Pseudo R2         =     0.1297
## 
## ------------------------------------------------------------------------------
##          low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          lwt |   .9834361   .0066887    -2.46   0.014     .9704134    .9966336
##       _Iht_1 |   6.490237   4.483259     2.71   0.007     1.676009    25.13302
##       _Iui_1 |   2.471801   1.106213     2.02   0.043     1.028189    5.942297
##    _Ismoke_1 |   2.817403   1.105908     2.64   0.008     1.305356    6.080917
##     _Irace_2 |   3.758631   1.959795     2.54   0.011     1.352705    10.44375
##     _Irace_3 |   2.526023   1.087054     2.15   0.031      1.08675    5.871446
##        _cons |   1.054066   .9884219     0.06   0.955     .1677556    6.623063
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline odds.
## 
## . 
## .

這時,只有年量被移出了最終模型,獲得了和逐步加入變量法 (forward) 相同的模型。但是實際上在很多現實數據中,這種情況很少見。

## 
## . use "../backupfiles/lbw.dta"
## (Hosmer & Lemeshow data)
## 
## . 
## . xi: stepwise, pe(0.05): logistic low i.ht lwt age (i.race) i.smoke i.ui
## i.ht              _Iht_0-1            (naturally coded; _Iht_0 omitted)
## i.race            _Irace_1-3          (naturally coded; _Irace_1 omitted)
## i.smoke           _Ismoke_0-1         (naturally coded; _Ismoke_0 omitted)
## i.ui              _Iui_0-1            (naturally coded; _Iui_0 omitted)
##                       begin with empty model
## p = 0.0229 <  0.0500  adding  lwt
## p = 0.0081 <  0.0500  adding  _Iht_1
## p = 0.0320 <  0.0500  adding  _Iui_1
## 
## Logistic regression                             Number of obs     =        189
##                                                 LR chi2(3)        =      18.04
##                                                 Prob > chi2       =     0.0004
## Log likelihood = -108.31778                     Pseudo R2         =     0.0769
## 
## ------------------------------------------------------------------------------
##          low | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##          lwt |   .9832484   .0064531    -2.57   0.010     .9706816     .995978
##       _Iht_1 |   7.102846   4.940338     2.82   0.005     1.817125    27.76387
##       _Iui_1 |   2.534583   1.099004     2.14   0.032     1.083484    5.929122
##        _cons |   2.900455   2.415719     1.28   0.201     .5669188    14.83923
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline odds.
## 
## .

這時,逐步刪除法即使使用了更加嚴格的p值標準,留下的變量和逐步增加法的模型也不同。所以,分析者採用逐步刪除或者逐步增加變量的方法,可能獲得截然不同的最終保留模型。

65.5.2.3 建立一個包含全部變量的模型。解釋其提示的孕期高血壓和低出生體重之間的關係。

M6 <- glm(low ~ ht + ui + lwt + as.factor(race) + 
              age + as.factor(smoke), data = lbw, 
            family = binomial(link = "logit"))
summary(M6); jtools::summ(M6, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht + ui + lwt + as.factor(race) + age + as.factor(smoke), 
##     family = binomial(link = "logit"), data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.73162  -0.83298  -0.53446   0.98682   2.16710  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        0.4342545  1.1918299  0.3644 0.715590   
## ht                 1.8564821  0.6887108  2.6956 0.007026 **
## ui                 0.8953379  0.4484747  1.9964 0.045890 * 
## lwt               -0.0162548  0.0068565 -2.3707 0.017755 * 
## as.factor(race)2   1.2800691  0.5266377  2.4306 0.015072 * 
## as.factor(race)3   0.9022946  0.4343165  2.0775 0.037755 * 
## age               -0.0182848  0.0353519 -0.5172 0.605000   
## as.factor(smoke)1  1.0275430  0.3938993  2.6086 0.009090 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 203.970  on 181  degrees of freedom
## AIC: 219.97
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(7) 30.701874
Pseudo-R² (Cragg-Uhler) 0.210853
Pseudo-R² (McFadden) 0.130829
AIC 219.970122
BIC 245.904098
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 1.543812 0.149316 15.961799 0.364359 0.715590
ht 6.401179 1.659695 24.688325 2.695590 0.007026
ui 2.448163 1.016478 5.896342 1.996407 0.045890
lwt 0.983877 0.970743 0.997188 -2.370697 0.017755
as.factor(race)2 3.596888 1.281304 10.097220 2.430645 0.015072
as.factor(race)3 2.465253 1.052375 5.775006 2.077505 0.037755
age 0.981881 0.916152 1.052327 -0.517224 0.605000
as.factor(smoke)1 2.794192 1.291126 6.047055 2.608644 0.009090
Standard errors: MLE

放入了全部的變量之後,調整後的孕期高血壓和低出生體重之間關係的比值比變成了 6.40 ,它的95%信賴區間是 (1.66, 24.69)。這個結果和逐步變量篩選法,以及第一部分中逐個分析變量之間關係之後建立起來的模型所估算的比值比結果其實十分接近。在彙報這些結果的時候,無論分析者選用了哪種方法,最終獲得接近的答案,會讓讀者相信結果是穩定可靠的。分析者也應該在論文中提示使用不同的策略分析選擇變量和模型之後獲得的比值比都很接近。當然,有些人會樂意展示所有分析策略的過程和結果。另外值得注意的是,保留不同變量的模型所估計的各自的比值比之間嚴格來說是不同的條件比值比,由於邏輯回歸的不可壓縮型,他們不適宜直接拿來比較。

65.5.3 Part 3

已知這個數據中收集到的數據還包含 1)懷孕初期(最初三個月),孕婦訪問醫生的次數;2)該母親是否曾經有過流產史。根據這一實情,回答下面的問題:

65.5.3.1 你是否同一把上述這兩個中的一個或者兩者全部都放入前面兩個部分建立的回歸模型中做預測變量?在怎樣的條件下,你會認爲可能增加這兩個變量是可以考慮的?

關於懷孕初期訪問醫生次數這一變量,它很可能其實是處在孕期高血壓和低出生體重兩者因果關係通路上的變量。所以,增加它作爲預測變量可能引起不小的爭議。

調整母親的流產史同樣也是一件容易引起爭議的事。如果說,研究的目的是要預測某母親會產下低出生體重胎兒的概率,那麼增加這個變量可能會被認可。但是如果說研究目的是爲了分析孕期高血壓和低出生體重胎兒的出生之間可能存在的病因學關係,那你應該拒絕增加調整這個變量。這是由於母親的流產史可能和某些基因上的或者環境上的因素有關,間接或者直接的成爲母親的孕期高血壓(有或無),及低出生體重胎兒(有或無)的原因。

65.5.3.2 如果你認爲應該增加這兩個變量中的一個或者全部,請建立包含你認爲應該增加的變量的模型,觀察結果的變化。

儘管增加這兩個變量可能引起一些爭議,但是實際上增加了這兩個變量中的任何一個對孕期高血壓和低出生體重之間的關係影響都幾乎可以忽略不計。

M7 <- glm(low ~ ht + ui + lwt + ftv, data = lbw, 
            family = binomial(link = "logit"))
summary(M7); jtools::summ(M7, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht + ui + lwt + ftv, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.81697  -0.82243  -0.70736   1.14260   2.10438  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.0878869  0.8387712  1.2970  0.19463   
## ht           1.9429161  0.6969828  2.7876  0.00531 **
## ui           0.9231848  0.4344255  2.1251  0.03358 * 
## lwt         -0.0167706  0.0065898 -2.5449  0.01093 * 
## ftv         -0.0482029  0.1697755 -0.2839  0.77647   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 216.554  on 184  degrees of freedom
## AIC: 226.554
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(4) 18.117740
Pseudo-R² (Cragg-Uhler) 0.128548
Pseudo-R² (McFadden) 0.077205
AIC 226.554256
BIC 242.762991
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.967996 0.573456 15.361253 1.297001 0.194631
ht 6.979073 1.780430 27.357134 2.787610 0.005310
ui 2.517295 1.074361 5.898176 2.125070 0.033581
lwt 0.983369 0.970750 0.996153 -2.544936 0.010930
ftv 0.952940 0.683207 1.329165 -0.283921 0.776471
Standard errors: MLE
M8 <- glm(low ~ ht + ui + lwt + ptl, data = lbw, 
            family = binomial(link = "logit"))
summary(M8); jtools::summ(M8, digits = 6, confint = TRUE, exp = TRUE)
## 
## Call:
## glm(formula = low ~ ht + ui + lwt + ptl, family = binomial(link = "logit"), 
##     data = lbw)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.05197  -0.79228  -0.68950   1.10836   2.14262  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.8291144  0.8483560  0.9773 0.328411   
## ht           1.9435809  0.7005454  2.7744 0.005531 **
## ui           0.7732965  0.4458041  1.7346 0.082810 . 
## lwt         -0.0158861  0.0066317 -2.3955 0.016599 * 
## ptl          0.6273847  0.3372620  1.8602 0.062853 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.672  on 188  degrees of freedom
## Residual deviance: 213.036  on 184  degrees of freedom
## AIC: 223.036
## 
## Number of Fisher Scoring iterations: 4
Observations 189
Dependent variable low
Type Generalized linear model
Family binomial
Link logit
𝛘²(4) 21.636480
Pseudo-R² (Cragg-Uhler) 0.152117
Pseudo-R² (McFadden) 0.092199
AIC 223.035516
BIC 239.244251
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 2.291289 0.434468 12.083750 0.977319 0.328411
ht 6.983714 1.769217 27.567142 2.774383 0.005531
ui 2.166898 0.904418 5.191676 1.734610 0.082810
lwt 0.984239 0.971529 0.997116 -2.395465 0.016599
ptl 1.872706 0.966923 3.627001 1.860229 0.062853
Standard errors: MLE

References

Robinson, Laurence D, and Nicholas P Jewell. 1991. “Some Surprising Results about Covariate Adjustment in Logistic Regression Models.” International Statistical Review, 227–40.