第 61 章計數型因變量 Count outcomes

計數型變量在臨牀醫學/流行病學研究中也十分常見，下面是一些例子：

某個呼吸科診所的患者中，每個人在過去一個月中哮喘發作的次數；
癲癇患者在過去一年中癲癇發作次數；
接受腦部 CT 掃描的患者中，每個人被診斷出顱內腫瘤個數。

最早的泊松模型可以追溯到普魯士騎兵連中被馬蹄踢死士兵的人數模型。

61.1 泊松 GLM

一個計數型的隨機變量，只能取大於等於零的正整數，\(0,1,\cdots\)。泊松隨機變量可以理解爲產生於發生在一段時間內的事件次數。泊松模型可以用於計數型數據的迴歸模型的構建：

\[ \begin{aligned} Y &\sim \text{Po}(\mu) \\ \text{P} (Y = y) & = \frac{\mu^y e^{-\mu}}{y!} \end{aligned} \]

所以，一個泊松迴歸，默認的前提是因變量 \(Y\) 服從一個以預測變量 \(x_1, \cdots, x_p\) 爲條件的泊松分佈。其標準鏈接方程是 \(\theta=\text{log}(\mu)\)。

\[ \begin{aligned} Y_i & \sim \text{Po}(\mu_i) \\ \text{log}(\mu_i) & = \alpha + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} \end{aligned} \]

觀測對象 1，用模型中全部的預測變量 \(\mathbf{x_1}=(x_{11},\cdots,x_{1p})\) 計算獲得的擬合值，和另一個觀測對象 0 的擬合值之比爲：

\[ \begin{aligned} & \frac{\text{exp}(\alpha + \beta_1 x_{11} + \cdots + \beta_p x_{1p})}{\text{exp}(\alpha + \beta_1 x_{01} + \cdots + \beta_p x_{0p})} \\ = & \exp(\beta_1(x_{11}-x_{01}) + \cdots + \beta_p(x_{1p} - x_{0p})) \end{aligned} \]

其中，

線性預測方程 linear predictor 中的截距 \(\alpha\) 的含義是，當所有的預測變量均等於零 \(\mathbf{x_1} = 0\) 時，因變量 \(Y\) 的均值之對數。
\(\beta_1\) 的含義是，其餘預測變量保持不變時，預測變量 \(x_1\) 每增加一個單位時，因變量變化量的對數。
迴歸係數的指數 (自然底數) 大小，可以被理解爲是率比 (rate ratio) (詳見下一章率的 GLM)。

61.2 泊松迴歸實例

下列數據來自 UCLA 的統計學網站。數據內容是某高中全部學生，獲獎的次數。預測變量包括，1) 獲獎種類 “一般 General”，“學術類 Academic”，“技能類 Vocational”；和所有學生期末數學考試分數。

p <- read_csv("../backupfiles/poisson_sim.csv")

## Rows: 200 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): id, num_awards, prog, math
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

p <- within(p, {
  prog <- factor(prog, levels=1:3, labels=c("General", "Academic",
                                                     "Vocational"))
  id <- factor(id)
})
summary(p)

##        id        num_awards           prog          math       
##  1      :  1   Min.   :0.00   General   : 45   Min.   :33.000  
##  2      :  1   1st Qu.:0.00   Academic  :105   1st Qu.:45.000  
##  3      :  1   Median :0.00   Vocational: 50   Median :52.000  
##  4      :  1   Mean   :0.63                    Mean   :52.645  
##  5      :  1   3rd Qu.:1.00                    3rd Qu.:59.000  
##  6      :  1   Max.   :6.00                    Max.   :75.000  
##  (Other):194

下面的代碼擬合因變量爲獲獎次數，預測變量爲獲獎種類 (分類) 和數學成績 (連續) 的泊松分佈，泊松分佈默認的鏈接方程就是 \(\text{log}\)，所以你可以像第一行那樣把鏈接方程部分省略。結果也是一樣的。

m1 <- glm(num_awards ~ prog, family="poisson", data=p)
m2 <- glm(num_awards ~ prog, family=poisson(link = log), data=p)
summary(m1); summary(m2)

## 
## Call:
## glm(formula = num_awards ~ prog, family = "poisson", data = p)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.41421  -0.69282  -0.63246   0.00000   3.39133  
## 
## Coefficients:
##                Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)    -1.60944    0.33333 -4.8283 1.377e-06 ***
## progAcademic    1.60944    0.34733  4.6338 3.590e-06 ***
## progVocational  0.18232    0.44096  0.4135    0.6793    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.672  on 199  degrees of freedom
## Residual deviance: 234.460  on 197  degrees of freedom
## AIC: 416.515
## 
## Number of Fisher Scoring iterations: 6

## 
## Call:
## glm(formula = num_awards ~ prog, family = poisson(link = log), 
##     data = p)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.41421  -0.69282  -0.63246   0.00000   3.39133  
## 
## Coefficients:
##                Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)    -1.60944    0.33333 -4.8283 1.377e-06 ***
## progAcademic    1.60944    0.34733  4.6338 3.590e-06 ***
## progVocational  0.18232    0.44096  0.4135    0.6793    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.672  on 199  degrees of freedom
## Residual deviance: 234.460  on 197  degrees of freedom
## AIC: 416.515
## 
## Number of Fisher Scoring iterations: 6

輸出結果的迴歸係數部分，

該學校學生獲得學術類獎項的平均次數和獲得一般獎項的平均次數的比值是 \(\text{exp}(1.6094) = 4.999\)，所以獲得的學術類獎平均次數要高於一般獎次數 \(390\%\)；
獲得技能類獎的平均次數和一般獎平均次數的比值是 \(\text{exp}(0.1823) = 1.199\)，也就是高出了 \(19.9\%\)；
該校學生獲得一般類獎的次數平均每人是 \(\text{exp}(-1.6094) = 0.20\) 次；
該校學生獲得學術獎的次數平均每人是 \(\text{exp}(-1.6094 + 1.6094) = 1\) 次；(一人一次夠流弊)
該校學生獲得技能類獎的次數平均每人是 \(\text{exp}(-1.6094 + 0.182) = 0.24\) 次。

看來該校師生很重視學術。

當然也可以用下面定義的函數來幫助我們計算上面這些數值，及其信賴區間。

glm.RR <- function(GLM.RESULT, digits = 2) {

    if (GLM.RESULT$family$family == "binomial") {
        LABEL <- "OR"
    } else if (GLM.RESULT$family$family == "poisson") {
        LABEL <- "RR"
    } else {
        stop("Not logistic or Poisson model")
    }

    COEF      <- stats::coef(GLM.RESULT)
    CONFINT   <- stats::confint(GLM.RESULT)
    TABLE     <- cbind(coef=COEF, CONFINT)
    TABLE.EXP <- round(exp(TABLE), digits)

    colnames(TABLE.EXP)[1] <- LABEL

    TABLE.EXP
}

glm.RR(m1)

##                 RR 2.5 % 97.5 %
## (Intercept)    0.2  0.10   0.36
## progAcademic   5.0  2.68  10.63
## progVocational 1.2  0.51   2.94

61.3 過度離散 overdispersion

泊松分佈的前提條件之一是，方差和均值相等。這是一個非常強的假設，很多計數型數據其實是無法滿足這個條件的。許多時候 (包括上面的例子也是) 方差要大於或者小於均值：

epiDisplay::summ(p$num_awards[p$prog == "Academic"], graph = FALSE)

##  obs. mean   median  s.d.   min.   max.  
##  105  1      1       1.279  0      6

epiDisplay::summ(p$num_awards[p$prog == "General"], graph = FALSE)

##  obs. mean   median  s.d.   min.   max.  
##  45   0.2    0       0.405  0      1

epiDisplay::summ(p$num_awards[p$prog == "Vocational"], graph = FALSE)

##  obs. mean   median  s.d.   min.   max.  
##  50   0.24   0       0.517  0      2

試想一下，實際的數據中其實是經常出現這樣的違反泊松分佈前提的計數型數據的。例如某兩個觀測對象，如果他們二者的線性預測方程給出相等的結果 (他們各自的預測變量可以完全不同)，會被認爲服從相同均值，相同方差的泊松分佈，這顯然是不合理的。例如本章用到的學校學生獲獎的例子，有的學生成績好，那麼獲得學術類獎的平均次數 (及其方差) 自然和成績排在後面的學生不同，強制這樣的兩個學生服從相同均值，相同方差的泊松分佈顯然是不合情理的。手工好的學生，可能更傾向於獲得更多得技能類獎。實際情況下，還有許許多多其他的未知因素會影響學生獲獎的次數，例如家庭教育背景的不同，有些學生鋼琴獲獎多，因爲他每天都去練習彈鋼琴等等，這些都是沒有被收集到的數據。

真實情況應該是這樣的，當有其他的我們不知道的因素存在時，這些因素會導致某些人的均值高於其他人。如果對象 \(i\) 的因變量 \(Y_i\) 服從均值爲 \(\mu_i\) 的泊松分佈，那麼對於所有的 \(\mu_i\)，其均值 (overall mean) 是 \(\mu\)，方差 (overall variance) 是 \(\sigma^2\)。這是一個典型的隨機效應模型 random effect model，我們會在後面的 hierarchical data analysis 再深入討論，但是這裏的重點是，每個觀測對象自己的均值 \(\mu_i\)，是我們在普通泊松迴歸中忽略掉的隨機共變量 (the effects of omitted covariates)。

所以樣本數據來自的人羣如果共同均值 (或者叫邊際效應均值，marginal mean) 爲 \(\mu\)：

\[ E(Y_i) = E(E(Y_i | \mu_i)) = E(\mu_i) = \mu \]

和共同方差 (邊際效應方差) ，需要用到總體方差法則 (Law of total variance) 概念：

\[ \begin{aligned} \text{Var}(Y_i) & = E(\text{Var}(Y_i | \mu_i)) + \text{Var}(E(Y_i | \mu_i)) \\ & = E(\mu_i) + \text{Var}(\mu_i) \\ & = \mu + \sigma^2 \end{aligned} \]

61.3.1 過度離散怎麼查？

如果，我們的泊松回歸模型中的共變量全部都是分類型變量，我們可以把觀測值 \(Y\) 對每一個分類變量分別作簡單的數據總結，觀察其均值和方差是否可以認為大致相同。但是許多時候模型中不會只有分類型變量。

R 輸出的結果中的模型偏差 deviance，可以用來初步判斷整體模型的擬合優度。如果模型偏差除以殘差獲得的殘差偏差 (residual deviance) 足夠小，說明擬合的模型跟數據本身比較接近，也就是模型和數據擬合程度較好，反之則提示模型本身具有較高的過度離散 overdispersion。另外，模型偏差由於在個人數據 (individual data) 情況下不適用 (因為模型偏差值就不再服從卡方分佈了)，下面的檢驗結果僅僅只能作為極為微弱的參考證據。此時應該推薦使用 Pearson 的模型擬合檢驗。如果 Pearson 統計量，除以殘差的自由度獲得的值遠大於 1，就提示存在過度離散。

with(m1, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

##      res.deviance  df           p
## [1,]    234.45997 197 0.034961171

Goodness of fit 檢驗結果提示本模型可能存在過度離散，數據擬合度不理想。值得注意的是如果樣本很大時，模型偏差的檢驗統計量將不再服從卡方分佈，應用的時候一定要慎重。

61.3.2 負二項式分佈模型 negative binomial model

如果普通泊松迴歸模型擬合數據時，發現數據本身有過度離散的嫌疑，那麼建議使用負二項式分佈模型來重新擬合數據。負二項式分佈模型其實是泊松分佈的擴展版本，即考慮了個體的方差和均值的隨機效應 subject-specific random effect。如果設每個觀測對象的隨機效應部分爲 \(a_i\)，預測變量爲向量 \(\mathbf{x_i} = (x_{i1}, \cdots, x_{ip})\)，那麼因變量 \(Y_i\) 服從均值爲 \(\text{exp}(\beta^T\mathbf{x_i}+a_i)\) 泊松分佈。在負二項式分佈中，個體的隨機效應部分的自然底數的指數 \(e^{a_i}\) 其實是服從均值爲 1，方差爲 \(\alpha\) 的伽馬分佈 (gamma distribution)。\(\alpha\) 越大，说明过度离散越明显。

接下來用相同的數據，使用負二項式分佈模型在 R 裏作模型的擬合，你就會看到差別：

R 裏擬合負二項式分佈模型的函數 glm.nb 在基本包 MASS 裏。

m1 <- glm.nb(num_awards ~ prog, data = p)
m2 <- glm(num_awards ~ prog, family=poisson(link = log), data=p)
summary(m1)

## 
## Call:
## glm.nb(formula = num_awards ~ prog, data = p, init.theta = 1.72267107, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.25581  -0.67036  -0.61517   0.00000   2.32349  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.60944    0.35215 -4.5703 4.87e-06 ***
## progAcademic    1.60944    0.37291  4.3159 1.59e-05 ***
## progVocational  0.18232    0.46793  0.3896   0.6968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.7227) family taken to be 1)
## 
##     Null deviance: 211.264  on 199  degrees of freedom
## Residual deviance: 171.066  on 197  degrees of freedom
## AIC: 406.532
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.723 
##           Std. Err.:  0.717 
## 
##  2 x log-likelihood:  -398.532

summary(m2)

## 
## Call:
## glm(formula = num_awards ~ prog, family = poisson(link = log), 
##     data = p)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.41421  -0.69282  -0.63246   0.00000   3.39133  
## 
## Coefficients:
##                Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)    -1.60944    0.33333 -4.8283 1.377e-06 ***
## progAcademic    1.60944    0.34733  4.6338 3.590e-06 ***
## progVocational  0.18232    0.44096  0.4135    0.6793    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.672  on 199  degrees of freedom
## Residual deviance: 234.460  on 197  degrees of freedom
## AIC: 416.515
## 
## Number of Fisher Scoring iterations: 6

仔細比較普通泊松分佈迴歸和負二項式分佈迴歸的輸出結果，你會發現

迴歸係數的計算是完全相同的 (由於我們只放了一個簡單的分類型變量作爲預測變量，一般來說泊松迴歸和負二項式分佈迴歸計算的迴歸係數會有些許不同)；
另外一個變化是標準誤的估計量在負二項式分佈模型中明顯變大了，這就是我們放寬了前提條件，允許模型考慮個體的隨機效應的體現。如果泊松模型被數據本身的過度離散影響顯著，那麼泊松迴歸計算獲得的參數標準無是偏低的；
負二項式分佈迴歸的結果最底下出現的 Theta: 1.723 部分，它的倒數是前面提到的個體的隨機效應部分 \(a_i\) 服從的伽馬分佈的方差 \(\alpha\)。它是關鍵的離散程度參數 (dispersion parameter)。在 STATA 裏，如果用 nbreg 擬合負二項式分佈迴歸的模型，輸出的結果最底下會有 \(\alpha\) 值的報告，注意它和 R 輸出的 Theta 結果互爲倒數。另外，STATA 的輸出結果還會對 \(\alpha = 0\) 直接進行檢驗。在 R 裏面則需要給兩個模型分別進行擬合優度檢驗，多數情況下你會發現負二項式分佈迴歸的模型更加擬合數據：

with(m1, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

##      res.deviance  df          p
## [1,]    171.06608 197 0.90901874

with(m2, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

##      res.deviance  df           p
## [1,]    234.45997 197 0.034961171

另一種獲取沒有被低估的迴歸係數的標準誤的方法來自穩健統計學手段。在 R 裏，擬合完普通泊松迴歸以後，用 sandwich 包裏的 vcovHC() 命令進行穩健的參數誤差估計 (具體說是夾心方差矩陣估計 sandwich estimator of variance)：

m2 <- glm(num_awards ~ prog, family=poisson(link = log), data=p)
cov.m2 <- vcovHC(m2, type = "HC0")
std.err <- sqrt(diag(cov.m2))
robust.est <- cbind(Estimate= coef(m2), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(m2)/std.err), lower.tail=FALSE),
LL = coef(m1) - 1.96 * std.err,
UL = coef(m1) + 1.96 * std.err)
robust.est

##                   Estimate  Robust SE      Pr(>|z|)          LL         UL
## (Intercept)    -1.60943791 0.29814240 6.7305731e-08 -2.19379701 -1.0250788
## progAcademic    1.60943791 0.32296809 6.2517920e-07  0.97642045  2.2424554
## progVocational  0.18232156 0.42426407 6.6738767e-01 -0.64923602  1.0138791

61.4 GLM Practical 05

在這次練習中，我們來探索幾個不同的計數型數據的模型，進一步探討如何處理過度離散的方法。數據來自Stata的網站，記錄的是美國亞利桑那州醫院的住院時長數據。使用如下代碼下載該數據:

# medpar <- read_dta("http://www.stata-press.com/data/hh3/medpar.dta")
medpar <- read_dta(file = "../backupfiles/medpar.dta")

我們主要使用的數據是下面這幾列：

Variable	Description
`los`	length of hospital stay, in days
`age`	Age group
`type1`	Binary variable indicating elective admission
`type2`	Binary variable indicating urgent admission
`type3`	Binary variable indicating emergency admission

分析住院時間長短和年齡及其他共變量之間關係之前，先瞭解一下 los 本身的特徵。首先，計算 los 的平均值，及其 Wald 法的 95% 信賴區間。在怎樣的前提條件下，這個信賴區間可以被認爲有效 (valid)？你認爲這些前提條件在這裏得到滿足了嗎？

psych::describe(medpar$los)

##    vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1495 9.85 8.83      8    8.61 5.93   1 116   115 3.65    26.16 0.23

t.test(medpar$los)$conf.int

## [1]  9.4060722 10.3022890
## attr(,"conf.level")
## [1] 0.95

這裏默認的前提條件是，住院時長 (days) 服從正態分佈。即使住院時長這一數據可能並不100% 服從正態分佈，但是如果樣本量足夠大，那麼該 95% 信賴區間依然可以被認爲近似可以涵蓋95%的可重複實驗的次數。這一結論依據的是中心極限定理。在這裏，住院時長的數據其實分佈的非常的偏，並非正態分佈。但是我們可以認爲由於樣本量接近1500，可以認爲計算獲得的95%信賴區間是漸進有效的。

接下來我們使用 glm 命令來估計 los 的邊際均值 (marginal mean)，不加任何預測變量。根據你擬合的泊松回歸模型獲得的結果，請思考 los 的模型估計 95% 信賴區間是多少。和前一步簡單的估計相比較，他們是否相似或者有怎樣的不同，原因是什麼。你認爲哪種估計更加有意義？

mA <- glm(los ~ 1, family=poisson(link = log), data=medpar)
jtools::summ(mA, digits = 6, confint = TRUE)

Observations	1495
Dependent variable	los
Type	Generalized linear model
Family	poisson
Link	log

𝛘²(0)	-0.000000
Pseudo-R² (Cragg-Uhler)	0.000000
Pseudo-R² (McFadden)	0.000000
AIC	14618.283648
BIC	14623.593529

	Est.	2.5%	97.5%	z val.	p
(Intercept)	2.287896	2.271748	2.304044	277.694425	0.000000
Standard errors: MLE

根據這個模型的結果，住院時長的均值可以被估計爲 \(\exp(2.287896) = 9.854\)。但是其95%信賴區間的估計是：

下限爲，\(\exp(2.287896 - 1.96\times0.008239) = 9.696\)

上限爲，\(\exp(2.287896 + 1.96\times0.008239) = 10.014\)

我們發現，均值的點估計，和第一步簡單歸納時的結果一致，都是 9.854 天。但是模型估計的95%信賴區間 (9.696, 10.014) 相比 Wald 法的 95% 信賴區間 (9.406, 10.302) 更加精確 (範圍更窄)。當然可以理解爲，當數據本身分佈較偏 (skew) 時，使用泊松模型分析獲得的結果更加可靠且更加有效 (more efficient)。在這個數據中，模型估計的信賴區間和wald法信賴區間之間的差別更加可能是由於住院時長數據本身的過度離散問題導致的。在R裏我們獲得的結果只有殘差離差量 (residual deviance): Residual deviance: 8901.1 on 1494 degrees of freedom。如果你用的是 Stata，還可以獲得 Pearson 統計量，以及他們除以自由度以後獲得的數字都顯著地大於1。這說明其實病人住院時長這樣的數據很大程度上有相當大的差異，因爲每個病人各自住院的時間長度更加取決於他們本身患病的程度。這樣的數據不會是通過泊松分佈可以簡單擬合的，因爲泊松分佈的均值和方差是嚴格相等的。

## 
## . use http://www.stata-press.com/data/hh3/medpar, clear
## 
## . glm los, family(Poisson) 
## 
## Iteration 0:   log likelihood = -7354.5568  
## Iteration 1:   log likelihood = -7308.2078  
## Iteration 2:   log likelihood = -7308.1418  
## Iteration 3:   log likelihood = -7308.1418  
## 
## Generalized linear models                         Number of obs   =      1,495
## Optimization     : ML                             Residual df     =      1,494
##                                                   Scale parameter =          1
## Deviance         =  8901.134077                   (1/df) Deviance =   5.957921
## Pearson          =  11828.70662                   (1/df) Pearson  =   7.917474
## 
## Variance function: V(u) = u                       [Poisson]
## Link function    : g(u) = ln(u)                   [Log]
## 
##                                                   AIC             =   9.778116
## Log likelihood   = -7308.141824                   BIC             =  -2019.829
## 
## ------------------------------------------------------------------------------
##              |                 OIM
##          los |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##        _cons |   2.287896   .0082389   277.69   0.000     2.271748    2.304044
## ------------------------------------------------------------------------------
## 
## .

也就是下面這兩行所提示的內容。

## Deviance         =  8901.134077                   (1/df) Deviance =   5.957921
## Pearson          =  11828.70662                   (1/df) Pearson  =   7.917474

下面我們在泊松回歸模型中加入不同的入院類型的啞變量，看他們是否和患者住院時長有關。嘗試用醫學文獻的文章寫法描述這個模型的結果。

mB <- glm(los ~ type2 + type3, family=poisson(link = log), data=medpar)
jtools::summ(mB, digits = 6, confint = TRUE, exp = TRUE)

Observations	1495
Dependent variable	los
Type	Generalized linear model
Family	poisson
Link	log

𝛘²(2)	717.506464
Pseudo-R² (Cragg-Uhler)	0.381200
Pseudo-R² (McFadden)	0.049090
AIC	13904.777184
BIC	13920.706828

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	8.830688	8.659413	9.005350	217.975720	0.000000
type2	1.267877	1.216985	1.320897	11.354985	0.000000
type3	2.065477	1.963233	2.173046	28.003160	0.000000
Standard errors: MLE

記得模型中省略掉了 type1 因爲它被當作參考組 (reference group)。

下面的文獻描述可以用於參考：

There is strong evidence (p < 0.0001) that the length of hospital admission is related to the type of admission. The mean length of stay for elective admission is 8.83 days (95% CI 8.66 to 9.01 days). Urgent admissions are associated with stays that are on average 26.8% (95% CI 21.7% to 32.1%) longer than those resulting from elective admissions. Emergency admissions result in stays that are on average 2.06 (95% CI 1.96 to 2.17) times as long as those from elective admissions.

值得注意的是，這些結果所依據的泊松回歸模型的前提條件很可能因爲數據本身存在過度離散 (overdispersion) 的問題而無法得到滿足。

在上述模型中如果加入年齡作爲解釋變量，試分析年齡是否可以認爲是住院時長的獨立解釋變量 (獨立於住院形態 type of admission)。

mC <- glm(los ~ as.factor(age) + type2 + type3, 
          family=poisson(link = log), data=medpar)
jtools::summ(mC, digits = 6, confint = TRUE, exp = TRUE)

Observations	1495
Dependent variable	los
Type	Generalized linear model
Family	poisson
Link	log

𝛘²(10)	735.948666
Pseudo-R² (Cragg-Uhler)	0.388787
Pseudo-R² (McFadden)	0.050351
AIC	13902.334982
BIC	13960.743678

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	11.329126	9.015810	14.236002	20.830249	0.000000
as.factor(age)2	0.775566	0.608964	0.987746	-2.059888	0.039409
as.factor(age)3	0.817641	0.647619	1.032300	-1.692696	0.090513
as.factor(age)4	0.791049	0.628003	0.996425	-1.990375	0.046550
as.factor(age)5	0.765953	0.608122	0.964748	-2.264798	0.023525
as.factor(age)6	0.795170	0.631446	1.001345	-1.948541	0.051350
as.factor(age)7	0.759360	0.601769	0.958220	-2.319574	0.020364
as.factor(age)8	0.723916	0.570628	0.918383	-2.661289	0.007784
as.factor(age)9	0.731655	0.571365	0.936912	-2.476470	0.013269
type2	1.265918	1.214954	1.319021	11.246930	0.000000
type3	2.064878	1.962417	2.172690	27.922756	0.000000
Standard errors: MLE

lrtest(mC, mB)

## Likelihood ratio test
## 
## Model 1: los ~ as.factor(age) + type2 + type3
## Model 2: los ~ type2 + type3
##   #Df   LogLik Df   Chisq Pr(>Chisq)  
## 1  11 -6940.17                        
## 2   3 -6949.39 -8 18.4422   0.018145 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

加入和年齡組作爲預測變量的模型結果如上所示。和沒有加年齡的模型相比較的似然比檢驗 (likelihood ratio test) 結果顯示，如果泊松回歸模型前提得到滿足，那麼有證據證明 (p = 0.018)，在調整了住院形態之後，年齡依然是住院時長獨立的預測變量。

重新對加入年齡的泊松回歸模型加入穩健統計標準誤 (robust standard error) 的估計，獲得新的穩健信賴區間估計。與非穩健信賴區間相比較，你能得出怎樣的結論？

jtools::summ(mC, digits = 6, confint = TRUE, exp = TRUE, 
             robust = "HC1")

Observations	1495
Dependent variable	los
Type	Generalized linear model
Family	poisson
Link	log

𝛘²(10)	735.948666
Pseudo-R² (Cragg-Uhler)	0.388787
Pseudo-R² (McFadden)	0.050351
AIC	13902.334982
BIC	13960.743678

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	11.329126	9.015810	14.236002	20.830249	0.000000
as.factor(age)2	0.775566	0.608964	0.987746	-2.059888	0.039409
as.factor(age)3	0.817641	0.647619	1.032300	-1.692696	0.090513
as.factor(age)4	0.791049	0.628003	0.996425	-1.990375	0.046550
as.factor(age)5	0.765953	0.608122	0.964748	-2.264798	0.023525
as.factor(age)6	0.795170	0.631446	1.001345	-1.948541	0.051350
as.factor(age)7	0.759360	0.601769	0.958220	-2.319574	0.020364
as.factor(age)8	0.723916	0.570628	0.918383	-2.661289	0.007784
as.factor(age)9	0.731655	0.571365	0.936912	-2.476470	0.013269
type2	1.265918	1.214954	1.319021	11.246930	0.000000
type3	2.064878	1.962417	2.172690	27.922756	0.000000
Standard errors: Robust, type = HC1

可以看到加入了 robust 選項之後，並不會改變每個變量的回歸係數的點估計 (point estimates)。但是，可以發現每個變量的回歸係數對應的標準誤發生了較大的變化 – 信賴區間的範圍都無一例外地變大了。由於使用 robust = "HC1" 選項，這裏的標準誤估計不再依據泊松模型的前提條件 – 泊松分佈的特徵。在 Stata 中擬合相同的模型，我們可以獲得是否有過度離散的指標型數據結果，也就是殘差離差量 (residual deviance) 和 Pearson 統計量，以及他們對各自的自由度之比：

## 
## . use http://www.stata-press.com/data/hh3/medpar, clear
## 
## . glm los type2 type3 i.age, family(Poisson) robust eform
## 
## Iteration 0:   log pseudolikelihood = -7023.6853  
## Iteration 1:   log pseudolikelihood = -6940.3577  
## Iteration 2:   log pseudolikelihood = -6940.1675  
## Iteration 3:   log pseudolikelihood = -6940.1675  
## 
## Generalized linear models                         Number of obs   =      1,495
## Optimization     : ML                             Residual df     =      1,484
##                                                   Scale parameter =          1
## Deviance         =   8165.18541                   (1/df) Deviance =   5.502147
## Pearson          =  9346.752373                   (1/df) Pearson  =   6.298351
## 
## Variance function: V(u) = u                       [Poisson]
## Link function    : g(u) = ln(u)                   [Log]
## 
##                                                   AIC             =   9.299221
## Log pseudolikelihood = -6940.167491               BIC             =  -2682.679
## 
## ------------------------------------------------------------------------------
##              |               Robust
##          los |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##        type2 |   1.265918    .067136     4.45   0.000     1.140942    1.404585
##        type3 |   2.064878   .2416633     6.20   0.000     1.641625    2.597257
##              |
##          age |
##           2  |   .7755658   .1792692    -1.10   0.272     .4930223    1.220031
##           3  |   .8176414    .173137    -0.95   0.342     .5399076    1.238244
##           4  |   .7910486   .1623995    -1.14   0.254     .5289985     1.18291
##           5  |   .7659533   .1574871    -1.30   0.195     .5119027    1.146086
##           6  |   .7951697   .1646582    -1.11   0.268     .5299061    1.193221
##           7  |   .7593596   .1589473    -1.32   0.188     .5038207    1.144508
##           8  |   .7239164   .1535639    -1.52   0.128     .4776652    1.097118
##           9  |   .7316548   .1652243    -1.38   0.166     .4699868    1.139008
##              |
##        _cons |   11.32913   2.271633    12.11   0.000     7.647503    16.78314
## ------------------------------------------------------------------------------
## Note: _cons estimates baseline incidence rate.
## 
## .

可以看到殘差離差量 (residual deviance) 和 Pearson 統計量與各自的自由度之比均顯著大於1。提示該數據有相當程度的過度離散，也就是泊松回歸模型的前提泊松分佈無法得到滿足。而使用了 robust = "HC1" (in R) 或者 robust (in Stata) 的選項之後，就不再需要這一前提假設。我們也看到穩健信賴區間相較於沒有使用該選項時要不那麼精確，也就是區間範圍都變寬了。

當考慮了過度離散的模型被採納後 robust (in Stata)，我們無法再使用似然比檢驗法檢驗年齡是否是住院時長的獨立預測變量。但是我們可以使用 Stata 裏特有的模型擬合之後的 test 命令來實施聯合 Wald 檢驗 (joint Wald test) 年齡是否在穩健泊松模型下依然是住院時長的獨立預測變量。試解讀該檢驗結果和之前未考慮過度離散現象時使用的模型的年齡的獨立預測變量檢驗之間有何不同，原因是什麼呢？

## 
## . use http://www.stata-press.com/data/hh3/medpar, clear
## 
## . quietly: glm los type2 type3 i.age, family(Poisson) robust eform
## 
## . test 2.age = 3.age = 4.age = 5.age = 6.age = 7.age = 8.age = 9.age = 0
## 
##  ( 1)  [los]2.age - [los]3.age = 0
##  ( 2)  [los]2.age - [los]4.age = 0
##  ( 3)  [los]2.age - [los]5.age = 0
##  ( 4)  [los]2.age - [los]6.age = 0
##  ( 5)  [los]2.age - [los]7.age = 0
##  ( 6)  [los]2.age - [los]8.age = 0
##  ( 7)  [los]2.age - [los]9.age = 0
##  ( 8)  [los]2.age = 0
## 
##            chi2(  8) =    4.09
##          Prob > chi2 =    0.8493
## 
## .

這個聯合 Wald 檢驗的結果是 p = 0.8493。這和之前未考慮數據過度離散時使用的模型時進行的對年齡這一變量的似然比模型檢驗結果大相徑庭 (p = 0.018)。之所以結果相差如此之大，我們相信主要是因爲之前忽略數據過度離散問題的模型其實是錯誤的。而考慮了數據過度離散特徵之後，可以認爲使用了穩健泊松回歸模型之後的聯合 Wald 檢驗結果才是真的值得相信的。

請使用負二項回歸模型 (negative binomial regression model) 來擬合上述模型，先擬合一個不加任何預測變量的負二項回歸模型。請解釋模型結果的下半部分出現的似然比檢驗的意義，和無預測變量的泊松回歸模型結果做一下對比。

在 Stata 裏：

## 
## . use http://www.stata-press.com/data/hh3/medpar, clear
## 
## . nbreg los
## 
## Fitting Poisson model:
## 
## Iteration 0:   log likelihood = -7308.1418  
## Iteration 1:   log likelihood = -7308.1418  
## 
## Fitting constant-only model:
## 
## Iteration 0:   log likelihood = -4988.8172  
## Iteration 1:   log likelihood = -4857.3372  
## Iteration 2:   log likelihood =  -4856.494  
## Iteration 3:   log likelihood =  -4856.494  
## 
## Fitting full model:
## 
## Iteration 0:   log likelihood =  -4856.494  
## Iteration 1:   log likelihood =  -4856.494  
## 
## Negative binomial regression                    Number of obs     =      1,495
##                                                 LR chi2(0)        =       0.00
## Dispersion     = mean                           Prob > chi2       =          .
## Log likelihood =  -4856.494                     Pseudo R2         =     0.0000
## 
## ------------------------------------------------------------------------------
##          los |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##        _cons |   2.287896   .0198946   115.00   0.000     2.248903    2.326888
## -------------+----------------------------------------------------------------
##     /lnalpha |  -.7128727   .0431289                     -.7974038   -.6283417
## -------------+----------------------------------------------------------------
##        alpha |   .4902339   .0211432                       .450497    .5334758
## ------------------------------------------------------------------------------
## LR test of alpha=0: chibar2(01) = 4903.30              Prob >= chibar2 = 0.000
## 
## .

在 R 裏：

mD <- glm.nb(los ~ 1, data=medpar)
jtools::summ(mD, digits = 6, confint = TRUE, exp = TRUE)

## Error in glm.control(...) : 
##   unused argument (family = list("Negative Binomial(2.0398)", "log", function (mu) 
## log(mu), function (eta) 
## pmax(exp(eta), .Machine$double.eps), function (mu) 
## mu + mu^2/.Theta, function (y, mu, wt) 
## 2 * wt * (y * log(pmax(1, y)/mu) - (y + .Theta) * log((y + .Theta)/(mu + .Theta))), function (y, n, mu, wt, dev) 
## {
##     term <- (y + .Theta) * log(mu + .Theta) - y * log(mu) + lgamma(y + 1) - .Theta * log(.Theta) + lgamma(.Theta) - lgamma(.Theta + y)
##     2 * sum(term * wt)
## }, function (eta) 
## pmax(exp(eta), .Machine$double.eps), expression({
##     if (any(y < 0)) stop("negative values not allowed for the negative binomial family")
##     n <- rep(1, nobs)
##     mustart <- y + (y == 0)/6
## }), function (mu) 
## all(mu > 0), function (eta) 
## TRUE, function (object, nsim) 
## {
##     ftd <- fitted(object)
##     rnegbin(nsim * length(ftd), ftd, .Theta)
## }))

## Warning: Something went wrong when calculating the pseudo R-squared. Returning NA instead.

Observations	1495
Dependent variable	los
Type	Generalized linear model
Family	Negative Binomial(2.0398)
Link	log

𝛘²(NA)	NA
Pseudo-R² (Cragg-Uhler)	NA
Pseudo-R² (McFadden)	NA
AIC	9716.988008
BIC	9727.607771

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	9.854181	9.477334	10.246011	115.000849	0.000000
Standard errors: MLE

summary(mD)

## 
## Call:
## glm.nb(formula = los ~ 1, data = medpar, init.theta = 2.03984274, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.92829  -0.98645  -0.26025   0.38038   5.49923  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) 2.287896   0.019895     115 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.0398) family taken to be 1)
## 
##     Null deviance: 1571.48  on 1494  degrees of freedom
## Residual deviance: 1571.48  on 1494  degrees of freedom
## AIC: 9716.99
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.0398 
##           Std. Err.:  0.0880 
## 
##  2 x log-likelihood:  -9712.9880

可以看到在 R 裏最下方出現的是 Theta 也就是評價過度離散程度的指標，他和 Stata 的輸出報告中的 alpha 互爲倒數。不同的是 Stata 的報告中還對 alpha = 0 做了檢驗。檢驗結果提示住院時長這個數據並不服從泊松分佈。也就是實際的住院時長的數據比泊松分佈時的結果的方差要大的多。 (there is more variability than would be expected from a Poisson variable) 其實，簡單對 los 分析一下就知道它的樣本均值和樣本方差分別是：（他們相差巨大，不符合泊松分佈的特徵）

mean(medpar$los)

## [1] 9.8541806

var(medpar$los)

## [1] 78.020222

如果你有興趣，請擬合一個只有住院形態一個預測變量的負二項回歸模型。使用其輸出的結果來計算一下不同住院形態下的平均住院時長。比較模型預測的平均住院時長和觀測到的不同住院形態下的實際住院時長的平均值之間的差別。

在 Stata 裏：

## 
## . use http://www.stata-press.com/data/hh3/medpar, clear
## 
## . nbreg los type2 type3 
## 
## Fitting Poisson model:
## 
## Iteration 0:   log likelihood = -6949.6915  
## Iteration 1:   log likelihood = -6949.3886  
## Iteration 2:   log likelihood = -6949.3886  
## 
## Fitting constant-only model:
## 
## Iteration 0:   log likelihood = -4988.8172  
## Iteration 1:   log likelihood = -4857.3372  
## Iteration 2:   log likelihood =  -4856.494  
## Iteration 3:   log likelihood =  -4856.494  
## 
## Fitting full model:
## 
## Iteration 0:   log likelihood = -4802.8473  
## Iteration 1:   log likelihood = -4800.2234  
## Iteration 2:   log likelihood = -4800.2189  
## Iteration 3:   log likelihood = -4800.2189  
## 
## Negative binomial regression                    Number of obs     =      1,495
##                                                 LR chi2(2)        =     112.55
## Dispersion     = mean                           Prob > chi2       =     0.0000
## Log likelihood = -4800.2189                     Pseudo R2         =     0.0116
## 
## ------------------------------------------------------------------------------
##          los |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
## -------------+----------------------------------------------------------------
##        type2 |   .2373439   .0502166     4.73   0.000     .1389211    .3357666
##        type3 |   .7253612   .0757014     9.58   0.000     .5769893    .8737332
##        _cons |   2.178233   .0222433    97.93   0.000     2.134637    2.221829
## -------------+----------------------------------------------------------------
##     /lnalpha |  -.8033559   .0443828                     -.8903446   -.7163671
## -------------+----------------------------------------------------------------
##        alpha |   .4478236   .0198757                      .4105142    .4885238
## ------------------------------------------------------------------------------
## LR test of alpha=0: chibar2(01) = 4298.34              Prob >= chibar2 = 0.000
## 
## .

在 R 裏：

mE <- glm.nb(los ~ type2 + type3, data=medpar)
jtools::summ(mE, digits = 6, confint = TRUE)

## Error in glm.control(...) : 
##   unused argument (family = list("Negative Binomial(2.233)", "log", function (mu) 
## log(mu), function (eta) 
## pmax(exp(eta), .Machine$double.eps), function (mu) 
## mu + mu^2/.Theta, function (y, mu, wt) 
## 2 * wt * (y * log(pmax(1, y)/mu) - (y + .Theta) * log((y + .Theta)/(mu + .Theta))), function (y, n, mu, wt, dev) 
## {
##     term <- (y + .Theta) * log(mu + .Theta) - y * log(mu) + lgamma(y + 1) - .Theta * log(.Theta) + lgamma(.Theta) - lgamma(.Theta + y)
##     2 * sum(term * wt)
## }, function (eta) 
## pmax(exp(eta), .Machine$double.eps), expression({
##     if (any(y < 0)) stop("negative values not allowed for the negative binomial family")
##     n <- rep(1, nobs)
##     mustart <- y + (y == 0)/6
## }), function (mu) 
## all(mu > 0), function (eta) 
## TRUE, function (object, nsim) 
## {
##     ftd <- fitted(object)
##     rnegbin(nsim * length(ftd), ftd, .Theta)
## }))

## Warning: Something went wrong when calculating the pseudo R-squared. Returning NA instead.

Observations	1495
Dependent variable	los
Type	Generalized linear model
Family	Negative Binomial(2.233)
Link	log

𝛘²(NA)	NA
Pseudo-R² (Cragg-Uhler)	NA
Pseudo-R² (McFadden)	NA
AIC	9608.437779
BIC	9629.677305

	Est.	2.5%	97.5%	z val.	p
(Intercept)	2.178233	2.134637	2.221829	97.927402	0.000000
type2	0.237344	0.138921	0.335767	4.726402	0.000002
type3	0.725361	0.576989	0.873733	9.581877	0.000000
Standard errors: MLE

summary(mE)

## 
## Call:
## glm.nb(formula = los ~ type2 + type3, data = medpar, init.theta = 2.233022138, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.47528  -0.90422  -0.28563   0.43558   3.86701  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) 2.178233   0.022243 97.9274 < 2.2e-16 ***
## type2       0.237344   0.050217  4.7264 2.285e-06 ***
## type3       0.725361   0.075701  9.5819 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.233) family taken to be 1)
## 
##     Null deviance: 1685.08  on 1494  degrees of freedom
## Residual deviance: 1568.12  on 1492  degrees of freedom
## AIC: 9608.44
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.2330 
##           Std. Err.:  0.0991 
## 
##  2 x log-likelihood:  -9600.4380

從輸出的分析報告結果來看，模型估計的不同住院形態 (elective, urgent, and emergency) 的平均住院時長分別是 \(\exp(2.1782) = 8.83\) 天，\(\exp(2.1782 + 0.2373) = 11.20\) 天，\(\exp(2.1782 + 0.7253) = 18.24\)天。要估算模型的預測方差，可以通過手工計算。已知如果期待值是 \(\mu\)，那麼其方差是 \(\mu(1 + \alpha \mu)\)。那麼依據這個公式，就可以估算不同住院形態 (elective, urgent, and emergency) 的住院時長的方差分別是 \(8.83 \times (1 + 0.4478 \times 8.83) = 43.74\) days\(^2\)，\(11.2 \times (1 + 0.4478 \times 11.2) = 67.3\) days\(^2\)，\(18.24 \times (1 + 0.4478 \times 18.24) = 167.2\) days\(^2\)。

和觀測值相比較：

var(medpar$los[medpar$type1 == 1])

## [1] 41.680046

var(medpar$los[medpar$type2 == 1])

## [1] 77.878016

var(medpar$los[medpar$type3 == 1])

## [1] 424.87884

可見，對於前兩種住院形態來說，模型推測的方差和觀測方差還是比較接近的。但是對於第三種住院形態 (emergency)，觀測方差遠遠大於模型推測方差。當然，這可能由於因爲緊急住院的患者人數較低 (n = 96)。也就是說，樣本量太低的組使用該模型推測的方差準確度就比較低。另一種原因值得考慮的就是，負二項回歸模型中使用的 Gamma 隨機效應分佈可能並不適用與本數據 (對於不同住院形態的住院時長可能需要不同的 alpha，而不是強迫他們都是相同的)。另外一個被忽視的是，住院時長這一數據其實是只能取正值的數據。然而泊松分佈和負二項分佈本身其實是允許有 0 這樣的數字的。如此一來，或許我們應該把住院時長定義爲允許 0 存在的數字。也就是把住院時長的數據減去1。(number of days from the day of admission)

第 61 章 計數型因變量 Count outcomes