Christophe Lalanne
December 2, 2014
Statistics is the grammar of science. Karl Pearson (1857-1936)
To describe the assocation between two categorical variables, with I and J levels, we can build a two-way table of counts (or frequencies). Note that we generally need to know the marginal distribution of each variable (row and columns totals).
A basic question that we may ask is whether the two variables are related in some way, which amounts to test the null hypothesis that the two variables are independent.
There are several tests available, depending on the kind of data and the type of hypothesis we are interested in (Bishop, Fienberg, and Holland, 2007).
In the general case, a count or frequency table can be built as shown below. The ni⋅ (i=1,…,I) and n⋅j (j=1,…,J) represent the marginal distributions of variables A and B. Note that if I=J, we call it a square table.
Pearson's chi-square test is used to compare observed data to expected data, that is data occuring under the null hypothesis of no association between two variables. Expected counts are given by the product of margins (if A and B are independant, P(A and B) = P(A) x P(B)).
Up to a constant, the chi-square statistic is the sum of squared differences between observed and expected data after some normalization:
Φ2=I∑i=1J∑j=1(pij−pi⋅p⋅j)2pi⋅p⋅j.
Then, χ2=NΦ2 is called the 'chi-squared' statistic and it follows a χ2 distribution with (I−1)(J−1) degrees of freedom.
In the most general case, how can we define the association between two categorical variables in the 2-by-2 case?
Let us consider a two-way table (two binary variables):
The odds-ratio is defined as α=p11p22p12p21,
or, equivalently, α=p11/p12p21/p22, when row margins (e.g., exposure) are fixed (e.g., p11/p12 = probability of disease when exposed).
This is a general measure of association, which is invariant after permutation of rows and columns.
With binary variables, the previous formula for the 'chi-squared' statistic simplifies to Φ2=2∑i=12∑j=1(pij−pi⋅p⋅j)2pi⋅p⋅j=(p11p22−p21p12)2p1⋅p2⋅p⋅1p⋅2,
which shows that it amounts to a simple correlation between two numerically-scored variables.
Indeed, considering 0/1 scores, we have
ρ=p11p22−p21p12√p1⋅p2⋅p⋅1p⋅2=p22−p2⋅p⋅2√p1⋅p2⋅p⋅1p⋅2,
hence, r2=χ2/N.
Caffeine consumption and marital status (Dalgaard, 2008).
coffee <- matrix(c(652,1537,598,242,36,46,38,21, 218,327,106,67), nrow = 3, byrow = TRUE) dimnames(coffee) <- list("marital status" = c("Married", "Prev. married", "Single"), consumption = c("0", "1-150", "151-300", ">300")) coffee
## consumption ## marital status 0 1-150 151-300 >300 ## Married 652 1537 598 242 ## Prev. married 36 46 38 21 ## Single 218 327 106 67
round(prop.table(coffee, margin = 1), digits = 2) ## by row
## consumption ## marital status 0 1-150 151-300 >300 ## Married 0.22 0.51 0.20 0.08 ## Prev. married 0.26 0.33 0.27 0.15 ## Single 0.30 0.46 0.15 0.09
round(prop.table(coffee, margin = 2), digits = 2) ## by column
## consumption ## marital status 0 1-150 151-300 >300 ## Married 0.72 0.80 0.81 0.73 ## Prev. married 0.04 0.02 0.05 0.06 ## Single 0.24 0.17 0.14 0.20
Bar chartbarchart(table(...))
Dotplotdotplot(count ~ A, groups = B)
library(reshape2) coffee.df <- melt(coffee, varnames = c("Status", "Caffeine"))
chsq <- chisq.test(coffee) chsq
## ## Pearson's Chi-squared test ## ## data: coffee ## X-squared = 51.6556, df = 6, p-value = 2.187e-09
chsq$residuals
## consumption ## marital status 0 1-150 151-300 >300 ## Married -2.0262275 1.269954 0.8291261 -0.9411871 ## Prev. married 0.5484102 -2.795611 2.1380815 2.6109594 ## Single 3.9187205 -1.369542 -2.6504574 0.7761028
The theory of Generalized Linear Model encompasses a unified approach to regression models where a single response variable is assumed to follow a probability distribution fucntion from the exponential family (Nelder and Wedderburn, 1972). This includes the following PDFs: Gaussian, Binomial, Poisson, Gamma, Inverse Gaussian, Geometric, and Negative Binomial. The idea is to 'relaxe' some of the assumptions of the linear model such that the relationship between the response and the predictors remains linear. You may recall that in the case of linear regression, we usually relate the predictors to the expected value of the outcome like so:
E(y∣x)=f(x;β),
or, using matrix notation,
E(y∣x)=Xβ.
How can this be achieved with a logistic regression where individual responses are binary and follow a Bernoulli, or B(1;0.5), distribution? Moreover, a standard regression model could predict individual probabilities outside the [0;1] interval.
Some transformations, like p′=arcsinp, have been proposed to allow the use of ANOVA with binary data (Zar, 1998). However, it is fairly easy to apply a logistic regression, see also (Dixon, 2008).
Considering the logit transformation of the probability of the event under consideration, π(x)=eβ0+β1x1+eβ0+β1x, the logistic regression model is comparable to the linear case, i.e. it is additive in its effect terms. In the simplest case (one predictor + an intercept term), we have:
g(x)=ln(π(x)1−π(x))=β0+β1x.
Prognostic study of risk factor associated with low birth infant weight (Hosmer and Lemeshow, 1989).
data(birthwt, package = "MASS") yesno <- c("No","Yes") birthwt <- within(birthwt, { race <- factor(race, labels = c("White","Black","Other")) smoke <- factor(smoke, labels = yesno) ui <- factor(ui, labels = yesno) ht <- factor(ht, labels = yesno) })
low
: whether child's weight (bwt
) is > 2.5 kg or not.ui
: presence of uterine irritability (mother).age
: mother's age (when enrolled in the study).library(Hmisc) summary(low ~ age + ui, data = birthwt)
## low N=189 ## ## +-------+-------+---+---------+ ## | | |N |low | ## +-------+-------+---+---------+ ## |age |[14,20)| 51|0.2941176| ## | |[20,24)| 56|0.3571429| ## | |[24,27)| 36|0.4166667| ## | |[27,45]| 46|0.1956522| ## +-------+-------+---+---------+ ## |ui |No |161|0.2795031| ## | |Yes | 28|0.5000000| ## +-------+-------+---+---------+ ## |Overall| |189|0.3121693| ## +-------+-------+---+---------+
m <- glm(low ~ age, data = birthwt, family = binomial("logit")) summary(m)
## ## Call: ## glm(formula = low ~ age, family = binomial("logit"), data = birthwt) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.0402 -0.9018 -0.7754 1.4119 1.7800 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.38458 0.73212 0.525 0.599 ## age -0.05115 0.03151 -1.623 0.105 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 234.67 on 188 degrees of freedom ## Residual deviance: 231.91 on 187 degrees of freedom ## AIC: 235.91 ## ## Number of Fisher Scoring iterations: 4
set.seed(154) idx <- sample(1:nrow(birthwt), 5) birthwt$low[idx]
## [1] 1 1 0 0 1
predict(m)[idx]
## 15 11 91 210 49 ## -0.8942416 -1.3546181 -0.6896299 -1.3034652 -0.5361710
predict(m, type = "response")[idx]
## 15 11 91 210 49 ## 0.2902353 0.2051164 0.3341154 0.2135824 0.3690787
m2 <- glm(low ~ ui + age, data = birthwt, family = binomial("logit")) summary(m2)
## ## Call: ## glm(formula = low ~ ui + age, family = binomial("logit"), data = birthwt) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.2646 -0.8615 -0.7631 1.2740 1.8288 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.14876 0.74972 0.198 0.8427 ## uiYes 0.90780 0.41955 2.164 0.0305 ## age -0.04744 0.03197 -1.484 0.1378 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 234.67 on 188 degrees of freedom ## Residual deviance: 227.30 on 186 degrees of freedom ## AIC: 233.3 ## ## Number of Fisher Scoring iterations: 4
How to interpret the effet of age
and ui
?
ui = 0
(No
), the log-odds decrease by -0.05; the adjusted odds-ratio equals 0.954. The corresponding OR for a 5-year increase is 0.789.ui = 1
), the log-odds increase by 0.90, when age is held constant; the adjusted odds-ratio equals 2.479.exp(coef(m2)["uiYes"])
## uiYes ## 2.478858
exp(confint(m2)["uiYes",])
## 2.5 % 97.5 % ## 1.083421 5.683810
Yet another way to compute the odds-ratio:
m3 <- glm(low ~ ui, data = birthwt, family = binomial("logit")) exp(coef(m3)["uiYes"])
## uiYes ## 2.577778
library(vcd) oddsratio(xtabs(~ low + ui, data = birthwt), log = FALSE)
## [1] 2.577778
[1] Y. Bishop, S. Fienberg and P. Holland. Discrete Multivariate Analysis. 2nd ed.. Springer, 2007.
[2] P. Dalgaard. Introductory Statistics with R. 2nd ed.. Springer, 2008.
[3] P. Dixon. "Models of accuracy in repeated-measures designs". In: Journal of Memory and Language 59.4 (2008), pp. 447-456.
[4] D. Hosmer and S. Lemeshow. Applied Logistic Regression. New York: Wiley, 1989.
[5] J. Nelder and W. Wedderburn. "Generalized linear models". In: Journal of the Royal Statistical Society, Series A 135 (1972), pp. 370-384.
[6] J. Zar. Biostatistical Analysis. 4th. Pearson, Prentice Hall, 1998.