Processing math: 100%

Categorical data and logistic regression

Christophe Lalanne

December 2, 2014

Synopsis

Statistics is the grammar of science. Karl Pearson (1857-1936)

contingency and square tables • chi-square test • odds-ratio • logistic regression

Analysis of categorical data

Association between two categorical variables

To describe the assocation between two categorical variables, with I and J levels, we can build a two-way table of counts (or frequencies). Note that we generally need to know the marginal distribution of each variable (row and columns totals).

A basic question that we may ask is whether the two variables are related in some way, which amounts to test the null hypothesis that the two variables are independent.

There are several tests available, depending on the kind of data and the type of hypothesis we are interested in (Bishop, Fienberg, and Holland, 2007).

Two-way cross-classification table

In the general case, a count or frequency table can be built as shown below. The $n_{i\cdot}$ ( $i=1,\dots,I$ ) and $n_{\cdot j}$ ( $j=1,\dots,J$ ) represent the marginal distributions of variables A and B. Note that if $I=J$ , we call it a square table.

The chi-square test

Pearson's chi-square test is used to compare observed data to expected data, that is data occuring under the null hypothesis of no association between two variables. Expected counts are given by the product of margins (if A and B are independant, P(A and B) = P(A) x P(B)).

Up to a constant, the chi-square statistic is the sum of squared differences between observed and expected data after some normalization:

$\Phi^2=\sum_{i=1}^I\sum_{j=1}^J\frac{(p_{ij}-p_{i\cdot}p_{\cdot j})^2}{p_{i\cdot}p_{\cdot j}}.$

Then, $\chi^2 = N\Phi^2$ is called the 'chi-squared' statistic and it follows a $\chi^2$ distribution with $(I-1)(J-1)$ degrees of freedom.

Measures of association

In the most general case, how can we define the association between two categorical variables in the 2-by-2 case?

as a function of the cross-product of all cells (odds-ratio)
as a correlation coefficient (chi-square test)

Let us consider a two-way table (two binary variables):

The odds-ratio

The odds-ratio is defined as $\alpha=\frac{p_{11}p_{22}}{p_{12}p_{21}},$

or, equivalently, $\alpha=\frac{p_{11}/p_{12}}{p_{21}/p_{22}}$ , when row margins (e.g., exposure) are fixed (e.g., $p_{11}/p_{12}$ = probability of disease when exposed).

This is a general measure of association, which is invariant after permutation of rows and columns.

The chi-square

With binary variables, the previous formula for the 'chi-squared' statistic simplifies to $\Phi^2=\sum_{i=1}^2\sum_{j=1}^2\frac{(p_{ij}-p_{i\cdot}p_{\cdot j})^2}{p_{i\cdot}p_{\cdot j}}=\frac{(p_{11}p_{22}-p_{21}p_{12})^2}{p_{1\cdot}p_{2\cdot}p_{\cdot 1}p_{\cdot 2}},$

which shows that it amounts to a simple correlation between two numerically-scored variables.

Indeed, considering 0/1 scores, we have

$\rho=\frac{p_{11}p_{22}-p_{21}p_{12}}{\sqrt{p_{1\cdot}p_{2\cdot}p_{\cdot 1}p_{\cdot 2}}}=\frac{p_{22}-p_{2\cdot}p_{\cdot 2}}{\sqrt{p_{1\cdot}p_{2\cdot}p_{\cdot 1}p_{\cdot 2}}},$

hence, $r^2=\chi^2/N$ .

Illustration

Caffeine consumption and marital status (Dalgaard, 2008).

coffee <- matrix(c(652,1537,598,242,36,46,38,21,
                   218,327,106,67), nrow = 3, byrow = TRUE)
dimnames(coffee) <- list("marital status" = c("Married", "Prev. married", "Single"), 
                         consumption = c("0", "1-150", "151-300", ">300"))
coffee

##                consumption
## marital status    0 1-150 151-300 >300
##   Married       652  1537     598  242
##   Prev. married  36    46      38   21
##   Single        218   327     106   67

round(prop.table(coffee, margin = 1), digits = 2)  ## by row

##                consumption
## marital status     0 1-150 151-300 >300
##   Married       0.22  0.51    0.20 0.08
##   Prev. married 0.26  0.33    0.27 0.15
##   Single        0.30  0.46    0.15 0.09

round(prop.table(coffee, margin = 2), digits = 2)  ## by column

##                consumption
## marital status     0 1-150 151-300 >300
##   Married       0.72  0.80    0.81 0.73
##   Prev. married 0.04  0.02    0.05 0.06
##   Single        0.24  0.17    0.14 0.20

Bar chart
barchart(table(...))

Dotplot
dotplot(count ~ A, groups = B)

library(reshape2)
coffee.df <- melt(coffee, varnames = c("Status", "Caffeine"))

chsq <- chisq.test(coffee)
chsq

## 
##  Pearson's Chi-squared test
## 
## data:  coffee
## X-squared = 51.6556, df = 6, p-value = 2.187e-09

chsq$residuals

##                consumption
## marital status           0     1-150    151-300       >300
##   Married       -2.0262275  1.269954  0.8291261 -0.9411871
##   Prev. married  0.5484102 -2.795611  2.1380815  2.6109594
##   Single         3.9187205 -1.369542 -2.6504574  0.7761028

Logistic regression

Generalized linear models

The theory of Generalized Linear Model encompasses a unified approach to regression models where a single response variable is assumed to follow a probability distribution fucntion from the exponential family (Nelder and Wedderburn, 1972). This includes the following PDFs: Gaussian, Binomial, Poisson, Gamma, Inverse Gaussian, Geometric, and Negative Binomial. The idea is to 'relaxe' some of the assumptions of the linear model such that the relationship between the response and the predictors remains linear. You may recall that in the case of linear regression, we usually relate the predictors to the expected value of the outcome like so:

$\mathbb{E}(y \mid x) = f(x; \beta),$

or, using matrix notation,

$\mathbb{E}(y \mid x) = {\bf X}\beta.$

From linear to logistic regression

How can this be achieved with a logistic regression where individual responses are binary and follow a Bernoulli, or ${\cal B}(1;0.5)$ , distribution? Moreover, a standard regression model could predict individual probabilities outside the $[0;1]$ interval.

Some transformations, like $p'=\arcsin p$ , have been proposed to allow the use of ANOVA with binary data (Zar, 1998). However, it is fairly easy to apply a logistic regression, see also (Dixon, 2008).

Considering the logit transformation of the probability of the event under consideration, $\pi(x)=\frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}$ , the logistic regression model is comparable to the linear case, i.e. it is additive in its effect terms. In the simplest case (one predictor + an intercept term), we have:

$g(x)=\ln\left(\frac{\pi(x)}{1-\pi(x)}\right)=\beta_0+\beta_1x.$

Illustration

Prognostic study of risk factor associated with low birth infant weight (Hosmer and Lemeshow, 1989).

data(birthwt, package = "MASS")
yesno <- c("No","Yes")
birthwt <- within(birthwt, {
  race <- factor(race, labels = c("White","Black","Other"))
  smoke <- factor(smoke, labels = yesno)
  ui <- factor(ui, labels = yesno)
  ht <- factor(ht, labels = yesno)
})

low: whether child's weight (bwt) is > 2.5 kg or not.
ui: presence of uterine irritability (mother).
age: mother's age (when enrolled in the study).

library(Hmisc)
summary(low ~ age + ui, data = birthwt)

## low    N=189
## 
## +-------+-------+---+---------+
## |       |       |N  |low      |
## +-------+-------+---+---------+
## |age    |[14,20)| 51|0.2941176|
## |       |[20,24)| 56|0.3571429|
## |       |[24,27)| 36|0.4166667|
## |       |[27,45]| 46|0.1956522|
## +-------+-------+---+---------+
## |ui     |No     |161|0.2795031|
## |       |Yes    | 28|0.5000000|
## +-------+-------+---+---------+
## |Overall|       |189|0.3121693|
## +-------+-------+---+---------+

m <- glm(low ~ age, data = birthwt, family = binomial("logit"))
summary(m)

## 
## Call:
## glm(formula = low ~ age, family = binomial("logit"), data = birthwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0402  -0.9018  -0.7754   1.4119   1.7800  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.38458    0.73212   0.525    0.599
## age         -0.05115    0.03151  -1.623    0.105
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 231.91  on 187  degrees of freedom
## AIC: 235.91
## 
## Number of Fisher Scoring iterations: 4

Fitted values

set.seed(154)
idx <- sample(1:nrow(birthwt), 5)
birthwt$low[idx]

## [1] 1 1 0 0 1

predict(m)[idx]

##         15         11         91        210         49 
## -0.8942416 -1.3546181 -0.6896299 -1.3034652 -0.5361710

predict(m, type = "response")[idx]

##        15        11        91       210        49 
## 0.2902353 0.2051164 0.3341154 0.2135824 0.3690787

More than one predictor

m2 <- glm(low ~ ui + age, data = birthwt, family = binomial("logit"))
summary(m2)

## 
## Call:
## glm(formula = low ~ ui + age, family = binomial("logit"), data = birthwt)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2646  -0.8615  -0.7631   1.2740   1.8288  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.14876    0.74972   0.198   0.8427
## uiYes        0.90780    0.41955   2.164   0.0305
## age         -0.04744    0.03197  -1.484   0.1378
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 227.30  on 186  degrees of freedom
## AIC: 233.3
## 
## Number of Fisher Scoring iterations: 4

How to interpret the effet of age and ui?

When age increases by one unit and ui = 0 (No), the log-odds decrease by -0.05; the adjusted odds-ratio equals 0.954. The corresponding OR for a 5-year increase is 0.789.
In presence of uterine irritability (ui = 1), the log-odds increase by 0.90, when age is held constant; the adjusted odds-ratio equals 2.479.

exp(coef(m2)["uiYes"])

##    uiYes 
## 2.478858

exp(confint(m2)["uiYes",])

##    2.5 %   97.5 % 
## 1.083421 5.683810

Yet another way to compute the odds-ratio:

m3 <- glm(low ~ ui, data = birthwt, family = binomial("logit"))
exp(coef(m3)["uiYes"])

##    uiYes 
## 2.577778

library(vcd)
oddsratio(xtabs(~ low + ui, data = birthwt), log = FALSE)

## [1] 2.577778

References

[1] Y. Bishop, S. Fienberg and P. Holland. Discrete Multivariate Analysis. 2nd ed.. Springer, 2007.

[2] P. Dalgaard. Introductory Statistics with R. 2nd ed.. Springer, 2008.

[3] P. Dixon. "Models of accuracy in repeated-measures designs". In: Journal of Memory and Language 59.4 (2008), pp. 447-456.

[4] D. Hosmer and S. Lemeshow. Applied Logistic Regression. New York: Wiley, 1989.

[5] J. Nelder and W. Wedderburn. "Generalized linear models". In: Journal of the Royal Statistical Society, Series A 135 (1972), pp. 370-384.

[6] J. Zar. Biostatistical Analysis. 4th. Pearson, Prentice Hall, 1998.