I always found Dave Garson’s tutorial on Reliability Analysis very interesting. However, all illustrations are with SPSS. Here is a friendly R version of some of these notes, especially for computing intraclass correlation.
They are different versions of the intraclass correlation coefficient (ICC), that reflect distinct ways of accounting for raters or items variance in overall variance, following Shrout and Fleiss (1979) cases 1 to 3 in Table 1):
They are called ICC(1,1), ICC(2,1), and ICC(3,1) when we have only one rating for each rater x item combination, and ICC(1,k), ICC(2,k), and ICC(3,k) when the reported rater x item cell is an average.
There are many ways to get ICC estimates, including psy (icc
), psych (ICC
), or irr (icc
). I will use the latter one.
I fetched the data from here , which looks like a general repository for SPSS example dataset that I don’t have. Let’s go with the tvsurvey.sav
dataset which D. Garson used when describing ICC. Quoting his words:
Some 906 respondents were asked if they would watch a particular show for….
 Any reason
 No other popular shows on at that time
 Critics still give the show good reviews
 Other people still watch the show
 The original screenwriters stay
 The original directors stay
 The original cast stays
where answers were coded 0=No or 1=Yes.
I happened to load the data and recode them as binary variable as follows:
> library(foreign)
> tv < read.spss("tvsurvey.sav", to.data.frame=T)
> head(tv)
ANY BORED CRITICS PEERS WRITERS DIRECTOR CAST
1 N0 NO NO NO YES YES YES
2 YES YES NO YES YES YES YES
3 YES YES YES YES YES YES YES
4 YES YES YES YES YES YES YES
5 YES YES YES YES YES YES YES
6 YES YES YES YES YES YES YES
> dim(tv)
[1] 906 7
> tv.num < sapply(tv, function(x) as.numeric(x)1)
Then, computing ICC for “Singles Measures” is just a matter of running:
> library(irr)
> icc(t(tv.num))
Single Score Intraclass Correlation
Model: oneway
Type : consistency
Subjects = 7
Raters = 906
ICC(1) = 0.143
FTest, H0: r0 = 0 ; H1: r0 > 0
F(6,6335) = 152 , p = 8.02e181
95%Confidence Interval for ICC Population Values:
0.064 < ICC < 0.448
The default settings correspond to the socalled “consistency” ICC, that is ICC computed from a oneway ANOVA where raters are considered as random effects (henece, representative of a larger pool of raters). But D. Garson also showed ICC for “Average measures”, which means that we consider the unit of analysis to be the mean of several ratings. This is readily obtained using the following instruction:
> icc(t(tv.num), unit="average")
Average Score Intraclass Correlation
Model: oneway
Type : consistency
Subjects = 7
Raters = 906
ICC(906) = 0.993
FTest, H0: r0 = 0 ; H1: r0 > 0
F(6,6335) = 152 , p = 8.02e181
95%Confidence Interval for ICC Population Values:
0.984 < ICC < 0.999
However, in this particular case, it doesn’t apply (or it is not obvious from the description of the data that ratings can be regarded as average ratings).
Dave Garson mentioned the use of Tukey nonadditivity test for checking the assumption of no raters by items interaction. I must admit I never used it in this situation. I didn’t find any references in Dunn’s textbook on the Design and Analysis of Reliability Studies (Oxford Univ. Press, 1989). A little googling didn’t reveal any recent application of this test in the context of reliability studies, but I did find a paper that discusses its validity [1]. It should be noted that we are in the case of a single rating/item, that is there’re no repetitions in the ratings which prevent from estimating an interaction effect from the ANOVA table (at least, the interaction term is confoudned with the error term). Moreover, it only applies to the case 2 and 3 above, where we can form a twoway model (raters x items).
Hopefully, this test is available in the agricolae package as nonadditivity()
. I tested it against Stata nonadd command and got the same results:
. findit nonadd
. use https://stats.idre.ucla.edu/stat/stata/faq/rb4a, clear
(RB Example  Kirk, 1982)
. anova y a s
Number of obs = 32 Rsquared = 0.8790
Root MSE = 1.16496 Adj Rsquared = 0.8214
Source  Partial SS df MS F Prob > F

Model  207 10 20.7 15.25 0.0000

a  194.5 3 64.8333333 47.77 0.0000
s  12.5 7 1.78571429 1.32 0.2914

Residual  28.5 21 1.35714286

Total  235.5 31 7.59677419
. nonadd y a s
Tukey's test of nonadditivity (Ho: model is additive)
SS nonadd = 8.0218509 df = 1
F (1,20) = 7.8345468 Pr > F: .01108091
In R, using the file rb4a.dat, we have:
> rb4a < read.table("rb4a.dat", header=T, colClasses=c("numeric","factor","factor"))
> summary(aov(y ~ ., rb4a))
Df Sum Sq Mean Sq F value Pr(>F)
a 3 194.5 64.833 47.7719 1.478e09 ***
s 7 12.5 1.786 1.3158 0.2914
Residuals 21 28.5 1.357

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> lm0 < lm(y ~ ., data=rb4a)
> df.lm0 < df.residual(lm0)
> MSE.lm0 < deviance(lm0)/df.lm0
> res < with(rb4a, nonadditivity(y, a, s, df.lm0, MSE.lm0))
Tukey's test of nonadditivity
y
P : 49.375
Q : 303.9062
Analysis of Variance Table
Response: residual
Df Sum Sq Mean Sq F value Pr(>F)
Nonadditivity 1 8.0219 8.0219 7.8345 0.01108 *
Residuals 20 20.4781 1.0239

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It is important not to forget to treat a
and s
as factors. Let’s see it in action:
> library(reshape)
> tv.num.t.long < melt(t(tv.num))
> tv.num.t.long$X2 < factor(tv.num.t.long$X2)
> library(agricolae)
> lm0 < lm(value ~ X1 + X2, data=tv.num.t.long)
> df.lm0 < df.residual(lm0)
> MSE.lm0 < deviance(lm0)/df.lm0
> res < with(tv.num.t.long, nonadditivity(value, X1, X2, df.lm0, MSE.lm0))
Tukey's test of nonadditivity
value
P : 76.04803
Q : 89.38298
Analysis of Variance Table
Response: residual
Df Sum Sq Mean Sq F value Pr(>F)
Nonadditivity 1 64.70 64.703 845.95 < 2.2e16 ***
Residuals 5429 415.24 0.076

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Ok, so we see that there’s a significant linear item by linear rater interaction.
As an example of random blocks without interaction, let’s consider data from Table 2 in Lahey et al’s paper, which comes from Shrout & Fleiss’s original paper in 1979:
> tab2 < matrix(c(9,6,8,7,10,6,
2,1,4,1,5,2,
5,3,6,2,6,4,
8,2,8,6,9,7), nc=4)
> tab2.df.long < melt(tab2, varnames=c("Target","Judge"))
> tab2.df.long$Target < factor(tab2.df.long$Target)
> tab2.df.long$Judge < factor(tab2.df.long$Judge)
> lm0 < lm(value ~ ., data=tab2.df.long)
> df.lm0 < df.residual(lm0)
> MSE.lm0 < deviance(lm0)/df.lm0
> res < with(tab2.df.long, nonadditivity(value, Target, Judge, df.lm0, MSE.lm0))
Tukey's test of nonadditivity
value
P : 15.58681
Q : 912.9951
Analysis of Variance Table
Response: residual
Df Sum Sq Mean Sq F value Pr(>F)
Nonadditivity 1 0.2661 0.2661 0.2479 0.6263
Residuals 14 15.0256 1.0733
Tukey’s test here confirms that there’s no interaction. To find the corresponding ICC(3,1) given by Shrout and Fleiss (1979), we need to use a twoway model:
> icc(tab2, model="twoway")
Single Score Intraclass Correlation
Model: twoway
Type : consistency
Subjects = 6
Raters = 4
ICC(C,1) = 0.715
FTest, H0: r0 = 0 ; H1: r0 > 0
F(5,15) = 11 , p = 0.000135
95%Confidence Interval for ICC Population Values:
0.342 < ICC < 0.946
The second example discussed by Lahey et al. include an interaction effect (the last column from Table 2 was changed to negatively correlate with the others). Let’s go with it:
> tab4 < matrix(c(9,6,8,7,10,6,
2,1,4,1,5,2,
5,3,6,2,6,4,
3,9,3,5,2,4), nc=4, dimnames=list(1:6, 1:4))
> tab4.df.long < melt(tab4, varnames=c("Target","Judge"))
> tab4.df.long$Target < factor(tab4.df.long$Target)
> tab4.df.long$Judge < factor(tab4.df.long$Judge)
> lm0 < lm(value ~ ., data=tab4.df.long)
> df.lm0 < df.residual(lm0)
> MSE.lm0 < deviance(lm0)/df.lm0
> res < with(tab4.df.long, nonadditivity(value, Target, Judge, df.lm0, MSE.lm0))
Tukey's test of nonadditivity
value
P : 2.246528
Q : 155.9048
Analysis of Variance Table
Response: residual
Df Sum Sq Mean Sq F value Pr(>F)
Nonadditivity 1 0.032 0.0324 0.0075 0.9321
Residuals 14 60.259 4.3042
There’s no effect of Target, as can be seen in the ANOVA Table:
> anova(lm0)
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
Target 5 11.208 2.2417 0.5577 0.730692
Judge 3 83.458 27.8194 6.9212 0.003802 **
Residuals 15 60.292 4.0194

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To fully reproduce Lahey et al’s Tabel 5, the missing MS (within targets) might be obtained as follows:
> unlist(summary(aov(value ~ Target, data=tab4.df.long)))["Mean Sq2"]
Mean Sq2
7.986111
(By the way, there is a typo in Lahey et al’s Table 5: the corresponding DF for the between targets effect should read 5, not 3.)
Now, what’s interesting is that the ICC is negative, but Tukey’s test failed to recover the significant interaction, while clearly Rater 4 is at variance with others’ ratings:
> round(cor(tab4), 2)
[,1] [,2] [,3] [,4]
[1,] 1.00 0.75 0.73 0.75
[2,] 0.75 1.00 0.89 0.73
[3,] 0.73 0.89 1.00 0.72
[4,] 0.75 0.73 0.72 1.00
This might also be shown visually, where Target have been reordered by their mean rating:
> library(lattice)
> dotplot(tab4[order(apply(tab4, 1, mean)),], type="b")
In conclusion, the characteristic root test of the interaction might provide an interesting alternative to Tukey’s test when there’s no main effect [2,3]. I’m not aware of any available R implementation. I should write one. Other relevant papers are listed in the References below.
For the second dataset, the SAV file I used was GSS93_subset.sav
. and I happened lo load it into R as follows:
> gss < read.spss("GSS93 subset.sav", to.data.frame=T)
> items.idx < 10:13
> gss.it < gss[,items.idx]
> gss.it.num < sapply(gss.it, function(x) as.numeric(x))
Then, to compute Cronbach’s alpha and itemscale statistics, I used:
> library(psych)
> alpha(gss.it.num)
This gives:
raw_alpha std.alpha G6(smc) average_r mean sd
0.7 0.79 0.82 0.49 4.7 1.4
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r
EDUC 0.68 0.69 0.62 0.43
DEGREE 0.53 0.71 0.63 0.45
PADEG 0.66 0.78 0.78 0.54
MADEG 0.68 0.78 0.79 0.54
Item statistics
n r r.cor r.drop mean sd
EDUC 1496 0.84 0.84 0.71 13.0 3.07
DEGREE 1496 0.83 0.82 0.81 2.4 1.18
PADEG 1207 0.74 0.59 0.48 1.9 1.19
MADEG 1352 0.73 0.58 0.48 1.8 0.94
This is in close agreement with the results reported by Dave Garson. Now, fitting an ANOVA model and applying the Tukey’s test yields:
> gss.it.num.long < melt(gss.it.num)
> gss.it.num.long$X1 < factor(gss.it.num.long$X1)
> lm0 < lm(value ~ ., data=gss.it.num.long)
> anova(lm0)
However, since there are missing responses, let’s do the analysis by considering complete cases only. Testing for nonadditivity then yields the following results:
Tukey's test of nonadditivity
value
P : 51572.02
Q : 734728.4
Analysis of Variance Table
Response: residual
Df Sum Sq Mean Sq F value Pr(>F)
Nonadditivity 1 3619.9 3619.9 3793.8 < 2.2e16 ***
Residuals 3368 3213.7 1.0

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Again, this is pretty close to what Dave Garson reported on his website, with the right degrees of freedom. The syntax used for obtaining the above result was:
> gss.it.num.long.noNA < melt(na.omit(gss.it.num))
> gss.it.num.long.noNA$X1 < factor(gss.it.num.long.noNA$X1)
> lm0 < lm(value ~ ., data=gss.it.num.long.noNA)
> df.lm0 < df.residual(lm0)
> MSE.lm0 < deviance(lm0)/df.lm0
> res < with(gss.it.num.long.noNA, nonadditivity(value, X1, X2, df.lm0, MSE.lm0))