Inter-Rater Agreement Tutorial

on

The possible values for Kappa`s statistics range from 1 to 1, 1 indicating a perfect match, 0, indicating a totally random match, and 1, indicating a “perfect” divergence. Landis and Koch (1977) provide guidelines for interpreting Kappa`s values, values between 0.0 and 0.2 being slightly consistent, 0.21 to 0.40, indicating fair consent, from 0.41 to 0.60, indicating moderate support, from 0.61 to 0.80 and from 0.81 to 1.0, indicating near-perfect or perfect consistency. The use of these qualitative limit values, however, is under discussion and Krippendorff (1980) gives a more conservative interpretation that suggests that conclusions should be reduced for variables with values below 0.67, conclusions for values between 0.67 and 0.80, and final conclusions for values greater than 0.80. However, in practice, Kappa coefficients below Krippendorff`s conservative Cutoff values are often retained in research studies, and Krippendorff proposes these cutoffs on the basis of his own content analysis work, while acknowledging that acceptable estimates of IRR vary according to study methods and the research issue. SpSS and R require that the data be structured for each variable of interest with separate variables for each code, as shown in Table 3 for the depression variable. If additional variables were evaluated by each coder, each variable would have additional columns for each coder (z.B Rater1_Anxiety, Rater2_Anxiety, etc.) and kappa would have to be calculated separately for each variable. Datasets formatted with reviews of different coders listed in a column can be reformatted using the VARSTOCASES command in SPSS (see Lacroix-GiguĂ©re tutorial, 2006) or the “Reforming in R” function. Intraclass correlation analysis (CCI) is one of the most commonly used statistics to assess ERREURS for ordination, interval and reporting variables. CCI is suitable for studies involving two or more coders and can be used if all subjects are evaluated by multiple coders in one study or if a single subset of subjects is evaluated by multiple coders and the rest is evaluated by a coder. ICCs are suitable for completely cross-concepts or when a new group of coders is randomly selected for each participant.

Unlike Cohens Kappa (1960), which quantifies IRRs on an all-or-nothing basis, ICCs take into account the magnitude of discrepancies in the calculation of IRR estimates, with larger differences of opinion resulting in smaller ICCs than smaller differences of opinion. Cohen (1968) proposes an alternative weighted kappa that allows researchers to penalize differences differently because of the magnitude of differences. Cohen`s weighted Kappa is generally used for category data with an ordinal structure, for example. B in an evaluation system that categorizes the high, medium or low presence of a particular attribute. In this case, a subject considered high by one coder and low by another should lead to a lower estimate of the ERREUR than that of a subject considered high by one coder and another by another. Norman and Streiner (2008) show that the use of a weighted cappa with square weights for ordination scales is identical to a single two-sided mixed ICC and can be replaced.