The Influence of Examiner and Observer Level of Experience on the Inter-Rater Reliability of ADOS Item and Algorithm Scores and Diagnostic Outcomes

Thursday, May 17, 2012
Sheraton Hall (Sheraton Centre Toronto)
1:00 PM
G. Pasco1, K. Hudry2, S. Chandler3, T. Charman3 and &. the BASIS Team4, (1)Centre for Research in Autism & Education, Institute of Education, London, United Kingdom, (2)La Trobe University, Bundoora, Australia, (3)Centre for Research in Autism and Education, Institute of Education, London, United Kingdom, (4)British Autism Study of Infant Siblings, London, United Kingdom
Background: Studies using observational measures such as the ADOS regularly involve two researchers to independently score assessments in order to achieve a best estimate consensus code and to be able to report inter-rater reliability (IRR). Reports of IRR rarely discuss issues relating to the relative levels of experience of raters.

Objectives: To investigates the IRR of individual ADOS items, diagnostic algorithm totals and diagnostic classifications with reference to the relative experience of the ADOS examiner and observer.

Methods: 90 children participating in a high-risk sib study were assessed at 3 years of age (mean 38 months, SD 3.3) using ADOS module 2. All assessments were scored by the examiner and an observer, who then agreed a consensus score for each item. Of the 8 ADOS-trained researchers 3 were classified as having a relatively high level of experience of the ADOS (e.g. ADOS trainer/10 years of administering research-standard ADOS) while 5 had relatively low levels of experience. Reliability was calculated for individual item scores, diagnostic algorithm totals and diagnostic classification, on the basis of the combination of examiner and observer experience. 

Results: IRR for the 28 module 2 items was calculated using percentage agreements between the two raters. The mean percentage agreements for the High – High (N=17), High – Low (N=40) and Low – High (N=31) conditions were 87.5, 85.6 and 87.0, respectively. IRR for diagnostic algorithm totals was calculated using intra-class correlation (ICC) coefficients. For the three conditions the ICCs were .93, .86 & .96, respectively (all p<.001). To investigate the influence of the observer rating on the agreed consensus scores the examiner diagnostic algorithm total scores were compared with the consensus total scores. The mean differences were 0.8 (t=-1.97, n/s), 0.7 (t=-2.10, p<.05) and 2.1 (t=-6.87, p<.001), respectively - the consensus totals were higher in all conditions. The IRR of diagnostic classifications was assessed by comparing the outcomes based on the examiner scoring (i.e non-spectrum, or above cut-offs for autism spectrum or autism) with those based on the agreed consensus scoring. For the three conditions the chi-squared and kappas were all p<.01, ≥.63. The numbers of participants moving to a more “severe” category and to a less “severe” category in each condition were 1 & 2, 3 & 2 and 6 & 0, respectively.

Conclusions: The overall reliability for items and of algorithm scores and diagnostic outcomes was high across all combinations of examiner and observer experience. Consensus algorithm totals were higher than the original examiner scoring, with a resultant tendency for diagnostic outcomes to shift to more “severe” categories, particularly in situations where the examiner was of low experience and the observer was of high experience, suggesting that those with less experience of the ADOS may tend to “under-score” participants when administering the assessment. Results demonstrate that researchers can be trained to achieve acceptable levels of reliability in scoring, even with a complex assessment such as the ADOS, but that there is a need for involvement from expert practitioners to maintain reliability to research standards.

| More