Objectives: To investigates the IRR of individual ADOS items, diagnostic algorithm totals and diagnostic classifications with reference to the relative experience of the ADOS examiner and observer.
Methods: 90 children participating in a high-risk sib study were assessed at 3 years of age (mean 38 months, SD 3.3) using ADOS module 2. All assessments were scored by the examiner and an observer, who then agreed a consensus score for each item. Of the 8 ADOS-trained researchers 3 were classified as having a relatively high level of experience of the ADOS (e.g. ADOS trainer/10 years of administering research-standard ADOS) while 5 had relatively low levels of experience. Reliability was calculated for individual item scores, diagnostic algorithm totals and diagnostic classification, on the basis of the combination of examiner and observer experience.
Results: IRR for the 28 module 2 items was calculated using percentage agreements between the two raters. The mean percentage agreements for the High – High (N=17), High – Low (N=40) and Low – High (N=31) conditions were 87.5, 85.6 and 87.0, respectively. IRR for diagnostic algorithm totals was calculated using intra-class correlation (ICC) coefficients. For the three conditions the ICCs were .93, .86 & .96, respectively (all p<.001). To investigate the influence of the observer rating on the agreed consensus scores the examiner diagnostic algorithm total scores were compared with the consensus total scores. The mean differences were 0.8 (t=-1.97, n/s), 0.7 (t=-2.10, p<.05) and 2.1 (t=-6.87, p<.001), respectively - the consensus totals were higher in all conditions. The IRR of diagnostic classifications was assessed by comparing the outcomes based on the examiner scoring (i.e non-spectrum, or above cut-offs for autism spectrum or autism) with those based on the agreed consensus scoring. For the three conditions the chi-squared and kappas were all p<.01, ≥.63. The numbers of participants moving to a more “severe” category and to a less “severe” category in each condition were 1 & 2, 3 & 2 and 6 & 0, respectively.
Conclusions: The overall reliability for items and of algorithm scores and diagnostic outcomes was high across all combinations of examiner and observer experience. Consensus algorithm totals were higher than the original examiner scoring, with a resultant tendency for diagnostic outcomes to shift to more “severe” categories, particularly in situations where the examiner was of low experience and the observer was of high experience, suggesting that those with less experience of the ADOS may tend to “under-score” participants when administering the assessment. Results demonstrate that researchers can be trained to achieve acceptable levels of reliability in scoring, even with a complex assessment such as the ADOS, but that there is a need for involvement from expert practitioners to maintain reliability to research standards.
See more of: Clinical Phenotype
See more of: Symptoms, Diagnosis & Phenotype