18959
Use of Behavior Imaging to Assess Inter-Rater Reliability in a Multi-Site Pharmaceutical Trial
Objectives: The objective of the study was to document rater reliability outcomes for two interview protocols that included the Autism Diagnostic Observation Schedule (ADOS) and the Social Communication Interaction Test (SCIT). Both were administered and scored in the multi-site pharmaceutical trial. Attention was placed on identifying items in both of the diagnostic tests that contributed to inconsistencies in ratings between raters and the “Gold Standard” (GS). The feedback allowed targeted training of the raters to improve the inter-rater reliability.
Methods: This experimental study was conducted in collaboration with four national medical research institutions. Reliability checks were done at the beginning of the study. Raters at four sites viewed pre-recorded interviews and scored their observations through the BI on-line platform. After each rater completed the scoring, the results were compared to the GS through an automatically generated report. The report identified discrepancies in the scoring and recommendations for improving the rater’s accuracy. Raters were required to repeat their assessments until their scores matched the GS.
Results: All raters completed reliability checks for ADOS and SCIT. Table 1 depicts training examples of rater assessments using SCIT protocol. A rater is considered not reliable if there is a difference of more than 1 for each of the domains and more than 10% for the total score. The summaries show that Rater #1 matched the Gold Standard in the domains A, B, E, and F, while domains C and D were different by one point in relationship to the GS. However, none of the differences were greater than “1” and the total score was the same as that for the GS. Therefore, Rater #1 was considered “Reliable”. Rater #4 met the Gold Standard for domains A and F. Domain D was off by 2 points and domain E was off by 1 point. The difference between the GS and Rater #4 was off by 25% and was, therefore, not considered “Reliable”.
Conclusions: The study demonstrated the practicality of using the BI technology to document inter-rater reliability throughout the duration of a pharmacological study, conducting effective interviewer-observer training, and performing ongoing maintenance of assessment reliability.