18959
Use of Behavior Imaging to Assess Inter-Rater Reliability in a Multi-Site Pharmaceutical Trial

Friday, May 15, 2015: 10:00 AM-1:30 PM
Imperial Ballroom (Grand America Hotel)
R. M. Oberleitner1, U. Reischl2 and K. G. Gazieva1, (1)Behavior Imaging Solutions, Boise, ID, (2)Department of Community and Environmental Health, Boise State University, Boise, ID
Background: Multi-site pharmacological trials offer numerous methodological advantages over single-site trials. Multi-site trials can enhance external validity, provide greater statistical power, are often more representative of the patient population 5,7,8. However, multi-site trials are sometimes associated with disadvantages such as reduced data quality, greater heterogeneity of results, and poorer interview quality 3,4. While some of these issues may be controlled statistically, the systematic error due to interviewer errors across multiple sites cannot be addressed statistically 3. The need to reduce such errors has been highlighted in a number of studies 1,3,5,8. Approaches used to enhance inter-rater reliability include such measures as rater training and co-rating of video recorded interviews 1. A behavior imaging (BI) technology was used for such a purpose in a multi-site pharmaceutical trial. This technology facilitated the training of raters and provided automated assessment of inter-rater reliability measures. The system was based on a store-and-forward telehealth method.

Objectives: The objective of the study was to document rater reliability outcomes for two interview protocols that included the Autism Diagnostic Observation Schedule (ADOS) and the Social Communication Interaction Test (SCIT). Both were administered and scored in the multi-site pharmaceutical trial. Attention was placed on identifying items in both of the diagnostic tests that contributed to inconsistencies in ratings between raters and the “Gold Standard” (GS). The feedback allowed targeted training of the raters to improve the inter-rater reliability.

Methods: This experimental study was conducted in collaboration with four national medical research institutions. Reliability checks were done at the beginning of the study. Raters at four sites viewed pre-recorded interviews and scored their observations through the BI on-line platform. After each rater completed the scoring, the results were compared to the GS through an automatically generated report. The report identified discrepancies in the scoring and recommendations for improving the rater’s accuracy. Raters were required to repeat their assessments until their scores matched the GS.

Results: All raters completed reliability checks for ADOS and SCIT. Table 1 depicts training examples of rater assessments using SCIT protocol. A rater is considered not reliable if there is a difference of more than 1 for each of the domains and more than 10% for the total score. The summaries show that Rater #1 matched the Gold Standard in the domains A, B, E, and F, while domains C and D were different by one point in relationship to the GS. However, none of the differences were greater than “1” and the total score was the same as that for the GS. Therefore, Rater #1 was considered “Reliable”. Rater #4 met the Gold Standard for domains A and F. Domain D was off by 2 points and domain E was off by 1 point. The difference between the GS and Rater #4 was off by 25% and was, therefore, not considered “Reliable”.

Conclusions: The study demonstrated the practicality of using the BI technology to document inter-rater reliability throughout the duration of a pharmacological study, conducting effective interviewer-observer training, and performing ongoing maintenance of assessment reliability.