International Meeting for Autism Research (May 7 - 9, 2009): Automated Identification of Stress and Focus Assignment

Automated Identification of Stress and Focus Assignment

Friday, May 8, 2009
Boulevard (Chicago Hilton)
E. T. Prud'hommeaux , Center for Spoken Language Understanding, Oregon Health & Science University, Beaverton, OR
J. P. H. van Santen , Center for Spoken Language Understanding, Oregon Health & Science University, Beaverton, OR
L. M. Black , Center for Spoken Language Understanding, Oregon Health & Science University, Beaverton, OR
Background: Evaluation of expressive prosodic ability plays an important role in the diagnosis of neurodevelopmental disorders such as ASD. Existing methods for assessing prosodic performance require that judgments be made at the time of examination. Such real-time subjective judgments are typically not verified, since verification by one or more additional listeners is time-consuming and costly. Accurate automated analysis of prosody could increase both efficiency and accuracy in clinical evaluations of prosodic ability.

Objectives: The goals of this study are 1) to determine the reliability of real-time judgments of stress and focus assignment, and 2) to determine whether our complex automated measures of the acoustic features associated with stress and focus are comparable to consensus listener judgments and real-time clinical assessments.

Methods: Responses for the following three tasks were scored by clinicians during examination, by six naïve listeners in a web-based perceptual experiment, and with automated objective methods:

(i) Lexical Stress (repeat a disyllabic nonsense word with initial or final stress)
(ii) Emphatic Stress (repeat a four-word sentence with emphasis on one word; adapted from Shriberg et al. 2001, 2006)
(iii) Focus (correct an inaccurate description of a picture by emphasizing the correct word; adapted from PEPS-C (Peppé & McCann 2003))

During examination, clinicians immediately assessed each response for each stimulus as either correct or incorrect, thereby producing real-time scores.

In the perceptual experiment, six judges listened to recordings of “minimal pairs” of responses for each of the three tasks, with each pair from a single speaker with the same content but different target prosody. The judges were asked to identify the intended meaning of the two utterances (e.g., of two recordings, which one was meant to be “BLUE cow” rather than “blue COW”).

For the automated analysis, pitch and energy trajectories and phoneme duration information were extracted from recordings of the children's responses and analyzed using an innovative “dynamic difference” measure that captures the difference in the pitch and amplitude dynamics of the two recordings in a minimal pair. Measures of melody, timing, and intensity were combined using multiple linear regression to create a single complex score for each utterance.

Results: For all three tasks, the combined objective measures correlated with the consensus scores at least as well as the judges correlated with one another and with the consensus scores. These correlations were also substantially better than the correlations between real-time scores and the consensus scores. A per-speaker analysis revealed similar results: objective measures correlated with consensus scores as well as the individual judges and substantially better than the real-time scores.

Conclusions: The automated digital measures of stress and focus assignment were shown to be comparable in reliability to consensus subjective scores and superior to real-time clinical judgments on both a per-utterance and a per-speaker basis. Including automated objective measures of prosody alongside traditional real-time judgments could enhance both accuracy and reliability in clinical assessments of prosodic ability.