17039
Paralinguistic Event Detection in Children's Speech
Paralinguistic cues convey a lot of information about the affective state of the speaker. From a diagnostic perspective, cues such as laughter and crying have been found to be important markers in the very early detection of autism spectrum disorders (ASD) in children. Children with ASD have been found to have a high proportion of voiced laughter compared to typically developing children (Hudenko et.al. 2009). Infants at risk for ASD have also been found to produce cries with a higher and more variable fundamental frequency or pitch than low-risk infants (Sheinkopf et.al. 2012). The ability to automatically detect such paralinguistic events in a clinical setting would be of great benefit in helping identifying children who are at risk of ASD at an early age.
Objectives:
To use acoustic features and machine learning algorithms to build a speech-based detector that can automatically determine when speech, laughter, and crying occur in an audio recording.
Methods:
Our data is drawn from a larger dataset of over 140 sessions in which toddlers, 15-30 months of age, interacted with an examiner in a brief play session that included rolling a ball back and forth, looking at pictures in a book, and gentle tickling (Rehg et.al. 2013). The toddlers wore a lapel microphone throughout the interaction, and the sessions were coded to identify onsets and offsets of child speech (vocalizations or verbalizations), laughter, and whining/crying. We used audio data from 35 children, which together included 483 speech segments, 49 laughter segments, and 58 whining/crying segments. We used the audio feature extraction tool, openSMILE, to extract 988 spectral and prosodic features from these segments. The most useful features in discriminating speech, laughter, and crying were selected using the information gain criterion and the forward selection technique.
Results:
We developed three binary classifiers to discriminate between speech and laughter, speech and crying, and laughter and whining. The classification of different types of speech segments was done with a support vector machine (SVM) with a quadratic kernel (degree = 2), using 10-fold cross-validation. The accuracy for discriminating between speech and laughter was 80.6%, between speech and whining was 84.5%, and between whining and laughter 86.9%. The features that were common to the three classifiers were pitch, line spectral pair frequencies (resonant frequencies in the vocal tract when the glottis is fully open and closed), and mel-frequency cepstral coefficients (representation of power spectrum mapped to a psychoacoustical scale).
Conclusions:
We have demonstrated reasonable discrimination between speech and paralinguistic cues along with discrimination between different paralinguistic cues. The results are significantly better than chance (50%) and may lead to a clinical setup where large amounts of speech data from at-risk infants could be parsed automatically.