Objectives: This research reports an analysis of automatically-generated phone-level vocalization composition of children from three groups (TD: typically developing; LD: language-delayed but not with ASD; AD: children diagnosed with ASD). Pattern recognition and machine-learning approaches were applied to vocalization data to build a fully automatic model for identifying children at risk for ASD.
Methods: A lightweight recorder is worn by a child to record his/her vocalizations and environmental sound over 16 hours. Child vocalizations in the recording are automatically identified using speech signal processing and recognition software. They are decomposed by applying adult-phone-model and child-vocalization-clusters to recognize phone-like units. The frequency of each unit is calculated, resulting in 63 features based on child-clusters and 50 features based on the adult-phone-model. Frequency features are analyzed using Linear Discriminant Analysis (LDA) and other machine-learning and statistical methods. Posterior probabilities for recording vocalizations being produced by a child with ASD are estimated based on the statistics after LDA or other transforms. For a child with multiple recordings, the probability is simply the geometric average of the probabilities of all recordings. The ASD at-risk identification is done by comparing the probability to a threshold.
Results: Two datasets were examined. Set-1 includes 76 TD children (712 recordings), 30 LD children (290 recordings) and 34 AD children (225 recordings). An independent Set-2 includes data from 30 TD children (90 recordings), 12 LD children (36 recordings) and 45 AD children (132 recordings). All AD children were formally diagnosed with ASD. Three detection tasks were tested: 1) AD versus TD; 2) AD versus LD and 3) AD versus TD+LD. Performance was evaluated via “leave-one-out-cross-validation” which is commonly used in pattern recognition and machine-learning research. For the LDA method using leave-one-CHILD-out-cross-validation, the equal-error-rates (EER) for each task are reported below in the exact order as above. For the recordings in Set-1, the EERs are 11.5%, 15.5% and 12.6% respectively. By including Set-2, the EERs become 11.7%, 16.3% and 12.6% correspondingly. For children in Set-1 (with multiple recordings), the EERs are 9.2%, 10.0%, 9.4% respectively. By including Set-2, the EERs become 8.9%, 12.7% and 10.8% correspondingly. Analyses utilizing other modeling methods yielded similar results.
Conclusions: Results for independent data sets and using different methods consistently show that child vocalization composition contains rich information for identification of children at-risk for ASD. We discuss the possibility of improving the performance by incorporating the modeling of other child vocal behaviors through audio recording, and the potential of these results and methodology for early ASD screening.