Objectives: The goals of this study are to apply principles of NLP in order to 1) identify neologisms and patterns of overly formal word use; 2) compare the relative prevalence of these features in spontaneous language samples of children with Autism Spectrum Disorders (ASD) and Typical Development (TD); and 3) determine whether these features can be used to distinguish the two groups.
Methods: The ADOS was administered to children ages 4 to 8 with TD and with ASD. (Module 3 was administered to the majority of subjects; module 2 was used only for those subjects whose expressive language age equivalency was less than 4.0.) The two groups were roughly matched in terms of various measures of utterance complexity and acceptability as measured by standard NLP methods. The entire ADOS for each child was recorded and digitized for analysis. The subject utterances from the following ADOS activities were transcribed from these audio recordings: Make-Believe Play, Joint Interactive Play, Description of a Picture, Telling a Story From a Book, and Conversation and Reporting.
Relative frequencies of occurrence of single words and word-sequences were generated from two corpora: 1) the Wall Street Journal training corpus of the Penn Treebank, and 2) the Child Language Data Exchange System (CHILDES) database of child speech. For each child, we determined the relative frequencies of each word in the two respective corpora. Words whose relative frequency is zero (i.e., those that do not occur in a given corpus, known as out-of-vocabulary words, or OOVs) are likely to be neologisms.
Results: The average relative frequencies of the words, based on either corpus, were not significantly different in the two groups. However, neologism use, as measured by OOV rate, was significantly higher in the ASD group than in the TD group, using both the Wall Street Journal corpus and the CHILDES corpus. Very low-frequency words from the Wall Street Journal corpus were also used significantly more often in ASD speech. This trend was not observed using the CHILDES corpus, which suggests that ASD speech is characterized not only by neologisms but also by the use of very infrequent formal words.
Conclusions: Neologistic and formal word use, which are both characteristic of ASD speech, can be identified automatically using natural language processing techniques. Incorporating automated analysis of speech could enhance the coding of these behaviors and reveal word distribution properties that might go unrecognized during examination.