A Machine Learning-Based Approach to Detecting Autism Spectrum Disorder from Unstructured and Semi-Structured Medical Records

Thursday, May 12, 2016: 5:30 PM-7:00 PM
Hall A (Baltimore Convention Center)
T. Smith1, J. Yuan2 and J. Luo2, (1)601 Elmwood Ave, Box 671, University of Rochester Medical Center, Rochester, NY, (2)Computer Science, University of Rochester, Rochester, NY

Currently, no laboratory test for ASD exists, and the process of diagnosing the disorder is highly complex and labor intensive, requiring extensive expertise. As a result, few centers offer ASD diagnostic evaluations, and these centers have lengthy waiting lists.


We tested the feasibility and potential utility of a novel method for identifying children who may have ASD: natural language processing (NLP) with machine learning. This method involves developing computer algorithms to process and understand human communication. Specifically, we sought (1) to extract unstructured and semi-structured information from medical records and (2) to create an algorithm that analyzes records obtained prior to the initial diagnostic evaluation and accurately predicts which children do or do not receive an ASD diagnosis when evaluated by an expert clinician.


We examined medical records from 199 children, age 2-5 years (56 who were later diagnosed with ASD, 143 with other developmental concerns). The records included (1) the referral form from the primary care physician, (2) intake questionnaires completed by the child’s parent and teacher, (3) school reports (when available), and (4) phone intakes by clinic social workers. Diagnosis was ascertained from the clinician’s evaluation report. Medical forms were saved on a HIPAA-compliant server, de-skewed (rotated to a right angle), and de-identified (automatically blanking areas containing personal information). Optical character recognition software was then used to extract hand-written and typed information from records. The following models were used to identify lexical features in the records: (1) Bag-of-Words (BoW, occurrence of a word in a document), (2) N-Gram (occurrence of a phrase in a document), (3) Term Frequency-Inverse Document Frequency (Tf-idf, a statistical measure used to evaluate how important a word is to a document), (4) Latent Dirichlet Allocation (LDA, a measure of the probability that a word occurs within a topic), and (5) Distributed Representation (Doc2Vec, a measure of meaning that is represented by a pattern of activity across multiple sources). Finally, using lexical features obtained from records, we employed support vector machine algorithms to classify each child as possibly having ASD or not.  


We successfully extracted information and identified lexical features from all medical records. With 150 lexical features, accuracy of classification was 66.3% for BoW, 67.8% for N-Gram, 66.8% for Tf-Idf, 78.4% for LDA, and 83.4% for Doc2Vec. Positive predictive value was 40.4% for BoW, 43.1% for N-Gram, 41.4% for Td-Idf, 58.0% for LDA, and 64.6% for Doc2Vec. Sensitivity was 41.1% for BoW, 44.6% for N-Gram, 42.9% for Tf-Idf, 83.9% for LDA, and 91.1% for Doc2Vec. 


This study demonstrates the feasibility of extracting information and identifying lexical features from unstructured and semi-structured medical records. The most successful classification system, based on Doc2Vec, showed promising levels of accuracy, positive predictive value, and sensitivity. Analyses that involve a larger dataset are needed to improve the classification rate. With further development, the proposed framework could simplify and shorten the process of diagnosing ASD.