The Association Between Trimester Specific Daily Average PM2.5 and Autism Spectrum Disorder: A Large-Scale Multi-Source Linked Analysis

Friday, May 13, 2016: 5:30 PM-7:00 PM
Hall A (Baltimore Convention Center)
N. Connolly and K. A. Bowers, Cincinnati Children's Hospital Medical Center, Cincinnati, OH
Background: Data integration allows one to take advantage of the vast amounts of data available in various sources in order to gleam new insights into the etiology of ASD.   This presentation presents the results of a large-scale data integration analysis, combining electronic medical records (EMR) from a premier ASD diagnosis and treatment center with Ohio state birth records, as well as with an environmental toxin exposure dataset released by the Environmental Protection Agency (EPA).

Objectives:  Our objective was to assess the applicability of big data analytics to integrate multiple data streams and conduct epidemiologic analyses using multiple environmental exposures as a model.  

Methods: In order to produce our integrated dataset, we first queried the EPIC-hosted EMR of Cincinnati Children’s Hospital Medical Center (CCHMC) to identify all patients with a diagnosis of ASD.  We included all patients with a 299.* ICD9 diagnostic code in their EMR from 2009-2014 that was recorded by CCHMC's Division of Developmental Disabilities and Behavioral Pediatrics (DDBP).  In addition, we employed natural language processing (NLP) techniques to gleam clinical concepts (including diagnoses) from free-text office visit notes. We manually reviewed approximately 100 clinical notes to ascertain the agreement between the NLP-extracted assessment of ASD status in the clinical notes, and the presence or absence of the corresponding ICD9 code in the encounter diagnosis list.

We then matched the EMR data with Ohio state birth records in order to 1) identify the pre-birth residence of mothers who gave birth to offspring with ASD; and 2) have access to a large number of locale- and age- matched controls unaffected by ASD.  To link the data sources, we wrote custom software (based on the PERL scripting language) that matched patients by birthdate, and first and last names, allowing for minor misspellings. 

Having geocoded the addresses where mothers of ASD cases and controls resided immediately prior to giving birth, we estimated prenatal exposure to environmental factors by linking addresses with two public datasets released by the EPA.   The first dataset estimates green scape coverage within 400 m of an address; the second uses a Bayesian space-time fusion model [1-3] to estimate daily PM2.5 (daily average) and O3(daily 8-hr maximum) on a 12km x 12km grid for the conterminous United States, for years 2001-2008.

Using logistic regression to control for birth year and additional covariates, we will determine the association between ASD and green space, as well as trimester-specific association with PM2.5 and O3.

Results: We have found that NLP techniques successfully abstract clinical concepts from free text. We found integration of Ohio state birth records with EMR can be conducted with high accuracy.  However, due to sparsity of environmental monitors in our geographical area, extrapolation of environmental data to a wide range of geocoded location is associated with uncertainty.  

Conclusions: Using novel EMR data extraction methods and data linkage with multiple source/databases, we will efficiently evaluate the association between prenatal exposure to PM2.5, O3 and access to green space, and ASD.

See more of: Epidemiology
See more of: Epidemiology