Building a Data Warehouse Describing the Autism Research Community in a New Way: Extract, Load and Transform (ELT)

Friday, May 18, 2012
Sheraton Hall (Sheraton Centre Toronto)
9:00 AM
C. Tirrell1, M. Peddle1, S. B. Johnson2, C. D. Walentas3, O. McGettrick1, B. Lawlor1, H. Agnew1, D. Voccola1 and L. Rozenblit1, (1)Prometheus Research, LLC, New Haven, CT, (2)Public Health, Weill Cornell Medical College, New York, NY, (3)Self, New York, NY
Background: Understanding the scope of research activities in a scientific community requires leveraging data from multiple public and private sources.  However, data from different sources, such as PubMed and NIH RePORTER, may be difficult to link because data are organized differently, are of inconsistent or poor quality or lack appropriate identifiers. The standard solution to these challenges is to use an Extract, Transform and Load process (ETL), which pulls the data out of the original data source, changes its structure and content, then stores it in the target database. However, ETL processes tend to be brittle, expensive, and difficult to run. We needed a data-integration solution that would remain flexible and inexpensive, and would support an expanding list of ad hoc data sources to support decisions about scientific funding.

Objectives: We set out to build a process that could inexpensively integrate data about autism research projects from multiple sources into a unified data warehouse.  Our initial sources included a grant management system (proposalCENTRAL), PubMed, and NIH RePORTER, but our long-term goal was to integrate additional sources at linear cost. Unified data would yield insights about grant applicants’ public and private funding history, publication trajectories, and collaborations. This data warehouse would enable science officers to discover new researchers who should be targeted for funding or could serve as reviewers on grant applications.

Methods: We altered the usual Extract, Transform and Load (ETL) sequence to Extract, Load and Transform (ELT). Data were first extracted from the source system and immediately loaded, without any transformation, into a “loading zone” of the data warehouse. Modular transformation rules were written in a high-level query language (HTSQL) and saved using the saved-query mechanism of our system, where they remained available for automated testing. To improve transparency and testability, complex transformations were done in multiple stages that produced intermediate data models. We found that HTSQL was sufficiently powerful to handle all desired transformations, allowing us to use a single technology for the ELT process. Uniquely identifying scientists and institutions across data sources posed an additional set of challenges. Disambiguation and reconciliation algorithms were developed to handle the majority of cases, with a subset of low-certainty cases requiring manual inspection.

Results: The ELT approach proved successful in combining the initial three data sources (proposalCentral, PubMed, NIH RePORTER) into a unified data model. It also demonstrated flexibility in adapting to changes in the data structure of grant management data. We believe it is sufficiently flexible to incorporate numerous other data sources in the future.

Conclusions: The innovative ELT approach was effective largely because of an enabling technology, HTSQL, a high-level query language. The unification of the first three data sources created a valuable resource for making decisions about research funding. Automatic disambiguation of scientists and institutions, and reconciliation of objects across data sources remains a major challenge and will require additional work.

| More