Objectives: We set out to build a process that could inexpensively integrate data about autism research projects from multiple sources into a unified data warehouse. Our initial sources included a grant management system (proposalCENTRAL), PubMed, and NIH RePORTER, but our long-term goal was to integrate additional sources at linear cost. Unified data would yield insights about grant applicants’ public and private funding history, publication trajectories, and collaborations. This data warehouse would enable science officers to discover new researchers who should be targeted for funding or could serve as reviewers on grant applications.
Methods: We altered the usual Extract, Transform and Load (ETL) sequence to Extract, Load and Transform (ELT). Data were first extracted from the source system and immediately loaded, without any transformation, into a “loading zone” of the data warehouse. Modular transformation rules were written in a high-level query language (HTSQL) and saved using the saved-query mechanism of our system, where they remained available for automated testing. To improve transparency and testability, complex transformations were done in multiple stages that produced intermediate data models. We found that HTSQL was sufficiently powerful to handle all desired transformations, allowing us to use a single technology for the ELT process. Uniquely identifying scientists and institutions across data sources posed an additional set of challenges. Disambiguation and reconciliation algorithms were developed to handle the majority of cases, with a subset of low-certainty cases requiring manual inspection.
Results: The ELT approach proved successful in combining the initial three data sources (proposalCentral, PubMed, NIH RePORTER) into a unified data model. It also demonstrated flexibility in adapting to changes in the data structure of grant management data. We believe it is sufficiently flexible to incorporate numerous other data sources in the future.
Conclusions: The innovative ELT approach was effective largely because of an enabling technology, HTSQL, a high-level query language. The unification of the first three data sources created a valuable resource for making decisions about research funding. Automatic disambiguation of scientists and institutions, and reconciliation of objects across data sources remains a major challenge and will require additional work.