International Meeting for Autism Research: Meta-Search: Automatic Indexing of Meta-Data and Data Can Dramatically Improve Variable Discovery In Very Large Autism Data Sets Like the Simons Simplex Collection (SSC)

Meta-Search: Automatic Indexing of Meta-Data and Data Can Dramatically Improve Variable Discovery In Very Large Autism Data Sets Like the Simons Simplex Collection (SSC)

Friday, May 13, 2011
Elizabeth Ballroom E-F and Lirenta Foyer Level 2 (Manchester Grand Hyatt)
9:00 AM
L. Rozenblit1, A. Voronoy1, M. Peddle1, D. Voccola1, C. C. Evans1 and S. B. Johnson2, (1)Prometheus Research, LLC, New Haven, CT, (2)Biomedical Informatics, Columbia University, New York, NY
Background: The sheer size of large autism data sets, such as the SSC, NDAR, AGRE, or IAN,  poses a serious barrier to their utilization. The SSC, for example, includes nearly 6000 phenotype variables, and identifying those relevant to a research project can be a challenge.  Recent approaches to this problem have focused on developing ontologies. However, these approaches require the user to invest in learning a new, often complex, categorization scheme before getting started, and take many years to develop.

Objectives: We set out to develop “meta search”, a light-weigh approach to quickly identifying variables of interest via intelligent automated indexing of both data and meta-data in a relational database. From the perspective of a researcher using the system to discover variables, the tool should present a “Google-like” search interface. The researcher should be able to type in search terms, drawing from their own conceptual scheme, and get back a list of variables that match their interests. Sufficient descriptions of each variable should be provided in the output to determine relevance and refine the search and results should be sorted by relevance. Importantly, the tool must work in the absence of any manual tagging of variables with keywords, but should support the addition of manual tags. The tool should also support future integration with external ontology efforts, such that if the researcher used an ontology term in a search they would get the expected results.

Methods: We used an agile software development methodology, iterating over a 2-week cycle for 3 months. Each iteration incorporated feedback from test users, familiar with the SSC data set. The system uses data in SFARI Base (a data management system developed by Prometheus Research that stores SSC data) to automatically populate an SQLite database, building for each variable (1) a structured search index, and (2) a configurable “column report” that provides useful information about the variable. We developed a Google-like GUI to enter arbitrary search terms, and were able to utilize an existing full-text search mechanism provided by SQLite to locate keywords in the structured search index. For each match, meta-search returns the content of the “column report”, sorted by relevance.

Results: Testing with pilot users suggests that meta-search delivers intuitive and useful results with the SSC. The content of column report is configurable, and currently provides information like column names, table names, data type, examples of actual data stored in the column, manual keyword tags, if any, and column statistics. Researchers can use the output of the system to further explore each variable or to build more complex queries that return multiple variables.

Conclusions: Meta-search can run on top of any relational database, is accessible via the web, and anticipates future integration with ontology efforts. If successful, this system can be deployed at low cost on top of other large research data sources such as NDAR, AGRE, or IAN. Meta-search is a promising addition to the set of tools that help autism researchers make sense of very large data sets.

| More