Objectives: As part of the development of the National Database of Autism Research (NDAR) (http://ndar.nih.gov/), which will maintain and share data gathered by standardized clinical assessments in NIH studies, we are developing an ontology, or formal knowledge model, to standardize and catalog the definitions of phenotypes. The resulting ontology will allow NDAR users to query high-level concepts, such as age of first spoken word, without needing to understand the low-level representation of how such data is stored.
Methods: To define the scope and content of the ontology, we first undertook a requirements analysis to gather the range of concepts, relationships, and abstractions used in autism research. We undertook a literature search of the PubMed database used the key words “(ADI R or ADOS or Vineland) and (genes or genetics) and autism.” We then created a list of those phenotype definitions that were used as eligibility criteria or analytic studies in an original research study. We identified a unique set of definitions, and then determined whether the phenotype definition could be encoded using the Semantic Web ontology and rule language standards, OWL and SWRL, respectively.
Results: We found 43 published research papers as of March 1, 2008, and selected 26 of these as relevant based on the inclusion criteria of studies who enrolled subjects with a diagnosis of autism and were published in the English language. Excluding criteria used for diagnosis of autism or autism spectrum disorder, we found 75 uniquely defined concepts used as candidate phenotypes (mean of 4 concepts per paper). Nearly two-thirds of the concepts (63%) were based on a one-to-one mapping with a single item on an assessment instrument, sometimes using a cutoff score. Approximately one quarter (24%) were defined as abstractions that were the sum of several items, where such a score was not already an instrument item. The remaining concepts (13%) were not defined clearly enough by the authors to be mapped precisely to a discrete set of items. Among this set of concepts, several were used for analysis across multiple papers. Some concepts were slight variations of each other; for example, different cutoff scores used to define presence or absence of a savant-skill phenotype. Of the 65 phenotype descriptions that we could map precisely to instrument items, we were able to encode all of them using the OWL and SWRL formalisms.
Conclusions: The results of our analysis of phenotype concepts in the autism research literature indicate the need for a well-defined set of clinical phenotypes. The ontology-based framework that we are developing for NDAR can provide a common, standardized core set of concepts and relationships to unify diverse clinical, behavioral, and genetic data on autism; allow investigators to share, query and integrate stored data using common terms; and serve as a computational catalog that tracks the evolution of candidate phenotypes.