Previous studies established a strong influence of genomic variation in the etiology of Autism Spectrum Disorders (ASD). While associated genomic regions have been identified, the estimated effect sizes for these regions are small and combined evidence from numerous genetic analyses does not explain the highly heritable nature of the disorder. Disorders diagnosed within the autism spectrum are heterogeneous regarding phenotypic presentation. This wide variability in clinical manifestation is a potential explanation for difficulties in identifying common genetic variation associated with ASD.
Objectives:
We chose to use multivariate clustering to explore ASD phenotype data in an attempt to uncover highly similar genetic sub-groups. Our hypothesis is that by sub-grouping individuals relative to behavioral and clinical exam information our power to detect genes influencing risk for ASD will be greatly increased.
Methods:
For cluster analyses, we included Autism Diagnostic Interview-Revised (ADI-R) scores, Autism Diagnostic Observation Schedule (ADOS) scores, Vineland Adaptive Behavior Scale (VABS) scores and head circumference measures for 1,689 affected individuals from Caucasian families in the Autism Genetic Research Exchange dataset. Weights were assigned to each measure to allow equal contribution of the ADI-R, ADOS, VABS and head circumference in cluster analyses. Seven different clustering methods were evaluated for internal validity and cluster stability while partitioning the dataset into anywhere from 2 to 12 clusters. Kruskal-Wallis equality-of-populations rank tests were subsequently done on untransformed scores to determine the distributional variation of scores between clusters. Cluster validation was also done by permuting phenotype data across individuals 1,000 times, clustering the permuted data and calculating the Hubert-Arabie Adjusted Rand Index to compare clustering of the real data to permuted data. Intra-cluster family structure was evaluated by calculating the odds of individuals being assigned to the same cluster given a familial relationship.
Results:
The best validity and stability scores were for agglomerative clustering with the dataset partitioned into two groups, one cluster with 1,136 individuals and a second cluster with 550 individuals. The agglomerative coefficient was 0.78 indicating strong clustering structure identified in the dataset. The average Hubert-Arabie Adjusted Rand Index when comparing real data results to permuted results was 0.0012 meaning partitioning of real data was significantly better than partitioning permuted data. Kruskal-Wallis results showed that all input variables were significantly different (p<0.0001) between the resulting clusters. Examination of the variables indicates that individuals with more severe measures for most variables are placed into the same cluster. The odds ratio determined for family structure within clusters was approximately 1.42 (p<0.0001) suggesting it more likely for related individuals to cluster together than unrelated individuals.
Conclusions:
This approach to ASD gene discovery allows effective evaluation of a broad array of data, enabling more complete phenotype definitions for ASD datasets. The data indicating that related individuals are more likely to be assigned to the same cluster when clustering on phenotype data suggest that clinical variability of ASD is related to underlying genetic variability. Our results suggest that more effective methods of phenotype definition will increase power to detect genetic factors influencing risk for ASD.