16089
Integration of Copy Number and Exome Sequence Data in a Queryable Database for the Investigation of ASDs
Objectives: To develop a pilot queryable and scalable database for effective integration of whole exome sequence data and CNV data in order to identify additional genes and variants of interest in ASD.
Methods: A pilot database was developed using available data for 24 probands—23 male and 1 female. Exome sequence data were variant-called with GATK and TrioCaller, exported to a variant call format (VCF), and annotated with ANNOVAR. The VCF contained information such as position, phased genotype, raw read depth, and multiple measures of deleteriousness and conservation. CNV calls were generated by CNVision using Illumina 1M array data and contained information including position, copy number, ancestry, gender, and the calling algorithm used for detection (PennCNV, GNOSIS, and/or QuantiSNP). A MySQL database with four relational tables containing all variant and proband data was developed. The database was queried for rare hemizygous deleterious variants by specifying SNVs with a minor allele frequency less than 5% and possibly deleterious LJB_PolyPhen2 scores which are within areas of CNV deletion.
Results: Eight rare hemizygous and putative deleterious SNVs were found in the population. Three of the eight (rs59056023, rs17844333, rs115218749) were found in CNV deletion regions within the gene PCDHA9, encoding protocadherin alpha 9. By issuing further queries, it was determined that nine of the 24 individuals (37.5%) have at least one region of deletion within the PCDHA9 gene.
Conclusions: Through this investigation of rare, hemizygous deleterious variants in ASD probands, we present an efficient approach for effectively integrating different data types and demonstrate that the approach is simple, although robust with even a small sample of data. Database queries are easily customizable for multiple parameters, such as deleteriousness, copy number, conservation, exonic function, MAF, among others. Our initial application of the methodology suggests a further need for investigation into the characterization of rare SNVs within CNV regions. Further attention to PCDHA9 is warranted, especially given its predicted role in establishing and maintaining neuronal cell-cell connections. Other applications of this database framework and integration of additional types of genomic and phenotypic data for ASD investigation will be considered as future directions.