Modern day studies exploring the determinants of disease require increased power to detect small effects. This in turn demands large sample sizes. Obtaining sufficient sample size is often achieved through the pooling of datasets, invariably located in disparate locations, into a single master database. However, it is often the case that ethico-legal and data-ownership issues exist that potentially hamper the pooling of datasets into a single resource. Database federation techniques offer a viable solution to this problem by permitting the access to datasets located in disparate locations through a single database interface without the need for pooling.
To created a computational infrastructure based around database federation and develop a web-based, secure analysis interface to facilitate querying of the federated datasets.
Federation is implemented through MySQL. Each data contributing site holds a harmonised dataset in a Local iCARE Database (LID), stored on either a physical or virtual server at their respective site. A central Master iCARE Database (MID) contains federated tables that do not contain data themselves but point to the data held at each site. The connections between the MID and each LID are maintained through secure SSH tunnels. The iCARE Web-based Analysis Portal (iWAP) is implemented using PERL CGI, Ajax and secured using SSL server and client certificates and further protected by user authentication and session cookies. Popular analysis packages (R, Stata, SAS) are available to users who interact with them by submitting relevant syntax into a text field provided when an analysis is initiated.
Separate projects examining specific variables from specific sites are preconfigured for access by only authorised users of the system prior to analysis. Within a single project, users only have access to these pre-defined variables and resources. Through the simple iWAP interface, analysis runs for each defined project can be initiated and subsequent result files viewed. When a run is submitted, the iWAP queries the MID, which sequentially queries each LID and retrieves the data for the requested variables. The retrieved data are not stored on the MID or iWAP or committed to disk in any way on these servers. Instead named pipes (FIFO) are used to pipe the data to the requested analysis package. Only run-specific output files are stored on the server and can be viewed through the provided file manager. Data retrieval times are adequate with a simple analysis of ~3 million records split between 5 sites completed from start to finish in around 4 minutes (with the MID based in Australia, and other sites based in Europe). All analyses are logged, including user syntax and data retrieval times.
We have created a secure, easily accessible, user friendly analysis framework for pooled data analysis built on open source technology. Its flexibility allows us to easily grow if new groups wish to join their datasets into the system. In addition it allows us to adopt other technologies for pooled analysis such as DataSHIELD (Wolfson et al) which we aim to implement within the existing framework early 2011.
See more of: Session Submissions