Skip to content

sciDB backending of SummarizedExperiments and multi assay containers, proof of concept

Tim Triche, Jr edited this page Mar 26, 2015 · 3 revisions

As promised/threatened previously, I have started creating a reproducible example of using the SciDB database (http://www.paradigm4.com/, http://cran.r-project.org/web/packages/scidb/index.html, and the much more active https://github.com/Paradigm4/SciDBR repository, where bwlewis is actively committing new code as I write this) by spinning up a Docker SciDB instance (https://github.com/albhasan/docker_scidb).

SciDB essentially takes the idea of a BigMatrix and stuffs it into a database built around array indexing. See http://db.csail.mit.edu/nedbday13/slides/lewis.pdf for some more (albeit not super up-to-date) background. As far as I can tell, their quant clients keep the project healthy and wealthy.

Right now I am at the stage of linking together (i.e. https://docs.docker.com/userguide/dockerlinks/) the SciDB Docker image (https://github.com/albhasan/docker_scidb) and Dan Tenenbaum's (https://github.com/Bioconductor/bioc_docker) bioconductor/devel_core Docker image (http://bioconductor.org/help/docker/#the-full-list).

The obvious use case IMHO is TCGA pan-cancer clustering jointly on DNA methylation and transcript expression, which acquits itself rather well under Tibshirani's guidance (http://genomebiology.com/2015/16/1/17). As implied, there are a number of straightforward notions (direct integration of CNVs, FEM, etc.) that could further improve and extend Gevaert's results. It might serve as the basis for generic joint analyses if one could trivially swap in and out the sample inclusion criteria without running out of local RAM. Hence the idea of using a SciDB backend.

Hopefully the decoupling of assay backend storage from sample- or time-point-specific covariates can allow for more thoughtful reanalyses that aren't constrained primarily by enormous memory or network bandwidth requirements. I prefer to use a SummarizedExperiment for just about everything, and have previously encapsulated a BigMatrix assay into an SE, so if the semantics are as clean as bwlewis claims, this could be a useful way forward, especially when doing joint private/public meta-analyses.

The initial proof of concept (PoC) is a Dockerfile that spins up a SciDB-backended SummarizedExperiment.

The substantial PoC is a SciDB-backended multiAssay object that is sufficient to reproduce Gevaert 2015.

I (tjt) will update this page with more details and/or links to Dockerfiles/code milestones as I produce them. I'm putting the initial bits into https://github.com/ttriche/biocMultiSciDB for the time being.

--tjt, 3/26/2015