Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.
Obi Griffith edited this page Jun 23, 2015 · 53 revisions

Contents

How do I cite the GMS?

The GMS was recently accepted for publication in PLoS Computational Biology. Please check back here for updated citation details soon.

Griffith M, Griffith OL, Smith SM, Ramu A, Callaway MB, Brummett AM, Kiwala M, Coffmann AC, Regier AA, Oberkfell BJ, Sanderson GE, Mooney TP, Nutter N, Belter EA, Du F, Long RL, Abbott TE, Ferguson I, Morton D, Burnett M, Weible JV, Peck JB, Dukes A, McMichael JF, Lolofie JT, Derickson BR, Hundall J, Skidmore ZL, Ainscough BJ, Dees ND, Schierding WS, Kandoth C, Kim K, Lu C, Harris CC, Maher N, Maher CA, Magrini VJ, Abbott BS, Chen K, Clark E, Das I, Fan X, Hawkins AE, Hepler TG, Wylie TN, Leonard S, Schroeder WE, Shi X, Carmichael KL, Weil M, Wohldstadter RW, Stiehr G, McLellan MD, Pohl CS, Miller CA, Koboldt DC, Walker JR, Eldred JM, Larson DE, Dooling DJ, Ding L, Mardis ER, Wilson RK. 2015. Genome Modeling System: A Knowledge Management Platform for Genomics. PLoS Computational Biology. Accepted.

*Griffith M, Griffith OL and Smith SM contributed equally to this work and are listed in alphabetical order.

The GMS installation is time consuming. What is the fastest and easiest way for me to take a tour of the system?

You can download a pre-configured virtual machine with the GMS and demonstration data installed into it. Please refer to the Quick VM Tour Tutorial for details.

I think I have discovered a bug. How should I report it?

This is an open source project. Bug reports and other contributions are welcome. To report a bug please open a GitHub issue here: https://github.com/genome/gms/issues

The GMS uses BAM files to store raw data, even as input to alignment. What if I only have FASTQ files?

We store all read data, even unaligned reads in BAM format to save space. If we absolutely need fastq files for a particular step the GMS alignment API will generate them, use them and then discard them. If you only have fastq files for your input data, these should work once imported in the GMS. Or you can convert them to BAM format using the tool: Picard FastqToSam. If the GMS has generated a BAM file and you want to go back to Fastq, use the tool: Picard SamToFastq.

The GMS manuscript is focused on somatic analysis. What about germline analysis of single samples?

A typical tumor analysis involves running the reference-alignment pipeline for normal and tumor and then using these as inputs to the somatic-variation pipeline. If you have samples that you would like to analyze for germline variants, you simply define one reference-alignment model per sample and use the results from those directly. reference-alignment build result directories contain annotated VCFs of variants.

Can I just download the HCC1395 cell line data?

We provide a complete set of high quality, 2x100 bp Illumina sequence data for HCC1395 (tumor) and HCC1395/BL. This data consists of whole genome, whole exome, and whole transcriptome data for both cell lines. A total of 12 lanes of HiSeq 2000 data are provided. Yes, you are welcome to use this data for your own purposes. Please cite the GMS manuscript if you use this data. Details on how to download the data can be found here: HCC1395 WGS, Exome, and RNA-Seq Data

How do I get my own data into the GMS?

Please refer to the data import tutorial for detailed examples on how to enter you raw data and necessary meta-data into the system.

Are the pipelines you make available really the ones that produce the data for TGI publications and consortium projects?

Yes. As part of our commitment to the scientific community, and as our interpretation of fulfilling our grant obligations as a large-scale sequencing center, we make both our data and the means of producing it available. It is our strongly held belief that this kind of transparency should be the cornerstone of publicly funded research.

I just want to do exome analysis. Can I do that?

Yes. You can process exome or wgs data alone through the reference-alignment, somatic-variation, and clin-seq pipelines. Similarly you can just process rna-seq data through the rna-seq, differential-expression, and clin-seq pipelines. Once you have your samples and instrument data imported, you can also use genome model clin-seq advise. This tool will attempt to guide you through the process of defining and running all of the pipelines suitable for the data you have.

What minimum compute resources do you recommend?

The Install Documentation has a detailed description of resource requirements. Please note that genome analysis is computationally intense and involves very large data files. Even the demonstration analysis with down-sampled data will require considerable cpu, memory, and storage. That being said, we have been able to get the demonstration analysis to succeed on a 2013 MacBook Pro Laptop (OSX 10.9.2) within a VirtualBox virtual machine that was allocated: 3 cpus, 12 GB memory, and ~250 Gb of disk space. We recommend testing with resources greater than this if possible.

How does the GMS compare to other genome analysis software frameworks like GATK, Picard, and Firehose?

The GMS is a pipeline / data management framework, while GATK and Picard are lower level tool collections used as parts of pipelines. Picard and GATK are somewhat like a software library turned inside-out, for which components that use them are also launched as sub-commands. Several GMS pipelines use Picard core functionality. GATK can be used in GMS pipelines, as well, for users with a license. Firehose is used for analysis that parallels that of GMS pipelines not discussed in the publication: Mutational Significance.

How does the GMS compare to Galaxy?

Galaxy allows users to graphically define simple workflows. The simplest pipelines in the GMS are, however, difficult to express in Galaxy, since pipelines dynamically generate workflow details based on metadata associated with the reads and the sample. Since Galaxy is an open-source project, we monitor it carefully for opportunities to integrate in the future.

How does the GMS compare to SeqWare?

SeqWare is another excellent open-source project for processing NGS data. It is based on RedHat/CentOS instead of Ubuntu. Components are in Java instead of scripting languages. The aim for the GMS is to have bioinformatics staff be able to extend pipelines and prototype new ones without necessarily having a rewrite by an experienced engineer, though this is still often a good choice. Your choice of systems may depend on your preferred software environment, and whether or not you specifically want to run the TGI pipelines. Data produced in either should be usable in both.

Can I use LSF instead of OpenLava?

Yes! Nothing in the GMS depends on proprietary software, but Platform LSF (IBM) is used internally at The Genome Institute, and we recommend it if your cluster gets large.

Is the GMS Secure?

The GMS runs on Linux, and can be made as secure as anything on that platform, including most leading sites on the the internet. If installed on an existing system it is as secure as that system. You must configure Linux security yourself, so if you are not familiar with security standards, consult someone who is.

Can I used SGE, PBS, or (insert my favorite job scheduler) instead of OpenLava/LSF?

TGI is in the process of transitioning to a revamped workflow system, which will also be open-sourced independently, based on petri-nets. It should support a variety of job schedulers. If you are impatient, and want to fork us on github, we would love to accept a patch.

What languages can I write components in?

A component in the GMS can be written in any language. The “glue” layer is in Perl, primarily because it is broadly accessible to the bioinformatics community. Many tools are a one-page Perl module with a hash of metadata, and just enough code to shell out to Java, C, Python, Scala, Ruby, or R.

Why aren’t the tool modules in XML, or YAML?

When the wrapper around the tool really is just data, a hash the associated Perl module is just as simple as XML, probably simpler. When a small amount of logic is required to wrap a tool, embedding code into XML or YAML becomes more cumbersome, and doesn’t provide much value in return. We find that these tool wrappers are often much more complicated than one might initially expect.

You have a web interface for process monitoring. Why use a command-line interface for running tools/builds?

A bioinformatician that cannot use the command-line is not likely to be able to perform leading-edge analysis. The goal of the GMS is to empower those people. Other tools aim to allow simpler tasks to be performed by staff with lower skill levels, but this is not the target audience of the GMS.

Why do you use UUIDs as IDs in the database, instead of integers like most databases do?

A UUID ensures that two labs can install the GMS separately, and later transfer data without collisions. It also prevents an extra trip to the database to get an ID.

I want to pre-process my reads before analysis manually. What is the best thing to do?

Write a read preprocessor, and stick it into the pipeline. Then make a processing profile that uses it. The more you keep processing inside the system, instead of uploading reads that you just remember are preprocessed duplicates of others, the clearer your data management will be.

Why can’t I edit processing profiles?

To “change” a processing profile you make a new processing profile. The old processing profile can be deleted if there are zero models in the system that use it, but they will have different IDs. Only the name is editable. This principle holds with builds of genome models as well. They can be replaced, but not changed. The principle of immutability, popularized by the recent resurgence in functional programming languages, is not limited to those languages. It is key for concurrency/scaling, as well as trustworthy results.

What is “shortcutting”?

A premise in the GMS is that the same work should not be repeated twice with the same inputs, using the same tools, with the same parameters. When two attempts to do the same work are initiated, one will do the work, and both will link to the results and proceed on. This means that, if you re-model a genome in a slightly different fashion, the only work to really be repeated will be the steps absolutely required. To make this possible, all steps in our pipelines control for stochastic effects by doing things like seeding random number generators, and preventing external input.

Can I use a different database besides PostgreSQL?

TGI originally ran on Oracle, but migrated to PostgreSQL as part of this release. We recommend using PostgreSQL because the intensive data access is mostly outside of the RDBMS layer.

Why don’t you put alignments and variants into PostgreSQL?

BAM files, and also tabix-compressed VCF files, are high-performance custom storage specifically made for the kind of data these pipelines use. You will experience at least an order of magnitude drop in performance by putting “big data” data into an RDBMS layer. Even for data that does fit in an RDBMS (EnsEMBL annotation data, for instance), exporting to a raw file allows the tool to scale.

Why don’t you use Hadoop/HDFS?

A large number of the leading tools used in genome analysis operate on regular files. To support both requires a copy of the data to be in HDFS and also on a regular filesystem. HDFS benefits greatly from further duplication within that system. Because storage cost is significant, we have yet to reach a point where the value of having the data in Hadoop is worth the cost. This may change in the future, particularly for cross-genome analysis of final variant calls.

What is the best way to share data between GMS systems?

This can be handled by a standard Unix administrator. Make the data you intend to share mountable. Then export the metadata from the database with “genome model export metadata”, and send it to the intended recipient. You can revoke access to the low level data with standard permissions changes.

How is the code organized at a high level?

The gms repository is just the installer. Most pipeline code is in genome/gms-core, with thousands of modules and over a million lines of source. The low level workflow infrastructure is in genome/tgi-workflow. The component layer and ORM are in genome/UR. The web interface is in genome/genome_rails_prod. Other applications are packaged under genome, or genome-vendor

Has the GMS been tested on hardware outside WashU?

Yes, we have tested the installation of the GMS on different platforms such as Docker images and on OpenStack which are portable, this is still work in progress, documentation about these can be found here.

I found something that I think could be done better. What should I do?

Fork us on github, and send us a pull request. We would love your input.

I would like to collaborate with the Genome Institute on a cancer genome analysis

Please feel free to contact the corresponding authors of the GMS paper to discuss possible collaborations.

How was this work funded?

The development of the Genome Modeling System was funded by an NHGRI Large Scale Sequencing and Analysis Center grant (U54 HG003079). Additional funding to make this system usable by the community was also provided by NHGRI Genome Sequencing Informatics Tools (GS-IT) Program U01 HG006517.

Clone this wiki locally