Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Quick VM Tour

Malachi Griffith edited this page Aug 25, 2015 · 31 revisions

Introduction

The simplest way to get a quick sense of what the GMS is all about is to try loading a virtual machine where the GMS has already been installed and configured. When the GMS virtual machine loads you will be logged in as the user genome (with a password that is also genome). All installation and configuration steps will be complete and demonstration data will be in place. The system can be immediately tested by running genotype-microarray, reference-alignment, somatic-variation, rna-seq, differential-expression, and clin-seq pipelines on this demonstration data.

The virtual machine is a self contained sandbox. The idea is for you to take a short tour of the GMS, execute some simple GMS commands, view some features in the GMS web-viewer, etc. When you are done, you can remove the Virtual Machine and your system will be completely unaffected by the test.

Please note that the pre-configured GMS is meant for simple demonstration purposes only. If you wish to use the system in earnest for large-scale analysis you will want to identify appropriate hardware and adopt one of the installation methods described in the Install Manual.

Finally, to keep this tutorial simple, many details are left out. These details can be found throughout the GMS manuscript and elsewhere in the GMS wiki. For example, the Installation Guide, the Location and Description of the HCC1395 Data, the FAQ page, the Guide to Importing your own Data, the Reference Manual for useful Genome Commands, the Beginners Guide to Demonstration analysis, and much more...

System Requirements

Genome analysis is computationally intense and involves very large data files. Even the following demonstration analysis with down-sampled data will require considerable cpu, memory, and storage. That being said, we have been able to get the demonstration analysis to succeed on a 2013 MacBook Pro Laptop (OSX 10.9.2) within a VirtualBox virtual machine that was allocated: 3 cpus, 12 GB memory, and ~250 Gb of disk space. We recommend testing with resources greater than this if possible.

Table of contents

Steps

Step 1. Install VirtualBox

The virtual machine was created with VirtualBox version 4.3.8. VirtualBox is open-source and freely available for the Mac, Linux, and Windows platforms. You should install VirtualBox version >=4.3.8. Download and install VirtualBox for your system here:
https://www.virtualbox.org/wiki/Downloads

Step 2. Download a Pre-configured GMS VirtualMachine Image

The pre-configured virtual machine image contains the GMS installation, a fully functional Ubuntu 12.04 Precise operating system, annotation databases, reference genome sequences, example data and much more. The pre-configured virtual machines are available here:
https://xfer.genome.wustl.edu/gxfer1/project/gms/vms/

The image files are large (~48 Gb) and will take some time to download. You should therefore use a download agent that will allow the download to resume if it is interrupted. For example, at a terminal you could use wget.

wget -c --no-check-certificate https://xfer.genome.wustl.edu/gxfer1/project/gms/vms/GMS_VM_V1.tar.gz

Or using curl instead

curl -C - -O https://xfer.genome.wustl.edu/gxfer1/project/gms/vms/GMS_VM_V1.tar.gz

Step 3. Unpack the Image

Use your favorite decompression software to unpack the virtual machine. For example, in a Mac or Linux terminal you could use:

tar -zxvf GMS_VM_V1.tar.gz.

On Mac you may also be able to simply double-click the archive file. This will unpack to a folder of ~150GB and may take some time (~30mins) to complete.

Step 4. Import the Image

Open VirtualBox and add the GMS virtual machine by selecting the GMS .vbox file as follows.

Within VirtualBox, use the Machine -> Add option:

Find the GMS .vbox file and open it:

Step 5. Configure system resources to be used for the virtual machine

Depending on the resources available on your system you may want to adjust resource usage. For example, you might adjust the base memory, video memory, CPUs, and network connection type. To adjust each of these and more, select the machine GMS_VM_V1 and press the Settings button at the top left of the VirtualBox interface. As a general role you might allocate 50-75% of system resources to you virtual machine. For example: 3 of 4 cpus, 8 of 12 Gb base memory, and 64 of 128 Mb video memory. If you have more memory and cpus available, we recommend using them, but make sure you leave enough resources for your host to continue normal operation.

General settings:

Number of processors:

Base memory:

Video memory:

Network (set to NAT by default by Bridged Adaptor may be faster):

Step 6. Start the GMS system

Select the machine GMS_VM_V1 and press the Start -> button at the top left of the VirtualBox interface. The machine will boot and you will be automatically logged in as the user genome. If you are ever prompted for a password, remember that both the username and password for the system are genome. When the machine boots, you may prompted with some messages about keyboard and mouse settings. You can safely dismiss these.

Logging into the GMS

Step 7. Open the GMS web-viewer and explore demonstration models, processing-profiles, instrument-data, etc.

Open the FireFox browser by clicking the orange and blue icon on the left. Firefox has been pre-configured to open the GMS web-viewer, github wiki and github source code pages in three separate tabs.

Step 8. Perform some initial sanity checks of the system

Open a Terminal window by clicking the black icon on the left. Then execute the following commands to test various basic components of the system:

lsid                      # You should see the openlava cluster identification
lsload                    # You should see a report of available resources
bjobs                     # You should not have any unfinished jobs yet
bsub 'sleep 60'           # You should be able to submit a job to openlava (run bjobs again to see it)
bhosts                    # You should see one host
bqueues                   # You should see four queues
genome disk group list    # You should see four disk groups
genome disk volume list   # You should see at least one volume for your local drive
genome sys gateway list   # You should see two gateways, one for your new home system and one for the test data "GMS1"

Step 9. Perform some basic queries of the database

# list the metadata that is already present in the database:
genome taxon list
genome individual list
genome sample list
genome library list
genome instrument-data list solexa

# list the pre-defined models (no results yet ... you will launch these and generate results):
genome model list

# view the processing profiles (pipeline descriptions) associated with those models:
genome processing-profile view --processing-profile='Default Reference Alignment'
genome processing-profile view --processing-profile='Default Somatic Variation Exome'
genome processing-profile view --processing-profile='Default Somatic Variation WGS'
genome processing-profile view --processing-profile='Default Ovation V2 RNA-seq'
genome processing-profile view --processing-profile='cuffcompare/cuffdiff 2.0.2 protein_coding only'

Step 10. Start some test builds and monitor their progress

Open a Terminal window by clicking the black icon on the left. Then execute the following command to view models that have already been defined in the system for demonstration purposes:

genome model list

Start a the genotype-microarray builds for tumor and normal as follows:

genome model build start 'hcc1395-normal-snparray'
genome model build start 'hcc1395-tumor-snparray'

You can monitor progress of ongoing analysis runs in several ways. For example, you can load the GMS web-viewer and go to the builds tab. Or you can view the status of all builds in a Terminal using the command genome model build list. Or you can view a much more detailed status of a running build using the following command for the build of interest (replacing '$build_id' with your own build ID):

genome model build view '$build_id'

You can find the genotype-microarray results files as follows:

genome model build list --filter model.name='hcc1395-normal-snparray' --show data_directory

Once the genotype-microarray builds are done launch the reference-alignment builds for the exome data as follows (you may want to do one at a time if you are running on a small machine like a laptop):

genome model build start 'hcc1395-normal-refalign-exome'

Once again you can view the progress of this build as follows:

genome model build view '$build_id'

As above you can find the results files for the reference-alignment pipeline including BAM files and germline variants in VCF format as follows:

genome model build list --filter model.name='hcc1395-normal-refalign-exome' --show model.name,data_directory
genome model build list --filter model.name='hcc1395-tumor-refalign-exome' --show model.name,data_directory

To get the final, merged, sorted, duplicate-marked BAM from the tumor exome alignment, you can use the following method:

genome model list --filter name='hcc1395-tumor-refalign-exome' --show id,name,last_complete_build.merged_alignment_result.bam_path

Step 11. Dealing with Failed Builds

If you run genome model build view '$build_id' or genome model build list and find that a build has a status of "failed", you may have had a sporadic failure due to a disk access problem or insufficient memory. You can use genome model build abandon '$build_id' to abandon the build. You can then start a new build of the model using genome model build start '$model_id'. Steps that completed successfully should shortcut automatically and the analysis should continue on past the failed step in the previous run attempt.

For more tips on trouble-shooting failed builds see The Beginner's Guide to the Demonstration Analysis

Step 12. Completing the Demonstration Analysis

Next run, the whole genome reference-alignment builds:

genome model build start "name='hcc1395-normal-refalign-wgs'"
genome model build start "name='hcc1395-tumor-refalign-wgs'"

While those are building, you can run the RNA-Seq models:

genome model build start "name='hcc1395-normal-rnaseq'"
genome model build start "name='hcc1395-tumor-rnaseq'"

To build the WGS somatic and exome somatic models, wait until the ref-align models above complete, and then run:

genome model build start "name='hcc1395-somatic-exome'"
genome model build start "name='hcc1395-somatic-wgs'"

To build the differential expression models, wait until the rna-seq models above complete, and then run:

genome model build start "name='hcc1395-differential-expression'"

When all of the above complete, the MedSeq pipeline can be run:

genome model build start "name='hcc1395-clinseq'"

Step 13. Exploring the Results

For a detailed description of results you might refer to the Location and Description of Results Files.

Step 14. More Advanced Examples

For many more examples, refer to the Reference Manual for useful Genome Commands.

Clone this wiki locally