Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Beginner's Guide to Data Import

Avinash Ramu edited this page Mar 19, 2015 · 39 revisions

This guide describes how to import instrument data into GMS using the downsampled TST1 dataset (TST1ds) as an example. Before continuing you must have an installed version of the GMS (see Installation-Types-Overview for options) and already run the ./setup/prime-system.pl command as documented in the installation instructions or use the Pre-configured Virtual Machine. NOTE: in the guides above (and depending on options chosen) a pre-imported version of the TST1 data will already be in place. If you wish you may proceed directly to the Beginner's Guide to the Demonstration Analysis. This tutorial shows you how to go through the import process yourself to illustrate how you would bring your own data into the system. At the end of this tutorial you will have two sets of data for TST1 (HCC1395) available for analysis. The individuals, samples, libraries, and instrument-data imported here have the suffix "ds" to distinguish them from the pre-imported data.

You may choose to run the already scripted version of these steps in the gms repo: ~/gms/downsampled-demo-data/example-import-data.sh [OR] You may choose to follow the instructions in the rest of this guide, which describes the commands executed by example-import-data.sh. Even if you choose not to execute example-import-data.sh, it provides a complete example of commands needed to import data into GMS.

To simplify this process, a new tool is in development that will replace this series of commands with a single command that takes a spreadsheet-style file as input.

Download the TSTds Dataset

example-import-data.sh lines 31-36:

INSTRUMENT_DATA_DIRECTORY='.'

echo Downloading downsampled instrument data to $INSTRUMENT_DATA_DIRECTORY
wget --no-check-certificate --no-directories --recursive --continue --no-parent --accept='*.bam' \
  --directory-prefix "$INSTRUMENT_DATA_DIRECTORY" \
  https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/bams/hcc1395_1tenth_percent/

Import the TSTds Dataset

In order to import an instrument data file, there must exist a record in GMS of the library which was sequenced, the sample from which the library was derived, and the individual from which the sample was collected. So, in order to import a set of instrument data files, an individual, one or more samples, and one or more libraries must first exist or be created in GMS.

Setup the Individual

example-import-data.sh lines 48-54:

INDIVIDUAL='H_NJ-HCC1395ds'
genome individual create                                                        \
    --name="$INDIVIDUAL"                                                        \
    --upn='HCC1395ds'                                                           \
    --common-name="TST1ds"                                                      \
    --gender=female                                                             \
    --taxon="name=human"

The value of the --name argument is arbitrary, but it should be something representative of the individual because you'll reference this name later when adding samples for this individual to GMS.

Setup the Samples

example-import-data.sh lines 78-85:

SAMPLE_TUMOR='H_NJ-HCC1395ds-HCC1395'
genome sample create                                                            \
    --extraction-type='genomic dna'                                             \
    --source="name=$INDIVIDUAL"                                                 \
    --name=$SAMPLE_TUMOR                                                        \
    --common-name='tumor'                                                       \
    --extraction-label='HCC1395'                                                \
    --tissue-desc='epithelial'

The --source parameter references the individual created in the previous step. Notice how the value of the --source parameter here matches the --name parameter given to genome individual create. The argument to the --extraction-type parameter can be either rna or genomic dna.

Repeat the genome sample create command for each sample in your dataset. The TST1ds dataset has four samples. See example-import-data.sh lines 87-112.

Setup the Libraries

example-import-data.sh lines 135-141:

LIBRARY_TUMOR_1='H_NJ-HCC1395ds-HCC1395-lig2-lib1'
genome library create                                                           \
  --name="$LIBRARY_TUMOR_1"                                                     \
  --sample="$SAMPLE_TUMOR"                                                      \
  --protocol='Illumina Library Construction'                                    \
  --original-insert-size='271'                                                  \
  --library-insert-size='390'

The genome library create command creates a record of a library in GMS. The library links to the sample from which it was created, so it is important that the argument to --sample here matches the argument to --name given to the genome sample create command above, to indicate which sample record in GMS this library record should link to.

The --transcript-strand parameter is specified because the extraction-type of this library's sample is rna. The --transcript-strand parameter accepts one of three values as its argument. Possible values are 'unstranded', 'firststrand', or 'secondstrand'. This parameter should not be used for samples with extraction type genomic dna.

If your library is a capture library, create it just as you would a library for genomic dna. When you import the instrument data, you will have the opportunity to specify a capture set during import.

Repeat the genome library create command for each library in your dataset. The TST1ds dataset has ten libraries in all. See example-import-data.sh lines 143-223 for the code to create the other nine libraries.

Import the Instrument Data

example-import-data.sh lines 249-254:

INSTRUMENT_DATA_DIRECTORY='.'
LIBRARY_TUMOR_1='H_NJ-HCC1395ds-HCC1395-lig2-lib1'
genome instrument-data import basic                                             \
    --description='tumor wgs 1'                                                 \
    --import-source-name='TST1ds'                                               \
    --instrument-data-properties='clusters=188429464'                           \
    --source-files="$INSTRUMENT_DATA_DIRECTORY/gerald_D1VCPACXX_1.bam"          \
    --library="$LIBRARY_TUMOR_1"

The genome instrument-data import basic command imports sequence reads. This step depends on already having a library created, as in the previous step, and it requires the argument to --library match the argument to --name given to genome library create. This step imports reads from a bam file, which in this example, exists in a file named gerald_C2DBEACXX_3.bam in the current working directory.

Repeat the genome instrument-data import basic command for each instrument data file in your dataset. The TST1ds dataset has twelve instrument data bam files. See example-import-data.sh lines 256-345 for the code to create the other nine.

Working With Instrument Data in FASTQ Format

When adapting this import procedure to import your instrument data, you may need to convert the data you have from FASTQ format to BAM format. The FastqToSam tool, which is part of Picard, is designed for this purpose. For more information on how to use FastqToSam, refer to the Picard command-line tools documentation for FastqToSam.

A command to convert your FASTQ data to BAM format using Picard will look something like this:

java -Xmx2g -jar picard-tools-1.118/SamToFastq.jar                             \ 
    FASTQ=D1VCPACXX_lane6_Read1.fastq FASTQ2=D1VCPACXX_lane6_Read2.fastq       \ 
    OUTPUT=D1VCPACXX_lane6.bam

Defining Models and Running Builds

To make use of the instrument data, genome models must be defined which use the instrument data, and builds must be executed for those models. Once data has been imported, the genome model clin-seq advise command can be used to guide you through the process of defining models and running builds on those models.

First, see example usage by typing:

genome model clin-seq advise --help

Next, lets see what samples and instrument data are available for the TST1ds individual:

genome model clin-seq advise --allow-imported --individual='common_name=TST1ds'

The above command will show details about the TST1ds individual, default processing profiles and model inputs, and available samples. You should see four samples available (tumor DNA, normal DNA, tumor RNA, and normal RNA). Run the clin-seq advise command again and provide all four sample ids as follows:

genome model clin-seq advise --allow-imported --individual='common_name=TST1ds' --samples='id in [??,??,??,??]'

NOTE: Replace ?? with sample ids

NOTE: Depending on the amount of resources available to your system or virtual machine you may not be able to start all builds recommended by clin-seq advise simultaneously. See the Beginner's Guide to the Demonstration Analysis or Quick-VM-Tour for more details. You can repeat the clin-seq advise command as many times as you wish and it will give you the current status of models/builds and how to progress with the next steps until a complete analysis.

An example list of model define commands is here. These commands will vary depending on instrument-data ID's which are not set in a deterministic fashion, hence the users will have to replace the ID's with the ones on their system, the first two commands in the gist show how to obtain the ID's.

Clone this wiki locally