Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Auxiliary Data Import

Rob Fugina edited this page Apr 5, 2016 · 20 revisions

These instructions provide examples of importing certain auxiliary annotation data used the GMS, such as reference genomes, feature sets, and transcript annotations.

These instructions assume that you have followed the installation instructions up to and including the prime-system.pl command. If do not require the demonstration dataset, it is possible to prime the system without downloading the demonstration data by using the --data=none option to prime-system.pl.

Importing a New Human Reference Genome

A new human reference genome can be imported by defining a new imported-reference-sequence model using the genome model define imported-reference-sequence command. Several parameters must be specified when defining this model:

  • the file system path of the reference fasta file
  • the processing profile ID (typically 1990904)
  • the name of the reference's species (eg, human)
  • the reference version
  • a prefix which classifies the source of the reference
  • a name for the assembly of the reference (typically includes the prefix and version)
  • a build name which acts as a local name for the reference
  • a URL which identifies the original source of the reference

To import the soft-masked primary assembly from Ensembl, first download the compressed reference fasta from Ensembl's ftp server. After downloading the compressed reference fasta, calculate the check sum and compare the result to the checksum result in Ensembl's CHECKSUMS file (optional, but recommended). Decompress the downloaded file using the gunzip utility. Then run the genome model define imported-reference-sequence. Defining the imported reference sequence model automatically starts a build of this model to import the reference.

$ URI='ftp://ftp.ensembl.org/pub/release-76/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz'
$ wget $URI
$ sum Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
$ gunzip Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
$ genome model define imported-reference-sequence                    \
  --fasta-file=$PWD/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa   \
  --processing-profile-id=1990904                                    \
  --species-name=human                                               \
  --version=38                                                       \
  --prefix=GRC                                                       \
  --assembly-name=GRCh38-ensembl-primary                             \
  --build-name=GRCh38-ensembl-primary                                \
  --sequence-uri=$URI

Creating a Modified Reference based on an Existing Reference Genome

An existing reference genome may be used as the basis for a new reference genome which is the combination both the existing reference and new fasta file. In order to create the modified reference, the build id of the existing reference and the path to the new fasta file are required.

The existing reference genomes may be listed to show the existing reference genomes along with their build IDs:

$ genome model build list --filter="model.type_name='imported reference sequence'" 

Once the build ID of the existing reference and the path to the new fasta are both known, a new model can be defined. Defining this model automatically starts a build of this model, so there is no need to separately start a build.

The following command is an example of appending the sequences given in a file named ERCC92.fa to the GRCh37-lite reference.

$ genome model define imported-reference-sequence  \
  --append-to=106942997                            \
  --fasta-file=/home/ubuntu/ERCC92.fa              \
  --use-default-sequence-uri                       \
  --species-name=human                             \
  --version=37_ERCC

Importing a Variation List

import-dbsnp-build

A list of variants may be imported into the GMS from outside sources using genome model imported-variation-list. You may import variants directly from dbSNP with the import-dbsnp-build sub-command. To import a variation list, the build id for a reference sequence using the same coordinates as the variation list must be supplied.

Notice the use of http:// at the beginning of the vcf file url. The ftp protocol is not supported.

$ genome model imported-variation-list import-dbsnp-build \
  --version 141 --reference-sequence-build 106942997      \
  --vcf-file-url 'http://ftp.ncbi.nih.gov/snp/organisms/human_9606_b141_GRCh37p13/VCF/00-All.vcf.gz'

A copy of dbSNP vesion 141 for the GRCh37-lite reference is imported into GMS during prime-system.pl and assigned the build ID 127786607.

import-variants

Other variants in vcf or bed format may be imported with the import-variants sub-command.

$ genome model imported-variation-list import-variants  \
  --format vcf                                          \
  --input-path $PWD/clinvar_20140807.sorted.vcf         \
  --source dbsnp-clinvar                                \
  --version 20140807                                    \
  --description 'dbsnp clinvar 20140807'                \
  --reference-sequence-build 106942997                  \
  --variant-type snv

Importing a New Version of Ensembl

Ensembl is distributed as a set of files which can be easily imported into MySQL. The GMS requires Ensembl annotation to be loaded into a MySQL database before it can be imported into the GMS. MySQL is not setup by default when installing the GMS, so MySQL must be setup before importing Ensembl annotation for the first time.

The following example is based on Ensembl release 76 for human.

Setup MySQL

Install MySQL using apt-get. You will be prompted to provide a "root" password for MySQL.

$ sudo apt-get install -y mysql-client mysql-server

Use the mysql_setpermission script to create a new user and database.

$ mysql_setpermission --user=root --password

The mysql_setpermission command is interactive. First, choose menu prompt (2) to create a new database named homo_sapiens_core_76_38, a user named mse, which is accessible from host localhost. Be sure to confirm your choices by answering yes when prompted for confirmation. Now a database and user exist with basic permissions. Next, choose menu prompt (6) to give the mse full permissions on homo_sapiens_core_76_38 from host localhost. Again, be sure to confirm you choices when prompted. You may then exit the mysql_setpermission by choosing 0 from the main menu.

The final step to setting up MySQL for Ensembl annotation importing is to let the GMS know where the MySQL server is. To do that you need to modify the /etc/genome.conf file. Using an editor such as vim, open /etc/genome.conf and set the the following variables to their proper values according to how you configured MySQL:

export GENOME_DB_ENSEMBL_HOST='localhost'
export GENOME_DB_ENSEMBL_USER='mse'
export GENOME_DB_ENSEMBL_PORT='3306'

After modifying the /etc/genome.conf file, be sure to log out and log back in so that the new configuration may take effect in your environment.

Download Ensembl and Import into MySQL

$ cd
$ wget -r ftp://ftp.ensembl.org/pub/current_mysql/homo_sapiens_core_76_38/
$ cd ftp.ensembl.org/pub/current_mysql/homo_sapiens_core_77_38/
$ gunzip *.gz
$ mysql -u mse homo_sapiens_core_76_38 < homo_sapiens_core_76_38.sql
$ mysqlimport -u mse --fields_escaped_by=$'\t' homo_sapiens_core_76_38 -L *.txt

Disable Tiering

Support for tiering is being deprecated in the GMS, but for now your reference genome will need a "tiering directory". Create the follow empty directory structure:

$ mkdir -p $HOME/dummy-tiering-directory/rmsk
$ mkdir -p $HOME/dummy-tiering-directory/cpg_islands
$ mkdir -p $HOME/dummy-tiering-directory/conserved_regions
$ mkdir -p $HOME/dummy-tiering-directory/regulatory_regions

Modify the else block in get_or_create_ucsc_tiering_directory subroutine definition in /opt/gms/$GENOME_SYS_ID/sw/genome/lib/perl/Genome/Model/Build/ReferenceSequence.pm to point the new tiering directory. Be sure to use the actual location of the tiering directory that you just created instead of /home/ubuntu/ which may not be correct on your system.

@@ -156,7 +156,7 @@ sub get_or_create_ucsc_tiering_directory {
     }
     else {
         $self->status_message("UCSC Tiering Directory not currently available for this species: ".$self->species_name);
-        return;
+        return '/home/ubuntu/dummy-tiering-directory';
     }
 }

Define the Imported Annotation Model

Use the following command to define and build the Imported Annotation model which will import the Ensembl Annotation from MySQL into the GMS. Replace $REFERENCE_BUILD_ID with the build id of your GRCh38 reference genome.

$ genome model define imported-annotation
    --processing-profile=2070042
    --model-name='NCBI-human.ensembl'
    --reference-sequence-build=$REFERENCE_BUILD_ID
    --version=76_38
    --build-name=NCBI-human.ensembl/76_38
    --species-name=human
    --annotation-import-version 2

Defining the imported annotation model also starts a build of that model. Once the build completes successfully, you can use the imported annotation build as an input to other models, such as a reference alignment model.

If you had to modify the /etc/genome.conf file earlier, don't forget to log out and log back into the GMS before defining the imported annotation model, otherwise GMS may not be able to connect to your MySQL database.

Clone this wiki locally