Skip to content

Commit

Permalink
Merge pull request #88 from databio/dev
Browse files Browse the repository at this point in the history
v0.7.0
  • Loading branch information
nsheff committed Oct 21, 2019
2 parents 8dd06f2 + 91389e4 commit edb9482
Show file tree
Hide file tree
Showing 32 changed files with 1,545 additions and 553 deletions.
9 changes: 9 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Copyright 2019 Nathan Sheffield

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
21 changes: 10 additions & 11 deletions containers/Dockerfile_refgenie
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,29 @@ MAINTAINER Nathan Sheffield <nathan@code.databio.org>
RUN apt-get install -y python python-pip
RUN apt-get install -y curl

# HTSLIB
# htslib 1.9 (tabix cmd)
RUN apt-get install -y libz-dev libncurses-dev
RUN wget -O ~/htslib.tar.bz2 https://github.com/samtools/htslib/releases/download/1.3.2/htslib-1.3.2.tar.bz2
RUN tar -xf ~/htslib.tar.bz2
RUN cd /htslib-1.3.2 && ./configure && make && make install
RUN wget -O ~/htslib.tar.gz https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 && tar xjf ~/htslib.tar.gz && cd htslib-1.9 && ./configure && make && make install
ENV PATH="/htslib-1.9:${PATH}"

# install samtools
# install samtools 1.3.1
RUN wget -O ~/samtools.bz2 https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
RUN tar -xf ~/samtools.bz2
RUN cd /samtools-1.3.1 && make
ENV PATH="/samtools-1.3.1:${PATH}"

# bowtie2 and deps
# bowtie2 2.3.0 and deps
RUN apt-get install -y libtbb-dev # bowtie2 dependencies
RUN wget -O ~/bowtie.zip "https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.3.0/bowtie2-2.3.0-linux-x86_64.zip?r=https%3A%2F%2Fsourceforge.net%2Fprojects%2Fbowtie-bio%2Ffiles%2Fbowtie2%2F2.3.0%2F&ts=1485465820&use_mirror=kent"
RUN unzip ~/bowtie.zip
ENV PATH="/bowtie2-2.3.0:${PATH}"

# Bismark Methylation caller
# Bismark Methylation caller 0.17.0
RUN wget -O ~/bismark.zip https://github.com/FelixKrueger/Bismark/archive/0.17.0.zip
RUN unzip ~/bismark.zip
ENV PATH="/Bismark-0.17.0:${PATH}"

# HISAT2
# hisat2 2.0.5
RUN wget -O ~/hisat.zip ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.5-source.zip
RUN unzip ~/hisat.zip
RUN cd hisat2-2.0.5 && make
Expand All @@ -54,12 +53,12 @@ MAINTAINER Nathan Sheffield <nathan@code.databio.org>
ADD includes/twoBitToFa bin/twoBitToFa
RUN apt-get install -y libpng-dev

# bwa
# bwa 0.7.17
RUN wget -O ~/bwa-0.7.17.tar.bz2 https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2
RUN tar -xf ~/bwa-0.7.17.tar.bz2
run cd /bwa-0.7.17 && make
ENV PATH="/bwa-0.7.17:${PATH}"

# STAR 2.7.1a
RUN wget -O ~/STAR.tar.gz https://github.com/alexdobin/STAR/archive/2.7.1a.tar.gz && tar -xf ~/STAR.tar.gz && cd STAR-2.7.1a/source && make STAR
ENV PATH="/STAR-2.7.1a/source:${PATH}"
RUN wget -O ~/STAR.tar.gz https://github.com/alexdobin/STAR/archive/2.7.1a.tar.gz && tar -xf ~/STAR.tar.gz && cd STAR-2.7.1a/source && make STAR
ENV PATH="/STAR-2.7.1a/source:${PATH}"
39 changes: 20 additions & 19 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,25 @@

## What is refgenie?

Refgenie is full-service reference genome manager that organizes storage, access, and transfer of reference genomes. It provides command-line and python interfaces to download pre-built reference genome "assets" like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another.
Refgenie manages storage, access, and transfer of reference genome resources. It provides command-line and Python interfaces to *download* pre-built reference genome "assets", like indexes used by bioinformatics tools. It can also *build* assets for custom genome assemblies. Refgenie provides programmatic access to a standard genome folder structure, so software can swap from one genome to another.

## What makes refgenie better?

1. **It provides a command-line interface to download individual resources**. Think of it as `GitHub` for reference genomes. You just type `refgenie pull -g hg38 -a bwa_index`.
1. **It provides a command-line interface to download individual resources**. Think of it as `GitHub` for reference genomes. You just type `refgenie pull hg38/bwa_index`.

2. **It's scripted**. In case you need resources *not* on the server, such as for a custom genome, you can `build` your own: `refgenie build -g custom_genome -a bowtie2_index`.
2. **It's scripted**. In case you need resources *not* on the server, such as for a custom genome, you can `build` your own: `refgenie build custom_genome/bowtie2_index`.

3. **It simplifies finding local asset locations**. When you need a path to an asset, you can `seek` it, making your pipelines portable across computing environments: `refgenie seek -g hg38 -a salmon_index`.
3. **It simplifies finding local asset locations**. When you need a path to an asset, you can `seek` it, making your pipelines portable across computing environments: `refgenie seek hg38/salmon_index`.

4. **It includes a python API**. For tool developers, you use `cfg = refgenie.RefGenConf("genomes.yaml")` to get a python object with paths to any genome asset, *e.g.*, `cfg.get_asset("hg38", "kallisto_index")`.
4. **It includes a python API**. For tool developers, you use `cfg = refgenie.RefGenConf("genomes.yaml")` to get a Python object with paths to any genome asset, *e.g.*, `cfg.get_asset("hg38", "kallisto_index")`.


## Quick example

### Install and initialize

Refgenie keeps track of what's available using a configuration file initialized by `refgenie init`:

```console
pip install --user refgenie
export REFGENIE='genome_config.yaml'
Expand All @@ -31,53 +33,52 @@ refgenie init -c $REFGENIE

### Download indexes and assets for a remote reference genome

First, view available remote assets:
Use `refgenie pull` to download pre-built assets from a remote server. View available remote assets with `listr`:

```console
refgenie listr
```

Response:
```console
Querying available assets from server: http://refgenomes.databio.org/assets
Remote genomes: hg19, hg19_cdna, hg38, hg38_cdna
Querying available assets from server: http://refgenomes.databio.org/v2/assets
Remote genomes: mouse_chrM2x, rCRSd
Remote assets:
hg19: bismark_bt1_index; bismark_bt2_index; bowtie2_index; bwa_index; fasta; hisat2_index
hg19_cdna: bowtie2_index; hisat2_index; kallisto_index; salmon_index
hg38: bismark_bt1_index; bismark_bt2_index; bowtie2_index; bwa_index; fasta; hisat2_index
hg38_cdna: bowtie2_index; hisat2_index; kallisto_index; salmon_index
mouse_chrM2x/ bowtie2_index:default, fasta.chrom_sizes:default, fasta.fai:default, fasta:default
rCRSd/ bowtie2_index:default, fasta.chrom_sizes:default, fasta.chrom_sizes:test, fasta.fai:default, fasta.fai:test, fasta:default, fasta:test
```

Next, pull one:

```console
refgenie pull --genome hg38 --asset bowtie2_index
refgenie pull rCRSd/bowtie2_index
```

Response:
```console
Starting pull for 'hg38/bowtie2_index'
'hg38/bowtie2_index' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive ...
'rCRSd/bowtie2_index:default' archive size: 116.8KB
Downloading URL: http://staging.refgenomes.databio.org/v2/asset/rCRSd/bowtie2_index/archive ...
```

See [further reading on downloading assets](pull.md).

### Build your own indexes and assets for a custom reference genome

Refgenie assets are scripted, so if what you need is not available remotely, you can use `build` it locally:


```console
refgenie build --genome mygenome --asset bwa_index --fasta mygenome.fa.gz
refgenie build mygenome/bwa_index --fasta mygenome.fa.gz
```

See [further reading on building assets](build.md).

### Retrieve paths to refgenie-managed assets

Once you've populated your refgenie with a few assets, it's easy to get paths to them:
Once you've populated your refgenie with a few assets, use `seek` to retrieve their local file paths:

```console
refgenie seek --genome mm10 --asset bowtie2_index
refgenie seek mm10/bowtie2_index
```

This will return the path to the particular asset of interest, regardless of your computing environment. This gives you an ultra-portable asset manager! See [further reading on retrieving asset paths](seek.md).
Expand Down
44 changes: 44 additions & 0 deletions docs/asset_registry_paths.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Asset registry paths

Each asset is defined by four components:

1. genome name
2. asset name
3. tag name
4. seek key

All `refgenie` commands require a genome name, and most also require an asset name. Tag and seek keys are used only when needed and otherwise use sensible defaults.

The most convenient way to provide this information on the command line is with an *asset registry path*, which take this form:

```console
genome/asset.seek_key:tag
```

For example, `hg38/fasta.fai:default`. Yes, that's a lot of typing if you want to be explicit, but `refgenie` makes usage of asset registry paths easy with a system of defaults, such that all the commands below return the same path:

```console
$ refgenie seek rCRSd/fasta
path/to/genomes/archive/rCRSd/fasta/default/rCRSd.fa

$ refgenie seek rCRSd/fasta.fasta
path/to/genomes/archive/rCRSd/fasta/default/rCRSd.fa

$ refgenie seek rCRSd/fasta.fasta:default
path/to/genomes/archive/rCRSd/fasta/default/rCRSd.fa
```

How did it work?

- **default tag** is determined by `default_tag` pointer in the config
- **seek_key** defaults to the name of the asset

## Using arguments instead of registry paths

Alternatively, you can specify all of these namespace components as command line arguments:

```console
refgenie seek -g rCRSd -a fasta -t default
```

One advantage of this approach is that it allows you to refer to multiple assets belonging to the same genome.
58 changes: 20 additions & 38 deletions docs/autodoc_build/refgenconf.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,35 +7,27 @@ document.addEventListener('DOMContentLoaded', (event) => {
</script>

<style>
h3 .lucidoc{
h3 .content {
padding-left: 22px;
text-indent: -15px;
}
h3 .hljs .lucidoc{
h3 .hljs .content {
padding-left: 20px;
margin-left: 0px;
text-indent: -15px;
martin-bottom: 0px;
}
h4 .lucidoc, table .lucidoc, p .lucidoc, li .lucidoc { margin-left: 30px; }
h4 .lucidoc {
h4 .content, table .content, p .content, li .content { margin-left: 30px; }
h4 .content {
font-style: italic;
font-size: 1em;
margin-bottom: 0px;
}

</style>
<div class='lucidoc'>

# Package `refgenconf` Documentation

## <a name="MissingGenomeError"></a> Class `MissingGenomeError`
Error type for request of unknown genome/assembly.


## <a name="GenomeConfigFormatError"></a> Class `GenomeConfigFormatError`
Exception for invalid genome config file format.

# Package `refgenconf` Documentation

## <a name="RefGenConf"></a> Class `RefGenConf`
A sort of oracle of available reference genome assembly assets
Expand All @@ -59,7 +51,7 @@ Map each assembly name to a list of available asset names.


```python
def assets_str(self, offset_text=' ', asset_sep=', ', genome_assets_delim=': ', order=None)
def assets_str(self, offset_text=' ', asset_sep='; ', genome_assets_delim=': ', order=None)
```

Create a block of text representing genome-to-asset mapping.
Expand Down Expand Up @@ -127,7 +119,7 @@ Get as single string this configuration's reference genome assembly IDs.


```python
def get_asset(self, genome_name, asset_name, strict_exists=True, check_exist=<function RefGenConf.<lambda> at 0x7f7dc4d4f9d8>)
def get_asset(self, genome_name, asset_name, strict_exists=True, check_exist=<function RefGenConf.<lambda> at 0x7fe8466d32f0>)
```

Get an asset for a particular assembly.
Expand Down Expand Up @@ -207,7 +199,7 @@ List locally available reference genome IDs and assets by ID.


```python
def list_remote(self, get_url=<function RefGenConf.<lambda> at 0x7f7dc4d4fc80>, order=None)
def list_remote(self, get_url=<function RefGenConf.<lambda> at 0x7fe8466d3598>, order=None)
```

List genomes and assets available remotely.
Expand All @@ -225,7 +217,7 @@ List genomes and assets available remotely.


```python
def pull_asset(self, genome, assets, genome_config, unpack=True, force=None, get_json_url=<function RefGenConf.<lambda> at 0x7f7dc4d4fd90>, get_main_url=None, build_signal_handler=<function _handle_sigint at 0x7f7dc527bea0>)
def pull_asset(self, genome, assets, genome_config, unpack=True, force=None, get_json_url=<function RefGenConf.<lambda> at 0x7fe8466d36a8>, get_main_url=None, build_signal_handler=<function _handle_sigint at 0x7fe8466a2950>)
```

Download and possibly unpack one or more assets for a given ref gen.
Expand Down Expand Up @@ -255,7 +247,7 @@ Download and possibly unpack one or more assets for a given ref gen.


```python
def update_assets(self, genome, asset=None, data=None)
def update_genomes(self, genome, asset=None, data=None)
```

Updates the genomes in RefGenConf object at any level. If a requested genome-asset mapping is missing, it will be created
Expand All @@ -273,22 +265,20 @@ Updates the genomes in RefGenConf object at any level. If a requested genome-ass



```python
def update_genomes(self, genome, data=None)
```

Updates the genomes in RefGenConf object at any level. If a requested genome is missing, it will be added
#### Parameters:
## <a name="GenomeConfigFormatError"></a> Class `GenomeConfigFormatError`
Exception for invalid genome config file format.

- `genome` (`str`): genome to be added/updated
- `data` (`Mapping`): data to be added/updated

## <a name="MissingAssetError"></a> Class `MissingAssetError`
Error type for request of an unavailable genome asset.

#### Returns:

- `RefGenConf`: updated object
## <a name="MissingConfigDataError"></a> Class `MissingConfigDataError`
Missing required configuration instance items


## <a name="MissingGenomeError"></a> Class `MissingGenomeError`
Error type for request of unknown genome/assembly.


## <a name="RefgenconfError"></a> Class `RefgenconfError`
Expand All @@ -299,14 +289,6 @@ Base exception type for this package
Use of environment variable that isn't bound to a value.


## <a name="MissingAssetError"></a> Class `MissingAssetError`
Error type for request of an unavailable genome asset.


## <a name="MissingConfigDataError"></a> Class `MissingConfigDataError`
Missing required configuration instance items


```python
def select_genome_config(filename, conf_env_vars=None, **kwargs)
```
Expand All @@ -325,7 +307,7 @@ Get path to genome configuration file.



</div>


*Version Information: `refgenconf` v0.3.0, generated by `lucidoc` v0.4.0*

*Version Information: `refgenconf` v0.2.0, generated by `lucidoc` v0.4.1*

0 comments on commit edb9482

Please sign in to comment.