Skip to content

Commit

Permalink
Merge pull request #28 from databio/dev
Browse files Browse the repository at this point in the history
v0.4.0
  • Loading branch information
nsheff committed Jun 14, 2019
2 parents 096965c + ea81e65 commit d61c6d9
Show file tree
Hide file tree
Showing 27 changed files with 1,773 additions and 218 deletions.
50 changes: 31 additions & 19 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,56 @@
# <img src="img/refgenie_logo.svg" class="img-header"> genome index manager

# <img src="img/refgenie_logo.svg" class="img-header"> reference genome manager

[![PEP compatible](http://pepkit.github.io/img/PEP-compatible-green.svg)](http://pepkit.github.io)


## What is refgenie?

Refgenie creates a standardized folder structure for reference genome files and indexes. You can download pre-built genomes or build your own for any fasta file.
Refgenie is full-service reference genome manager. It provides command-line and python interfaces to download pre-built reference genome "assets" like indexes used by different bioinformatics tools. It can also build assets for custom genome assemblies, and it facilitates systematic organization of, and access to, local genome "assets."

## What makes refgenie better?

Refgenie provides a **standard folder structure** for reference genome indexes, so that alignment tools can easily swap from one genome to another. Most importantly, Refgenie is **scripted** so that users can create their own assembly index packages from whatever genome source they like.
Refgenie provides programmatic access to a standard genome folder structure, so that software can easily swap from one genome to another. There are other similar projects, but Refgenie has a few advantages:

## Installing
1. **It provides a command-line interface to download individual resources**. Think of it as `GitHub` for reference genomes. You just type `refgenie pull -g hg38 -a kallisto`.

If you just want to use pre-built refgenie assemblies, just head over to the [download page](download.md); you don't even need to install refgenie. If you want to index your own genomes, then you'll need to install refgenie plus your genome indexers of choice. Install refgenie from [GitHub releases](https://github.com/databio/refgenie/releases) or from PyPI with `pip`:
2. **It's scripted**. In case you need resources *not* on the server, such as for a custom genome, refgenie provides a `build` function to create your own: `refgenie build -i custom.fa.gz -a bowtie2`.

3. **It includes a python API**. For tool developers, you use `cfg = refgenie.RefGenConf("genomes.yaml")` to get a python object with paths to any genome asset, *e.g.*, `cfg.hg38.kallisto`.

```console
pip install --user refgenie
```
4. When a new asset is downloaded, Refgenie can automatically update a local configuration file that acts as a sort of filesystem oracle for locally available genome assets. It's aware of the path to each resource that's been downloaded or otherwise declared.

## Quick example

### Downloading indexes and assets for a reference genome

Update with:

```console
pip install --user --upgrade refgenie
refgenie pull --genome hg38 --asset bowtie2
```

After that, you'll need to [install the genome indexers](install.md) -- but first, you can confirm that `refgenie` is functioning:
Response:
```console
Starting pull for 'hg38/bowtie2'
'hg38/bowtie2' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
...
```

## Quick start
Pull many assets at once:
```console
refgenie pull --genome mm10 --asset kallisto TSS_enrichment mappability
```

See if your install worked by invoking `refgenie` from the command line:
See [further reading on downloading assets](download.md).

```
refgenie -h
```
### Building your own indexes and assets for a reference genome

If the `refgenie` executable in not automatically in your `$PATH`, add the following line to your `.bashrc` or `.profile` (or `.bash_profile` on MACOS):

```console
export PATH=~/.local/bin:$PATH
refgenie build --input hg38.fa.gz --asset kallisto
```

Next, [install the genome indexers](install.md).
See [further reading on building assets](build.md).

If you want to read more about the motivation behind refgenie and the software engineering that makes refgenie work, proceed next to the [overview](overview.md).
241 changes: 241 additions & 0 deletions docs/autodoc_build/refgenconf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
# Package refgenconf Documentation

## Class MissingConfigDataError
Missing required configuration instance items


## Class MissingGenomeError
Error type for request of unknown genome/assembly.


## Class UnboundEnvironmentVariablesError
Use of environment variable that isn't bound to a value.


## Class MissingAssetError
Error type for request of an unavailable genome asset.


## Class RefGenConf
A sort of oracle of available reference genome assembly assets


### assets\_dict
Map each assembly name to a list of available asset names.
```python
def assets_dict(self)
```

#### Returns:

`Mapping[str, Iterable[str]]`: mapping from assembly name tocollection of available asset names.




### assets\_str
Create a block of text representing genome-to-asset mapping.
```python
def assets_str(self, offset_text=' ', asset_sep='; ', genome_assets_delim=': ')
```

#### Parameters:

- `offset_text` -- `str`: text that begins each line of the textrepresentation that's produced
- `asset_sep` -- `str`: the delimiter between names of types of assets,within each genome line
- `genome_assets_delim` -- `str`: the delimiter to place betweenreference genome assembly name and its list of asset names


#### Returns:

`str`: text representing genome-to-asset mapping




### genomes\_list
Get a list of this configuration's reference genome assembly IDs.
```python
def genomes_list(self)
```

#### Returns:

`Iterable[str]`: list of this configuration's reference genomeassembly IDs




### genomes\_str
Get as single string this configuration's reference genome assembly IDs.
```python
def genomes_str(self)
```

#### Returns:

`str`: single string that lists this configuration's knownreference genome assembly IDs




### get\_asset
Get an asset for a particular assembly.
```python
def get_asset(self, genome_name, asset_name, strict_exists=True, check_exist=<function RefGenConf.<lambda> at 0x7fd52059c0d0>)
```

#### Parameters:

- `genome_name` -- `str`: name of a reference genome assembly of interest
- `asset_name` -- `str`: name of the particular asset to fetch
- `strict_exists` -- `bool | NoneType`: how to handle case in whichpath doesn't exist; True to raise IOError, False to raise RuntimeWarning, and None to do nothing at all
- `check_exist` -- `function(callable) -> bool`: how to check forasset/path existence


#### Returns:

`str`: path to the asset


#### Raises:

- `TypeError`: if the existence check is not a one-arg function
- `refgenconf.MissingGenomeError`: if the named assembly isn't knownto this configuration instance
- `refgenconf.MissingAssetError`: if the names assembly is known tothis configuration instance, but the requested asset is unknown




### list\_assets\_by\_genome
List types/names of assets that are available for one--or all--genomes.
```python
def list_assets_by_genome(self, genome=None)
```

#### Parameters:

- `genome` -- `str | NoneType`: reference genome assembly ID, optional;if omitted, the full mapping from genome to asset names


#### Returns:

`Iterable[str] | Mapping[str, Iterable[str]]`: collection ofasset type names available for particular reference assembly if one is provided, else the full mapping between assembly ID and collection available asset type names




### list\_genomes\_by\_asset
List assemblies for which a particular asset is available.
```python
def list_genomes_by_asset(self, asset=None)
```

#### Parameters:

- `asset` -- `str | NoneType`: name of type of asset of interest, optional


#### Returns:

`Iterable[str] | Mapping[str, Iterable[str]]`: collection ofassemblies for which the given asset is available; if asset argument is omitted, the full mapping from name of asset type to collection of assembly names for which the asset key is available will be returned.




### list\_remote
List genomes and assets available remotely.
```python
def list_remote(self, get_url=<function RefGenConf.<lambda> at 0x7fd52059c2f0>)
```

#### Parameters:

- `get_url` -- `function(refgenconf.RefGenConf) -> str`: how to determineURL request, given RefGenConf instance


#### Returns:

`str, str`: text reps of remotely available genomes and assets




### pull\_asset
Download and possibly unpack one or more assets for a given ref gen.
```python
def pull_asset(self, genome, assets, genome_config, unpack=True, get_json_url=<function RefGenConf.<lambda> at 0x7fd52059c400>, get_main_url=None)
```

#### Parameters:

- `genome` -- `str`: name of a reference genome assembly of interest
- `assets` -- `str`: name(s) of particular asset(s) to fetch
- `genome_config` -- `str`: path to genome configuration file to update
- `unpack` -- `bool`: whether to unpack a tarball
- `get_json_url` -- `function(str, str, str) -> str`: how to build URL fromgenome server URL base, genome, and asset
- `get_main_url` -- `function(str) -> str`: how to get archive URL frommain URL


#### Returns:

`Iterable[(str, str | NoneType)]`: collection of pairs of assetname and folder name (key-value pair with which genome config file is updated) if pull succeeds, else asset key and a null value.


#### Raises:

- `TypeError`: if the assets argument is neither string nor otherIterable
- `refgenconf.UnboundEnvironmentVariablesError`: if genome folderpath contains any env. var. that's unbound




### update\_genomes
Updates the genomes in RefGenConf object at any level. If a requested genome-asset mapping is missing, it will be created
```python
def update_genomes(self, genome, asset=None, data=None)
```

#### Parameters:

- `genome` -- `str`: genome to be added/updated
- `asset` -- `str`: asset to be added/updated
- `data` -- `Mapping`: data to be added/updated


#### Returns:

`RefGenConf`: updated object




## Class RefgenconfError
Base exception type for this package


## Class GenomeConfigFormatError
Exception for invalid genome config file format.


### select\_genome\_config
Get path to genome configuration file.
```python
def select_genome_config(filename, conf_env_vars=None)
```

#### Parameters:

- `filename` -- `str`: name/path of genome configuration file
- `conf_env_vars` -- `Iterable[str]`: names of environment variables toconsider; basically, a prioritized search list


#### Returns:

`str`: path to genome configuration file





**Version Information**: `refgenconf` v0.1.1, generated by `lucidoc` v0.4dev
40 changes: 28 additions & 12 deletions docs/build.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,45 @@
# Building genome indexes with Refgenie
# Building genome indexes with refgenie

Indexing your own reference genome is as easy as 1-2-3:
Once you've [installed refgenie](install.md), you can use `refgenie pull` to [download pre-built assets](download.md) without installing any additional software. If you want to build your own, you'll also need to install the building software for the asset you want to build. You have two choices to get that software, you can either [install building software natively](#install_building_software_natively), or use a [docker image](#docker).

1. Install refgenie
2. [Install genome indexers](/install) or the [docker image](#docker).
3. Run refgenie with: `refgenie -i INPUT_FILE.fa`. (INPUT_FILE is a fasta file of your reference genome, and can be either a local file or a URL)
Once you're set up with all the additional software, you simply run `refgenie build`, passing it any necessary input files called for by the asset recipe. Further documentation on building specific assets is forthcoming.

## Customizing indexes
## Install building software natively

Refgenie expects to find in your `PATH` any tools needed for building a desired asset. You'll need to follow the instructions for each of these individually. You could find some basic ideas for how to install these programatically in the [dockerfile](https://github.com/databio/refgenie/blob/dev/containers/Dockerfile_refgenie).

Refgenie currently builds indexes for bowtie2, hisat2, bismark (for DNA methylation), and others. You can find the complete list in the [config file](https://github.com/databio/refgenie/blob/dev/refgenie/refgenie.yaml). These are all optional; you only have to build indexes for ones you intend to use. You can also add more later. If you don't pass along a configuration file to `refgenie`, it will simply use this one, building these indexes. If you want to toggle some of them, you may choose which indexes you want to include by toggling them. Just duplicate and edit the config file and pass it to refenie like this:

Refgenie knows how to build indexes for bowtie2, hisat2, bismark, and other common tools. You can find the complete list in the [config file](https://github.com/databio/refgenie/blob/dev/refgenie/refgenie.yaml). These are all optional; you only have to build indexes for ones you intend to use. You can also add more later. If you don't pass along a configuration file to `refgenie`, it will simply use that one, building those indexes. If you want to choose a subset, copy the config file, edit it as desired, and pass it to `refgenie` like this:
```
wget https://raw.githubusercontent.com/databio/refgenie/master/refgenie/refgenie.yaml
refgenie -c refgenie.yaml
refgenie build -c refgenie.yaml
```

## Adding GTFs
### Adding GTFs

Refgenie also allows you to add information in the form of a GTF file, which provides gene annotation.


#### Optional
## Docker

If you don't want to install all those indexers (and I don't blame you), then you may be interested in my docker image on DockerHub (nsheff/refgenie) that has all of these packages pre-installed, so you can run the complete indexer without worrying about paths and packages. Just clone this repo and run it with the `-d` flag. For example:

```
~/code/refgenie/refgenie/refgenie.py --input rn6.fa --outfolder $HOME -d
```

* Set an environment shell variable called `GENOMES` to point to where you want your references saved.
### Building the container

You can build the docker container yourself like this:

```
git clone https://github.com/databio/refgenie.git
cd refgenie/containers
make refgenie
```

### Pulling the container

```
docker pull nsheff/refgenie
```

0 comments on commit d61c6d9

Please sign in to comment.