Merge pull request #28 from databio/dev

v0.4.0
refgenie · Jun 14, 2019 · d61c6d9 · d61c6d9
2 parents 096965c + ea81e65
commit d61c6d9
Show file tree

Hide file tree

Showing 27 changed files with 1,773 additions and 218 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -1,44 +1,56 @@
-# <img src="img/refgenie_logo.svg" class="img-header"> genome index manager
+
+# <img src="img/refgenie_logo.svg" class="img-header"> reference genome manager
 
 [![PEP compatible](http://pepkit.github.io/img/PEP-compatible-green.svg)](http://pepkit.github.io)
 
+
 ## What is refgenie?
 
-Refgenie creates a standardized folder structure for reference genome files and indexes. You can download pre-built genomes or build your own for any fasta file.
+Refgenie is full-service reference genome manager. It provides command-line and python interfaces to download pre-built reference genome "assets" like indexes used by different bioinformatics tools. It can also build assets for custom genome assemblies, and it facilitates systematic organization of, and access to, local genome "assets."
 
 ## What makes refgenie better?
 
-Refgenie provides a **standard folder structure** for reference genome indexes, so that alignment tools can easily swap from one genome to another. Most importantly, Refgenie is **scripted** so that users can create their own assembly index packages from whatever genome source they like.
+Refgenie provides programmatic access to a standard genome folder structure, so that software can easily swap from one genome to another. There are other similar projects, but Refgenie has a few advantages:
 
-## Installing
+1. **It provides a command-line interface to download individual resources**. Think of it as `GitHub` for reference genomes. You just type `refgenie pull -g hg38 -a kallisto`.
 
-If you just want to use pre-built refgenie assemblies, just head over to the [download page](download.md); you don't even need to install refgenie. If you want to index your own genomes, then you'll need to install refgenie plus your genome indexers of choice. Install refgenie from [GitHub releases](https://github.com/databio/refgenie/releases) or from PyPI with `pip`:
+2. **It's scripted**. In case you need resources *not* on the server, such as for a custom genome, refgenie provides a `build` function to create your own: `refgenie build -i custom.fa.gz -a bowtie2`.
 
+3. **It includes a python API**. For tool developers, you use `cfg = refgenie.RefGenConf("genomes.yaml")` to get a python object with paths to any genome asset, *e.g.*, `cfg.hg38.kallisto`.
 
-```console
-pip install --user refgenie
-```
+4. When a new asset is downloaded, Refgenie can automatically update a local configuration file that acts as a sort of filesystem oracle for locally available genome assets. It's aware of the path to each resource that's been downloaded or otherwise declared.
+
+## Quick example
+
+### Downloading indexes and assets for a reference genome
 
-Update with:
 
 ```console
-pip install --user --upgrade refgenie
+refgenie pull --genome hg38 --asset bowtie2
 ```
 
-After that, you'll need to [install the genome indexers](install.md) -- but first, you can confirm that `refgenie` is functioning:
+Response:
+```console
+Starting pull for 'hg38/bowtie2'
+'hg38/bowtie2' archive size: 3.5GB
+Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
+...
+```
 
-## Quick start
+Pull many assets at once:
+```console
+refgenie pull --genome mm10 --asset kallisto TSS_enrichment mappability
+```
 
-See if your install worked by invoking `refgenie` from the command line:
+See [further reading on downloading assets](download.md).
 
-```
-refgenie -h
-```
+### Building your own indexes and assets for a reference genome
 
-If the `refgenie` executable in not automatically in your `$PATH`, add the following line to your `.bashrc` or `.profile` (or `.bash_profile` on MACOS):
 
 ```console
-export PATH=~/.local/bin:$PATH
+refgenie build --input hg38.fa.gz --asset kallisto
 ```
 
-Next, [install the genome indexers](install.md).
+See [further reading on building assets](build.md).
+
+If you want to read more about the motivation behind refgenie and the software engineering that makes refgenie work, proceed next to the [overview](overview.md).
diff --git a/docs/autodoc_build/refgenconf.md b/docs/autodoc_build/refgenconf.md
@@ -0,0 +1,241 @@
+# Package refgenconf Documentation
+
+## Class MissingConfigDataError
+Missing required configuration instance items
+
+
+## Class MissingGenomeError
+Error type for request of unknown genome/assembly.
+
+
+## Class UnboundEnvironmentVariablesError
+Use of environment variable that isn't bound to a value.
+
+
+## Class MissingAssetError
+Error type for request of an unavailable genome asset.
+
+
+## Class RefGenConf
+A sort of oracle of available reference genome assembly assets
+
+
+### assets\_dict
+Map each assembly name to a list of available asset names.
+```python
+def assets_dict(self)
+```
+
+#### Returns:
+
+`Mapping[str, Iterable[str]]`:  mapping from assembly name tocollection of available asset names.
+
+
+
+
+### assets\_str
+Create a block of text representing genome-to-asset mapping.
+```python
+def assets_str(self, offset_text='  ', asset_sep='; ', genome_assets_delim=': ')
+```
+
+#### Parameters:
+
+- `offset_text` -- `str`:  text that begins each line of the textrepresentation that's produced
+- `asset_sep` -- `str`:  the delimiter between names of types of assets,within each genome line
+- `genome_assets_delim` -- `str`:  the delimiter to place betweenreference genome assembly name and its list of asset names
+
+
+#### Returns:
+
+`str`:  text representing genome-to-asset mapping
+
+
+
+
+### genomes\_list
+Get a list of this configuration's reference genome assembly IDs.
+```python
+def genomes_list(self)
+```
+
+#### Returns:
+
+`Iterable[str]`:  list of this configuration's reference genomeassembly IDs
+
+
+
+
+### genomes\_str
+Get as single string this configuration's reference genome assembly IDs.
+```python
+def genomes_str(self)
+```
+
+#### Returns:
+
+`str`:  single string that lists this configuration's knownreference genome assembly IDs
+
+
+
+
+### get\_asset
+Get an asset for a particular assembly.
+```python
+def get_asset(self, genome_name, asset_name, strict_exists=True, check_exist=<function RefGenConf.<lambda> at 0x7fd52059c0d0>)
+```
+
+#### Parameters:
+
+- `genome_name` -- `str`:  name of a reference genome assembly of interest
+- `asset_name` -- `str`:  name of the particular asset to fetch
+- `strict_exists` -- `bool | NoneType`:  how to handle case in whichpath doesn't exist; True to raise IOError, False to raise RuntimeWarning, and None to do nothing at all
+- `check_exist` -- `function(callable) -> bool`:  how to check forasset/path existence
+
+
+#### Returns:
+
+`str`:  path to the asset
+
+
+#### Raises:
+
+- `TypeError`:  if the existence check is not a one-arg function
+- `refgenconf.MissingGenomeError`:  if the named assembly isn't knownto this configuration instance
+- `refgenconf.MissingAssetError`:  if the names assembly is known tothis configuration instance, but the requested asset is unknown
+
+
+
+
+### list\_assets\_by\_genome
+List types/names of assets that are available for one--or all--genomes.
+```python
+def list_assets_by_genome(self, genome=None)
+```
+
+#### Parameters:
+
+- `genome` -- `str | NoneType`:  reference genome assembly ID, optional;if omitted, the full mapping from genome to asset names
+
+
+#### Returns:
+
+`Iterable[str] | Mapping[str, Iterable[str]]`:  collection ofasset type names available for particular reference assembly if one is provided, else the full mapping between assembly ID and collection available asset type names
+
+
+
+
+### list\_genomes\_by\_asset
+List assemblies for which a particular asset is available.
+```python
+def list_genomes_by_asset(self, asset=None)
+```
+
+#### Parameters:
+
+- `asset` -- `str | NoneType`:  name of type of asset of interest, optional
+
+
+#### Returns:
+
+`Iterable[str] | Mapping[str, Iterable[str]]`:  collection ofassemblies for which the given asset is available; if asset argument is omitted, the full mapping from name of asset type to collection of assembly names for which the asset key is available will be returned.
+
+
+
+
+### list\_remote
+List genomes and assets available remotely.
+```python
+def list_remote(self, get_url=<function RefGenConf.<lambda> at 0x7fd52059c2f0>)
+```
+
+#### Parameters:
+
+- `get_url` -- `function(refgenconf.RefGenConf) -> str`:  how to determineURL request, given RefGenConf instance
+
+
+#### Returns:
+
+`str, str`:  text reps of remotely available genomes and assets
+
+
+
+
+### pull\_asset
+Download and possibly unpack one or more assets for a given ref gen.
+```python
+def pull_asset(self, genome, assets, genome_config, unpack=True, get_json_url=<function RefGenConf.<lambda> at 0x7fd52059c400>, get_main_url=None)
+```
+
+#### Parameters:
+
+- `genome` -- `str`:  name of a reference genome assembly of interest
+- `assets` -- `str`:  name(s) of particular asset(s) to fetch
+- `genome_config` -- `str`:  path to genome configuration file to update
+- `unpack` -- `bool`:  whether to unpack a tarball
+- `get_json_url` -- `function(str, str, str) -> str`:  how to build URL fromgenome server URL base, genome, and asset
+- `get_main_url` -- `function(str) -> str`:  how to get archive URL frommain URL
+
+
+#### Returns:
+
+`Iterable[(str, str | NoneType)]`:  collection of pairs of assetname and folder name (key-value pair with which genome config file is updated) if pull succeeds, else asset key and a null value.
+
+
+#### Raises:
+
+- `TypeError`:  if the assets argument is neither string nor otherIterable
+- `refgenconf.UnboundEnvironmentVariablesError`:  if genome folderpath contains any env. var. that's unbound
+
+
+
+
+### update\_genomes
+Updates the genomes in RefGenConf object at any level. If a requested genome-asset mapping is missing, it will be created
+```python
+def update_genomes(self, genome, asset=None, data=None)
+```
+
+#### Parameters:
+
+- `genome` -- `str`:  genome to be added/updated
+- `asset` -- `str`:  asset to be added/updated
+- `data` -- `Mapping`:  data to be added/updated
+
+
+#### Returns:
+
+`RefGenConf`:  updated object
+
+
+
+
+## Class RefgenconfError
+Base exception type for this package
+
+
+## Class GenomeConfigFormatError
+Exception for invalid genome config file format.
+
+
+### select\_genome\_config
+Get path to genome configuration file.
+```python
+def select_genome_config(filename, conf_env_vars=None)
+```
+
+#### Parameters:
+
+- `filename` -- `str`:  name/path of genome configuration file
+- `conf_env_vars` -- `Iterable[str]`:  names of environment variables toconsider; basically, a prioritized search list
+
+
+#### Returns:
+
+`str`:  path to genome configuration file
+
+
+
+
+
+**Version Information**: `refgenconf` v0.1.1, generated by `lucidoc` v0.4dev
diff --git a/docs/build.md b/docs/build.md
@@ -1,29 +1,45 @@
-# Building genome indexes with Refgenie
+# Building genome indexes with refgenie
 
-Indexing your own reference genome is as easy as 1-2-3:
+Once you've [installed refgenie](install.md), you can use `refgenie pull` to [download pre-built assets](download.md) without installing any additional software. If you want to build your own, you'll also need to install the building software for the asset you want to build. You have two choices to get that software, you can either [install building software natively](#install_building_software_natively), or use a [docker image](#docker).
 
-1. Install refgenie
-2. [Install genome indexers](/install) or the [docker image](#docker).
-3. Run refgenie with: `refgenie -i INPUT_FILE.fa`. (INPUT_FILE is a fasta file of your reference genome, and can be either a local file or a URL)
+Once you're set up with all the additional software, you simply run `refgenie build`, passing it any necessary input files called for by the asset recipe. Further documentation on building specific assets is forthcoming.
 
-## Customizing indexes
+## Install building software natively
 
+Refgenie expects to find in your `PATH` any tools needed for building a desired asset. You'll need to follow the instructions for each of these individually. You could find some basic ideas for how to install these programatically in the [dockerfile](https://github.com/databio/refgenie/blob/dev/containers/Dockerfile_refgenie).
 
-Refgenie currently builds indexes for bowtie2, hisat2, bismark (for DNA methylation), and others. You can find the complete list in the [config file](https://github.com/databio/refgenie/blob/dev/refgenie/refgenie.yaml). These are all optional; you only have to build indexes for ones you intend to use. You can also add more later. If you don't pass along a configuration file to `refgenie`, it will simply use this one, building these indexes. If you want to toggle some of them, you may choose which indexes you want to include by toggling them. Just duplicate and edit the config file and pass it to refenie like this:
-
+Refgenie knows how to build indexes for bowtie2, hisat2, bismark, and other common tools. You can find the complete list in the [config file](https://github.com/databio/refgenie/blob/dev/refgenie/refgenie.yaml). These are all optional; you only have to build indexes for ones you intend to use. You can also add more later. If you don't pass along a configuration file to `refgenie`, it will simply use that one, building those indexes. If you want to choose a subset, copy the config file, edit it as desired, and pass it to `refgenie` like this:
 ```
 wget https://raw.githubusercontent.com/databio/refgenie/master/refgenie/refgenie.yaml
-refgenie -c refgenie.yaml
+refgenie build -c refgenie.yaml
 ```
 
-## Adding GTFs
+### Adding GTFs
 
 Refgenie also allows you to add information in the form of a GTF file, which provides gene annotation.
 
 
-#### Optional
+## Docker
+
+If you don't want to install all those indexers (and I don't blame you), then you may be interested in my docker image on DockerHub (nsheff/refgenie) that has all of these packages pre-installed, so you can run the complete indexer without worrying about paths and packages. Just clone this repo and run it with the `-d` flag. For example:
+
+```
+~/code/refgenie/refgenie/refgenie.py --input rn6.fa --outfolder $HOME -d
+```
 
-* Set an environment shell variable called `GENOMES` to point to where you want your references saved.
+### Building the container
 
+You can build the docker container yourself like this:
 
+```
+git clone https://github.com/databio/refgenie.git
+cd refgenie/containers
+make refgenie
+```
+
+### Pulling the container
+
+```
+docker pull nsheff/refgenie
+```