Skip to content

Commit

Permalink
Merge pull request #70 from databio/dev
Browse files Browse the repository at this point in the history
0.4.4
  • Loading branch information
nsheff committed Jul 1, 2019
2 parents 9756bcf + be2849a commit ba44830
Show file tree
Hide file tree
Showing 11 changed files with 159 additions and 48 deletions.
21 changes: 15 additions & 6 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,20 +34,19 @@ refgenie listr
### Downloading indexes and assets for a reference genome

```console
refgenie pull --genome hg38 --asset bowtie2
refgenie pull --genome hg38 --asset bowtie2_index
```

Response:
```console
Starting pull for 'hg38/bowtie2'
'hg38/bowtie2' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive
...
Starting pull for 'hg38/bowtie2_index'
'hg38/bowtie2_index' archive size: 3.5GB
Downloading URL: http://refgenomes.databio.org/asset/hg38/bowtie2/archive ...
```

Pull many assets at once:
```console
refgenie pull --genome mm10 --asset kallisto TSS_enrichment mappability
refgenie pull --genome mm10 --asset bowtie2_index hisat2_index
```

See [further reading on downloading assets](download.md).
Expand All @@ -61,4 +60,14 @@ refgenie build --genome hg38 --asset kallisto_index --fasta hg38.fa.gz

See [further reading on building assets](build.md).

### Retrieving paths to refgenie-managed assets

Once you've populated your refgenie with a few assets, it's easy to get paths to them:

```console
refgenie seek --genome mm10 --asset bowtie2_index
```

This will return the path to the particular asset of interest, regardless of your computing environment. This gives you an ultra-portable asset manager!

If you want to read more about the motivation behind refgenie and the software engineering that makes refgenie work, proceed next to the [overview](overview.md).
46 changes: 23 additions & 23 deletions docs/autodoc_build/refgenconf.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,25 @@
# Package refgenconf Documentation

## Class GenomeConfigFormatError
Exception for invalid genome config file format.


## Class MissingAssetError
Error type for request of an unavailable genome asset.


## Class UnboundEnvironmentVariablesError
Use of environment variable that isn't bound to a value.


## Class RefgenconfError
Base exception type for this package


## Class MissingGenomeError
Error type for request of unknown genome/assembly.


## Class RefGenConf
A sort of oracle of available reference genome assembly assets

Expand Down Expand Up @@ -97,7 +117,7 @@ def genomes_str(self, order=None)
### get\_asset
Get an asset for a particular assembly.
```python
def get_asset(self, genome_name, asset_name, strict_exists=True, check_exist=<function RefGenConf.<lambda> at 0x7f9b5c8f9378>)
def get_asset(self, genome_name, asset_name, strict_exists=True, check_exist=<function RefGenConf.<lambda> at 0x7f56390c3378>)
```

#### Parameters:
Expand Down Expand Up @@ -181,7 +201,7 @@ def list_local(self, order=None)
### list\_remote
List genomes and assets available remotely.
```python
def list_remote(self, get_url=<function RefGenConf.<lambda> at 0x7f9b5c8f9620>, order=None)
def list_remote(self, get_url=<function RefGenConf.<lambda> at 0x7f56390c3620>, order=None)
```

#### Parameters:
Expand All @@ -200,7 +220,7 @@ def list_remote(self, get_url=<function RefGenConf.<lambda> at 0x7f9b5c8f9620>,
### pull\_asset
Download and possibly unpack one or more assets for a given ref gen.
```python
def pull_asset(self, genome, assets, genome_config, unpack=True, force=None, get_json_url=<function RefGenConf.<lambda> at 0x7f9b5c8f9730>, get_main_url=None, build_signal_handler=<function _handle_sigint at 0x7f9b5ce178c8>)
def pull_asset(self, genome, assets, genome_config, unpack=True, force=None, get_json_url=<function RefGenConf.<lambda> at 0x7f56390c3730>, get_main_url=None, build_signal_handler=<function _handle_sigint at 0x7f56395e78c8>)
```

#### Parameters:
Expand Down Expand Up @@ -248,30 +268,10 @@ def update_genomes(self, genome, asset=None, data=None)



## Class MissingGenomeError
Error type for request of unknown genome/assembly.


## Class MissingConfigDataError
Missing required configuration instance items


## Class UnboundEnvironmentVariablesError
Use of environment variable that isn't bound to a value.


## Class RefgenconfError
Base exception type for this package


## Class GenomeConfigFormatError
Exception for invalid genome config file format.


## Class MissingAssetError
Error type for request of an unavailable genome asset.


### select\_genome\_config
Get path to genome configuration file.
```python
Expand Down
2 changes: 1 addition & 1 deletion docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Once you've [installed refgenie](install.md), you can use `refgenie pull` to [download pre-built assets](download.md) without installing any additional software. However, you may need to use the `build` function for genomes or assets that are not available on the server.

If you want to build assets, you'll need to install the building software for the asset you want to build. You have two choices to get that software, you can either [install building software natively](#install_building_software_natively), or use a [docker image](#docker). Once you're set up with all the additional software, you simply run `refgenie build`, passing it any necessary input arguments called for by the asset recipe.
If you want to build assets, you'll need to install the building software for the asset you want to build. You have two choices to get that software: you can either install building software natively, or use a docker image. Once you're set up, you simply run `refgenie build`, passing it any necessary input arguments called for by the asset recipe.

## Install building software natively

Expand Down
6 changes: 5 additions & 1 deletion docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@

This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format.

## [0.4.4] - 2019-07-01
### Added
- `add` subcommand

## [0.4.3] - 2019-06-21
### Changed
- Re-envisioned the build process, so that individual assets are built
- Build process now builds individual assets

## [0.4.2] - 2019-06-18
### Added
Expand Down
33 changes: 33 additions & 0 deletions docs/igenomes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Using refgenie with iGenomes

If you're already using iGenomes, it's easy to configure refgenie to use your existing folder structure. [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is project that provides sequences and annotation files for commonly analyzed organisms. Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism.

Initialize a refgenie config file if you don't have one you want to use for your iGenomes assets:

```
export REFGENIE='genome_config.yaml'
refgenie init -c $REFGENIE
```

And then add individual assets you want refgenie to track with `refgenie add`:

```
refgenie add -g GENOME -a ASSET -p RELATIVE_PATH
```

So, for example,

```console
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Caenorhabditis_elegans/UCSC/ce10/Caenorhabditis_elegans_UCSC_ce10.tar.gz
tar -xf Caenorhabditis_elegans_UCSC_ce10.tar.gz
refgenie init -c igenome_config.yaml
refgenie add -c igenome_config.yaml -g ce10 --asset bowtie2_index --path Caenorhabditis_elegans/UCSC/ce10/Sequence/Bowtie2Index
```

Now we can `seek` any of those assets:

```console
refgenie seek -c igenome_config.yaml -g ce10 --asset bowtie2_index
```

This way you can configure refgenie to use your iGenomes assets, so you can wean yourself off of the iGenomes hard structure and transition to the refgenie-managed path system.
8 changes: 8 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,11 @@ Now you can use the `list` command to show local assets (which will be empty at
refgenie list
refgenie listr
```

# Seeking assets

Use the `seek` command to get paths to local assets you have already built or pulled:

```console
refgenie seek -g GENOME -a ASSET
```
16 changes: 15 additions & 1 deletion docs/refgenconf.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,18 @@ Use this to show all available remote assets:
rgc.list_remote()
```

See the complete [refgenie python API](/autodoc_build/refgenconf) for details.
In a tool, you're probably most interested in using refgenie to locate reference genome assets, for which you want to use the `get_asset` function. For example:

```python
# identify genome
genome = "hg38"

# get the local path to bowtie2 indexes:
bt2idx = rgc.get_asset(genome, "bowtie2_index")

# run bowtie2...
```

This enables you to write python software that will work on any computing environment without having to worry about passing around brittle environment-specific file paths.

See the complete [refgenie python API](/autodoc_build/refgenconf) for more details.
2 changes: 1 addition & 1 deletion docs/refgenieserver.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@

Typically, you'll only need to run refgenie either from the command-line or via the Python API. These clients can interact with the refgenie server to pull down genome assets. But what if you want to build your own server?

Though we don't anticipate many people wanting to run their own servers, there are a few use cases where this can make sense. First, perhaps you want a private, local server running on your internal network. This could speed up access to refgenie assets. Another reason is that you may want to make some particular assets available to the community. Building on the refgenie infrastructure will simplify distribution for your, and make it so that your users can download your resource through a familiar interface.
Though we don't anticipate many people wanting to run their own servers, there are a few use cases where this can make sense. First, perhaps you want a private, local server running on your internal network. This could speed up access to refgenie assets. Another reason is that you may want to make some particular assets available to the community. Building on the refgenie infrastructure will simplify distribution and make it so that your users can download your resource through a familiar interface.

The software that runs refgenie server is [available on GitHub](http://github.com/databio/refgenieserver). There, you will find detailed instructions on how to run it yourself.
10 changes: 9 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ nav:
- Download pre-built assets: download.md
- Build genome assets: build.md
- Use refgenie from python: refgenconf.md
- Use refgenie with iGenomes: igenomes.md
- Use external assets: external_assets.md
- Run my own server: refgenieserver.md
- Run my own asset server: refgenieserver.md
- Reference:
- Genome configuration file: genome_config.md
- Glossary: glossary.md
Expand All @@ -33,3 +34,10 @@ plugins:
autodoc_package: "refgenconf"
no_top_level: true
- search


navbar:
left:
- text: Refgenomes server
icon: fa-server
href: http://refgenomes.databio.org
2 changes: 1 addition & 1 deletion refgenie/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.4.3"
__version__ = "0.4.4"
61 changes: 48 additions & 13 deletions refgenie/refgenie.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
LIST_LOCAL_CMD = "list"
LIST_REMOTE_CMD = "listr"
GET_ASSET_CMD = "seek"
INSERT_CMD = "add"


BUILD_SPECIFIC_ARGS = ('fasta', 'gtf', 'context')
Expand Down Expand Up @@ -75,8 +76,9 @@ def add_subparser(cmd, description):
LIST_LOCAL_CMD: "List available local genomes.",
LIST_REMOTE_CMD: "List available genomes and assets on server.",
PULL_CMD: "Download assets.",
BUILD_CMD: "Build genome assets",
GET_ASSET_CMD: "Get the path to a local asset"
BUILD_CMD: "Build genome assets.",
GET_ASSET_CMD: "Get the path to a local asset.",
INSERT_CMD: "Insert a local asset into the configuration file."
}

sps = {}
Expand Down Expand Up @@ -108,7 +110,7 @@ def add_subparser(cmd, description):
help='Override the default path to genomes folder, which is the '
'genome_folder attribute in the genome configuration file.')

for cmd in [PULL_CMD, GET_ASSET_CMD, BUILD_CMD]:
for cmd in [PULL_CMD, GET_ASSET_CMD, BUILD_CMD, INSERT_CMD]:
sps[cmd].add_argument(
"-g", "--genome", required=True,
help="Reference assembly ID, e.g. mm10")
Expand All @@ -120,6 +122,10 @@ def add_subparser(cmd, description):
"-u", "--no-untar", action="store_true",
help="Do not extract tarballs.")

sps[INSERT_CMD].add_argument(
"-p", "--path", required=True,
help="Relative path to asset")

# Finally, arguments to the build command to give the files needed to do
# the building. These should eventually move to a more flexible system that
# doesn't require them to be hard-coded here in order to be recognized
Expand Down Expand Up @@ -177,6 +183,26 @@ def default_config_file():
return os.path.join(os.path.dirname(__file__), "refgenie.yaml")


def get_asset_vars(genome, asset_key, outfolder, specific_args=None):
"""
Gives a dict with variables used to populate an asset path.
"""
asset_outfolder = os.path.join(outfolder, asset_key)
asset_vars = {"genome": genome,
"asset": asset_key,
"asset_outfolder": asset_outfolder}
if specific_args:
asset_vars.update(specific_args)
return asset_vars


def refgenie_add(rgc, args):
outfolder = os.path.abspath(os.path.join(rgc.genome_folder, args.genome))
asset_vars = get_asset_vars(args.genome, args.asset, outfolder)
rgc.update_genomes(args.genome, args.asset, {"path": args.path.format(**asset_vars)})
# Write the updated refgenie genome configuration
rgc.write()

def refgenie_build(rgc, args):
"""
Runs the refgenie build recipe.
Expand Down Expand Up @@ -225,7 +251,7 @@ def path_data(root, c):



def build_asset(genome, asset_key, asset_build_package, specific_args):
def build_asset(genome, asset_key, asset_build_package, outfolder, specific_args):
"""
Builds assets with pypiper and updates a genome config file.
Expand All @@ -239,12 +265,7 @@ def build_asset(genome, asset_key, asset_build_package, specific_args):
assets.
"""
_LOGGER.debug("Asset build package: " + str(asset_build_package))

asset_outfolder = os.path.join(outfolder, asset_key)
asset_vars = {"genome": genome,
"asset": asset_key,
"asset_outfolder": asset_outfolder}
asset_vars.update(specific_args)
get_asset_vars(genome, asset_key, outfolder, specific_args)


print(str([x.format(**asset_vars) for x in asset_build_package["command_list"]]))
Expand Down Expand Up @@ -282,18 +303,24 @@ def build_asset(genome, asset_key, asset_build_package, specific_args):
volumes = outfolder
pm.get_container("nsheff/refgenie", volumes)


for asset_key in args.asset:
if asset_key in asset_build_packages.keys():
asset_build_package = asset_build_packages[asset_key]
_LOGGER.debug(specific_args)
required_inputs = ", ".join(asset_build_package["required_inputs"])
_LOGGER.info("Inputs required to build '{}': {}".format(asset_key, required_inputs))
for required_input in asset_build_package["required_inputs"]:
if not specific_args[required_input]:
raise ValueError("Argument '{}' is required to build asset '{}', but not provided".format(required_input, asset_key))

for required_asset in asset_build_package["required_assets"]:
if not rgc.get_asset(args.genome, required_asset):
raise ValueError("Asset '{}' is required to build asset '{}', but not provided".format(required_input, asset_key))
build_asset(args.genome, asset_key, asset_build_package, specific_args)
try:
if not rgc.get_asset(args.genome, required_asset):
raise ValueError("Asset '{}' is required to build asset '{}', but not provided".format(required_asset, asset_key))
except refgenconf.exceptions.MissingGenomeError:
raise ValueError("Asset '{}' is required to build asset '{}', but not provided".format(required_asset, asset_key))
build_asset(args.genome, asset_key, asset_build_package, outfolder, specific_args)
else:
_LOGGER.warn("Recipe does not exist for asset '{}'".format(asset_key))

Expand Down Expand Up @@ -515,6 +542,14 @@ def main():
print(" ".join([rgc.get_asset(args.genome, asset) for asset in args.asset]))
return

elif args.command == INSERT_CMD:
if len(args.asset) > 1:
raise NotImplementedError("Can only add 1 asset at a time")
else:
# recast from list to str
args.asset = args.asset[0]
refgenie_add(rgc, args)

elif args.command == PULL_CMD:
outdir = rgc[CFG_FOLDER_KEY]
if not os.path.exists(outdir):
Expand Down

0 comments on commit ba44830

Please sign in to comment.