Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automating asset building at scale #254

Open
nsheff opened this issue Jun 3, 2021 · 29 comments
Open

Automating asset building at scale #254

nsheff opened this issue Jun 3, 2021 · 29 comments

Comments

@nsheff
Copy link
Contributor

nsheff commented Jun 3, 2021

Related to refgenie/refgenomes.databio.org#3, #251

Motivation

We would like a system that's capable of building lots of assets concurrently. This would be useful for creating new servers or adding lots of assets to existing servers.

The refgenie build function was originally designed as being used by small groups to add local assets to ones that are pulled from a server. But we eventually realized that we can use refgenie build to create the very assets that would be served by refgenieserver -- which is super nice. This works enough that we used refgenie build top produce all the files we're currently serving, but as the scale of assets we want to serve increases, this has led to several challenges

Problem 1: Concurrency

The config file is required for building an asset with refgenie build. Even if the computing is all in a shared environment, hundreds of tasks all writing to the config file at the same time is not really a sustainable solution for our homebrew file-locking mechanism. Furthermore, if the compute tasks are ephemeral, then it's not even possible at all. How do we get the refgenie config file into each compute job? They can't each write and then push it independently; they will merge conflict.

Thus, there needs to be some central area to retain the metadata that's outside the compute jobs, that they can communicate with (like a third-party database). Could we use a database for the config file?

Problem 2: Dependencies

Some of these jobs depend on other jobs, because some assets use existing assets as input. For example, you can't build a bowtie2_index asset until you have a fasta asset built. Now, the degree of dependency is quite small; it's not a super complicated dependency graph. But there is definitely some need to manage jobs, we can't just submit them all. In the past we've managed this by just submitted them by asset type... So, first build all the fasta assets, then we can build all the bowtie2_index assets. This is essentially manual dependency management.

sub-Problem 2.1: Required assets aren't built yet

You have to only submit the job when its dependencies are built.

sub-Problem 2.2: Required assets exist...but aren't on this ephemeral compute environment

Even if the dependencies are built in correct order, the current system would fail on ephemeral compute, because the job for the child asset won't have access to the parent asset. To make them ephemeral, they'd need to be able to retrieve what they need from some central source instead of expecting to find it locally.

Goal

We would like to redesign refgenie build, or design a new build system, that is able to:

  • work with ephemeral compute
  • manage asset dependencies
  • work with high concurrency

Solution brainstorming

Process serially

If you just run jobs serially, then you're fine. But that will increase wall clock time (not compute time) dramatically.

Use a database for the config file

Config file into a database sounds nice.

Dependency management

We've been using looper to submit jobs, and managing dependencies manually. The fact is, looper is non-dependency-aware -- by design. So it's not the right tool for this job.

One thought is that a job script could do a refgenie pull if a parent asset is not existing.

@nsheff
Copy link
Contributor Author

nsheff commented Jun 3, 2021

Some notes on the concurrency issue:

Can we make each asset build independent? So when you build an asset, you record the information in a separate yaml file for that asset. Then, there's a separate "agglomerate" step that refreshes the config file? This could be a 'parallel' mode, which is only for batch computing?

@pinin4fjords
Copy link

I think that the one-file-per asset solution, or maybe per assembly to keep dependency handling simpler, with a final index or agglomerate may be sufficient, and would keep things simple for e.g. cloud deployments.

There are other desirable things to store in an easy-to-query way IMO (software versions, parameters, genome releases etc), so whatever solution is chosen should probably be able to accommodate those, maybe that's a DB.

Lots of solutions out there to model dependencies, I tend to alternate between Nextflow and Snakemake, but obviously you have CWL/ WDL + Cromwell etc too if you're so inclined. These would also allow you to abstract from compute and make things runnable in Grid (LSF, SGE etc) or cloud.

@pinin4fjords
Copy link

pinin4fjords commented Jun 3, 2021

As a first step, just allowing a merge of arbitrary refgenie databases would allow work-arounds. For our use case we could assemble the assets associated with each assembly in separate databases initially (respecting dependencies), then combine as a final step. That would much reduce the number of processes writing to single files.

@pinin4fjords
Copy link

pinin4fjords commented Jun 3, 2021

A bit tangenital, but just to stress the point, I do think structured metadata is crucial, so you can do the queries like:

"Hi Refgenie, give me a Salmon index of the Ensembl 104 release of Human genome assembly GRCh38 combined with ERCC spike-ins, made with Salmon version 1.2 and a kmer length of 45"

The killer application for Refgenie is to stop people having to do costly index processes all the time, but if I can't easily see (and query like above) the software and parameters used to produce the indices then I can't trust the indices I find on http://refgenomes.databio.org/, for example.

The most awesome thing possible (and this is a bit of a pipe dream requiring funding and infrastructure), would be if I could submit a 'recipe' somewhere, like I do a Bioconda recipe, specifying input assets of either URIs or pre-existing assets (and associated metadata) and software parameters, and automatically have some build system provide the genome resources to the community via widely used channels. Done properly that would be the nicest, most visible way to make things available to the community.

@nsheff
Copy link
Contributor Author

nsheff commented Jun 3, 2021

Yeah -- nicely articulated. This is exactly my vision! I've written this in my grant applications, but so far, haven't been able to convince enough reviewers that it's a good idea.

@pinin4fjords
Copy link

Ahh- glad to hear that's the way you're thinking anyway, hopefully we can find a way of making something similar happen.

@pcm32
Copy link

pcm32 commented Jun 14, 2021

I like the idea of a one file per asset while we wait for a larger overhaul with an RDBMS. This is a solution that many tools use with config files (the typical /etc/<your-tool>/conf.d/ that can contain many files that are being listened to), leading to independent locks.

@stolarczyk
Copy link
Contributor

That's what we've done. The "MapReduce" framework is now supported in the development version of refgenie and is documented here: http://refgenie.databio.org/en/dev/build/#build-assets-concurrently.

We will release this early this week. In the meantime, if you wish to test it (which would be great), you'll need to install the following packages:

pip install git+git://github.com/databio/yacman@dev#egg=yacman
pip install git+git://github.com/refgenie/refgenconf@dev#egg=refgenconf
pip install git+git://github.com/refgenie/refgenie@dev#egg=refgenie

And then you should be able to do the following, for example:

# rm -r data alias r.yml
export REFGENIE=$(pwd)/r.yml
refgenie init -c $REFGENIE
refgenie pull rCRSd/fasta
refgenie build rCRSd/bwa_index --map
refgenie build rCRSd/hisat2_index --map
refgenie build rCRSd/bowtie2_index --map
refgenie build rCRSd/star_index --map
refgenie build --reduce

feedback is welcome!

@pinin4fjords
Copy link

Thanks @stolarczyk , I'm testing this now and will get back to you.

@pinin4fjords
Copy link

pinin4fjords commented Jun 17, 2021

I'm assuming (and some initial testing seems to confirm) that dependencies of the indexing operations (fasta, fasta_txome) do need to exist in the central database before the indexing operations. So, the builds for those entities need to be done without --map at low concurrency, or else done en masse and a '--reduce' operation run before the indexing.

Assuming that's correct, maybe it could be made clear in the docs.

To be clear, what I'm doing is:

  1. Build all fasta genome assets without concurrency limits, with --map.
  2. Run a reduce operation when all the genomes are built
  3. Build all genome-dependent assets (fasta_txome cDNAs, hisat indices etc) without concurrency limits, with --map
  4. Run a reduce operation when all the fasta_txome cDNAs are built.
  5. Build all cDNA-dependent operations (salmon, kallisto indices etc) without concurrency limits, with --map
  6. Run a reduce for each assembly once all its assets are complete (making sure to only run one reduce at a time)

@pinin4fjords
Copy link

pinin4fjords commented Jun 17, 2021

Step 4 above is giving me the familiar error:

> refgenie build --reduce -c /path/to/genome_config.yaml
Running the reduce procedure. No assets will be built.
Reducing 21 configs                                            0% -:--:--Traceback (most recent call last):
  File "/path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/attmap/ordattmap.py", line 45, in __getitem__
    return super(OrdAttMap, self).__getitem__(item)
KeyError: 'assets'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/bin/refgenie", line 8, in <module>
    sys.exit(main())
  File "/path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/refgenie/cli.py", line 142, in main
    preserve_map_configs=args.preserve_map_configs,
  File "/path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/refgenie/refgenie.py", line 175, in refgenie_build_reduce
    tag_data = map_rgc[CFG_GENOMES_KEY][matched_genome][CFG_ASSETS_KEY][
  File "/path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/attmap/pathex_attmap.py", line 56, in __getitem__
    v = super(PathExAttMap, self).__getitem__(item)
  File "/path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/attmap/ordattmap.py", line 47, in __getitem__
    return AttMap.__getitem__(self, item)
  File "/path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/attmap/attmap.py", line 32, in __getitem__
    return self.__dict__[item]
KeyError: 'assets'

The --reduce operation at 4. is occurring at the same time as non-txome --map operations in 3. Is that a problem? Maybe the reduce is not correctly ignoring incomplete map operations?

@stolarczyk
Copy link
Contributor

stolarczyk commented Jun 17, 2021

I'm assuming (and some initial testing seems to confirm) that dependencies of the indexing operations (fasta, fasta_txome) do need to exist in the central database before the indexing operations. So, the builds for those entities need to be done without --map at low concurrency, or else done en masse and a '--reduce' operation run before the indexing.

Exactly. The results of --map builds are isolated, so unless --reduce is ran, refgenie is not "aware" of the assets and related metadata.

Regarding the build strategy, I would do this as follows:

  1. Build all top-level assets
  2. Wait until jobs are completed, reduce
  3. Build all derived assets (assets that are built from top-level assets)
  4. Wait until jobs are completed, reduce

In your case, I'm not sure why did you decide to build fasta_txome assets with the genome-derived ones.

And you're right -- to be safe, run only one --reduce at a time since you could run into the problem that this is trying to solve

@pinin4fjords
Copy link

Okay, I'll clarify, apologies for any misplaced assumptions in the following.

All of fasta_txome, hisatw_index, bowtie_index etc just need the reference genome to exist, but can be run alongside one another in my step 3 after a reduce for all the genome builds.

The salmon_index, kallisto_index further need the fasta_txome, so I build all those in parallel before reducing and doing the Salmon and Kallisto.

I'm not running parallel reduces, but since the hisat indexing at the genome level takes a while, those are still going on at the point I'm reducing for all the fasta_txome assets, so I am currently running a single reduce alongside many maps, which seems like it should work, but doesn't.

@stolarczyk
Copy link
Contributor

Run a reduce for each assembly

That's not possible. If you use --reduce flag, genome/asset specifications are disregarded. Reduce step is run globally.
This needs to be either clarified in the docs, or maybe we should add separate commands? refgenie build-map and refgenie build-reduce

@pinin4fjords
Copy link

Don't worry about the 'each assembly' bit, I haven't got that far anyway. I only meant that I can run a reduce once I know I have all assets in hand for a given assembly, which I can track in the workflow.

@pinin4fjords
Copy link

pinin4fjords commented Jun 17, 2021

Though of course if I can't run a single reduce alongside running maps that won't work either.

That's the bottom line here: I understand that I can't run multiple reduce operations, and I'm not. But I would like to be able to run reduces while there are still maps ongoing.

@stolarczyk
Copy link
Contributor

All of fasta_txome, hisatw_index, bowtie_index etc just need the reference genome to exist

That's what I was getting at in my previous comment. fasta_txome doesn't require fasta to exist, it's a top-level asset. For example:

~ refgenie build hg38/fasta_txome --requirements

'fasta_txome' recipe requirements: 
- files:
	fasta (gzipped fasta file)

Therefore, I'd build it with other top-level assets.

I am currently running a single reduce alongside many maps, which seems like it should work, but doesn't.

I'm afraid this could lead to problems. Reduce step simply looks for "map configs" in asset directories and tries to incorporate their contents into the "master config". So if any map builds are running, the resulting (possibly incomplete) configs will be detected by the reduce step. This is likely what you saw here: #254 (comment)

@stolarczyk
Copy link
Contributor

So this boils down to two rules for running refgenie build --reduce:

  1. Run only one refgenie build --reduce at a time
  2. Run only when there are no refgenie build --map in progress

@pinin4fjords
Copy link

Ahh, I'd thought that each assembly needed a /fasta in place as a starting point (to establish the SHA digest), before any /fasta_txomes.

Anyway, now things are clear I can code to defend against the main issue. But would it not be possible for the reduce process to actually check for completion of the mapped processes?

@stolarczyk
Copy link
Contributor

Ahh, I'd thought that each assembly needed a /fasta in place as a starting point (to establish the SHA digest)

oh, wait you're right :)

But build other top-level ones right after reducing after fasta assets:

  1. Build fasta assets to establish genome namespaces
  2. Wait until jobs are completed, reduce
  3. Build all other top-level assets (fasta_txome, gencode_gtf etc.)
  4. Wait until jobs are completed, reduce
  5. Build all derived assets (assets that are built from top-level assets)
  6. Wait until jobs are completed, reduce

sorry for confusion

@stolarczyk
Copy link
Contributor

But would it not be possible for the reduce process to actually check for completion of the mapped processes?

It would be possible, I just wasn't sure if that's worth the effort.

@pinin4fjords
Copy link

But would it not be possible for the reduce process to actually check for completion of the mapped processes?

It would be possible, I just wasn't sure if that's worth the effort.

It would certainly be VERY helpful for us.

e.g. I'd like to be able to 'release' the assets for a species as soon as they're built, not have to wait for the builds associated with 100s of other assemblies. Plus it simplifies the workflow logic if I don't have to add logic to wait for everything to stop before I can run a reduce.

@pinin4fjords
Copy link

That said, thanks for the quick work on getting things this far.

@pinin4fjords
Copy link

I can confirm that this works fine now. If we could get that release when you're ready so the new Conda package etc become available that'd help with our production.

@stolarczyk
Copy link
Contributor

It would certainly be VERY helpful for us.

running reduce while assets are built should be possible now on dev.

@pinin4fjords
Copy link

That's really awesome- thank you.

@pinin4fjords
Copy link

I confirm that this works in our test case- thanks again.

@stolarczyk
Copy link
Contributor

I can confirm that this works fine now. If we could get that release when you're ready so the new Conda package etc become available that'd help with our production.

great! we can start working on the release if there are no immediate feature requests/bugs

@pinin4fjords
Copy link

pinin4fjords commented Jun 22, 2021

Thanks @stolarczyk . I've just finished attempting to build all our assets. The issues I encountered seem mostly due assets not from Ensembl, namely:

  • Files not passing the post-processing (as described in Ability to disable secondary behaviours for builds #259)
  • Files with the same SHA being tagged as different assemblies, and causing clashes when building in their mapped processes. I'll have to figure some way round this at some point with aliases, but not your problem.

So all good I think from a release point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants