New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automating asset building at scale #254
Comments
Some notes on the concurrency issue: Can we make each asset build independent? So when you build an asset, you record the information in a separate yaml file for that asset. Then, there's a separate "agglomerate" step that refreshes the config file? This could be a 'parallel' mode, which is only for batch computing? |
I think that the one-file-per asset solution, or maybe per assembly to keep dependency handling simpler, with a final index or agglomerate may be sufficient, and would keep things simple for e.g. cloud deployments. There are other desirable things to store in an easy-to-query way IMO (software versions, parameters, genome releases etc), so whatever solution is chosen should probably be able to accommodate those, maybe that's a DB. Lots of solutions out there to model dependencies, I tend to alternate between Nextflow and Snakemake, but obviously you have CWL/ WDL + Cromwell etc too if you're so inclined. These would also allow you to abstract from compute and make things runnable in Grid (LSF, SGE etc) or cloud. |
As a first step, just allowing a merge of arbitrary refgenie databases would allow work-arounds. For our use case we could assemble the assets associated with each assembly in separate databases initially (respecting dependencies), then combine as a final step. That would much reduce the number of processes writing to single files. |
A bit tangenital, but just to stress the point, I do think structured metadata is crucial, so you can do the queries like: "Hi Refgenie, give me a Salmon index of the Ensembl 104 release of Human genome assembly GRCh38 combined with ERCC spike-ins, made with Salmon version 1.2 and a kmer length of 45" The killer application for Refgenie is to stop people having to do costly index processes all the time, but if I can't easily see (and query like above) the software and parameters used to produce the indices then I can't trust the indices I find on http://refgenomes.databio.org/, for example. The most awesome thing possible (and this is a bit of a pipe dream requiring funding and infrastructure), would be if I could submit a 'recipe' somewhere, like I do a Bioconda recipe, specifying input assets of either URIs or pre-existing assets (and associated metadata) and software parameters, and automatically have some build system provide the genome resources to the community via widely used channels. Done properly that would be the nicest, most visible way to make things available to the community. |
Yeah -- nicely articulated. This is exactly my vision! I've written this in my grant applications, but so far, haven't been able to convince enough reviewers that it's a good idea. |
Ahh- glad to hear that's the way you're thinking anyway, hopefully we can find a way of making something similar happen. |
I like the idea of a one file per asset while we wait for a larger overhaul with an RDBMS. This is a solution that many tools use with config files (the typical |
That's what we've done. The "MapReduce" framework is now supported in the development version of refgenie and is documented here: http://refgenie.databio.org/en/dev/build/#build-assets-concurrently. We will release this early this week. In the meantime, if you wish to test it (which would be great), you'll need to install the following packages:
And then you should be able to do the following, for example:
feedback is welcome! |
Thanks @stolarczyk , I'm testing this now and will get back to you. |
I'm assuming (and some initial testing seems to confirm) that dependencies of the indexing operations (fasta, fasta_txome) do need to exist in the central database before the indexing operations. So, the builds for those entities need to be done without --map at low concurrency, or else done en masse and a '--reduce' operation run before the indexing. Assuming that's correct, maybe it could be made clear in the docs. To be clear, what I'm doing is:
|
Step 4 above is giving me the familiar error:
The --reduce operation at 4. is occurring at the same time as non-txome --map operations in 3. Is that a problem? Maybe the reduce is not correctly ignoring incomplete map operations? |
Exactly. The results of Regarding the build strategy, I would do this as follows:
In your case, I'm not sure why did you decide to build And you're right -- to be safe, run only one |
Okay, I'll clarify, apologies for any misplaced assumptions in the following. All of fasta_txome, hisatw_index, bowtie_index etc just need the reference genome to exist, but can be run alongside one another in my step 3 after a reduce for all the genome builds. The salmon_index, kallisto_index further need the fasta_txome, so I build all those in parallel before reducing and doing the Salmon and Kallisto. I'm not running parallel reduces, but since the hisat indexing at the genome level takes a while, those are still going on at the point I'm reducing for all the fasta_txome assets, so I am currently running a single reduce alongside many maps, which seems like it should work, but doesn't. |
That's not possible. If you use |
Don't worry about the 'each assembly' bit, I haven't got that far anyway. I only meant that I can run a reduce once I know I have all assets in hand for a given assembly, which I can track in the workflow. |
Though of course if I can't run a single reduce alongside running maps that won't work either. That's the bottom line here: I understand that I can't run multiple reduce operations, and I'm not. But I would like to be able to run reduces while there are still maps ongoing. |
That's what I was getting at in my previous comment.
Therefore, I'd build it with other top-level assets.
I'm afraid this could lead to problems. Reduce step simply looks for "map configs" in asset directories and tries to incorporate their contents into the "master config". So if any map builds are running, the resulting (possibly incomplete) configs will be detected by the reduce step. This is likely what you saw here: #254 (comment) |
So this boils down to two rules for running
|
Ahh, I'd thought that each assembly needed a /fasta in place as a starting point (to establish the SHA digest), before any /fasta_txomes. Anyway, now things are clear I can code to defend against the main issue. But would it not be possible for the reduce process to actually check for completion of the mapped processes? |
oh, wait you're right :) But build other top-level ones right after reducing after
sorry for confusion |
It would be possible, I just wasn't sure if that's worth the effort. |
It would certainly be VERY helpful for us. e.g. I'd like to be able to 'release' the assets for a species as soon as they're built, not have to wait for the builds associated with 100s of other assemblies. Plus it simplifies the workflow logic if I don't have to add logic to wait for everything to stop before I can run a reduce. |
That said, thanks for the quick work on getting things this far. |
I can confirm that this works fine now. If we could get that release when you're ready so the new Conda package etc become available that'd help with our production. |
running reduce while assets are built should be possible now on dev. |
That's really awesome- thank you. |
I confirm that this works in our test case- thanks again. |
great! we can start working on the release if there are no immediate feature requests/bugs |
Thanks @stolarczyk . I've just finished attempting to build all our assets. The issues I encountered seem mostly due assets not from Ensembl, namely:
So all good I think from a release point of view. |
Related to refgenie/refgenomes.databio.org#3, #251
Motivation
We would like a system that's capable of building lots of assets concurrently. This would be useful for creating new servers or adding lots of assets to existing servers.
The
refgenie build
function was originally designed as being used by small groups to add local assets to ones that arepull
ed from a server. But we eventually realized that we can userefgenie build
to create the very assets that would be served byrefgenieserver
-- which is super nice. This works enough that we usedrefgenie build
top produce all the files we're currently serving, but as the scale of assets we want to serve increases, this has led to several challengesProblem 1: Concurrency
The config file is required for building an asset with
refgenie build
. Even if the computing is all in a shared environment, hundreds of tasks all writing to the config file at the same time is not really a sustainable solution for our homebrew file-locking mechanism. Furthermore, if the compute tasks are ephemeral, then it's not even possible at all. How do we get the refgenie config file into each compute job? They can't each write and then push it independently; they will merge conflict.Thus, there needs to be some central area to retain the metadata that's outside the compute jobs, that they can communicate with (like a third-party database). Could we use a database for the config file?
Problem 2: Dependencies
Some of these jobs depend on other jobs, because some assets use existing assets as input. For example, you can't build a bowtie2_index asset until you have a fasta asset built. Now, the degree of dependency is quite small; it's not a super complicated dependency graph. But there is definitely some need to manage jobs, we can't just submit them all. In the past we've managed this by just submitted them by asset type... So, first build all the fasta assets, then we can build all the bowtie2_index assets. This is essentially manual dependency management.
sub-Problem 2.1: Required assets aren't built yet
You have to only submit the job when its dependencies are built.
sub-Problem 2.2: Required assets exist...but aren't on this ephemeral compute environment
Even if the dependencies are built in correct order, the current system would fail on ephemeral compute, because the job for the child asset won't have access to the parent asset. To make them ephemeral, they'd need to be able to retrieve what they need from some central source instead of expecting to find it locally.
Goal
We would like to redesign
refgenie build
, or design a new build system, that is able to:Solution brainstorming
Process serially
If you just run jobs serially, then you're fine. But that will increase wall clock time (not compute time) dramatically.
Use a database for the config file
Config file into a database sounds nice.
Dependency management
We've been using looper to submit jobs, and managing dependencies manually. The fact is, looper is non-dependency-aware -- by design. So it's not the right tool for this job.
One thought is that a job script could do a
refgenie pull
if a parent asset is not existing.The text was updated successfully, but these errors were encountered: