New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limitations of single config file #251
Comments
Hi! yes, that's a valid point, which we have recognized some time ago. We successfully run I'd say ~100 builds in parallel and this is accommodated by config file locking feature -- no asset metadata is lost. But 100s of builds/concurrent writes may keep the file locked for a while, which wastes CPU time. These are two related issues, proposing solutions that could resolve this one:
|
Thanks for the response- I wondered about the DB backing too so glad to see there's an issue for that. |
Having done some more testing, I wanted to re-iterate that this is actually pretty problematic at scale, at least in our compute environment. For example I've noticed that occasionally, while generating lots of genome assets, the creation of child FASTAs (e.g. CDNAs) wipes out the parent (genome) asset. Maybe this is because, for a brief time while the config file is being re-written (which is constantly happening in this case), the assets seem to disappear? I managed to catch that in action:
As you can see, there was an asset there one minute, then it was gone, then it was back (and I wasn't rebuilding that specific one). |
This is a good catch. We worked through many of these issues when doing this locally, and seemed to have gotten everything working, but it looks like something has snuck through.
That shouldn't be the case -- the file is locked and should be complete whenever it's read, since it will only be read when unlocked. But, obviously, there's a bug here.
Yes, we didn't really design the system to be building many assets simultaneously; originally we had envisioned refgenie as being used by private individuals or small groups to pull and build some assets. It was a side bonus that we can use refgenie also to build the very assets that would be served on the other end, by refgenieserver -- which is super nice, but it is kind of abusing what the original refgenie client was intended to do. we got it working enough that we built our files successfully, but you're scaling it up even more and looks like uncovering some additional issues. We can probably solve this, but the long-term solution to this is that we need to back the metadata not by a file, but by a robust database. This solves not just this problem, but other problems as well. |
@pinin4fjords I'll post some more thoughts on this soon. A question, though -- how are you parallelizing the jobs? Are you using looper or some other way to parallelize them? And are you using ephemeral compute, or some kind of local cluster with a shared file system? |
Thanks @nsheff . I'm parallelising using a Nextflow workflow (see https://github.com/ebi-gene-expression-group/isl_refs_to_refgenie), pointing at our LSF cluster. |
@pinin4fjords Ok, this is great! Please take a look and comment on #254. I mention there a few potential issues with building at scale like this. One issue is the asset dependencies. Does this nextflow workflow handle the asset dependencies issue? As in, it knows that it can't build a bowtie2_index asset until the fasta asset is complete? How do you encode the dependency logic? Another issue is locating the prerequisite assets, which would be a problem on ephemeral compute -- but it seems like youavoid this issue because your jobs can all communicate to a central filesystem, correct? Another issue is the high concurrency, centralized config one -- that's what you bring up here. So if the first 2 issues are solved, then the only thing we'd need to solve for this to work is the high concurrency issue, right? |
@nsheff for the dependencies: yes of course, that's pretty much the whole point of composing this as a workflow :-). The workflow structure encodes that, so e.g. the outputs of the reference genome here are passed (sometimes through some slightly fiddly logic) to dependent processes. Yes, we don't have the file system issue since we have common storage volumes. But it would make sense to future-proof Refgenie such that that is not a requirement, to make cloud-based usage easier. Thanks for the linked issue, I'll comment further there. |
Just a note: if I were writing the above workflow again I'd probably use Snakemake which would allow some of that fiddly Nexflow logic to be removed. |
Just an addition to this issue: when I retry an asset build job after random failures, I add the '-R' flag (I'd found that locks from failed builds sometimes prevent the retry working without that). Maybe that's what's allowing the config to get corrupted- I'll try alternate solutions. Edit: nope, that's not it, it didn't help to stop using that option. |
If it helps, here's some more illustration. The issue impacts more than the species wit the problem. For example my workflow was trying to build the base carrot genome assembly and I got the error:
That hash actually refers to the barley genome from the config:
... which if I'm interpreting the config right has been superseded as barley top-dog by a cDNA file:
So that barley error corrupts the file and prevents anything else loading. |
Are those 2 sections in the config at the same time, meaning 2 different genome hashes have the same alias? If you're using a cdna fasta file, are you putting that in under the In the current system, you can only have 1 asset of each "type" per genome. You may need the cnda to have a separate alias if you need to build lots of assets under it. |
@stolarczyk is the config file unlocked between when the genome is added and when the alias is added? |
@pinin4fjords one thing you could do that could help us track this down is using |
@nsheff I'm building and indexing the cDNAs as instructed in #250, so both under 'fasta', but it does seem to work on a small scale, as per my example at https://github.com/ebi-gene-expression-group/isl_refs_to_refgenie. Of course I can switch to 'fasta_txome' if you think that will be better. And yep- I'll bump the verbosity |
@nsheff and yes- those were two sections from the same config |
@nsheff the issue with fasta_txome is that I'm going to have a lot of them (Ensembl versions, biotype sets etc), so if I can only have one of those per assembly that's not going to work. Also I don't see fasta_txome documented at http://refgenie.databio.org/en/latest/available_assets/. Alternatively if I make every transcriptome a top-level 'genome' then I lose the grouping under the assembly and the nice explicit link with that assembly (which, as I say, does seem to work on a small scale). |
Here's an illustration of three fasta type assets (1 genome, 2 cDNA) under one genome identifier:
As you can see that seems to behave okay, and I was able to index the different cDNA fastas specifically, using the instructions you provided. |
That's interesting. I thought that refgenie couldn't accept multiple fasta assets under 1 genome, since the fasta asset it strictly tied to the hashed genome identifier (1-to-1). It looks like you've been able to do that, though, so I need to think about that more. Maybe we never tried it and we may not be checking that correctly -- or maybe it is allowable, @stolarczyk correct me if I'm wrong here. But I wonder if this is the cause of some of your alias issues, in that we hadn't imagined it working this way.
But you can work with multiple
Well, they'd still be grouped under the primary assembly, right?
Sorry about that; it should be identical to But altogether, this is one of the reasons that I think your use case is going to require the expanded recipe descriptions proposed in #198. It looks like you're getting it sort-of working, which is great, but I think it's time to just solve that issue. |
@nsheff aha gotcha- thanks for clarifying fasta_txome, I'll try that now. |
It's not allowable, but it seems like it's technically possible -- I added a check so that the genome is not reinitialized, but building another fasta asset doesn't fail. In hindsight, that was a mistake because we rely on the 1:1 namespace:fasta relationship in other parts of the codebase, like You should see sth like the line below at the top of your "extra fasta" build logs @pinin4fjords:
|
@stolarczyk yep, I did, but as things worked anyway I wasn't over-worried. If this really is naughty then things should probably exit at that point. |
Using fasta_txome seems to be helping. I'm still getting a lot of the errors as reported in #253, but the config file seems to be maintaining consistency. |
Hi!
Just wanted to flag the limitations posed by a single config file. We have a use case where we may want to create 100s of assets at a time, and we have the ability to do that in parallel via a compute cluster, but I have to throttle that right down because of issues caused by multiple concurrent writes to the config.
Would a one-file-per-asset system maybe be better, perhaps with a separate indexing process?
The text was updated successfully, but these errors were encountered: