Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limitations of single config file #251

Open
pinin4fjords opened this issue Jun 1, 2021 · 23 comments
Open

Limitations of single config file #251

pinin4fjords opened this issue Jun 1, 2021 · 23 comments

Comments

@pinin4fjords
Copy link

Hi!

Just wanted to flag the limitations posed by a single config file. We have a use case where we may want to create 100s of assets at a time, and we have the ability to do that in parallel via a compute cluster, but I have to throttle that right down because of issues caused by multiple concurrent writes to the config.

Would a one-file-per-asset system maybe be better, perhaps with a separate indexing process?

@stolarczyk
Copy link
Contributor

stolarczyk commented Jun 1, 2021

Hi! yes, that's a valid point, which we have recognized some time ago. We successfully run I'd say ~100 builds in parallel and this is accommodated by config file locking feature -- no asset metadata is lost. But 100s of builds/concurrent writes may keep the file locked for a while, which wastes CPU time.

These are two related issues, proposing solutions that could resolve this one:

@pinin4fjords
Copy link
Author

Thanks for the response- I wondered about the DB backing too so glad to see there's an issue for that.

@pinin4fjords
Copy link
Author

Having done some more testing, I wanted to re-iterate that this is actually pretty problematic at scale, at least in our compute environment. For example I've noticed that occasionally, while generating lots of genome assets, the creation of child FASTAs (e.g. CDNAs) wipes out the parent (genome) asset.

Maybe this is because, for a brief time while the config file is being re-written (which is constantly happening in this case), the assets seem to disappear?

I managed to catch that in action:

>  refgenie seek pythium_aphanidermatum--pag1_scaffolds_v1/fasta:genome
/path/to/references/refgenie/alias/pythium_aphanidermatum--pag1_scaffolds_v1/fasta/genome/pythium_aphanidermatum--pag1_scaffolds_v1.fa
>  refgenie seek pythium_aphanidermatum--pag1_scaffolds_v1/fasta:genome
/path/to/references/refgenie/alias/pythium_aphanidermatum--pag1_scaffolds_v1/fasta/genome/pythium_aphanidermatum--pag1_scaffolds_v1.fa
>  refgenie seek pythium_aphanidermatum--pag1_scaffolds_v1/fasta:genome
Traceback (most recent call last):
  File "> /path/to/conda/envs/refgenie/lib/python3.9/site-packages/refgenconf/refgenconf.py", line 725, in seek
    genome_digest = self.get_genome_alias_digest(genome_name, fallback=True)
  File "> /path/to/conda/envs/refgenie/lib/python3.9/site-packages/refgenconf/refgenconf.py", line 1705, in get_genome_alias_digest
    return self[CFG_GENOMES_KEY].get_key(alias=alias)
  File "> /path/to/conda/envs/refgenie/lib/python3.9/site-packages/yacman/alias.py", line 223, in get_key
    raise UndefinedAliasError("No key defined for: {}".format(alias))
yacman.exceptions.UndefinedAliasError: No key defined for: pythium_aphanidermatum--pag1_scaffolds_v1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "> /path/to/conda/envs/refgenie/bin/refgenie", line 10, in <module>
    sys.exit(main())
  File "> /path/to/conda/envs/refgenie/lib/python3.9/site-packages/refgenie/cli.py", line 159, in main
    rgc.seek(
  File "> /path/to/conda/envs/refgenie/lib/python3.9/site-packages/refgenconf/refgenconf.py", line 727, in seek
    raise MissingGenomeError(f"Your genomes do not include '{genome_name}'")
refgenconf.exceptions.MissingGenomeError: Your genomes do not include 'pythium_aphanidermatum--pag1_scaffolds_v1'
>  refgenie seek pythium_aphanidermatum--pag1_scaffolds_v1/fasta:genome
/path/to/references/refgenie/alias/pythium_aphanidermatum--pag1_scaffolds_v1/fasta/genome/pythium_aphanidermatum--pag1_scaffolds_v1.fa

As you can see, there was an asset there one minute, then it was gone, then it was back (and I wasn't rebuilding that specific one).

@nsheff
Copy link
Contributor

nsheff commented Jun 3, 2021

This is a good catch. We worked through many of these issues when doing this locally, and seemed to have gotten everything working, but it looks like something has snuck through.

Maybe this is because, for a brief time while the config file is being re-written (which is constantly happening in this case), the assets seem to disappear?

That shouldn't be the case -- the file is locked and should be complete whenever it's read, since it will only be read when unlocked. But, obviously, there's a bug here.

I wanted to re-iterate that this is actually pretty problematic at scale, at least in our compute environment.

Yes, we didn't really design the system to be building many assets simultaneously; originally we had envisioned refgenie as being used by private individuals or small groups to pull and build some assets. It was a side bonus that we can use refgenie also to build the very assets that would be served on the other end, by refgenieserver -- which is super nice, but it is kind of abusing what the original refgenie client was intended to do. we got it working enough that we built our files successfully, but you're scaling it up even more and looks like uncovering some additional issues.

We can probably solve this, but the long-term solution to this is that we need to back the metadata not by a file, but by a robust database. This solves not just this problem, but other problems as well.

@nsheff
Copy link
Contributor

nsheff commented Jun 3, 2021

@pinin4fjords I'll post some more thoughts on this soon. A question, though -- how are you parallelizing the jobs? Are you using looper or some other way to parallelize them? And are you using ephemeral compute, or some kind of local cluster with a shared file system?

@pinin4fjords
Copy link
Author

Thanks @nsheff . I'm parallelising using a Nextflow workflow (see https://github.com/ebi-gene-expression-group/isl_refs_to_refgenie), pointing at our LSF cluster.

@nsheff
Copy link
Contributor

nsheff commented Jun 3, 2021

@pinin4fjords Ok, this is great! Please take a look and comment on #254. I mention there a few potential issues with building at scale like this.

One issue is the asset dependencies. Does this nextflow workflow handle the asset dependencies issue? As in, it knows that it can't build a bowtie2_index asset until the fasta asset is complete? How do you encode the dependency logic?

Another issue is locating the prerequisite assets, which would be a problem on ephemeral compute -- but it seems like youavoid this issue because your jobs can all communicate to a central filesystem, correct?

Another issue is the high concurrency, centralized config one -- that's what you bring up here.

So if the first 2 issues are solved, then the only thing we'd need to solve for this to work is the high concurrency issue, right?

@pinin4fjords
Copy link
Author

@nsheff for the dependencies: yes of course, that's pretty much the whole point of composing this as a workflow :-). The workflow structure encodes that, so e.g. the outputs of the reference genome here are passed (sometimes through some slightly fiddly logic) to dependent processes.

Yes, we don't have the file system issue since we have common storage volumes. But it would make sense to future-proof Refgenie such that that is not a requirement, to make cloud-based usage easier.

Thanks for the linked issue, I'll comment further there.

@pinin4fjords
Copy link
Author

Just a note: if I were writing the above workflow again I'd probably use Snakemake which would allow some of that fiddly Nexflow logic to be removed.

@pinin4fjords
Copy link
Author

pinin4fjords commented Jun 4, 2021

Just an addition to this issue: when I retry an asset build job after random failures, I add the '-R' flag (I'd found that locks from failed builds sometimes prevent the retry working without that). Maybe that's what's allowing the config to get corrupted- I'll try alternate solutions.

Edit: nope, that's not it, it didn't help to stop using that option.

@pinin4fjords
Copy link
Author

If it helps, here's some more illustration.

The issue impacts more than the species wit the problem. For example my workflow was trying to build the base carrot genome assembly and I got the error:

Recipe validated successfully against a schema: > /path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/refgenie/schemas/recipe_schema.yaml
Building 'daucus_carota--ASM162521v1/fasta:genome' using 'fasta' recipe
Traceback (most recent call last):
  File "> /path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/bin/refgenie", line 10, in <module>
    sys.exit(main())
  File "> /path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/refgenie/cli.py", line 147, in main
    refgenie_build(gencfg, asset_list[0]["genome"], asset_list, recipe_name, args)
  File "> /path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/refgenie/refgenie.py", line 364, in refgenie_build
    genome in rgc.genomes_list()
  File "> /path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/refgenconf/refgenconf.py", line 671, in genomes_list
    for x in list(self[CFG_GENOMES_KEY].keys())
  File "> /path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/refgenconf/refgenconf.py", line 671, in <listcomp>
    for x in list(self[CFG_GENOMES_KEY].keys())
  File "> /path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/refgenconf/refgenconf.py", line 1729, in get_genome_alias
    res = self[CFG_GENOMES_KEY].get_aliases(key=digest)
  File "> /path/to/conda/envs/refgenie-4f835c0534e73b8fd87dac6c443854f1/lib/python3.6/site-packages/yacman/alias.py", line 200, in get_aliases
    raise UndefinedAliasError("No alias defined for: {}".format(key))
yacman.exceptions.UndefinedAliasError: No alias defined for: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7

That hash actually refers to the barley genome from the config:

  0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7:
    assets:
      fasta:
        asset_description: DNA sequences in the FASTA format, indexed FASTA (produced with samtools index) and chromosome sizes file
        tags:
          genome:
            asset_path: fasta
            asset_digest: b01bf38f8866b602178a6806d06dfada
            seek_keys:
              fasta: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.fa
              fai: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.fa.fai
              chrom_sizes: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.chrom.sizes
              dir: .
            asset_parents: []
          genome--spikes_ercc:
            asset_path: fasta
            asset_digest: e69cb92bbe3f7f24c4b52bdc1e3de6fc
            seek_keys:
              fasta: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.fa
              fai: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.fa.fai
              chrom_sizes: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.chrom.sizes
              dir: .
            asset_parents: []
          cdna_plants51:
            asset_path: fasta
            asset_digest: b179682cf209e6725d5c365c55f22703
            seek_keys:
              fasta: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.fa
              fai: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.fa.fai
              chrom_sizes: 0c56d4ccb9a9496bfd7d6a645f83737dd59aaa73dfe9cef7.chrom.sizes
              dir: .
            asset_parents: []
        default_tag: genome
    aliases:
     - hordeum_vulgare--IBSC_v2

... which if I'm interpreting the config right has been superseded as barley top-dog by a cDNA file:

  afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468:
    assets:
      fasta:
        asset_description: DNA sequences in the FASTA format, indexed FASTA (produced with samtools index) and chromosome sizes file
        tags:
          cdna_newest:
            asset_path: fasta
            asset_digest: b179682cf209e6725d5c365c55f22703
            seek_keys:
              fasta: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.fa
              fai: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.fa.fai
              chrom_sizes: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.chrom.sizes
              dir: .
            asset_parents: []
        default_tag: cdna_newest
      ensembl_gtf:
        asset_description: Ensembl GTF, TSS, and gene body annotation
        tags:
          plants51:
            asset_path: ensembl_gtf
            asset_digest: 031e0348ece313acb42e0913a2d0ed80
            seek_keys:
              ensembl_gtf: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.gtf.gz
              ensembl_tss: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_TSS.bed
              ensembl_gene_body: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_gene_body.bed
              dir: .
            asset_parents: []
          newest:
            asset_path: ensembl_gtf
            asset_digest: 031e0348ece313acb42e0913a2d0ed80
            seek_keys:
              ensembl_gtf: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.gtf.gz
              ensembl_tss: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_TSS.bed
              ensembl_gene_body: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_gene_body.bed
              dir: .
            asset_parents: []
          plants43:
            asset_path: ensembl_gtf
            asset_digest: 3eecf9c9f82f58a6d608ed128f6c84b2
            seek_keys:
              ensembl_gtf: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.gtf.gz
              ensembl_tss: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_TSS.bed
              ensembl_gene_body: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_gene_body.bed
              dir: .
            asset_parents: []
          current:
            asset_path: ensembl_gtf
            asset_digest: 3eecf9c9f82f58a6d608ed128f6c84b2
            seek_keys:
              ensembl_gtf: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.gtf.gz
              ensembl_tss: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_TSS.bed
              ensembl_gene_body: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_gene_body.bed
              dir: .
            asset_parents: []
         plants51--spikes_ercc:
            asset_path: ensembl_gtf
            asset_digest: 2a77d6e26f9cd616574238b4d12ef812
            seek_keys:
              ensembl_gtf: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.gtf.gz
              ensembl_tss: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_TSS.bed
              ensembl_gene_body: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_gene_body.bed
              dir: .
            asset_parents: []
          newest--spikes_ercc:
            asset_path: ensembl_gtf
            asset_digest: 2a77d6e26f9cd616574238b4d12ef812
            seek_keys:
              ensembl_gtf: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.gtf.gz
              ensembl_tss: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_TSS.bed
              ensembl_gene_body: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_gene_body.bed
              dir: .
            asset_parents: []
          plants43--spikes_ercc:
            asset_path: ensembl_gtf
            asset_digest: 864f24b6700741ef8bd13b74bb2dc70b
            seek_keys:
              ensembl_gtf: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.gtf.gz
              ensembl_tss: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_TSS.bed
              ensembl_gene_body: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_gene_body.bed
              dir: .
            asset_parents: []
          current--spikes_ercc:
            asset_path: ensembl_gtf
            asset_digest: 864f24b6700741ef8bd13b74bb2dc70b
            seek_keys:
              ensembl_gtf: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468.gtf.gz
              ensembl_tss: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_TSS.bed
              ensembl_gene_body: afa7e7c27cfa0aba54b274bf60d3faabdbf9eff3f5aa0468_ensembl_gene_body.bed
              dir: .
            asset_parents: []
        default_tag: plants51
    aliases:
     - hordeum_vulgare--IBSC_v2

So that barley error corrupts the file and prevents anything else loading.

@nsheff
Copy link
Contributor

nsheff commented Jun 4, 2021

Are those 2 sections in the config at the same time, meaning 2 different genome hashes have the same alias?

If you're using a cdna fasta file, are you putting that in under the fasta_txome asset, or under the fasta asset?

In the current system, you can only have 1 asset of each "type" per genome.

You may need the cnda to have a separate alias if you need to build lots of assets under it.

@nsheff
Copy link
Contributor

nsheff commented Jun 4, 2021

@stolarczyk is the config file unlocked between when the genome is added and when the alias is added?

@nsheff
Copy link
Contributor

nsheff commented Jun 4, 2021

@pinin4fjords one thing you could do that could help us track this down is using refgenie --verbosity 5 build ... which will add lots of debug output

@pinin4fjords
Copy link
Author

@nsheff I'm building and indexing the cDNAs as instructed in #250, so both under 'fasta', but it does seem to work on a small scale, as per my example at https://github.com/ebi-gene-expression-group/isl_refs_to_refgenie. Of course I can switch to 'fasta_txome' if you think that will be better.

And yep- I'll bump the verbosity

@pinin4fjords
Copy link
Author

@nsheff and yes- those were two sections from the same config

@pinin4fjords
Copy link
Author

@nsheff the issue with fasta_txome is that I'm going to have a lot of them (Ensembl versions, biotype sets etc), so if I can only have one of those per assembly that's not going to work. Also I don't see fasta_txome documented at http://refgenie.databio.org/en/latest/available_assets/.

Alternatively if I make every transcriptome a top-level 'genome' then I lose the grouping under the assembly and the nice explicit link with that assembly (which, as I say, does seem to work on a small scale).

@pinin4fjords
Copy link
Author

Here's an illustration of three fasta type assets (1 genome, 2 cDNA) under one genome identifier:

> tail $(refgenie seek pristionchus_fissidentatus--Pristionchus_fissidentatus_genome/fasta:genome)

CCAAAAATATTAATACTAAA
>scaffold54591 length=200
AAAAAAAATGGAGTAATAGGAATCCTAGGGTAGCTGGTAGCTAATTGAATTTTAATTATT
GATTGTGAGTGTTTTTCTATTCTTTTGTGAGGGGAATTCTAAATGCGTAACGCATGGTTA
CTTCCTGACAATGATATTATTGCAAATGGGCGAAATTCGAAAATTATGGAAGAATCGGAA
AGCGGAAGCATTCAACATGT
>scaffold56872 length=135
AATGGTTAGTTCCAGGTTGGCGGGCCGGTCGAAATAACACAGAAAATGGTTGCAACCATA
GGGCCACCCTAGTCTATTTGTATTACGTATGACGCATTCTATAAAGGGCTGTGTCTGAGG
CCAGACGCCTGTTGG


> tail $(refgenie seek pristionchus_fissidentatus--Pristionchus_fissidentatus_genome/fasta:cdna_current)

TATTCCACATAA
>fissidentatus-sn_msk-S99-9.40-mRNA-1 gene=fissidentatus-sn_msk-S99-9.40-mRNA-1
ATGACCGTGAGTCGCACCAGCATTGGCACATGGCAAGAAGGATGTCGTCCACTCCCCATC
AGTCGTAGATCCGATTTTGGGCACGAGAATGTGTGCCAGACCCGCATTCTGCTGAGGGCT
CACGAACGGAATCGACTCCTTCGCGCACAGAAGCACGGGCGTACTCATGGTGCCGACGAT
GCCGAACTCGTACCAGCCGGGCGTCTTCGCGTTGGGCGAGGCGGCGCCAGCGCCGCCATT
CGGCGCCTGGTCCTGATCCGCGGCGGGCGCCTGATCCGCGCCGGGCGCCTCGTTCGCGCT
GGAACCGGGCTGATCGCCGTCGAGCTGATTCCGGACGCCGCTGGAGCGTGCTTCCTCCTG
GGCGGCGATTCGGGCACTTCGACGAGGTTCGGGCGGCGAATTGGGGTCGACCAGAAGATC
GTTCATTACGAGTCTAGAAGAAGTAAGTGA

> tail $(refgenie seek pristionchus_fissidentatus--Pristionchus_fissidentatus_genome/fasta:cdna_current--spikes_ercc)

TTTTAGATGTCTATGTTATGCTTCCTTCCTGTGTTCCAGCTACAAACTTAGAAACAAGTGGAGCTGAGATTACAGCAGAG
AATATTGAAGAACTCATTCTTTAGATAATGTCTTAGGTTAAAAAAAAAAAAAAAAAAAAAAAA
>DQ854994 gene:ERCC-00171
CTGGAGATTGTCTCGTACGGTTAAGAGCCTCCGCCCGTCTCTGGGACTATGGACGGGCACGCTCATATCAGGCTATATTT
GGTCCGGGTTATTATCGTCGCGGTTACCGTAATACTTCAGATCAGTTAAGTAGGGCCATATGCCTCGGGAATAAGCTGAC
GGTGACAAGGTTTCCCCCTAATCGAGACGCTGCAATAACACAGGGGCATACAGTAACCAGGCAAGAGTTCAATCGCTTAG
TTTCGTGGCGGGATTTGAGGAAAACTGCGACTGTTCTTTAACCAAACATCCGTGCGATTCGTGCCACTCGTAGACGGCAT
CTCACAGTCACTGAAGGCTATTAAAGAGTTAGCACCCACCATTGGATGAAGCCCAGGATAAGTGACCCCCCCGGACCTTG
GAGTTTCATGCTAATCAAAGAAGAGCTAATCCGACGTAAAGTTGCGGCGTTGATTACGCAGGATTGCGACCAAAGAACGA
GAAAAAAAAAAAAAAAAAAAAAAAA

As you can see that seems to behave okay, and I was able to index the different cDNA fastas specifically, using the instructions you provided.

@nsheff
Copy link
Contributor

nsheff commented Jun 4, 2021

That's interesting. I thought that refgenie couldn't accept multiple fasta assets under 1 genome, since the fasta asset it strictly tied to the hashed genome identifier (1-to-1). It looks like you've been able to do that, though, so I need to think about that more. Maybe we never tried it and we may not be checking that correctly -- or maybe it is allowable, @stolarczyk correct me if I'm wrong here. But I wonder if this is the cause of some of your alias issues, in that we hadn't imagined it working this way.

the issue with fasta_txome is that I'm going to have a lot of them (Ensembl versions, biotype sets etc), so if I can only have one of those per assembly that's not going to work.

But you can work with multiple fasta_txome assets under a single genome, because these don't lock the genome identifier. So, I would have envisioned you'd have 1 primary fasta file under fasta, and then you put all your cdna ones under fasta_txome.

Alternatively if I make every transcriptome a top-level 'genome' then I lose the grouping under the assembly and the nice explicit link with that assembly (which, as I say, does seem to work on a small scale).

Well, they'd still be grouped under the primary assembly, right?

Also I don't see fasta_txome documented at http://refgenie.databio.org/en/latest/available_assets/.

Sorry about that; it should be identical to fasta in terms of building. It just has a different name, which avoids the initialization component of the fasta asset.

But altogether, this is one of the reasons that I think your use case is going to require the expanded recipe descriptions proposed in #198. It looks like you're getting it sort-of working, which is great, but I think it's time to just solve that issue.

@pinin4fjords
Copy link
Author

@nsheff aha gotcha- thanks for clarifying fasta_txome, I'll try that now.

@stolarczyk
Copy link
Contributor

Maybe we never tried it and we may not be checking that correctly -- or maybe it is allowable, @stolarczyk correct me if I'm wrong here.

It's not allowable, but it seems like it's technically possible -- I added a check so that the genome is not reinitialized, but building another fasta asset doesn't fail. In hindsight, that was a mistake because we rely on the 1:1 namespace:fasta relationship in other parts of the codebase, like refgenie compare.

You should see sth like the line below at the top of your "extra fasta" build logs @pinin4fjords:

'xxx' genome is already initialized with other fasta asset (xxx/fasta:tag)

@pinin4fjords
Copy link
Author

@stolarczyk yep, I did, but as things worked anyway I wasn't over-worried. If this really is naughty then things should probably exit at that point.

@pinin4fjords
Copy link
Author

Using fasta_txome seems to be helping. I'm still getting a lot of the errors as reported in #253, but the config file seems to be maintaining consistency.

pinin4fjords added a commit to ebi-gene-expression-group/isl_refs_to_refgenie that referenced this issue Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants