Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database-backed config #284

Open
Tracked by #283
nsheff opened this issue Feb 22, 2024 · 0 comments
Open
Tracked by #283

Database-backed config #284

nsheff opened this issue Feb 22, 2024 · 0 comments

Comments

@nsheff
Copy link
Contributor

nsheff commented Feb 22, 2024

Can we use pipestat as the metadata store for refgenie?

advantages:

  • one abstraction would work for both file-based or db-based storage
  • can benefit from work/testing/features already in pipestat, and updates will benefit that project
  • pipelines are actually reporting results, which fits naturally with pipestat.

disadvantages:

  • does pipestat framework fit the refgenie system needs?
  • refgenie relationships may be a bit more complex than typical pipestat use case

Solution idea 1: one table of genomes

record_identifier corresponds to genome digest.
then each asset under a genome would be a specific result, which would be stored as an object type. Pipestat can accommodate a JSON field, so we could do this.

record_id: genome_digest
result: 
    - result_id: asset_digest
        value: {asset_dict}

Solution idea 2: one table per asset class

Since an asset class is essentially a schema, each asset class could correspond to a table (or, pipestat namespace). This would mean for a file_backend (the current use case), each asset class would have its own file.

record_id: asset_digest
result:
    - result_id: asset_attr1
        value: val1
    - result_id: asset_attr2
        value: val2
    - ...

It would be better if each genome had its own file instead.
These assets would lack a genome identifier

Solution idea 3: combination

RefGenConf holds 3 pipestat manager objects; or, 1 multi-object with 3 namespaces; or 1 with 2+X namespaces, where X is the number of asset classes known by this instance.

So,

  1. genomes, assets, aliases; OR,
  2. genomes, aliases, asset1, asset2, asset3, ...
genome_table:
    record_id: genome_digest
    result:
        result_id: asset_class_id
        value: asset_digest
asset_table:
    record_id: asset_digest
    result:
    - result_id: asset_attr1
        value: val1
    - result_id: asset_attr2
        value: val2
alias_table:
    record_id: alias
    result:
        result_id: "genome_digest"
        result: value: genome_digest

An asset class corresponds to a schema, so you could have a separate table for each asset; or, you could have one table for all assets but use a JSON column for the content and thereby make it schemaless.

This is incomplete -- how would you do these things?

adding a genome: genomes_psm.report({id: digest, ...}) # namespace: genomes

seek operation: looks up an asset by reg. path:
psm.retrieve(namespace="assets", id="hg38/bowtie2")

I would need to write custom indexes and custom joins...

@nsheff nsheff mentioned this issue Feb 22, 2024
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant