Database-backed config #284

nsheff · 2024-02-22T14:40:24Z

Can we use pipestat as the metadata store for refgenie?

advantages:

one abstraction would work for both file-based or db-based storage
can benefit from work/testing/features already in pipestat, and updates will benefit that project
pipelines are actually reporting results, which fits naturally with pipestat.

disadvantages:

does pipestat framework fit the refgenie system needs?
refgenie relationships may be a bit more complex than typical pipestat use case

Solution idea 1: one table of genomes

record_identifier corresponds to genome digest.
then each asset under a genome would be a specific result, which would be stored as an object type. Pipestat can accommodate a JSON field, so we could do this.

record_id: genome_digest
result: 
    - result_id: asset_digest
        value: {asset_dict}

Solution idea 2: one table per asset class

Since an asset class is essentially a schema, each asset class could correspond to a table (or, pipestat namespace). This would mean for a file_backend (the current use case), each asset class would have its own file.

record_id: asset_digest
result:
    - result_id: asset_attr1
        value: val1
    - result_id: asset_attr2
        value: val2
    - ...

It would be better if each genome had its own file instead.
These assets would lack a genome identifier

Solution idea 3: combination

RefGenConf holds 3 pipestat manager objects; or, 1 multi-object with 3 namespaces; or 1 with 2+X namespaces, where X is the number of asset classes known by this instance.

So,

genomes, assets, aliases; OR,
genomes, aliases, asset1, asset2, asset3, ...

genome_table:
    record_id: genome_digest
    result:
        result_id: asset_class_id
        value: asset_digest
asset_table:
    record_id: asset_digest
    result:
    - result_id: asset_attr1
        value: val1
    - result_id: asset_attr2
        value: val2
alias_table:
    record_id: alias
    result:
        result_id: "genome_digest"
        result: value: genome_digest

An asset class corresponds to a schema, so you could have a separate table for each asset; or, you could have one table for all assets but use a JSON column for the content and thereby make it schemaless.

This is incomplete -- how would you do these things?

adding a genome: genomes_psm.report({id: digest, ...}) # namespace: genomes

seek operation: looks up an asset by reg. path:
psm.retrieve(namespace="assets", id="hg38/bowtie2")

I would need to write custom indexes and custom joins...

The text was updated successfully, but these errors were encountered:

nsheff added the brainstorming label Feb 22, 2024

nsheff mentioned this issue Feb 22, 2024

Refgenie roadmap #283

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database-backed config #284

Database-backed config #284

nsheff commented Feb 22, 2024

Database-backed config #284

Database-backed config #284

Comments

nsheff commented Feb 22, 2024

Solution idea 1: one table of genomes

Solution idea 2: one table per asset class

Solution idea 3: combination