You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can we use pipestat as the metadata store for refgenie?
advantages:
one abstraction would work for both file-based or db-based storage
can benefit from work/testing/features already in pipestat, and updates will benefit that project
pipelines are actually reporting results, which fits naturally with pipestat.
disadvantages:
does pipestat framework fit the refgenie system needs?
refgenie relationships may be a bit more complex than typical pipestat use case
Solution idea 1: one table of genomes
record_identifier corresponds to genome digest.
then each asset under a genome would be a specific result, which would be stored as an object type. Pipestat can accommodate a JSON field, so we could do this.
Since an asset class is essentially a schema, each asset class could correspond to a table (or, pipestat namespace). This would mean for a file_backend (the current use case), each asset class would have its own file.
It would be better if each genome had its own file instead.
These assets would lack a genome identifier
Solution idea 3: combination
RefGenConf holds 3 pipestat manager objects; or, 1 multi-object with 3 namespaces; or 1 with 2+X namespaces, where X is the number of asset classes known by this instance.
An asset class corresponds to a schema, so you could have a separate table for each asset; or, you could have one table for all assets but use a JSON column for the content and thereby make it schemaless.
This is incomplete -- how would you do these things?
adding a genome: genomes_psm.report({id: digest, ...}) # namespace: genomes
seek operation: looks up an asset by reg. path: psm.retrieve(namespace="assets", id="hg38/bowtie2")
I would need to write custom indexes and custom joins...
The text was updated successfully, but these errors were encountered:
Can we use pipestat as the metadata store for refgenie?
advantages:
disadvantages:
Solution idea 1: one table of genomes
record_identifier
corresponds to genome digest.then each asset under a genome would be a specific result, which would be stored as an object type. Pipestat can accommodate a JSON field, so we could do this.
Solution idea 2: one table per asset class
Since an asset class is essentially a schema, each asset class could correspond to a table (or, pipestat namespace). This would mean for a file_backend (the current use case), each asset class would have its own file.
It would be better if each genome had its own file instead.
These assets would lack a genome identifier
Solution idea 3: combination
RefGenConf holds 3 pipestat manager objects; or, 1 multi-object with 3 namespaces; or 1 with 2+X namespaces, where X is the number of asset classes known by this instance.
So,
An asset class corresponds to a schema, so you could have a separate table for each asset; or, you could have one table for all assets but use a JSON column for the content and thereby make it schemaless.
This is incomplete -- how would you do these things?
adding a genome:
genomes_psm.report({id: digest, ...})
# namespace: genomesseek operation: looks up an asset by reg. path:
psm.retrieve(namespace="assets", id="hg38/bowtie2")
I would need to write custom indexes and custom joins...
The text was updated successfully, but these errors were encountered: