Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representing Hierarchies in the Catalog DB #684

Open
genematx opened this issue Mar 8, 2024 · 1 comment
Open

Representing Hierarchies in the Catalog DB #684

genematx opened this issue Mar 8, 2024 · 1 comment
Labels

Comments

@genematx
Copy link
Contributor

genematx commented Mar 8, 2024

We had a very interesting conversation with @danielballan yesterday about the structure of the catalog.db database, specifically the representation of the hierarchy of nodes.

The rows of the nodes table store metadata of assets that often (always?) are in a hierarchical relationship with each other (e.g. a tree of folders). Currently, this structure is encoded by storing lists of ancestors of each node. Since several nodes may have same uids (e.g. names of subfolders in separate paths of the tree), the lists of ancestors serve as a second part of the composite key to ensure uniqueness (please see attached). While having lists of ancestors readily available may improve the query performance, it also raises the question of the efficiency of data storage and the complexity of the key.

20240307_092937

A possible solution could be:

  1. Ensure that uid column uniquely identifies each node in the table.
  2. Use these uids to capture the tree structure in an external linked "closure" table. The idea is to keep track of every "ancestor-descendant" pair (not just "parent-child") along with the corresponding distance (depth) between them. Assuming that each row is it's own descendant and ancestor at depth 0, this leads to fast rules for data queries and updates without the need to keep lists of references.

For small (shallow) hierarchies the befits of using closure tables could be minimal, but it should be quite noticeable with larger datasets and would help us to future-proof the database schema. Please share your thoughts and comments!
@danielballan, @dylanmcreynolds, @Kezzsim

@genematx genematx added the SQL label Mar 8, 2024
@danielballan
Copy link
Member

The closure table looks like very elegant pattern that applies well to our use case. I like that triggers can be used to keep the closure table in sync with the "main" table (in our case, the nodes table) without the application getting directly involved.

I think we can use the id as a unique ID for each node.

# This id is internal, never exposed to the client.
id = Column(Integer, primary_key=True, index=True, autoincrement=True)

The key column, which is sometimes a BlueskyRun <uid> but is in general is any string, is akin to name in the example in that article. The key / name can repeat; we just need a unique constraint on (key, parent). (Currently we have a unique constraint on (key, ancestors).)

I think makes sense to sketch this at a level we can benchmark it against realistic data at a realistic scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants