Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with GCRcatalogs #106

Open
stuartmcalpine opened this issue Mar 29, 2024 · 3 comments
Open

Integration with GCRcatalogs #106

stuartmcalpine opened this issue Mar 29, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@stuartmcalpine
Copy link
Collaborator

Can we replace aspects of GCRcatalogs with the registry?

For example, people would register the GCRcatalogs configuration files into the registry rather than hard coding them into the GCRcatalogs code.

Can we scrape these config files in a useful way to make ingesting them a bit easier?

@stuartmcalpine stuartmcalpine added the enhancement New feature or request label Mar 29, 2024
@stuartmcalpine stuartmcalpine self-assigned this Mar 29, 2024
@JoanneBogart
Copy link
Collaborator

I'm starting to formulate a design for this. But first I'd like to see changes in the way the production dataset is created and handled. It's a separate issue, but related in that dealing with these legacy datasets will be our first serious use of production.
I think we need another db account, say production_rw, which is used to create the production db and is the only account to have write privileges there. reg_reader - and perhaps also reg_writer - should be given read access at creation time, just as reg_reader is given read privileges now.

At NERSC, production datasets will be under the existing shared area,
/global/cfs/cdirs/lsst/shared
root_dir for those accessing only the production db would be/global/cfs/cdirs/lsst/shared. I guess we could make a symlink
/global/cfs/cdirs/lsst/shared/production --> /global/cfs/cdirs/lsst/sharedFor a non-production database, if it has default root_dir =/something/root_dirthere needs to be a symlink/somthing/root_dir/production-->/global/cfs/cdirs/lsst/shared/(This is particularly clumsy because there already is a/global/cfs/cdirs/lsst/production. And although there isn't currently a /global/cfs/cdirs/lsst/shared/production, it doesn't seem quite right for us to usurp this path just for the registry. Maybe we should change the owner_type name -- or at least the corresponding subdirectory name -- from productionto, e.g.dataregistry_production` to avoid name conflicts)

Assuming something like the above has been done, adding a new dataset for an existing dataset known to GCRCatalogs is straightforward for "simple" (explained below) catalogs: call dataset.register as usual with old_location set to None, name equal to the GCRCatalogs name (basename of its config file, not include ".yaml"), and access_API set to GCRCatalogs. Value for access_API_configuration for existing datasets is deducible. Values for some other parameters (including at least relative_path and description might be retrievable from the config, but there is no guarantee since there is hardly anything fixed about the format of a GCRCatalogs config file.
Since GCRCatalogs has no uniform way to specify dataset version, I think we (that is, a function called something like dataset.register_gcr_catalog ) can by default set version string to, e.g. '1.0.0' but allow the caller to override.
By a "simple" catalog I mean one for which its config has a value which corresponds to relative_path. There are other kinds of catalogs: catalogs based on another config (e.g., only including a subset of the data), catalogs which are aliases for some other catalogs, and catalogs which are composites, essentially joining two or more simple catalogs. For the aliases we can probably just use our dataset_alias table. A catalog "based on" another catalog should be tractable. I'm not sure about the composites. Unfortunately for us they're quite useful so we'll have to come up with something.
Ideally GCRCatalogs.load_catalog(catalog_name) should then be able to look up the registry entry, retrieve the contents of access_API_configuration and go on its merry way.

@JoanneBogart
Copy link
Collaborator

Noting that, as @stuartmcalpine suggested, some notion of "collection", similar to the concept in Rucio, may help with composite catalogs.

@JoanneBogart
Copy link
Collaborator

@yymao your thoughts on this issue would be most welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants