Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object level metadata from JSON schema #3758

Open
birnbera opened this issue Oct 5, 2023 · 0 comments
Open

Object level metadata from JSON schema #3758

birnbera opened this issue Oct 5, 2023 · 0 comments

Comments

@birnbera
Copy link

birnbera commented Oct 5, 2023

This relates to #2963 but I wanted to create a separate issue as it is a very different method to update metadata in packages. I'm posting this here as an interesting option for other users and something to consider for inclusion as Quilt feature in future releases.

When creating packages it is usually straightforward to add package level metadata without too much effort. However, adding metadata to the individual objects can be challenging. In our case, we already store some metadata in the path to our files, such as sample IDs and several other types of entity IDs depending on the use case. Since Quilt is already includes logic to validate individual entries in a package manifest, I found a way to use that same schema to infer metadata for objects based on their path.

When Quilt performs entry validation in a workflow it generates a list of Python dictionaries, with the keys logical_key, size, and meta:

def get_pkg_entries_for_validation(self, pkg):
# TODO: this should be validated without fully populating array.
empty_dict = {}
def reuse_empty_dict(meta):
# Reuse the same empty dict for entries without meta
# to reduce memory usage.
return empty_dict if meta == {} else meta
return [
{
'logical_key': lk,
'size': e.size,
"meta": reuse_empty_dict(e.meta),
}
for lk, e in pkg.walk()
]

The meta key refers to the user_meta subkey of the object's metadata. If you create a JSON schema that matches a logical_key using a regex pattern, it is possible to include named capture groups, e.g.:

{
    "type": "object",
    "properties": {
        "logical_key": {
            "type": "string",
            "pattern": "^samtools/(?P<sampleId>[^/]+)/[^/]+\\.txt$"
        }
    }
}

Normally, named captures have no effect during validation other than documentation purposes. However, it is possible to extend a built in jsonschema validator with additional logic. In our case, we have updated the object properties validator to assign metadata to the meta dictionary before proceeding with validation. This is the code used to do this:

import re

from jsonschema import Draft7Validator, validators


def extend_with_meta_assignment(validator_class):
    validate_properties = validator_class.VALIDATORS["properties"]

    def set_meta_from_pattern(validator, properties, instance, schema):
        if not validator.is_type(instance, "object"):
            return

        if "logical_key" in properties and "meta" in properties:
            lkey_subschema = properties["logical_key"]
            meta_subschema = properties["meta"]

            if validator.is_valid(instance.get("logical_key"), lkey_subschema):
                if not validator.is_valid(instance.get("meta"), meta_subschema):
                    meta = instance.setdefault("meta", {})
                    # Pattern has to match logical_key
                    m = re.search(lkey_subschema["pattern"], instance["logical_key"])
                    for prop, entity_id in m.groupdict().items():
                        meta[prop] = entity_id

        # Descend and process as normal
        for error in validate_properties(
            validator,
            properties,
            instance,
            schema,
        ):
            yield error

    return validators.extend(
        validator_class,
        {"properties": set_meta_from_pattern},
    )


MetadataAssignmentValidator = extend_with_meta_assignment(Draft7Validator)

After validation with MetadataAssignmentValidator, the object that was passed in has updated meta fields based on the named captures in the pattern. This object can be used to update each PackageEntry before building/pushing the package.

There are a couple of things to watch out for:

  1. You want to be careful about matching multiple subschemas. The oneOf property is useful here:
"type": "array",
  "items": {
      "oneOf": [ {...} ]
  }
  1. Directly using the get_pkg_entries_for_validation function from the linked code above would be a mistake because it uses an optimization to save on memory be reusing a single empty dictionary when no metadata is already present on package entries. This could lead to all fields being present on all items since potentially every item's meta would be a reference to the same object.
  2. This only works for Python-style regular expressions. JS named captures use a different syntax so if you want to maintain a single set of entry schemas for validation and setting metadata Quilt has to continue using a Python JSON schema implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant