Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate references #30

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

validate references #30

wants to merge 10 commits into from

Conversation

tomreitz
Copy link
Collaborator

@tomreitz tomreitz commented May 8, 2024

This PR implements reference validation - a new feature whereby lightbeam validate will attempt to resolve Reference properties within a payload by first looking up the natural key values in local JSONL files for the referenced resource ("local references"), and if none are found, then doing a GET against the Ed-Fi API with the same natural key values ("remote references"). If a given reference is neither local nor remote, the payload will fail validation.

Because this feature can be slow (see comment 1 below), by default it is disabled. This is achieved by adding a new optional structure in lightbeam.yaml as follows:

validate:
  methods:
    - schema # checks that payloads conform to the Swagger definitions from the API
    - descriptors # checks that descriptor values are either locally-defined or exist in the remote API
    - uniqueness # checks that local payloads are unique by the required property values
    - references # NEW checks that references resolve

(In addition to the above, one more basic validation method is done before any of these: that the payload is valid JSON. This validation method cannnot be disabled because all the others require a valid JSON payload.) Default validate.methods (if unspecified by the user in lightbeam.yaml are ["schema", "descriptors", "uniqueness"].

I've tested the functionality and it works, however I'm leaving this PR as a draft for several reasons:

  1. The feature is slow - due to the serial nature of lookups, it took 3+ minutes to validate the schoolReferences and studentReferences in 954 studentSchoolAssociation payloads. I have a few ideas for how to improve performance, but I'm unsure how much effort to invest in this optimization work.

  2. Because of 1 above, it seems prudent to try to allow reference validation to fail quickly, when some user-specifiable threshold is reached. I propose a configuration structure (in lightbeam.yaml) like this:

validate:
  references:
    max_failures: 10 # stop testing after 10 failed payloads ("fail fast")

An open question is whether a default value of 10 max_failures is reasonable.

Another open question is whether we should also "succeed fast": if

  1. An unresolved issue is how to handle "typed" resources. For example, in studentEducationOrganizationAssociation the educationOrganizationReference property may be
{
  "educationOrganizationId": 255901,
  "link": {
    "rel": "LocalEducationAgency",
    "href": "/ed-fi/localEducationAgencies/5ec280c188db4f0bae9ef60e2ae5c231"
  }
}

or

{
  "educationOrganizationId": 255901044,
  "link": {
    "rel": "School",
    "href": "/ed-fi/school/7381540c0eff4b778d0399ce8b397c9a"
  }
}

Currently, this reference validation implementation determines what resource is being referenced based on the reference property name, i.e., schoolReference must be a reference to a school. For educationOrganizationReference, in principle one could derive the resource from link.rel which more specifically denotes the type of educationOrganization (localEducationAgency, school, etc.). However lightbeam JSON payloads created with earthmover often omit the link property and instead simply look like

{
  "educationOrganizationId": 255901044
}

from where it is impossible to determine what type of educationOrganization is being referenced. Possible solutions might be

  • having lightbeam know internally which Ed-Fi resources are members of what "type" and iteratively looks for the referenced values across all of them - so first look for schools with schoolId=255901, then for localEducationAgencies with localEducationAgencyId=255901, etc.
  • requiring link.rel (at least) to be present in JSONL payloads when using reference validation

I welcome feedback on the above questions and this PR in general.

@tomreitz
Copy link
Collaborator Author

tomreitz commented May 10, 2024

Per the team discussion yesterday, I've made further changes to this branch, including:

  • performance improvements on local references checks: load relevant properties of relevant ref'd endpoints from local files into an in-memory cache once, rather than looping over and reading files with every payload
  • performance improvements on remote reference checks: (asynchronously) GETting remote references in batches and caching responses
  • implement both "fail fast" and "succeed fast" features

Details including configuration options can be found in the updated README.

I've tested this, it works and is significantly faster (same processing of ~960 stuSchoolAssns that took 3+ mins before making the above performance changes now takes <1.5 mins). Marking the PR ready for review.

@tomreitz tomreitz marked this pull request as ready for review May 10, 2024 19:32
@tomreitz tomreitz changed the title inital commit with a working implementation validate references May 13, 2024
# to comparatively datasets (sections, schools, students).
self.load_local_reference_data(endpoint)
# create a structure which remote reference lookups can populate to prevent repeated lookups for the same thing
self.remote_reference_cache = {}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resets the cache for each endpoint, perhaps we only want to do it once outside the loop (at line 35) since multiple endpoints may reference the same (cacheable) resources?

@tomreitz tomreitz requested a review from jalvord1 May 16, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant