-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validate references #30
base: main
Are you sure you want to change the base?
Conversation
Per the team discussion yesterday, I've made further changes to this branch, including:
Details including configuration options can be found in the updated README. I've tested this, it works and is significantly faster (same processing of ~960 stuSchoolAssns that took 3+ mins before making the above performance changes now takes <1.5 mins). Marking the PR ready for review. |
# to comparatively datasets (sections, schools, students). | ||
self.load_local_reference_data(endpoint) | ||
# create a structure which remote reference lookups can populate to prevent repeated lookups for the same thing | ||
self.remote_reference_cache = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resets the cache for each endpoint, perhaps we only want to do it once outside the loop (at line 35) since multiple endpoints may reference the same (cacheable) resources?
This PR implements reference validation - a new feature whereby
lightbeam validate
will attempt to resolve Reference properties within a payload by first looking up the natural key values in local JSONL files for the referenced resource ("local references"), and if none are found, then doing a GET against the Ed-Fi API with the same natural key values ("remote references"). If a given reference is neither local nor remote, the payload will fail validation.Because this feature can be slow (see comment 1 below), by default it is disabled. This is achieved by adding a new optional structure in
lightbeam.yaml
as follows:(In addition to the above, one more basic validation method is done before any of these: that the payload is valid JSON. This validation method cannnot be disabled because all the others require a valid JSON payload.) Default
validate.methods
(if unspecified by the user inlightbeam.yaml
are["schema", "descriptors", "uniqueness"]
.I've tested the functionality and it works, however I'm leaving this PR as a draft for several reasons:
The feature is slow - due to the serial nature of lookups, it took 3+ minutes to validate the
schoolReferences
andstudentReferences
in 954studentSchoolAssociation
payloads. I have a few ideas for how to improve performance, but I'm unsure how much effort to invest in this optimization work.Because of 1 above, it seems prudent to try to allow reference validation to fail quickly, when some user-specifiable threshold is reached. I propose a configuration structure (in
lightbeam.yaml
) like this:An open question is whether a default value of
10
max_failures
is reasonable.Another open question is whether we should also "succeed fast": if
studentEducationOrganizationAssociation
theeducationOrganizationReference
property may beor
Currently, this reference validation implementation determines what resource is being referenced based on the reference property name, i.e.,
schoolReference
must be a reference to aschool
. ForeducationOrganizationReference
, in principle one could derive the resource fromlink.rel
which more specifically denotes the type ofeducationOrganization
(localEducationAgency
,school
, etc.). Howeverlightbeam
JSON payloads created with earthmover often omit thelink
property and instead simply look likefrom where it is impossible to determine what type of
educationOrganization
is being referenced. Possible solutions might belightbeam
know internally which Ed-Fi resources are members of what "type" and iteratively looks for the referenced values across all of them - so first look for schools withschoolId=255901
, then for localEducationAgencies withlocalEducationAgencyId=255901
, etc.link.rel
(at least) to be present in JSONL payloads when using reference validationI welcome feedback on the above questions and this PR in general.