Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite #373

Draft
wants to merge 88 commits into
base: master
Choose a base branch
from
Draft

Rewrite #373

wants to merge 88 commits into from

Conversation

Stranger6667
Copy link
Owner

@Stranger6667 Stranger6667 commented May 22, 2022

Overview

This is a work-in-progress PR that brings the jsonschema-rewrite crate that takes a different approach to its internal structure.

At a certain stage, it will be ready for review, but feel free to share your thoughts here.

Eventually, this PR will include all features provided by the current version, but at the moment it is more a proof of concept.

Porting some optimizations from the current version will be needed for proper benchmarking, but first, I'd like to focus on correctness.

Motivation

Validation performance. Building a "compiled" JSON Schema representation is a lesser concern, as it is supposed to be created once per schema and then reused. However, once the implementation is stable, I'd like to improve it too.

My personal use case is filtering randomly generated data leaving only samples that match some schema. It is a CPU-bound task and I'd like it to be as fast as possible (it also affects my AWS bills). Most of the data is generated by construction, but some parts are hard to implement this way, hence resorting to filtering.

Lessons learned

Over the last few years, I spent quite some time optimizing the current implementation and at some point, I realized that it is fundamentally limited by the internal structure I chose to use in the very beginning.

From my perspective the most annoying details of the current approach are:

  1. Fragmented data layout. Most of the data blocks required for validation live in different allocations which makes it not cache-friendly as we need to read from different memory locations very frequently;
  2. Allocations during validate. In applicator keywords, we almost always allocate a new vector for children's validation errors. I.e. it is not a truly lazy iteration - generally, it does validation first and then iterates over errors.
  3. Too much code generation for validate. Another aspect is that we use flat_map excessively and it produces a lot of code inside the resulting binary (according to llvm-lines). It affects compilation time (and presumably performance too).
  4. Locking during reference resolving. We use an RWLock for reference resolving during validation. That adds some overhead - I removed it with some unsafe stuff (it was a single-threaded environment, so it was ok) and it gave ~10-12% improvement for Open API JSON Schema benchmark.
  5. Async resolving is complex to implement. As it lives deep enough in the internals, it requires more effort to make async resolving possible.
  6. Relying too much on serde_json. This forces the users to convert schemas & inputs to serde_json::Value, which is a huge performance hit for bindings like Python's where data transformation could eat up to 85% of total validation time.
  7. Annotations affect simpler use cases. Currently, annotation data lives together with validators and we need to do a few more condition checks during validation, which makes them a non-zero-cost abstraction. Historically it was a quite small performance regression, and great work was put into optimizing it. I think we can regain that regression by storing annotation data separately and using it only in apply calls.

Ideas

Here are ideas on how each issue could be approached.

Fragmented data layout & annotations

Store input schemas in as few allocations as possible. My current idea is:

  • Store keywords & their data in a single allocation
  • Store edges between keywords in another allocation

Applicator keywords may have some edges, e.g. properties might store indexes into the edges vector, then those edges will point back to the keywords vector. This should be enough to traverse keywords in the right order and extract input components needed for specific keywords.

Keywords may need additional data, e.g. required needs a list of properties. This stuff may live in some shared arena too - I'd like to avoid nested allocations completely. Eventually, it would be nice to make it work in no-std environments, where the user can provide their own memory where the "compiled" schema version will store its data. Dunno how it is going to play with custom keywords where the user may have arbitrary data, but it should be solvable.

Similarly, annotations could be stored in a separate container to avoid loading them during is_valid / validate, but this should be benchmarked first.

Allocations during validate & excessive code generation

Here we may implement the Iterator trait that will traverse over the schema & input and store its state on manually-handled stacks. E.g., with the layout above we can have two vectors to store the traversal position in keywords & edges. It gets a bit tricky with applicator keywords that may need to run validation on all their children first, but we still can store this state in a struct that implements Iterator. This way it should be just allocations to store the iterator state.

It also implies that we won't need flat_map this much - return types should be simpler. E.g., instead of always returning an iterator we can return Result<(), ValidationError> / Option<ValidationError> from keywords and then use it in the Iterator trait implementation.

Locking during reference resolving & async

Resolve everything upfront, then there will be no need to store anything extra during validation, hence no locking is needed.

The layout above also allows us to handle recursive references much easier and gives us an opportunity to optimize even more by inlining trivial schemas (like ajv does), so we avoid some jumps across the schema at all.

If this will be just a single function, it is easier to make it async:

let compiled = jsonschema::build_async(&schema).await?;

Not sure about the most ergonomic interface here, but separating resolving definitely makes async easier to work with.

Input layout & serde_json

This one could be done later, but the idea would be to create a public trait, e.g. Json, and
make this library use it in its public API. This trait would make it possible to work with any inputs like it is a JSON value - checking types, extracting values, iterating, etc. By default, we can implement it only for serde_json::Value. However, I am not sure if it would be completely zero-cost for serde_json::Value.

Progress

The current state contains:

  • Reference resolving at the compilation phase + avoiding locks
  • Building graphs with the new layout
  • PoC for is_valid
  • All keywords
  • Multi-draft support
  • Custom resolvers
  • Misc customization options (documents, format checks, etc)
  • Error iterator
  • Additional data inside keywords
  • Async resolving
  • Annotations & formats
  • Optimizations from the current version
  • Generic input type

@codecov
Copy link

codecov bot commented May 22, 2022

Codecov Report

Merging #373 (74e0bd6) into master (44b7e54) will decrease coverage by 0.22%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #373      +/-   ##
==========================================
- Coverage   78.87%   78.66%   -0.22%     
==========================================
  Files          54       54              
  Lines        4488     4481       -7     
==========================================
- Hits         3540     3525      -15     
- Misses        948      956       +8     
Impacted Files Coverage Δ
jsonschema/src/keywords/type_.rs 76.70% <0.00%> (-3.98%) ⬇️
jsonschema/src/keywords/legacy/type_draft_4.rs 68.47% <0.00%> (-1.09%) ⬇️
jsonschema/src/resolver.rs 78.99% <0.00%> (-0.85%) ⬇️
jsonschema/src/compilation/options.rs 84.77% <0.00%> (-0.38%) ⬇️
jsonschema/src/schemas.rs 89.62% <0.00%> (-0.10%) ⬇️
jsonschema/src/keywords/helpers.rs 62.50% <0.00%> (+1.27%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@Stranger6667 Stranger6667 mentioned this pull request May 25, 2022
@Stranger6667 Stranger6667 added the help wanted Extra attention is needed label Jun 26, 2022
@jp707049
Copy link

Hi @Stranger6667 ,

I'm new to rust and came across this crate to validate some json files that I have for kubernetes.

  • Will this PR also enable external reference resolver? I've multiple json files in which there are $ref which links to other files.

@Stranger6667
Copy link
Owner Author

Stranger6667 commented Nov 13, 2022

Hi @jp707049

Reference resolving is available outside of this PR. This PR will change WHEN the resolving step is going to happen. Here is the relevant README section.

@Stranger6667 Stranger6667 changed the title chore: Initial incomplete prototype Rewrite Nov 13, 2022
@jp707049
Copy link

Hi @Stranger6667 ,
Thanks for the readme link. However, in my case I don't use HTTP. Here is my use case:

deployment.json - This file contains the $ref which specifies path to another json file in same directory.

"spec": {
	"$ref": "_definitions.json#/definitions/io.k8s.api.apps.v1.DeploymentSpec",
	"description": "Specification of the desired behavior of the Deployment."
}

_definitions.json

{
	"definitions": {
		"io.k8s.api.apps.v1.DeploymentSpec": {
			"description": "DeploymentSpec is the specification of the desired behavior of the Deployment.",
			"properties": {
				"minReadySeconds": {
					"description": "Minimum number of seconds for which a newly created pod should be ready without any of its container crashing, for it to be considered available. Defaults to 0 (pod will be considered available as soon as it is ready)",
					"format": "int32",
					"type": "integer"
				},
				"paused": {
					"description": "Indicates that the deployment is paused.",
					"type": "boolean"
				},
				"progressDeadlineSeconds": {
					"description": "The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.",
					"format": "int32",
					"type": "integer"
				},
				"replicas": {
					"description": "Number of desired pods. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1.",
					"format": "int32",
					"type": "integer"
				},
				"revisionHistoryLimit": {
					"description": "The number of old ReplicaSets to retain to allow rollback. This is a pointer to distinguish between explicit zero and not specified. Defaults to 10.",
					"format": "int32",
					"type": "integer"
				},
				"selector": {
					"$ref": "#/definitions/io.k8s.apimachinery.pkg.apis.meta.v1.LabelSelector",
					"description": "Label selector for pods. Existing ReplicaSets whose pods are selected by this will be the ones affected by this deployment. It must match the pod template's labels."
				},
				"strategy": {
					"$ref": "#/definitions/io.k8s.api.apps.v1.DeploymentStrategy",
					"description": "The deployment strategy to use to replace existing pods with new ones.",
					"x-kubernetes-patch-strategy": "retainKeys"
				},
				"template": {
					"$ref": "#/definitions/io.k8s.api.core.v1.PodTemplateSpec",
					"description": "Template describes the pods that will be created."
				}
			},
			"required": ["selector", "template"],
			"type": "object"
		}
	}
}

By default, the url scheme is json-schema and I get the following error message:

Validation error: failed to resolve json-schema:///_definitions.json: cannot resolve relative external schema without root schema ID

Note:

  • Both files are in schemas/v1.22.0/*.json folder.
  • There is no $id in the json files. What should I put as $id if it's required?
  • I've not set any features related to the latest Drafts for this crate.


/// Unique identifier of a node in a graph.
#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Hash)]
pub(crate) struct NodeId(usize);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be NonZeroUsize - the 0th node should not be addressable

#[derive(Debug, Copy, Clone)]
pub(crate) struct Node<'s> {
pub(crate) value: &'s Value,
kind: JsonSchemaNodeKind,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be named like TraversalScope

schema: &'s Value,
root: &'s Resolver,
resolvers: &'s HashMap<&str, Resolver>,
) -> Result<CompressedRangeGraph> {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe store node ids of parent / children / siblings right here? then we may avoid having a separate vec for edges + there will be no need to maintain a stack during validation (each node knows where to go after validation is done at its level):

struct Node {
    parent: Option<NonZeroUsize>,
    children: Option<Range<NonZeroUsize>>,
    data: Validator
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants