Rewrite #373

Stranger6667 · 2022-05-22T22:19:15Z

Overview

This is a work-in-progress PR that brings the jsonschema-rewrite crate that takes a different approach to its internal structure.

At a certain stage, it will be ready for review, but feel free to share your thoughts here.

Eventually, this PR will include all features provided by the current version, but at the moment it is more a proof of concept.

Porting some optimizations from the current version will be needed for proper benchmarking, but first, I'd like to focus on correctness.

Motivation

Validation performance. Building a "compiled" JSON Schema representation is a lesser concern, as it is supposed to be created once per schema and then reused. However, once the implementation is stable, I'd like to improve it too.

My personal use case is filtering randomly generated data leaving only samples that match some schema. It is a CPU-bound task and I'd like it to be as fast as possible (it also affects my AWS bills). Most of the data is generated by construction, but some parts are hard to implement this way, hence resorting to filtering.

Lessons learned

Over the last few years, I spent quite some time optimizing the current implementation and at some point, I realized that it is fundamentally limited by the internal structure I chose to use in the very beginning.

From my perspective the most annoying details of the current approach are:

Fragmented data layout. Most of the data blocks required for validation live in different allocations which makes it not cache-friendly as we need to read from different memory locations very frequently;
Allocations during validate. In applicator keywords, we almost always allocate a new vector for children's validation errors. I.e. it is not a truly lazy iteration - generally, it does validation first and then iterates over errors.
Too much code generation for validate. Another aspect is that we use flat_map excessively and it produces a lot of code inside the resulting binary (according to llvm-lines). It affects compilation time (and presumably performance too).
Locking during reference resolving. We use an RWLock for reference resolving during validation. That adds some overhead - I removed it with some unsafe stuff (it was a single-threaded environment, so it was ok) and it gave ~10-12% improvement for Open API JSON Schema benchmark.
Async resolving is complex to implement. As it lives deep enough in the internals, it requires more effort to make async resolving possible.
Relying too much on serde_json. This forces the users to convert schemas & inputs to serde_json::Value, which is a huge performance hit for bindings like Python's where data transformation could eat up to 85% of total validation time.
Annotations affect simpler use cases. Currently, annotation data lives together with validators and we need to do a few more condition checks during validation, which makes them a non-zero-cost abstraction. Historically it was a quite small performance regression, and great work was put into optimizing it. I think we can regain that regression by storing annotation data separately and using it only in apply calls.

Ideas

Here are ideas on how each issue could be approached.

Fragmented data layout & annotations

Store input schemas in as few allocations as possible. My current idea is:

Store keywords & their data in a single allocation
Store edges between keywords in another allocation

Applicator keywords may have some edges, e.g. properties might store indexes into the edges vector, then those edges will point back to the keywords vector. This should be enough to traverse keywords in the right order and extract input components needed for specific keywords.

Keywords may need additional data, e.g. required needs a list of properties. This stuff may live in some shared arena too - I'd like to avoid nested allocations completely. Eventually, it would be nice to make it work in no-std environments, where the user can provide their own memory where the "compiled" schema version will store its data. Dunno how it is going to play with custom keywords where the user may have arbitrary data, but it should be solvable.

Similarly, annotations could be stored in a separate container to avoid loading them during is_valid / validate, but this should be benchmarked first.

Allocations during `validate` & excessive code generation

Here we may implement the Iterator trait that will traverse over the schema & input and store its state on manually-handled stacks. E.g., with the layout above we can have two vectors to store the traversal position in keywords & edges. It gets a bit tricky with applicator keywords that may need to run validation on all their children first, but we still can store this state in a struct that implements Iterator. This way it should be just allocations to store the iterator state.

It also implies that we won't need flat_map this much - return types should be simpler. E.g., instead of always returning an iterator we can return Result<(), ValidationError> / Option<ValidationError> from keywords and then use it in the Iterator trait implementation.

Locking during reference resolving & async

Resolve everything upfront, then there will be no need to store anything extra during validation, hence no locking is needed.

The layout above also allows us to handle recursive references much easier and gives us an opportunity to optimize even more by inlining trivial schemas (like ajv does), so we avoid some jumps across the schema at all.

If this will be just a single function, it is easier to make it async:

let compiled = jsonschema::build_async(&schema).await?;

Not sure about the most ergonomic interface here, but separating resolving definitely makes async easier to work with.

Input layout & `serde_json`

This one could be done later, but the idea would be to create a public trait, e.g. Json, and
make this library use it in its public API. This trait would make it possible to work with any inputs like it is a JSON value - checking types, extracting values, iterating, etc. By default, we can implement it only for serde_json::Value. However, I am not sure if it would be completely zero-cost for serde_json::Value.

Progress

The current state contains:

codecov · 2022-05-22T22:27:19Z

Codecov Report

Merging #373 (74e0bd6) into master (44b7e54) will decrease coverage by 0.22%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #373      +/-   ##
==========================================
- Coverage   78.87%   78.66%   -0.22%     
==========================================
  Files          54       54              
  Lines        4488     4481       -7     
==========================================
- Hits         3540     3525      -15     
- Misses        948      956       +8

Impacted Files	Coverage Δ
jsonschema/src/keywords/type_.rs	`76.70% <0.00%> (-3.98%)`	⬇️
jsonschema/src/keywords/legacy/type_draft_4.rs	`68.47% <0.00%> (-1.09%)`	⬇️
jsonschema/src/resolver.rs	`78.99% <0.00%> (-0.85%)`	⬇️
jsonschema/src/compilation/options.rs	`84.77% <0.00%> (-0.38%)`	⬇️
jsonschema/src/schemas.rs	`89.62% <0.00%> (-0.10%)`	⬇️
jsonschema/src/keywords/helpers.rs	`62.50% <0.00%> (+1.27%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jsonschema-csr/src/vocabularies/applicator/all_of.rs

jp707049 · 2022-11-11T03:52:31Z

Hi @Stranger6667 ,

I'm new to rust and came across this crate to validate some json files that I have for kubernetes.

Will this PR also enable external reference resolver? I've multiple json files in which there are $ref which links to other files.

Stranger6667 · 2022-11-13T22:26:18Z

Hi @jp707049

Reference resolving is available outside of this PR. This PR will change WHEN the resolving step is going to happen. Here is the relevant README section.

jp707049 · 2022-11-13T22:46:32Z

Hi @Stranger6667 ,
Thanks for the readme link. However, in my case I don't use HTTP. Here is my use case:

deployment.json - This file contains the $ref which specifies path to another json file in same directory.

"spec": {
	"$ref": "_definitions.json#/definitions/io.k8s.api.apps.v1.DeploymentSpec",
	"description": "Specification of the desired behavior of the Deployment."
}

_definitions.json

{
	"definitions": {
		"io.k8s.api.apps.v1.DeploymentSpec": {
			"description": "DeploymentSpec is the specification of the desired behavior of the Deployment.",
			"properties": {
				"minReadySeconds": {
					"description": "Minimum number of seconds for which a newly created pod should be ready without any of its container crashing, for it to be considered available. Defaults to 0 (pod will be considered available as soon as it is ready)",
					"format": "int32",
					"type": "integer"
				},
				"paused": {
					"description": "Indicates that the deployment is paused.",
					"type": "boolean"
				},
				"progressDeadlineSeconds": {
					"description": "The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.",
					"format": "int32",
					"type": "integer"
				},
				"replicas": {
					"description": "Number of desired pods. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1.",
					"format": "int32",
					"type": "integer"
				},
				"revisionHistoryLimit": {
					"description": "The number of old ReplicaSets to retain to allow rollback. This is a pointer to distinguish between explicit zero and not specified. Defaults to 10.",
					"format": "int32",
					"type": "integer"
				},
				"selector": {
					"$ref": "#/definitions/io.k8s.apimachinery.pkg.apis.meta.v1.LabelSelector",
					"description": "Label selector for pods. Existing ReplicaSets whose pods are selected by this will be the ones affected by this deployment. It must match the pod template's labels."
				},
				"strategy": {
					"$ref": "#/definitions/io.k8s.api.apps.v1.DeploymentStrategy",
					"description": "The deployment strategy to use to replace existing pods with new ones.",
					"x-kubernetes-patch-strategy": "retainKeys"
				},
				"template": {
					"$ref": "#/definitions/io.k8s.api.core.v1.PodTemplateSpec",
					"description": "Template describes the pods that will be created."
				}
			},
			"required": ["selector", "template"],
			"type": "object"
		}
	}
}

By default, the url scheme is json-schema and I get the following error message:

Validation error: failed to resolve json-schema:///_definitions.json: cannot resolve relative external schema without root schema ID

Note:

Both files are in schemas/v1.22.0/*.json folder.
There is no $id in the json files. What should I put as $id if it's required?
I've not set any features related to the latest Drafts for this crate.

Stranger6667 · 2023-06-09T07:40:07Z

jsonschema-rewrite/src/schema/graph/nodes.rs

+
+/// Unique identifier of a node in a graph.
+#[derive(Debug, Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Hash)]
+pub(crate) struct NodeId(usize);


Could be NonZeroUsize - the 0th node should not be addressable

Stranger6667 · 2023-06-09T07:41:22Z

jsonschema-rewrite/src/schema/graph/nodes.rs

+#[derive(Debug, Copy, Clone)]
+pub(crate) struct Node<'s> {
+    pub(crate) value: &'s Value,
+    kind: JsonSchemaNodeKind,


This could be named like TraversalScope

Stranger6667 · 2023-06-09T07:57:53Z

jsonschema-rewrite/src/schema/graph/mod.rs

+    schema: &'s Value,
+    root: &'s Resolver,
+    resolvers: &'s HashMap<&str, Resolver>,
+) -> Result<CompressedRangeGraph> {


Maybe store node ids of parent / children / siblings right here? then we may avoid having a separate vec for edges + there will be no need to maintain a stack during validation (each node knows where to go after validation is done at its level):

struct Node { parent: Option<NonZeroUsize>, children: Option<Range<NonZeroUsize>>, data: Validator }

Stranger6667 mentioned this pull request May 25, 2022

simd support? #368

Open

Stranger6667 mentioned this pull request Jun 8, 2022

unexpected draft2019-09 (and up) $ref behavior #378

Closed

Stranger6667 added the help wanted Extra attention is needed label Jun 26, 2022

Stranger6667 commented Oct 20, 2022

View reviewed changes

jsonschema-csr/src/vocabularies/applicator/all_of.rs Outdated Show resolved Hide resolved

Stranger6667 force-pushed the dd/rewrite branch from 1ecc875 to 9d61042 Compare October 27, 2022 19:12

Stranger6667 changed the title ~~chore: Initial incomplete prototype~~ Rewrite Nov 13, 2022

Stranger6667 force-pushed the dd/rewrite branch from 9d61042 to 26dbb89 Compare January 15, 2023 13:39

Stranger6667 force-pushed the dd/rewrite branch from 0b2f7e6 to 54dac9c Compare January 29, 2023 19:51

Stranger6667 force-pushed the master branch from 58fb577 to 44b7e54 Compare February 1, 2023 00:43

Stranger6667 and others added 17 commits February 3, 2023 01:09

chore: Initial incomplete prototype

14a6960

chore: Renaming

9858cea

chore: Renaming

55cfebc

chore: Store edge labels

32c9e3e

chore: Graph processing

92f54a4

chore: Graph processing

646954d

chore: Export resolver code for benchmarking

267c00f

chore: Fix clippy warnings

a1313a3

chore: Improve scope handling

a47a49d

chore: Improve scope handling

34e6de2

fix: Self-ref collection

e2483ea

test: Name parametrized tests

6180d9d

chore: Store $ref parent object as a concrete node

00eee48

chore: Avoid re-exploring the same values

e06a20c

chore: Folders

f7aa63e

test: More tests

c319300

chore: More traversal

0c1d297

Stranger6667 and others added 14 commits February 3, 2023 01:09

chore: Cleanup

b9565aa

chore: Simplify

44e3c31

chore: Reorganize

3dc6fa0

chore: Reorganize

a014575

chore: Alternative approach for graphs

585a515

chore: Remove unused code

e738fb4

chore: Update comments

d4193ea

refactor: Restructure graph.rs

c5eea45

chore: Fix clippy warnings

c5e129b

chore: Track schemas

83d1886

refactor: More restructure

2148d10

refactor: Resolving

5602b0a

refactor: Visited map

2033696

chore: Draft graph compression

7f97346

Stranger6667 force-pushed the dd/rewrite branch from 489bce8 to 7f97346 Compare February 3, 2023 00:10

stranger6667 and others added 9 commits February 4, 2023 15:03

chore: Calculate root_offset & re-enable is_valid

453ed57

chore: Use repeat + take instead of allocating a vector

956e5d9

test: Enable proper graph validation for Ref

250bbab

fix: Ref position in RangeGraph

28c6e95

chore: Better refs handling

1a8a349

docs: Small update

70f2123

chore: Make references work

85b909b

test: Enable some tests

e1024d2

docs: Cleanup

74e0bd6

Stranger6667 commented Jun 9, 2023

View reviewed changes

Stranger6667 mentioned this pull request Feb 26, 2024

[Question] Plans for this library #455

Open

Stranger6667 force-pushed the master branch 2 times, most recently from c8ee78a to 8eacf2d Compare March 3, 2024 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite #373

Rewrite #373

Stranger6667 commented May 22, 2022 •

edited

codecov bot commented May 22, 2022 •

edited

jp707049 commented Nov 11, 2022

Stranger6667 commented Nov 13, 2022 •

edited

jp707049 commented Nov 13, 2022

Stranger6667 Jun 9, 2023

Stranger6667 Jun 9, 2023

Stranger6667 Jun 9, 2023 •

edited

Rewrite #373

Are you sure you want to change the base?

Rewrite #373

Conversation

Stranger6667 commented May 22, 2022 • edited

Overview

Motivation

Lessons learned

Ideas

Fragmented data layout & annotations

Allocations during validate & excessive code generation

Locking during reference resolving & async

Input layout & serde_json

Progress

codecov bot commented May 22, 2022 • edited

Codecov Report

jp707049 commented Nov 11, 2022

Stranger6667 commented Nov 13, 2022 • edited

jp707049 commented Nov 13, 2022

Stranger6667 Jun 9, 2023

Choose a reason for hiding this comment

Stranger6667 Jun 9, 2023

Choose a reason for hiding this comment

Stranger6667 Jun 9, 2023 • edited

Choose a reason for hiding this comment

Stranger6667 commented May 22, 2022 •

edited

Allocations during `validate` & excessive code generation

Input layout & `serde_json`

codecov bot commented May 22, 2022 •

edited

Stranger6667 commented Nov 13, 2022 •

edited

Stranger6667 Jun 9, 2023 •

edited