Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining with JSON Schema specification for validation? #1052

Open
devinbost opened this issue Aug 18, 2022 · 30 comments
Open

Combining with JSON Schema specification for validation? #1052

devinbost opened this issue Aug 18, 2022 · 30 comments

Comments

@devinbost
Copy link

devinbost commented Aug 18, 2022

EDIT: For anyone first reading this, please see this comment that introduces the business case: #1052 (comment)

I don't know if this is the right place to start this discussion, but I've noticed that there is some overlap between the JSON side of CloudEvents and the features of json-schema specifically around validation. In large part, it seems like they were developed to solve different problems, but I'm wondering if I could get folks from both sides to start a discussion on the feasibility of allowing json-schema to be used within events that conform to the CloudEvents specification. If we could write bindings for CloudEvents and then use json-schema to perform validation, that would open doors for us. But, maybe there are other solutions in the industry that would solve this problem.
I'd like to hear some thoughts on this.

@devinbost devinbost changed the title Combining with json-schema? Combining with json-schema for validation? Aug 18, 2022
@handrews
Copy link

I see that the JSON Schema being used to validate the JSON format is a draft-07 schema. Just in case folks aren't aware, the more recent draft 2020-12 (core, validation), which has also been adopted by OpenAPI 3.1, has the concept of extension vocabularies.

If JSON Schema covers some but not all of your needs, extension vocabularies could be a way to bridge the gap. OpenAPI 3.1 has an extension vocabulary for this purpose. I mention OpenAPI both to bring up their use of an extension vocabulary, and to note that their adoption means that there will be a larger demand driving tooling support for 2020-12 than for any other JSON Schema draft since draft-04.

@devinbost devinbost changed the title Combining with json-schema for validation? Combining with JSON Schema specification for validation? Aug 19, 2022
@devinbost
Copy link
Author

I found some change logs between draft-07 and 2020-12 that make it easier to see what has changed:

Draft-07 to Draft 2019-09
Draft 2019-09 to Draft 2020-12

@gregsdennis
Copy link

There's also a tool that's being developed to perform this transition.

https://github.com/sourcemeta/alterschema

@duglin
Copy link
Collaborator

duglin commented Sep 1, 2022

ping @n3wscott @jskeet @clemensv for thoughts

@jskeet
Copy link
Contributor

jskeet commented Sep 1, 2022

I'd like to know more about what's being proposed. I would expect the dataschema attribute to at least potentially refer to a JSON schema for the payload. But is this issue more about a JSON schema for the CloudEvent data surrounding the payload?

As an aside, I would prefer not to end up with anything in the 1.0 spec which is still draft - is there any notion of JSON schema ever actually getting to "1.0"?

@gregsdennis
Copy link

In regard to a schema for cloud events, the $dynamic* keywords of draft 2020-12 provide a good pattern. I've recently written about it here.

is there any notion of JSON schema ever actually getting to "1.0"?

We (JSON Schema) have decided to split from our involvement with IETF, and are pursuing other publication methods. As such, we're exploring what that looks like.

That said, draft 7 is wisely used in production environments, and implementations are steadily starting to support draft 2020-12 (the latest).

We're getting close to a "1.0," but we're not quite there yet. Even so, this shouldn't hinder adoption.

@handrews
Copy link

handrews commented Sep 1, 2022

@jskeet You can read the discussion of our standards approach, and do note that we are still working with the IETF HTTPAPI working group to register the JSON Schema media types. It is the rest of the specification for which we are looking at different publication approaches.

@devinbost
Copy link
Author

What is the intent behind the dataschema field of CloudEvents? Is it just to validate the payload?

I'm interested in a mechanism to distinguish between (1) validation of the payload, and (2) validation of the envelope.
For example, if a producing application is expected to provide a URN with a strict naming convention in the source field, we'd like a standard way to validate the content before an invalid URN blows up a downstream function after the handoff. JSON Schema 2020-12 provides a nice way to do this, at least when the entire event is JSON.

Another area where I can see this functionality being helpful is when certain CloudEvents extensions are required for a particular implementation.

If this envelope validation needs to be implemented with an extension of CloudEvents, we lose a bit of interoperability.

@devinbost
Copy link
Author

devinbost commented Sep 8, 2022

Also, @jskeet I saw some comments that versioning was out of scope of CloudEvents, so please let me know if this falls into that category, but I'd like a way to distinguish between version of the envelope (which is currently done nicely by specversion) and version of the payload (which is currently missing). For example, if a producer wants to evolve a contract and starts emitting both the prior version (for backward compatibility) and the new version of an event, I want consumers to be able to filter to only the version they're interested in without needing to deserialize the entire body/payload and without needing to cut a new path/route for every new event version.
I saw a note that the version MAY be included in the type, but if we don't have a way to validate the content of the envelope (hence why I think this is relevant to this thread), then it's up to the implementation to define how (or if) that version would exist in type, which lacks guarantees for consumers.

I could potentially put this payload version into some kind of metadata context object that would exist in every schema, but then we need to enforce that it exists in every payload across every protocol, which gets complicated. So, it seems like that field would fit better in the envelope.

@jskeet
Copy link
Contributor

jskeet commented Sep 8, 2022

@devinbost: We have some guidance on that in https://github.com/cloudevents/spec/blob/main/cloudevents/primer.md#versioning-of-cloudevents

EDIT: Whoops - just seen that's the note you referred to.

@devinbost
Copy link
Author

devinbost commented Sep 8, 2022

@jskeet I went back through that doc more carefully and noticed some commentary about how the version could be included in the URI of the dataschema. That could work, but it again brings up the challenge with not being able to validate the content in the envelope. For example, we'd want to be able to ensure that the version identifier matches a regular expression that provides a minor and major version number to distinguish backward-compatible from backward-incompatible changes. (e.g. "v1" would be invalid but "1.2" would be valid.) Also, at the very least, we'd need to ensure that producers are providing it. (A producer's messages should be rejected if they're not providing the version in the schema URI.)

@jskeet
Copy link
Contributor

jskeet commented Sep 8, 2022

@devinbost: I think it's really up to providers, to be honest. I'm not entirely sure what the request/proposal is here - and I may well not be the best person to comment on that request/proposal anyway. (It's also unclear whether the "we" in your messages is a general internet "we" or a specific organization with specific needs - if you could clarify that, it would be useful.)

@sasha-tkachev
Copy link
Contributor

Maybe we can create an extension attribute called ceschema which is a uri that describes the schema of all the attributes of an event. given they are serialized into a JSON dict.
In combination with the dataschema this MAY solve the issue at hand.

@sasha-tkachev
Copy link
Contributor

We talked on the call about that idea of creating such ceschema attributes and it seems it does not actually solve a particular problem. Because if a consumer receives an event it should already know how to process certain attributes.
Moreover, it MAY create alot of issues - for example what about the case than an event is propagated across the system and new extensions are added to the existing extensions.

My personal opinion, this idea is very complex and does not worth it.

@gregsdennis
Copy link

an event is propagated across the system and new extensions are added to the existing extensions.

Can you elaborate on this? I'm unfamiliar with extensions in regards to CE.

@sasha-tkachev
Copy link
Contributor

sasha-tkachev commented Sep 15, 2022

@gregsdennis each intermediary MAY add or remove optional attributes because nothing prohibits it to do so.
For example I as an intermediary decide that each event passing through me will get a new attribute named myattr with the value "hello world".
If the ceschema will exist I will need to edit it as-well which is very complicated.

@gregsdennis
Copy link

gregsdennis commented Sep 16, 2022

I think it's actually pretty simple to achieve.

So let's say the schema uses the $dynamic* approach I mentioned:

// generic cloud event (unknown content)
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://cloudevents.io/schema/envelope",
  "$defs": {
    "data": {
      "$dynamicAnchor": "data",
      "not": true
  },
  "type": "object",
  "properties": {
    "data": { "$dymamicRef": "#data" },
    ...
}

// cloud event for "person updated"
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://myserver.com/schema/person",
  "$defs": {
    "person": {
      "$dynamicAnchor": "data",
      "firstName": { "type": "string" },
      ...
    }
  },
  "allOf": [
    { "$ref": "https://cloudevents.io/schema/envelope" }
  ]
}

NOTE I changed the above from the blog post to put the $ref in an allOf. The reason will become clear later.

Someone receiving a "person updated" event could validate the entire event (envelope and payload) using https://myserver.com/schema/person. The particular point here is that there is no additionalProperties restricting extensions.

So let's say you're an intermediary, and you want to add "myAttr": "hello world" to the payload. Do it. It'll still pass validation (assuming person doesn't already define myAttr in a contradictory way). The schema doesn't need to change at all.

Even if you wanted to add myAttr to the schema, all you'd need to do is add an entry in the allOf that validates the payload contains a myAttr property that's a string. So the intermediary would update the person schema to:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://myserver.com/schema/person",
  "$defs": {
    "person": {
      "$dynamicAnchor": "data",
      "firstName": { "type": "string" },
      ...
    }
  },
  "allOf": [
    { "$ref": "https://cloudevents.io/schema/envelope" },
    {
      "properties": {
        "data": {
          "myAttr": { "type": "string" }
        }
      }
    }
  ]
}

Each intermediary can add to the allOf to facilitate its own requirements.

NOTE You should be aware that changing the schema in-flight should probably require a new $id since that URI is associated with the base event, not the updated one.

@devinbost
Copy link
Author

devinbost commented Sep 16, 2022

@sasha-tkachev I apologize for missing the call. I'll be there at the next one.

if a consumer receives an event it should already know how to process certain attributes.

This is exactly the scenario I'm concerned about. What we were observing is actually the opposite. Let me explain with the following scenarios.

SCENARIO 1: An event contract is changed by a producer upstream. Since there's no way to enforce validation on the producers, it triggers an entire cascade of breaks across multiple consumer flows. Consider this flow:
P1 -----> A1 ------> A2 -------> A3
-------> B1 ------> B2 -------> B3
-------> C1 ------> C2
-------> D1
P1 owns P1 and A1-A3. P1 makes a breaking change after updating the code in A1-A3 to support the change.
However, P1 is unaware of consumers B, C, and D in different departments. The breaking change to P1 breaks contracts for B1, C1, and D1.
As soon as they're fixed, now B2 and C2 are blowing up and must be fixed.
This breaks into two sub-scenarios:

SCENARIO 1.1: Producer P1 made a change in unversioned code or forgot to increment the version, so there was no way for consumers to know some events had a different contract. This is more of an issue on the producer side, so I won't say much here other than that an additional validation layer could have caught the mistake.

SCENARIO 1.2: Consumers didn't know that they needed to validate messages against the version in the URI or didn't know how to parse the URI to extract the version to make a check. Only when their code starts throwing exceptions do they inspect incoming messages and notice there has been a version change. Not all companies have analytics that allow them to observe which downstream teams are consuming which versions of which upstream events, so it can be hard (or even impossible) for producers to know which teams they need to communicate with to ensure downstream teams can handle breaking changes. Someone could say that implementations should be obvious, but that's not always true.

SCENARIO 2: We also saw cases where a break occurred due to an intermediate dependency. For example, a change in P1 results in a change to the behavior of B1, but the exception is thrown farther down the flow in B3 in code owned by a different team. Now there's a communication problem as the owners of B3 struggle to find the root cause since -- as far as they're concerned -- the messages' contracts shouldn't have changed upstream, right? (These are the cases that have caused some of the most severe production outages since it took significant time for teams to track down the cause.)

SCENARIO 3: P1 cuts a new minor version. Teams owning apps B, C, and D knew about a coming change but incorrectly didn't think they would be impacted. Whether or not they would be impacted is a concern that should be addressed through validation, like a JSON Schema, tied to the version. If a message with a new version validates successfully against the schema for the prior version, then consumers can trust that their implementations will still successfully process the message. So, we have a mechanism to check against backward compatibility if we use JSON Schema to validate the envelope. The schema in this case can inform consumers if a message is still valid or not, based on what their expectations are. (Keep in mind that consumers can have stronger validation via JSON Schema if needed based on features they're implementing, and standardizing on JSON Schema makes that validation easier to maintain in general.)

CONCRETE EXAMPLE 1:
P1 wants to cut a change to the path of a URN that's used by analytics downstream when analyzing site traffic. A new version of the schema can inform consumers exactly how they should handle the incoming envelope from the new version to inform them of whether it's a backward-compatible change or not. This can also happen with tracking id's and other similar values.
CONCRETE EXAMPLE 2:
A header used for routing receives a new value. However, some consumers aren't aware of the new value and send the message instead to a default path, which breaks tracking logic, or worse, causes inaccuracies in production reports.
CONCRETE EXAMPLE 3:
A change is made to a tracking ID. We saw a case where this data invalidated a report used by execs when reporting to the board of directors, so they needed to scramble at the last moment to try to reconstruct the data before it could be released... It turned out that the data had been wrong for an entire year. I won't get into that scenario further or say which company it involved, but this kind of thing is more common than you'd like to believe when events are processed by analytical flows. I saw a similar case at a different company, but in that case, they didn't catch the issue for several years, and it resulted in very difficult communication with customers since peoples' lives were impacted by the issue...

ADDITIONAL BUSINESS CASES:

  1. Aside from preventing production incidents, we were seeing cases where there were chains of 7-9 apps long that cut across multiple teams' and departments' boundaries. This makes change management hard in general, so having a way to validate the entire message is very important in these situations to ensure everyone is on the same page.
  2. When there's no way to hold producers accountable to a contract, they can make changes and not even care if some downstream consumers are impacted since at that point they have different priorities (different departments.) Though that's more of a business culture problem, it's difficult for consumers to handle this kind of upstream attitude when they don't have a clear way of validating the entirety of the messages they receive.
  3. Producers don't always know what kinds of changes will break consumers. Validation with JSON Schema can protect consumers by enforcing message processing semantics on a clear data contract. For example, if the producer wants to cut a new version, such as a change to the path of a URN that's used by analytics downstream when analyzing site traffic, a new version of the schema can inform consumers exactly how they should handle the incoming envelope from the new version and allow them to easily check if their existing code is compatible.

I'm sure I can think of more cases if I search my memory, but hopefully, this is a good start.

  • If it's confusing when I say "we," just interpret "we" to mean "me and people I work with," which may sometimes be in a consulting capacity (not necessarily always my day job.)

@devinbost
Copy link
Author

devinbost commented Sep 16, 2022

One other important case I forgot to mention:

SCENARIO 4: P1 needs to cut a new version but wants to remain backward-compatible, so it starts emitting both new and old versions of events. Consumers need a way to filter (by version) to only the messages of interest and perform different validation depending on the version they're interested in, but let's assume they can do this by version information in the URI or type. One advantage here for supporting JSON Schema validation of the envelope is support for automatic upgrades; if a new version passes existing validation for the consumer in question, then the consumer can automatically switch to consuming the new event version.

@sasha-tkachev
Copy link
Contributor

sasha-tkachev commented Sep 16, 2022

@gregsdennis So you are saying the the schema MAY validate only some of the attributes, correct?
And I as an intermediary MAY add new attributes to the event as long as the schema stays valid.
This seems logical.
Also I thought about removing attributes, but as long as the schema stays valid I think this SHOULD be ok.
And at the worst case scenario you MAY change the value of the attersschema attribute to point to a new schema.
It is still pretty complex though, and I think that creating a single schema for both the attributes and the data is still not necessary.

However the explanation @devinbost gave is very detailed and I have changed my opinion on the usability of such an extension.

I addition to the definition of the attersschema we SHOULD define a canonical attributes representation format.
The reason we need to do that is because we need format on which we will need to do the schema validation on.

@devinbost
Copy link
Author

devinbost commented Sep 21, 2022

Hi @sasha-tkachev , regarding your comment:

In addition to the definition of the attersschema we SHOULD define a canonical attributes representation format.
The reason we need to do that is because we need format on which we will need to do the schema validation on.

what do you mean by "canonical attributes representation format"?

I suppose there's an open question for how an implementation should interpret the URI provided in atterschema . JSON Schema seems to be the obvious choice for JSON messages, but due to a lack of standards for schema validation for non-JSON types, consumers may need more information to know how to interpret the atterschema URI if the event is non-JSON. Is that what your comment is about?

Also, I reviewed the recording from the last working group meeting. I can clarify some points raised in that meeting.

@jskeet I apologize if I was confusing/unclear at the start of this thread! Do my examples above (and in the comment below that one) help you understand my intent for this? I hope they make the intent clearer.

@JemDay
The case does manifest that there may be technologies that wish to conform to the spec without using the CloudEvents SDK, but I don't think the intent of this attribute would be primarily for internal use of events. I think my examples above might make that clearer.

@duglin
Without having a standards-based way of validating envelopes, implementations actually do hard-code validation logic since they lack any kind of common "language" for validation. A lack of support for standards-based validation in the spec also complicates reusability since consumers tend to bake their own validation approaches. So, enabling standards-based validation of the entire event provides a mechanism to stop that kind of anti-pattern.

@clemensv , regarding your concern that subsequent events may need to add attributes, if I'm understanding the question correctly, in that case, I'd think that whatever app is responsible for adding those attributes should be updating the schema to ensure those attributes are supported, or at least they must ensure that the event they pass downstream can be validated according to the attersschema they provide if they provide one, just like they would be responsible for ensuring the dataschema is valid for any data they're producing if a dataschema is provided. I do think it should be optional, but if I were a consumer, I'd want it to be available for any event I needed to consume. There could be a question of whether it adds value within flows owned by a single person, but I think this kind of validation is very useful between domains or when there's a handoff, especially between teams or technologies. What do you think? Are there cases or consequences I'm not considering? It sounds like you and @sasha-tkachev were thinking of a specific scenario regarding downstream apps/functions adding attributes, so I want to make sure I'm considering those use cases.

@deissnerk
Copy link
Contributor

Thanks @devinbost for the detailed explanation of the use cases. I have the impression that this is not so much about a schema for a specific event type, though. To me the examples with URN formats or IDs being used look more like a constraint that is defined as an additional contract or convention in or between organizations.

If we introduced a more general concept around this idea of constraints (others talk about contracts, conventions or characteristics), there could be pointers (URIs) to JSON schema for those who prefer this. For others a constraint could just point to a github page or a test description that explained what additional constraints were in place. Something like "our events always have trace IDs", or "subject for us always is a URN following format xyz".

If you add this as an extension attribute to each event, there is the challenge, that collections are not supported in CloudEvent attributes. I could also imagine to have this kind of information only in the catalog/discovery service.

@sasha-tkachev
Copy link
Contributor

@deissnerk We talked about the implementation in the call two weeks ago.
The way we are headed right now is defining an attrschema URI attribute which points to a schema in a format defined in the discovery spec.

The constraint idea is interesting, but I think a simple schema for each attribute is enough

@deissnerk
Copy link
Contributor

@sasha-tkachev I think we have to discuss a bit more in the next call. Relying on the schema format from the discovery spec sounds good to me. If we specify the attribute in a way that it can also point to an actual event definition in the discovery service, we might both get what we want. Perhaps the notion of constraints/contracts/characteristics is something we can pick up there. An intermediary can then even enrich the discovery information if needed.

Sorry, for being a bit late to the discussion. I had to leave the call two weeks ago very early because of an unplanned, urgent matter.

@sasha-tkachev
Copy link
Contributor

@deissnerk @devinbost
Here is my current proposal for the attrschema extension
#1106

@duglin
Copy link
Collaborator

duglin commented Feb 16, 2023

@sasha-tkachev (or anyone else) what's the status of this issue?

@sasha-tkachev
Copy link
Contributor

My proposal was rejected

@github-actions
Copy link

This issue is stale because it has been open for 30 days with no
activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

@duglin
Copy link
Collaborator

duglin commented May 5, 2023

is this one still under discussion or should we close it?

@github-actions
Copy link

github-actions bot commented Jun 6, 2023

This issue is stale because it has been open for 30 days with no
activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants