schema.org defaults/coercions #3201

rrlevering · 2022-10-28T14:18:39Z

rrlevering
Oct 28, 2022

At Google, we do a lot to "fix" markup we find on the web so it complies with schema.org more fully and is easier to specify. This is evident in the schema.org validator tool that we currently host (http://validator.schema.org). For a simple example:

<script type="application/ld+json">
  {
    "@context": "http://schema.org",
    "@type": "Article",
    "author": "Jaap van Zweden"
  }
</script>

Go try that at validator.schema.org and you'll see what I mean. http://schema.org/author cannot be typed to Text currently so we try to turn it into a typed object. Which type is ambiguous so we end up creating a http://schema.org/Thing object. Historically the reason is that schema.org tried not to enforce it's typing system too strongly and so strings were always implicitly allowed. Additionally, Microdata and RDFa are error-prone to annotate complex structures. But this in turn means that all the consuming code if it wants to be most permissive needs to handle all variations of typed inputs.

At Google we either don't truly fix this like with http://schema.org/author currently (since we're converting it to an invalid type) and then complain on a per-feature basis in our Google-specific Rich Results Test OR we decide that it's pretty unambiguous which typed object is being suggested and internally force it. You can see this if you try this example:

<script type="application/ld+json">
  {
    "@context": "http://schema.org",
    "@type": "Event",
    "location": "Arlington, MA"
  }
</script>

What happened there? Now we unambiguously decided that the location was a Place, even though there's a number of types it could be. Note that in this case, the location here is actually valid markup because the range of location includes Text. But our code is built to handle objects there, so we need to force it to an object for downstream systems to interpret it.

This is annoying. I don't like doing this voodoo coercion magic, but it's the only way we can give a consistent experience to internal teams and in particular handle range expansions in schema.org. And a lot of the time, allowing primitive types on predicates in schema.org is a very useful convenience.

The crux of the issue is that there is no good way for us to specify these "default" behaviors in a standardized way. We could probably push on the JSON-LD standard to add some @context magic that would make it work, but then we have to leave Microdata and RDFa out of that and there's still very valid reasons to support those standards on the web. And they are the standards that benefit the most from primitive shorthands.

What I want is something that says http://schema.org/target -> http://schema.org/URL => http://schema.org/target -> http://schema.org/EntryPoint -> http://schema.org/urlTemplate or something to that effect. That's in essence what we do internally ourselves. (and this recent range expansion in schema.org 15 is what prompted this post).

I guess my open question is whether there is value in encoding this into schema.org itself so there is a common set of inference/defaults for interpreting primitive values in cases where a target of a predicate can also be a more expressive object. This type of thing would make it much easier to support simpler markup which is one of the main goals of schema.org while still allowing ranges to expand to more expressive objects. Our fallback option is just to publicly document these coercions we do but that does not feel like it solves the problem long term. Or I guess another option that I don't l ike would be to drop all these coercions/defaults completely and just don't consume things that are under-specified for the sake of correctness.

WeaverStever · 2022-10-30T08:42:40Z

WeaverStever
Oct 30, 2022

I think what you are dealing with are persons who don't want to learn the structure part of structured data, they only want to provide the data.

Since you are only interested in JSON-LD, I think it would be best to publish a set of script-builders that make it easier for the general public to create the minimal valid scripting for the various types.

The script builders could live on the Rich Snippets guidance pages...'
Here is the guidance for article:
https://developers.google.com/search/docs/appearance/structured-data/article

There are 32 types on the Rich Snippets guidance pages.

0 replies

rrlevering · 2022-10-31T15:06:25Z

rrlevering
Oct 31, 2022
Author

Thanks for the reply!

We are definitely not just interested in JSON-LD. As I said in OP, "there's still very valid reasons to support those [RDFa/Microdata] standards on the web. And they are the standards that benefit the most from primitive shorthands."
We have Markup Helper (https://www.google.com/webmasters/markup-helper/u/0/) which was essentially designed for this purpose. Tag a web page and we show you the markup to put on your webpages. It still has usage but our schema support is not up to date.

The problem is:
a) Not everyone reads the documentation fully or validates their markup
b) Everyone implements markup and then goes away for 5 years -> infinity. Let's say addressCountry is type Text and schema.org starts to support Country for allowing more details on the country. We can't support the legacy markup easily without adding two separate blocks to handle addressCountry OR addressCountry.name every place this is consumed because there are probably 10 pieces of code at Google that look at that field and those teams might be long gone. So practically we just map the raw string to Country.name in that case via our coercion mechanism. But if everyone on the ingestion side does that differently, it will result in a bad experience for data providers.

0 replies

WeaverStever · 2022-11-01T04:21:03Z

WeaverStever
Nov 1, 2022

Most of SEO "news" sites deal primarily with JSON-LD examples. I'd guess that there are a lot of TLDR types out there that don't understand that some data needs additional declared "Types".

a) Not everyone reads the documentation fully or validates their markup

I have not used the markup helper for years, but I just looked again, the markup helper will supply incomplete scripts.

b) Everyone implements markup and then goes away for 5 years ->

Yes, that will happen Fortunately, the internet is not forever, contrary to popular belief.

There is an interesting discussion here, Yoast is developing a more object-orientated approach to JSON-LD. Basically, they are using the @id on declarations (Organization, Place etc) and then that script can be used as a portion of a larger script (Event) by calling it with the @id or with https://schema.org/isPartOf.

An overview here...
https://yoast.com/yoast-seo-11-0/

For example the logo in the scripting is declared as an imageobject and also an image.

I'm not sure if the above example will work across domains, but what if the authoritative website could provide always up to date JSON-LD snippets (Place) so that other data providers could re-use them?

Say we have a musical artist that wants to provide data for her (in person) event. In the instance below, the "location" data could be provided by the venue, the "offer" information provided by the ticketing agency, and the organizer / performer data provided from another page on the artist's website.

<script type="application/ld+json">
    {
      "@context": "https://schema.org",
      "@type": "Event",
      "@id": "https://kiraandmorrisonmusic.com/#2025-07-21"
      "name": "The Adventures of Kira and Morrison",
      "startDate": "2025-07-21T19:00-05:00",
      "endDate": "2025-07-21T23:00-05:00",
      "eventAttendanceMode": "https://schema.org/OfflineEventAttendanceMode",
      "eventStatus": "https://schema.org/EventScheduled",
      "description": "The Adventures of Kira and Morrison is coming to Snickertown in a can't miss performance.",
      "location": {
          "@id": "https://www.example.com/venue/#place",
      },
      "image": [
        "https://kiraandmorrisonmusic.com/photos/1x1/photo.jpg",
        "https://kiraandmorrisonmusic.com/photos/4x3/photo.jpg",
        "https://kiraandmorrisonmusic.com/photos/16x9/photo.jpg"
       ],
      "offers": {
        "@id": "https://www.example.com/ticket-agency/event_offer/#12345_201803180430
      },
      "performer": {
         "@id": "https://kiraandmorrisonmusic.com/#about"
      },
      "organizer": {
        "@id": "https://kiraandmorrisonmusic.com/#contact"
      }
    }
    </script>

If the venue or the ticketing agency has to make changes(pricing / sold out), these would (also) be updated within the artist's JSON-LD on her website.

Again, not sure if this would work across domains, but instead of everybody having to retype data, we should have a way to get it from the authoritative website and let them keep it up to date and correct.

0 replies

rrlevering · 2022-11-04T02:07:18Z

rrlevering
Nov 4, 2022
Author

That is sort of an orthogonal conversation. Remote id resolution is a really interesting idea and I think in theory it's really cool. I'm not positive it's practical (because it sort of assumes somewhat static content and very rational actors in general) but I think we should start a new discussion if we want to go further. In a lot of the cases I'm talking about, we might not even have an entity as the target at all to go resolve externally.

Internally in my discussions with Dan, I'm mostly focusing this issue on the problem with the ambiguity of the primitive value (a predicate that leads to an object OR a string/url). If I have an entity, especially if it's typed, there is much less confusion. Even Thing->name sort of implies a lot of semantics that there is an entity that has some descriptive string. But if I have just a Text at the end of a predicate, I could have anything there. I could have a completely separate encoding of the structure even ("I have a hex-encoded image there"). In multi-typed ranges, there isn't even a discussion of what the string should look like. So it's the ultimate example of both providers and consumers not being able to deal with it. Even a note in the documentation that said, "For Text strings, we expect the name of the Brand to be encoded" or something would be better than where schema.org is. Text strings that could be anything are essentially unparseable by any real standard or agreement. If we could solve/elaborate on the semantics of that, it would go a lot of the way towards my problem of how we interpret these things without assuming something schema.org doesn't explicitly say.

2 replies

WeaverStever Nov 4, 2022

I think your original instinct was probably correct, don't digest data that is improperly typed.

I did create an item list and populated with @id on scripts fro the same domain. The rich snippets testing tool did not like it. Probably should have used ispartof, or haspart, functionally which I learned a about later.

WeaverStever Nov 4, 2022

PS for maintaining legacy schema definitions, the BLS / OES uses crosswalks, but they are working with unique occupational codes and unique descriptions.

rrlevering · 2022-11-04T12:07:01Z

rrlevering
Nov 4, 2022
Author

What is the "rich snippets testing tool"? We killed that thing a long time ago. We're several incarnations past that :)

But I'm not even talking about "improperly typed" data necessarily. I mean how does someone interpret the semantics of a primitive string on a lot of these edges. http://schema.org/location can be a Place, PostalAddress, or a raw string. What is the raw string? It's even more unusable than a Thing with a name. At least that carries some restriction that it's a single entity with a descriptive name.

0 replies

WeaverStever · 2022-11-05T06:21:45Z

WeaverStever
Nov 5, 2022

I guess it is called Rich Results test now.

Looking at http://schema.org/location, that is pretty bizarre that a naked string is acceptable there. Maybe for things / locations that are really broad like "Pacific Ocean", but even so, place.name would be a more informative way to go about that.

If you want to check the field as a string you might see if you get a hit as a match for a Wikidata item, conversely, if you think it is a one line address you might try querying Google Places to see if you get a match. But then what to you do? Store the data (as is) from the field, or store the data from what you think might be the authoritative database -- guessing the original provider's intent.

0 replies

rrlevering · 2022-11-11T14:39:27Z

rrlevering
Nov 11, 2022
Author

I am continuing to talk with Dan about ways to add semantics to the raw primitive values so they can be used more effectively. If anyone has interest in making these more specified (either by something soft like docs or more strict like Text => SometType -> name) please feel free to chime in. Otherwise, I will try to find some way to document what Google does during ingestion without making it look like a recommendation (since we usually would prefer a more semantic object notation).

0 replies

WeaverStever · 2022-11-18T06:32:18Z

WeaverStever
Nov 18, 2022

Giving this a bump, at this point, it seems that creating a new Type of "undefined" might be useful for this. Not only for ingesting the data, but also for consuming it.

0 replies

HughP · 2022-11-18T15:40:55Z

HughP
Nov 18, 2022

I don’t like the bare string component. I prefer an object, but a general undefined object doesn’t seem to make sense. Is there an object for place that is a planet? Something might be on Mars or the moon. I’m a fan of adding more descriptive comments when a property is accepted and requiring at least 2 examples. This sort of policy would really help new adopters and beef up the current documentation.

On Thu, Nov 17, 2022 at 10:32 PM WeaverStever ***@***.***> wrote: Giving this a bump, at this point, it seems that creating a new Type of "undefined" might be useful for this. Not only for ingesting the data, but also for consuming it. — Reply to this email directly, view it on GitHub <#3201 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAJ2JQXC7O265NAQJPWS23WI4PH7ANCNFSM6AAAAAARRDYNHU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

-- All the best, -Hugh

…

Sent from my iPhone

1 reply

WeaverStever Nov 18, 2022

@HughP

What @rrlevering is discussing is some un-typed properties (location, name etc) will pass on the SDTT with inconsistent results , the testing tool resolves location to Place and name to Thing. This because Text is already an @type of itself.

I'm wondering if there should be a new ontology, perhaps JSON-LAZY for common objects like Article, Event, etc. I.e., something that can be more loosely typed for the lesser technically inclined.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema.org defaults/coercions #3201

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

schema.org defaults/coercions #3201

Replies: 9 comments · 3 replies

rrlevering Oct 31, 2022 Author

rrlevering Nov 4, 2022 Author

rrlevering Nov 4, 2022 Author

rrlevering Nov 11, 2022 Author

Replies: 9 comments 3 replies

rrlevering
Oct 31, 2022
Author

rrlevering
Nov 4, 2022
Author

rrlevering
Nov 4, 2022
Author

rrlevering
Nov 11, 2022
Author