Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON serialisation format is woefully underspecified #321

Open
gouttegd opened this issue Sep 19, 2023 · 17 comments · Fixed by mapping-commons/sssom-py#480
Open

JSON serialisation format is woefully underspecified #321

gouttegd opened this issue Sep 19, 2023 · 17 comments · Fixed by mapping-commons/sssom-py#480

Comments

@gouttegd
Copy link
Contributor

The spec basically says, “the JSON serialisation format is whatever LinkML’s dumper classes produce”. That may be enough for people working in languages well supported by LinkML (so, Python and… what else?), but for the others, this is completely inadequate.

As a result, I’m trying to figure out what the JSON serialisation should look like and what sssom-py (as the reference implementation) expects, but it is not helped by the fact that even a JSON file produced by sssom convert (so, presumably, correct… right?) is not accepted as it is by sssom parse!

Consider the following minimal TSV file:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  ORCID: "https://orcid.org/"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#creator_id:
#  - "ORCID:0000-0002-6095-8718"
subject_id	subject_label	predicate_id	object_id	mapping_justification
FBbt:00000001	organism	semapv:crossSpeciesExactMatch	UBERON:0000468	semapv:ManualMappingCuration

It is accepted without any issue by sssom convert to produce a JSON file:

$ sssom convert -O json -o sample1.json sample1.sssom.tsv

And yet sssom parse -I json barfs when trying to read the resulting JSON:

$ sssom parse -C merged -I json -o sample2.sssom.tsv sample1.json
WARNING:root:ORCID is used in the SSSOM mapping set but it does not exist in the prefix map
Traceback (most recent call last):
  File "/Users/dpg44/Library/Python/3.10/bin/sssom", line 8, in <module>
    sys.exit(main())
  File "/Users/dpg44/Library/Python/3.10/lib/python/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/dpg44/Library/Python/3.10/lib/python/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/dpg44/Library/Python/3.10/lib/python/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/dpg44/Library/Python/3.10/lib/python/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/dpg44/Library/Python/3.10/lib/python/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/dpg44/Library/Python/3.10/lib/python/site-packages/sssom/cli.py", line 215, in parse
    parse_file(
  File "/Users/dpg44/Library/Python/3.10/lib/python/site-packages/sssom/io.py", line 112, in parse_file
    doc.clean_prefix_map(strict=strict_clean_prefixes)
  File "/Users/dpg44/Library/Python/3.10/lib/python/site-packages/sssom/util.py", line 185, in clean_prefix_map
    raise ValueError(
ValueError: ['ORCID'] are used in the SSSOM mapping set but it does not exist in the prefix map

The JSON file does contain a declaration for the JSON prefix in its JSON-LD context, though:

{
  "mapping_set_id": "https://w3id.org/sssom/mappings/afe3deeb-25c4-4e6e-a612-6db26fa2f692",
  "license": "https://w3id.org/sssom/license/unspecified",
  "mappings": [
    {
      "subject_id": "FBbt:00000001",
      "predicate_id": "semapv:crossSpeciesExactMatch",
      "object_id": "UBERON:0000468",
      "mapping_justification": "semapv:ManualMappingCuration",
      "subject_label": "organism"
    }
  ],
  "creator_id": [
    "ORCID:0000-0002-6095-8718"
  ],
  "@type": "MappingSet",
  "@context": {
    "dcterms": "http://purl.org/dc/terms/",
[… TRUNCATED FOR BREVITY …]
    "FBbt": "http://purl.obolibrary.org/obo/FBbt_",
    "ORCID": "https://orcid.org/",
    "UBERON": "http://purl.obolibrary.org/obo/UBERON_"
  }
}

And yet those prefix declarations seem completely ignored. The only way I got sssom parse to accept that JSON file was by specifying an external metadata file containing the Curie map (option -m).

So, I have the following questions about the JSON format:

1) Are (or should) CURIEs be allowed in that format? Obviously sssom convert behaves as if they should (since it does not write identifiers in their expanded form). But one could argue that, given the inherent verbosity of the JSON output (compared to the TSV format), the rationale for using CURIEs in that format is tenuous.

The spec does hint at the fact that CURIEs may only be intended to be used in the TSV format, based on this bit (emphasis mine):

Two serialisations for SSSOM mappings are provided in this document, aimed at different communities: an RDF/OWL serialisation using IRIs that is aimed at the Knowledge Graph/Semantic Web community, and a TSV serialisation using CURIE syntax which is aimed at the wider bioinformatics community

and on the fact that the notion of Curie map (which is indissociable of the use of CURIEs) is only ever mentioned in the section about the TSV format.

So, maybe JSON output should only use expanded identifiers? In which case this is a problem of sssom convert not doing the expansion it should perform when writing JSON.

If CURIEs are intended to be used in JSON output, then:

2) How should the Curie map be stored when using JSON output? I thought it was in the JSON-LD context (that’s apparently what sssom convert thinks, too), but as illustrated above sssom parse does not expect it there (it completely ignores whatever is in that context).

Is sssom parse wrong to ignore the JSON-LD context, or is intended that the Curie map should always be provided in a separate metadata file? Either way, the spec should clearly state what is expected here.

Finally:

3) How important is the JSON format for SSSOM? I can’t help but feel that the JSON format is completely neglected by the spec, so I am seriously considering giving up on trying to support it in SSSOM-Java, unless there is a compelling reason to do so.

@matentzn
Copy link
Collaborator

Closely related: #102 #241

Thank you @gouttegd, very, very welcome nudge. We really need to take a closer look at the other serialisations. IMO RDF is in an even worse state right night, not that it makes this situation much better.

Parsing issue:

the fact the sssom-py does not parse the JSON again is IMO a sssom-py issue. The problem is that the context is somehow passed wrongly to the conversion function. @hrshdhgd can you create an isolated sssom-py issue that ensures that TSV -> JSON -> TSV always works, and add a few tests that involve custom prefixes?

  1. Are (or should) CURIEs be allowed in that format? Obviously sssom convert behaves as if they should (since it does not write identifiers in their expanded form). But one could argue that, given the inherent verbosity of the JSON output (compared to the TSV format), the rationale for using CURIEs in that format is tenuous.

I think CURIES should be at the very least allowed in that format. I am a bit on the fence of making them required though. If a JSON-LD parser can parse both a document with expanded IRIs and CURIEs into the same internal representation, we should probably permit both. The question is if JSON-LD is really the format most people would be looking for, rather then some more classic, compact "normal" JSON. In any case, what we call JSON right now is JSON-LD, and therefore should be treated as such - this wont change.

  1. How should the Curie map be stored when using JSON output? I thought it was in the JSON-LD context (that’s apparently what sssom convert thinks, too), but as illustrated above sssom parse does not expect it there (it completely ignores whatever is in that context).

I thought so too.. @hrshdhgd can you look into this?

  1. How important is the JSON format for SSSOM? I can’t help but feel that the JSON format is completely neglected by the spec, so I am seriously considering giving up on trying to support it in SSSOM-Java, unless there is a compelling reason to do so.

It has not been at the forefront of importance so far, but it is becoming more important now that we are writing our first APIs, cc @anitacaron @udp

@gouttegd
Copy link
Contributor Author

the fact the sssom-py does not parse the JSON again is IMO a sssom-py issue.

Sure, but is it an issue of sssom convert producing an invalid file or of sssom parse being unable to parse an otherwise valid file? The spec should provide the authoritative answer to that question (by saying what a valid file should look like), and the fact that it does not is (for me, as someone trying to implement the spec) the real issue here.

add a few tests that involve custom prefixes?

Emphasis on custom prefixes. The problem is made worse by the fact that sssom-py uses a built-in table of prefixes whether the user asks for it or not, and this can completely hide the issue as long as you are only using the prefixes that sssom-py expects.

For example, if you replace ORCID:0000-0002-6095-8718 by orcid:0000-0002-6095-8717 in the provided sample, everything works fine even if you don’t provide any Curie map at all (because orcid, UBERON, and FBbt are all known to sssom-py, whereas ORCID is not) and therefore you would never notice the problem.

@hrshdhgd
Copy link
Contributor

So in this specific case, how do we handle case-sensitiveness of prefixes?

  • Allow upper-case, lower-case and mixed-case prefixes? (make prefix determination case-insensitive)
  • Convert casing to the appropriate one and proceed?
  • Throw an error if prefix not recognized (as it is currently doing)?
    • Or just show a warning and adopt one of the two solutions asked above.

I understand standardizing prefixes is a huge ask (and an issue greater than us ) but what would be a practical solution?

@gouttegd
Copy link
Contributor Author

gouttegd commented Sep 25, 2023

I am not sure I understand your question. The problem here has nothing to do with case-sensitiveness.

In the provided example, the prefix is declared as ORCID and is used as ORCID. The problem is that the prefix declaration is ignored by sssom parse, regardless of its case.

But to answer your question, I think prefixes should be treated as case-sensitive and if there is a discrepancy between the Curie map and the actual Curies used in the mapping set (e.g., the Curie map contains a declaration for orcid but the Curies in the mapping set are of the form ORCID:...), that should be considered an error (use of an undeclared prefix). SSSOM tools should not, in my opinion, perform any transformation on the prefixes (such as case conversion) just to try to make sense of a file with a bogus Curie map.

@hrshdhgd
Copy link
Contributor

Thanks for clarifying Damien! So in that case, should ORCID: be flagged as an invalid prefix during sssom convert itself since the correct form (I'm guessing) is orcid?

@gouttegd
Copy link
Contributor Author

gouttegd commented Sep 25, 2023

orcid is not the “correct” form of the prefix, anymore than “ORCID”. There is no such thing as a “correct form“ for a prefix. A prefix is correct if it is declared and if is used in a way that matches its declaration – which is the case, again, in the provided example: I declare ORCID in the Curie map:

#curie_map:
#  ORCID: "https://orcid.org/"

and I use it as such in the rest of the metadata:

#creator_id:
#  - "ORCID:0000-0002-6095-8718"

This is 100% correct. And this works just as expected as long as long as I am only using the TSV format. But once the mapping set is converted into JSON (using sssom convert), sssom parse ignores the Curie map.

And I still don’t know whether this is because

  • sssom convert does not store the Curie map in the JSON file as it should (it stores it in the JSON-LD context, maybe that’s not where it should be stored?), or
  • sssom parse does not look for the Curie map where it should.

And the reason I don’t know that is because the spec does not specify the expected behaviour. This is what should be fixed. The spec MUST say explicitly where and how the Curie map should be stored in the JSON serialisation. Only then will it be possible to fix the implementation.

@hrshdhgd
Copy link
Contributor

Aah! I see your point. Apologies for not grasping this from the original issue.

@matentzn
Copy link
Collaborator

matentzn commented Oct 9, 2023

I would like to revisit this issue immediately after the big sssom-py refactor now. @hrshdhgd can you resurface this in one of our next meetings?

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 13, 2024

Re-opening because the root issue has not been addressed at all. The root issue is not the round-tripping bug in SSSOM-Py. It’s the complete absence of an actual specification for the JSON serialisation format.

I don’t understand how is that so difficult to grasp. I‘ve created the issue on the repo holding the specification, not the Python implementation. The title of the issue explicitly says: “JSON serialisation format is woefully underspecified”. I’ve explained in the initial comment that I have been trying to understand what the format is supposed to look like, and that doing so is basically impossible given the current state of the spec. I’ve asked precise questions about the format, none of which have been answered. I’ve re-stated again in later comments that “the spec should provide the authoritative answer to that question (by saying what a valid file should look like), and the fact that it does not is (for me, as someone trying to implement the spec) the real issue here” and that “the reason I don’t know that is because the spec does not specify the expected behaviour – this is what should be fixed.”

How can I be more clear that this issue is a about a specification problem, not a bug in SSSOM-Py?

The round-tripping bug in SSSOM-Py only made the problem worse by making it impossible to infer the JSON format by reverse-engineering SSSOM-Py. Glad that this is now seemingly fixed, but the problem remains that if I want to support the JSON format in SSSOM-Java, I still need to reverse-engineer the reference implementation, which – do I even need to say it ? – is very bad for something aiming to be a standard exchange format!

Right now, I feel that the JSON format is solely an internal format that is only supposed to be used by SSSOM-Py, and not intended to be interoperable. If that’s the case, then fine, but then the spec does not even need to mention that format at all.

@gouttegd gouttegd reopened this Jan 13, 2024
@matentzn
Copy link
Collaborator

In my view we should move to properly define the JSON format, as the main alternative to TSV.

The starting point should be

https://github.com/mapping-commons/sssom/blob/master/project/jsonschema/sssom_schema.schema.json

which is the JSON schema, but I don't know enough about JSON-LD to suggest any concrete steps to a full documentation of the format. @gouttegd do you have a suggestion on how we should start documenting the format properly? What exactly is missing in between the linkml docs, the json schema export we need to document carefully? Do you have a suggestion on how to document?

@gouttegd
Copy link
Contributor Author

The first thing would be to define the expectations of the format.

Funnily, the JSON generated by sssom convert changed completely between SSSOM-Py 0.4.0 and SSSOM-Py 0.4.2 (this is, in fact, not funny at all): the output that was generated by version 0.4.0 was an ersatz of JSON-LD which was not, actually, valid JSON-LD at all (trying to parse it through a JSON-LD-compliant processor yielded nothing).

Now SSSOM-Py 0.4.3 generates what seems to be fully valid JSON-LD. I suppose this is the intention?

Then first questions the spec MUST answer:

  1. Can all features of JSON-LD be used?

For example, according to the JSON-LD spec it should be possible to do funny things like this:

{
  "mapping_set_id": "http://purl.obolibrary.org/obo/fbbt/fbbt-mappings.sssom.tsv",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "mappings": [
    {
      "subject_id": "FBbt:00000001",
      "predicate_id": "semapv:crossSpeciesExactMatch",
      "object_id": "UBERON:0000468",
      "mapping_justification": "semapv:ManualMappingCuration"
    },
    {
      "subject_id": "FBbt:00000002",
      "predicate_id": "semapv:crossSpeciesExactMatch",
      "object_id": "UB:6000002",
      "mapping_justification": "semapv:ManualMappingCuration"
      "@context": {
        "UB": "http://purl.obolibrary.org/obo/UBERON_"
      }
    }
  ],
  "@type": "MappingSet",
  "@context": {
    "@vocab": "https://w3id.org/sssom/",
    /* Fragments of context skipped for brevity */
    "FBbt": "http://purl.obolibrary.org/obo/FBbt_",
    "UBERON": "http://purl.obolibrary.org/obo/UBERON_"
  }
}

where the second mapping uses an embedded context to define another prefix (UB) for Uberon terms. Should that be allowed, and should SSSOM implementations support that?

I’m not suggesting that using embedded contexts to allow redefining prefixes throughout a mapping set would be a good idea – it would be, in fact, a very bad idea in my opinion. But the JSON-LD spec allows it, for better or for worse. If the SSSOM JSON is in fact built on top of JSON-LD, then people would expect to be able to whatever the JSON-LD spec allows.

Likewise, it should also be possible (again, according to the JSON-LD spec), to do this:

{
  "mapping_set_id": "http://purl.obolibrary.org/obo/fbbt/fbbt-mappings.sssom.tsv",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "mappings": [
    {
      "my_subject": "FBbt:00000001",
      "my_predicate": "semapv:crossSpeciesExactMatch",
      "my_object": "UBERON:0000468",
      "my_justification": "semapv:ManualMappingCuration"
    }
  ],
  "@type": "MappingSet",
  "@context": {
    "@vocab": "https://w3id.org/sssom/",
    /* Fragments of context skipped for brevity */
    "my_subject": {
       "@type": "rdfs:Resource",
       "@id": "owl:annotatedSource"
    },
    "my_object": {
       "@type": "rdfs:Resource",
       "@id": "owl:annotatedTarget"
    },
    "my_predicate": {
       "@type": "rdfs:Resource",
       "@id": "owl:annotatedProperty"
    },
    "my_justification": {
       "@type": "rdfs:Resource",
       "@id": "sssom:mapping_justification"
    },
  }
}

where I use custom keys (my_subject, my_object, etc.) which are then mapped to the correct properties in the JSON-LD context. After JSON-LD expansion, this would yield exactly the same dictionary as what the expansion of the output of sssom convert would generate.

Is that supposed to be allowed? Again, not saying this is necessarily a good idea, but JSON-LD does allow that.

  1. Is JSON-LD processing supposed to be required to parse and use a SSSOM JSON file?

(This is tightly related to question 1, because if all features of JSON-LD are supposed to be allowed, then implementations have no choice: they MUST send the file through JSON-LD expansion before they can do anything with it.)

In other words: Is the JSON format strictly built on top of JSON-LD, or is it merely a custom JSON format that happens to re-use some (not all) features of JSON-LD?

It is important to note that after JSON-LD expansion, the JSON Schema you linked becomes useless, because the keys in the expanded dictionary no longer match what the schema is describing (for example, a tuple subject_id=FBbt:00000001 before expansion becomes http://www.w3.org/2002/07/owl#annotatedSource=FBbt:00000001 after expansion). This means that implementations must validate against the provided JSON Schema first, and then perform JSON-LD expansion (not the other way round). The spec must clarify this.

Other questions I can think of:

  1. As stated from the beginning: the spec must explicitly state:

3.a. Whether identifiers are expected to be written in full-length form only, CURIEfied form only, or if both forms are allowed.

3.b. Where and how the CURIE map should be stored (if CURIEfied forms are allowed; of course this is moot if we don’t allow CURIEfied identifiers). Based from the output of sssom convert the answer is “in the JSON-LD context”. It is very important to specify that explicitly because this is frankly counter-intuitive, for at least two reasons:

  • JSON-LD expansion will not use the prefixes declared there to automatically expand CURIEs in the mapping set, because none of the fields that are expected to contain CURIEs are declared in the context as being of type @id – so, from a JSON-LD point of view, those prefix declarations are useless;
  • In many cases, the prefixes that would be used in a SSSOM mapping set would not be valid JSON-LD prefixes. For example:
{
  "@context": {
    "UBERON": "http://purl.obolibrary.org/obo/UBERON_"
  }
}

is not a valid prefix for compacting IRIs, because it ends with an underscore (valid prefixes intended to be used for compacting IRIs must end with one of those characters: :, /, ?, #, [, ], or @).

Using the JSON-LD context to store prefix declarations that will not be used by JSON-LD processing and with prefixes that in many cases will not even be valid seems very weird to me. If that is really how the spec wants the Curie map to be stored, it must be very explicit about it – and I would argue that it must explain the rationale behind that choice.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 15, 2024

  1. The spec must also explicitly state what is expected from the JSON-LD context (what should a writer puts in there?). I cannot rely on the output of sssom convert to infer that, because I suspect part of what sssom convert generates is, frankly, garbage.

Sorry, but what the hell is that:

    "object_type": {
      "@context": {
        "@vocab": "@null",
        "text": "skos:notation",
        "description": "skos:prefLabel",
        "meaning": "@id"
      }
    },

@matentzn
Copy link
Collaborator

Ok, all of these are good questions, and I know the answers to exactly none of them.

I will see that I schedule a working meeting about this soon.

@cmungall
Copy link
Contributor

My suggestion is to have separate json and jsonld exports.

Anything regarding prefixes will be a separate concern from the json export. The json export will be isomorphic to the TSV, the cell values will essentially be the values in the json (I am not providing a spec here, this is just the flavor). All of the things mentioned in this issue that relate to prefix expansion will no be relevant to the json export. I can commit to providing specification (in the form of a stable json-schema), associated docs, and tooling in the short term.

jsonld is a more complicated beast to specify, for many of the reasons alluding to in this issue. I also explain the reasons for this here: https://linkml.io/linkml/howtos/using-jsonld.html. tl;dr it's not clear what a spec for a jsonld of a model would look like in the general case, so I don't want to commit to that. However, we can commit in the short term to providing a separate jsonld serialization that will effectively be identical to the json serialization plus injected context. But I would rather separate the discussion of these, as discussion of jsonld always brings in separate concerns that are irrelevant to most json consumers.

@matentzn
Copy link
Collaborator

Ok, I am not opposed to that (keeping JSON-LD and JSON side by side, and separate). Does this mean we should focus our standardisation effort on JSON (rather than JSON-LD)?

@gouttegd
Copy link
Contributor Author

I have no opinion on that (well, not quite: I am not that fond of JSON-LD personally, but I can keep that aside), but please note that right now what SSSOM-Py calls “JSON” is really JSON-LD (sssom convert --output-format json yields a JSON-LD file, not a “plain” JSON file) – as you noted yourself at the beginning of this discussion:

In any case, what we call JSON right now is JSON-LD, and therefore should be treated as such - this won’t change

What are people using the JSON output for, and do they actually need JSON-LD over plain JSON?

@anitacaron
Copy link
Contributor

I need the JSON-LD for the SSSOM-API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants