Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Considerations on interoperability and evolution #325

Open
gouttegd opened this issue Oct 24, 2023 · 17 comments
Open

Considerations on interoperability and evolution #325

gouttegd opened this issue Oct 24, 2023 · 17 comments

Comments

@gouttegd
Copy link
Contributor

What are the plans, if any, to ensure the SSSOM TSV format remains interoperable

  • across versions (a file produced with an implementation conforming to version X can be used by an implementation conforming to version X+1),
  • across implementations (a file produced by implementation X — e.g. sssom-py — can be used another implementation Y — e.g. sssom-java)?

Currently, this interoperability goal is explicitly not sought after, as indicated both by the lower-than-one major version number (which, if we assume the project is using semantic versioning, means that anything can change at any time) and by the explicit warning on the top-level README: Note that SSSOM is currently under development and subject to change.

With the SSSOM format being seemingly more and more used in the wild, I believe it is time to consider committing to some form of long-term stability of the format, and/or design ways to make the format evolve while preserving some basic interoperability across versions and implementations.

The following is a random set of ideas that could be explored. Feel free to discuss them, refute them, and add more.

Defining a “core” set of metadata that will never change

We could select a handful of the most important mapping metadata and promote them to a “core set” that would be guaranteed never to change in any future evolution. This is similar to the “minimal spec” idea, though it could probably include slightly more metadata slots than the four mentioned in that ticket.

The principle here is that potential users could be confident that, no matter how the format evolves over time, as long as they only use the “core set” their files would always remain exploitable by any version of any conforming implementation. For users who need metadata from the “non-core set”, the situation would be the same as it is now for the entire standard: they would need to watch carefully the evolution of the standard to avoid being surprised by a breaking change.

For implementations, this would mean that they should only be strict when parsing the “core” metadata (at least by default — of course they can choose to allow users to specify a different behaviour). If they encounter a metadata slot they don’t recognise (because it’s an addition from a newer version of the spec), or a slot whose format has changed (e.g. because of a change such as the one envisioned here, they may log a warning but should not fail altogether to parse the file.

Allow each set to declare its own “must-understand” slots

This can be seen as a variation of the “core set” idea. Here, instead of having a fixed list of “core” metadata slots, the creators of a mapping set could define their own list of the slots they consider as critical.

For example, considering the following set (and assuming that the similarity_threshold slot, proposed here, and the mapping_chain_intermediate slot, proposed here, have been added to a later version of the spec):

# must-understand:
#   - mapping_chain_intermediate
subject_id	predicate_id	predicate_modifier	object_id	mapping_justification	similarity_threshold	mapping_chain_intermediate
EXA:1234	skos:exactMatch	Not	EXB:5678	semapv:ManualMappingCuration	0.8	EXC:4321

An implementation trying to read that file would first check the list of the “must-understand” slots for any slot that it does not recognise, and should flatly reject the file if it does contain such a slot.

So, an implementation up-to-date with the latest version of the spec (and thus, which supports both mapping_chain_intermediate and similarity_threshold) would parse the file without any issue. An implementation that for whatever reason does not recognise mapping_chain_intermediate (maybe because it has not been updated to catch up with the latest version of the spec yet) should immediately fail with an error.

Adding a slot for the spec version

We could add a simple sssom-version metadata slot at the mapping set level, to indicate the version of the spec this set is conforming to. Ideally that slot would be required to be the very first slot listed in the metadata block, so that a parser could figure out immediately whether the file it is trying to read is using a version it supports.

It would be up to the implementations to decide whether they want to support several versions at the same time or not.

Versioning the format in addition to the spec

Regardless of whether we add a versioning slot, it could be useful to introduce a version number for the file format, distinct from the version number of the specification. Not all changes to the specification have an impact on the file format, so tracking the evolution of the spec separately from the evolution of the format would make sense.

@matentzn
Copy link
Collaborator

This is of great concern to me as well, and related to #189.

I do not know the best answer to any of your questions, and I am happy to yield to advice by more experienced people in standards management. Here is my rough sense:

Allow each set to declare its own “must-understand” slots

Seem to be too complicated to implement in practice.

Adding a slot for the spec version

I think this is a very good idea. Even if you cant open a specific version of an (older) file, this could give a hand which tool sssom-py, sssom-java version could deal with this. I would support such a slot potentially even as "mandatory".

Core slots

This is my cautious weak position:

  1. I think the idea is good. We have a notion of core properties in sssom-py which are all required properties, plus predicate_modifier. We could stableise these right now. But how do we communicate this?
  2. We should probably agree that file format breaking changes should always require a new major version upgrade, while format changes that do not affect parsing (metadata etc) could be dealt with by minor version updates. This means that all file formats are intrinsically tied to a major release of SSSOM, say 1.0, 2.0 etc.

@gouttegd
Copy link
Contributor Author

gouttegd commented Oct 25, 2023

Adding a slot for the spec version

[…] I would support such a slot potentially even as "mandatory".

Or optional with a default value of 1.0. Meaning that if a SSSOM metadata block does not start with

#sssom_version: x.y

an implementation should assume the version is 1.0; this would avoid making all the already existing SSSOM files incompatible with the standard as soon as we introduce this new slot.

Core slots

But how do we communicate this?

Well, by adding a paragraph to that effect to the specification. How else would you want to do that?

We should probably agree that file format breaking changes should always require a new major version upgrade, while format changes that do not affect parsing (metadata etc) could be dealt with by minor version updates.

Yes. Either that or, as also suggested, we separate the version number of the specification from the version number of the format. If we change the spec without changing the format, we bump the spec version number, but leaves the format version number alone.

From example, the change proposed in option C in this issue changes the way the mapping_date slot should be interpreted, but it has no effect on how it should be represented on file (and therefore on how a file should be parsed or written) – it’s a specification change but not a format change.

Another nice possibility, but that I believe would be incompatible with the way LinkML is used, would be to separate (and version separately) the data model from the file format. Currently, the two are completely intertwined, because the file format is basically a direct serialisation of the data model. This is annoying, because it means that any breaking change to the data model will systematically result in a breaking change to the file format, and it does not need to be that way.

For example, consider #323. The proposed change is a data model change. It’s also a file format change because the following TSV file:

subject_id predicate_id     object_id subject_type
EXA:1234   skos:exactMatch  EXB:5678  owl class

which is currently valid, would become invalid after the change (the value of subject_type should be owl:Class).

If the file format was specified separately from the data model, it would be possible to specify that the on-disk file format accepts both the old syntax (owl class) and the new one (owl:Class), even though the data model only expects an identifier (it would be the role of the parser to recognise the old syntax and silently translate it to an identifier). Therefore there would be no need to break existing files just because of a slight change to the data model.

@matentzn
Copy link
Collaborator

I think we are on the same page regarding where we are going with this. We need a little to substantiate this understanding:

Currently, the two are completely intertwined, because the file format is basically a direct serialisation of the data model.

As far as I understand, this is only really true for the YAML serialisation, which we have not promoted much at all; I think the "mapping" from the data model to their serialisations is done via transformation procedures. So there is actually a double risk: not only can the changes to the datamodel directly affect json and rdf serialisations, but also changes to the transformation procedures. So I now think we should probably do as you say:

  1. Define serialisation versions
  2. Tie them to specific versions of linkml transform (this has the ugly ugly ugly consequence that we need to pin any specific version of sssom-py to a specific version of linkml-runtime) and sssom-schema
  3. Add unit tests for each release when updating linkml (deliberately) to ensure the serialisation does not change.

Ugh.. You are shining some torchlight into the underdark @gouttegd..

@gouttegd
Copy link
Contributor Author

gouttegd commented Oct 26, 2023

As far as I understand, this is only really true for the YAML serialisation

It is also partially the case for the SSSOM-TSV format, with sssom-py.

When writing a SSSOM-TSV file, sssom-py writes the metadata block by basically dumping the metadata dictionary as it is (yaml.safe_dump(meta)) and writes the TSV part by basically dumping the Pandas data frame (msdf.df.to_csv()). So what’s written is a direct serialisation of the SSSOM data model, so there’s no room for the serialisation format and the data model to diverge.

The good news is that when reading though, sssom-py first loads the metadata block and the TSV rows as they are, and then compares them with what it expects from the SSSOM schema – meaning that here it would be possible for the serialisation format to diverge from the data model, because the parser could perform any necessary “translation” (for example, replace owl class by owl:Class, if we take again the example of #323).

I think the "mapping" from the data model to their serialisations is done via transformation procedures.

Unless I missed something, I don’t think so, at least not for SSSOM-TSV and not in the current version of sssom-py.

@gouttegd
Copy link
Contributor Author

gouttegd commented Oct 26, 2023

Tie them to specific versions of linkml transform

Please don’t. The serialisation format should be specified on its own, without any reference to any LinkML runtime stuff.

Reference to LinkML is precisely the problem with the JSON serialisation format (#321). The “specification” just says, “the JSON format is whatever LinkML’s serialisation procedures do.” In effect this makes this format impossible to implement in any language that does not have decent LinkML support (so, any language other than Python¹).


¹Yeah, I know, “LinkML is definitely not only for Python developers”… Believe that if you want. As someone who has been implementing two LinkML-defined projects in Java, I can tell you the support for other languages in LinkML is symbolic at most. Make the SSSOM serialisation format entirely dependent on the LinkML runtime and the only result will be that nobody will ever try to support SSSOM in any other language than Python — which would mean for example no more SSSOM plugin for ROBOT.

@matentzn
Copy link
Collaborator

matentzn commented Nov 1, 2023

Unless I missed something, I don’t think so, at least not for SSSOM-TSV and not in the current version of sssom-py.

sssom TSV is indeed different, for all other serialisations though, it is true.

As always, all you say is sane and we should go this way. What is the best way to document the file format independently of the schema though? I thought this is precisely what the LinkML JSON Schema export was for.

Before we start with anything, we should probably build a library of examples and, similar to ROBOT, build a test suite that ensures that with every release, we do not break serialisation expectations..

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 1, 2023

As always, all you say is sane and we should go this way.

Err, which way? My mind is not set on any way in particular (sorry if I gave the impression it was), I was merely throwing ideas. ^^'

What is the best way to document the file format independently of the schema though?

The documentation/specification does not need to be completely “independent of the schema”. What it must be (and that’s the only aspect on which I do have a firm opinion) is to be independent of the LinkML runtime – that is, it must not do what the description of the JSON format is doing. The SSSOM-TSV format is currently described independently of the runtime, and it should remain so.

In fact, the way the SSSOM-TSV format is currently specified is pretty good, I think. It’s a textual description that refers to the LinkML schema when needed (e.g., to avoid manually listing all the metadata fields or the meaning of some enumeration values).

What needs to be decided is if we want to allow the SSSOM-TSV format to deviate from the schema for backwards-compatibility reasons, so that the format (or rather, its parsers) can act as a compatibility layer that shields users from breaking changes in the schema.

If we want do to that, it can be done simply by adding a section in the spec that would say something like

Compatibility with previous versions

In addition to the columns described in the metadata table mentioned above, a compliant parser SHOULD be ready to accept the following columns from previous versions of the specification:

match_type

This column was used to indicate how a mapping had been asserted in versions prior to 0.9.1. If it is present, the parser SHOULD silently convert it to a mapping_justification slot using the following table:

  • Lexical -> semapv:LexicalMatching
  • HumanCurated -> semapv:ManualMappingCuration
    [etc.]

match_term_type

This column was used to describe the type of entities being matched. It was replaced in SSSOM 0.9.1 by two distinct columns subject_type and object_type. If it present, the parser SHOULD convert it to those two columns using the following table:

  • ClassMatch -> owl class
  • ConceptMatch -> skos concept

Of course, this is only an example. These changes occurred before version 1.0, so it would be fine not to guarantee backwards compatibility in this case. But this illustrates how we could do it if we need to introduce breaking changes after version 1.0.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 1, 2023

What needs to be decided is if we want to allow the SSSOM-TSV format to deviate from the schema for backwards-compatibility reasons

If we do not want to do that (that is, we want the SSSOM-TSV format to always follow closely the schema, as it currently does), then I think that adding a version number to the metadata becomes necessary.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 2, 2023

Trying to summarise the options:

  1. Backwards compatibility is out of scope for the specification.

We decide that the spec simply does not care about backwards compatibility. The spec is only ever about the latest (current) version. We update the schema as we see fit. At most, we promise, once version 1.0 is out, to break things as rarely as possible. But backwards compatibility is entirely left at the discretion of the implementations: some may only support the latest version of the spec, other may try to do their best to support previous versions as well.

  1. Backwards compatibility is out of scope, but with a version number.

Basically same as option 1), but we do add a sssom_version field to the dataset metadata so that implementations can at least immediately know whether the file they are trying to read is using a version of the model that they support (instead of figuring that out the hard way, for example upon encountering a metadata slot that they do not recognise). It is again left at the discretion of the implementations whether they want to support older versions and how they do so.

Both options 1 and 2 do not require the SSSOM-TSV format to deviate from the data model.
Under any of these options, having a “core set” of metadata slots that would be guaranteed never to change would be nice.

  1. Backwards compatibility is a spec feature

The spec explicitly a) allows, b) encourages, or c) mandates (pick one option) implementations to support older versions. The SSSOM-TSV format is allowed to deviate from the latest data model. Every new version of the spec includes some backwards compatibility notes describing what has changed, what old metadata slots are no longer present in the data model but may still be found in SSSOM-TSV files, and how to deal with such slots (e.g. how to convert them to the current data model).

A sssom_version metadata slot, as in option 2, would also be nice here, though not indispensable.

I am inclined towards option 3b, personally.

@matentzn
Copy link
Collaborator

matentzn commented Nov 2, 2023

Err, which way? My mind is not set on any way in particular (sorry if I gave the impression it was), I was merely throwing ideas. ^^'

Sorry I should have referred to your quote on not tying to a specific linkml runtime version. :D

What needs to be decided is if we want to allow the SSSOM-TSV format to deviate from the schema for backwards-compatibility reasons, so that the format (or rather, its parsers) can act as a compatibility layer that shields users from breaking changes in the schema.
If we want do to that, it can be done simply by adding a section in the spec that would say something like

I think we could do something like that. Of course, this approach will also have certain limits (for example, if the free field (not enum) changes from string to entityref datatype, and you don't know all the possible values like in the match_type case))

I think we can go this way.

If we do not want to do that (that is, we want the SSSOM-TSV format to always follow closely the schema, as it currently does), then I think that adding a version number to the metadata becomes necessary

I think we should do that anyways. Making this an optional property that defaults to the latest version.

Trying to summarise the options:

Great summary!

Can you play the following through: what exactly would happen if an uncontrolled field (no fixed values) is changed from type string to type EntityRef? This happened twice in the past, with subject_source and mapping_justification (aka match_type). Are you saying that when writing file where a value is still a string, the reference writers

  1. Will try to repair the value silently if not ambiguous
  2. Will be forced to write the value as it was (maybe printing a warning)

I am not against encouraging implementors to support older versions, but I fear that this will create a big communication overhead (with resources that have their own parsers). Every more small variants of the file formats will make the parser requirements ever more complex. Validation methods will become stranger too, as the parsers required for validation are very much tied to the schema at the moment (linkml validate). I am personally more inclined towards (2), so far.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 2, 2023

this approach will also have certain limits (for example, if the free field (not enum) changes from string to entityref datatype, and you don't know all the possible values like in the match_type case))

Yes. Depending on the breaking changes we introduce, there will not always be a straightforward way to ensure backwards compatibility.

what exactly would happen if an uncontrolled field (no fixed values) is changed from type string to type EntityRef?

It will happen whatever we’ll decide should be happening in that case, nothing more, nothing less. :)

There is no pre-defined answer. Every time we introduce a breaking change, we’ll have to consider what could/should be done.

That may depend on many things, such as:

  • Is the field that we are breaking an “important” field? For example, when match_type was replaced by mapping_justification, this was touching a very important field. For such a field, we should probably go to greater length to ensure compatibility than for a more “minor” field.
  • Do we have any idea of how widely the field has been used in the wild?
  • For an uncontrolled field, is there a way we could get a glimpse of how it has been used (what type of values people put in it)?
  • When the field had been used, what has it been used for?

What should probably be done to get some answers to at least some of these questions is to advertise beforehand our intention to change the field (e.g. on a GitHub ticket, with call to comments on Slack and any other channel you can think of).

And then, we decide: if we go ahead with the changes, and if yes, how do we suggest / recommend / mandate that implementations deal with the change. Options may include:

  1. Don’t even try to deal with it. It’s only a minor field, just ignore it. Print a warning if you want, but then forget it and process the rest of the mapping set normally.
  2. Don’t even try to deal with it. It’s a field that is too important for you to try to guess how to convert it to the new model. Just abort immediately if you come across the old field. Leave it to the user to amend her mapping set with the new field. Point her to this very helpful page we have prepared in which we explain the issue.
  3. Use the table below that lists some commonly used values for this field along with the EntityRef that should now be used instead. If the mapping set you’re parsing contains values that are not in this list, use the special value semapv:IHaveNoIdeaWhatThisIsAbout, print a warning, but keep parsing the mapping set.

Validation methods will become stranger too, as the parsers required for validation are very much tied to the schema at the moment (linkml validate)

That is not a problem as long as validation happens after any compatibility preprocessing.

For example, if a mapping set is still using match_type, the compatibility layer in the parser will convert that to mapping_justification, and then the validator will only ever see mapping_justification. The validator needs not be changed in any way; it can remain aware only of the latest version of the schema.

I am personally more inclined towards (2)

The main difference between (2) and (3a) is that, while in both cases backwards compatibility is entirely optional, in (3a) at least the spec provides informations about how to do it.

Basically, option (2) is: “This document only describes the latest version of the spec. You MUST support that version as described here. It’s entirely up to you whether you want to support older versions, but if you do, it’s up to you to dig up older versions of this document, figure out what has changed, and coming up with a way to deal with those changes. Have fun!”

Option (3a) is: “This document describes the latest version of the spec. You MUST support that version as described here. You MAY support older versions if you want, it’s up to you. If you want to do that, please refer to the COMPATIBILITY WITH OLDER VERSIONS section below for details of what has changed in the last version and hints on how to deal with older versions.”

One issue with option (2) is that if you leave each implementations come up with their own solutions, well… they will each come up with their own solutions, which may very well be completely different from one implementation to the next. One implementation will silently ignore that old field, another one will reject it outright, yet another one will silently convert it in one way, and yet another will convert it in another slightly incompatible way.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 2, 2023

More on the choice between options (3a), (3b), (3c) (allow / recommend / mandate support for older versions):

First, option (3c) should globally be avoided, in my opinion. Mandating that implementations MUST support older versions would be too much of a burden. And this would also have the side-effects of removing any incentive for producers of SSSOM mapping sets to update to the latest version. “My software writes out mapping sets that use the match_type field; why should I bother to update it to make it use the mapping_justification field, since the spec guarantees that all implementations MUST still support match_type?”

Second and more importantly: support for older versions does not need to be a “all-or-nothing” choice. The spec can offer the choice between MAY / SHOULD / MUST for every single compatibility issue.

For example, we may say that “implementations SHOULD support the old match_type field; they MAY support the old match_term_type field” – because we deem match_type to be very important, and match_term_type not so much.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 2, 2023

Support for older versions does not need to be a “all-or-nothing” choice. The spec can offer the choice between MAY / SHOULD / MUST for every single compatibility issue.

This also means that “supporting older versions” does not need to mean “support for all older versions”. We can fine-tune the expected level of support for each version.

For example, if we make a SSSOM version 1.1, it would be perfectly reasonable, in my opinion, to ask that implementations compliant with that version SHOULD (possibly even MUST) also support version 1.0, because presumably there wouldn’t be much differences between 1.0 and 1.1.

Later, when we get to SSSOM 2.0, it would also be perfectly reasonable to say that implementations compliant with that version MAY support any 1.x version (meaning that they can do it if they want, but they are not expected to).

@matentzn
Copy link
Collaborator

matentzn commented Nov 3, 2023

I cannot tell you enough how much I enjoy this discussion!

Option (3a) is: “This document describes the latest version of the spec. You MUST support that version as described here. You MAY support older versions if you want, it’s up to you. If you want to do that, please refer to the COMPATIBILITY WITH OLDER VERSIONS section below for details of what has changed in the last version and hints on how to deal with older versions.”

I am sold! Would you be interested to try and draft the basic guidelines to enacting your proposal? I think we should do a SSSOM 1.0 before the end of the year, which also explicitly introduces your (3a) suggestion. I do not think we should try (and I don't think you are suggesting this) to figure out the dozens of breaking changes that happened in the past. But it would be good if the guidelines are in place, and documented explicitly as part of the "spec", before we publish 1.0.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 3, 2023

what exactly would happen if an uncontrolled field (no fixed values) is changed from type string to type EntityRef?

Another option that I did not include yesterday, but that I think could the best option if the field is really important: phased deprecation.

That is, let’s say that in SSSOM version X we have a foo uncontrolled field that we wish to change to EntityRef with a controlled vocabulary; that field is too important to ignore but alas it is also impossible to define an automated way of mapping the thousands of free text values in use in the wild to the controlled vocabulary.

What we can do then:

  • In SSSOM version X+1, we leave foo unchanged and we introduce a new field called foo_id that uses the controlled vocabulary. Both fields are present simultaneously in the schema, so all implementations MUST support both. For consumers, if both fields are present in a mapping set, implementations MUST ignore foo and only use foo_id; if only foo is present, implementations MUST accept it as before but MUST print a warning that foo is going to be deprecated in favour of foo_id. For producers, they MUST produce mapping sets with foo_id and MUST NOT use foo anymore.
  • In SSSOM version X+2, we remove foo from the model. Implementations MUST support foo_id, but support for the uncontrolled foo is now optional. If an implementation does decide to support foo, it should treat it as in version X+1: accept it but print a (stronger) warning that “foo is now obsolete, you should really update your mapping sets to use foo_id, you know“.

We could even do a 3-phases deprecation, depending on how important the field is: in X+2, support for foo is optional but recommended (implementations SHOULD support foo), in X+3 and ulterior support for foo is fully optional (implementations MAY support foo).

Of course it’s complicated, cumbersome, but it may be the only way to ensure a smooth transition if/when we need an important breaking change post-1.0. (All the more reason to avoid such breaking changes whenever possible!)

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 3, 2023

Would you be interested to try and draft the basic guidelines to enacting your proposal?

I opened the ticket in the first place, so I better be interested! :D

I do not think we should try (and I don't think you are suggesting this) to figure out the dozens of breaking changes that happened in the past.

Agreed. I have used the past breaking changes as examples above because they were there, but I am not suggesting that SSSOM 1.0 should recommend supporting features from pre-1.0 versions, even less all features from pre-1.0 versions.

I do think that if we go any of the (3) options, SSSOM 1.0 should document how to support at least one or two old fields that don’t exist anymore (e.g. match_type), because it will give implementers a good idea of what they could expect for possible future breaking changes.

Such support would be entirely optional though (MAY). For example, sssom-java does still support match_type, but sssom-py does not (it flatly rejects a mapping set that uses match_type instead of mapping_justification), and I think it’s perfectly fine.

@matentzn
Copy link
Collaborator

matentzn commented Nov 6, 2023

Wow.. I can't remember a more vigilant defender of adopters than you. Ok, make a proposal as a PR on the spec whenever you find the energy and time, and we has out the details. I like the idea of phased deprecation, but it will necessitate divorcing the TSV implementation entirely from LinkML, which is in any case necessary as per our discussion on serialisation versioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants