Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish core mappings from metadata and introduce core_mapping_id #361

Open
matentzn opened this issue Apr 11, 2024 · 2 comments
Open

Comments

@matentzn
Copy link
Collaborator

I accidentally derailed #359 to a discussion about identifying the core mapping, so I am moving this here now.

Adding mapping_id as a required slot is out of the question I think, because identifier management is too much churn for most users that just want to add a quick table with mappings to their repo.

Lets start with discussing the why, then the how.

Because of the format, lets have the discussion here: #360

@gouttegd
Copy link
Contributor

gouttegd commented Apr 11, 2024

Not fan of splitting the discussion in multiple places, but oh well.

Regardless of the “why” people might need a way to refer to a core mapping (curious to see the use cases people will bring in #360), I want to state again that we don’t need a core_mapping_id field!

An identifier for a triple (or a quadruple, if we want to include the predicate modifier) can always be derived on the fly from the triple itself, there is no need to explicitly store it. Let the spec define a standard derivation algorithm, let SSSOM-Py and SSSOM-Java provide helper methods to perform the derivation, but let’s not clutter the format with a field that would merely duplicate what is already contained in other fields.

In fact the more I think about it, the more strongly I object to the creation of a core_mapping_id field. Even an optional one.

I’ve quickly mentioned it in the original discussion, but I will expand more here on one of the reasons I object to such a field: It will make editing a set needlessly more difficult.

Let’s say that I am creating this mapping record about the core mapping {FBbt:1234, skos:exactMatch, CL:5678} (pss, see what I did here? I just referred to a core mapping, and I didn’t need an identifier to do that):

subject_id   predicate_id      object_id   mapping_justification
FBbt:1234    skos:exactMatch   CL:5678     semapv:LexicalMatching

And let’s say we decide that core mapping identifiers should be derived from the core mapping by concatenating the elements of the triple and hashing them with MD5, so the identifier for the core mapping above would be 53cc280ef2220b850e2b92ef48d45d19.

There’s no way I am going to derive the identifier myself (my interest for cryptography does not go far enough for me to know how to compute a hash in my head), so I’m going to need some tool to post-process the file in order to add the identifier:

core_mappping_id                   subject_id   predicate_id      object_id   mapping_justification
53cc280ef2220b850e2b92ef48d45d19   FBbt:1234    skos:exactMatch   CL:5678     semapv:LexicalMatching

And that, already, makes a core_mapping_id field a very bad idea. Sure, it is always a good idea anyway to run a tool like sssom validate on your file after creating it. But the core_mapping_id fields makes going through such a tool absolutely necessary. We go from a format that is sold as being easily manipulatable by common tools (a standard spreadsheet software or even a basic text editor) to a format that requires highly specific tooling for the basic task of filling an identifier field – for nothing.

OK, then maybe the derivation algorithm does not need to include a hashing step? Wouldn’t change much. Let’s say the core mapping identifier is generated instead by representing the triple as a canonical S-expression: (7:subject40:http://purl.obolibrary.org/obo/FBbt_1234)(9:predicate46:http://www.w3.org/2004/02/skos/core#exactMatch)(6:object:38http://purl.obolibrary.org/obo/CL_5678). Sure, it’s not difficult to create such an expression even by hand. But it’s still cumbersome, and you probably can’t expect everyone to get it right, so having a post-editing ID-generating step would probably still be required.

@joeflack4
Copy link
Contributor

joeflack4 commented Apr 11, 2024

Thanks for moving this over here. In addition "core vs record mapping" discussed in #359, there is also this axis of "globally unique vs not"; mapping/record_id (type: string) vs mapping/record_guid (hashed). We could choose to add 1 new identifier field, several, or none.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants