Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies Between Code, Documentation, and Schema #534

Open
ptth222 opened this issue Feb 16, 2024 · 5 comments
Open

Inconsistencies Between Code, Documentation, and Schema #534

ptth222 opened this issue Feb 16, 2024 · 5 comments
Assignees

Comments

@ptth222
Copy link

ptth222 commented Feb 16, 2024

I went through the code in the "model" folder, the JSON Schemas in "resources/schemas/isa_model_version_1_0_schemas", the JSON Schemas on https://isa-specs.readthedocs.io/en/latest/isajson.html, and the specification on https://isa-specs.readthedocs.io/en/latest/isatab.html to try and find inconsistencies. My impetus to do this was after trying to figure out the correct way to fix one of the validation functions and realizing that I don't have a source of truth for what should be valid.

I am going to list off some of what I found and categorize them a little bit.

Hard Inconsistencies

  1. The JSON Schemas actually used by the code in "resources/schemas/isa_model_version_1_0_schemas" are different from what is in the documentation at https://isa-specs.readthedocs.io/en/latest/isajson.html. The differences are slight, but the documentation should match what is actually in use.
  2. In the "to_dict" method for the Process class, "comments" for parameterValues are not added to the dictionary created for them. Assuming "to_dict" is how you are supposed to create a JSON version from an Investigation instance, and since there is no where else in the JSON representation that these can appear, this should probably change to include the comments. I did not check if these comments appear in Tab.
  3. The "to_dict" method for the Protocol class just puts an empty list for "components". The Protocol class also has "components" as a list of OntologyAnnotations, but this does not match what is in the JSON Schema in the "resources/schemas/isa_model_version_1_0_schemas" folder.
  4. The Material class doesn't have "derivesFrom" anywhere, this does not match what is in the JSON Schema.
  5. The Characteristic class has the "category" as an OntologyAnnotation, but this does not match the JSON Schema. The StudyAssayMixin class also has "characteristic_categories" as OntologyAnnotations which does not match the JSON Schema, but the "categories_to_dict" method puts them in the form that matches JSON Schema. The Characteristic class's "to_dict" method does not put them in the form that matched the JSON Schema though.
  6. The FactorValue class does not put "comments" in the dict created by the "to_dict" method, but the class is Commentable.

Minor Things I Noticed

  1. The schemas in "resources/schemas/isa_model_version_1_0_schemas" include an "organization_schema.json" file that isn't used by any of the schemas in that folder.
  2. The JSON-LD attributes aren't able to be set using "from_dict" for most classes, like Person and Publication. You can set "@id" for OntologyAnnotation, but not the other LD attributes. These are just some examples.

Questions

  1. Why do some nodes/classes not inherit from Identifiable? For instance, the Publication class doesn't. It makes it so you can't set an "id" and multiple calls to "to_dict" will generate a new "@id" property each time, so it isn't persistent.
  2. There are multiple folders for JSON Schemas in "resources/schemas". In particular, folder "isa_model_version_2_0_schemas" seems like it might be a version 2 based on the name. Are there plans to revamp the JSON representation of ISA? The schemas in "isa_model_version_2_0_schemas" are quite different from what are in "isa_model_version_1_0_schemas".

Suggestions or Possible Improvements

  1. For processes it might be a good idea to allow the "performer" attribute to be a person_schema in addition to a string, so people can put an "@id" there if they want. I actually assumed this had to be a person_schema because that would match the pattern used in so many other attributes.

The Hard Inconsistencies between the model code and JSON Schemas really needs addressed, and most importantly, which one is the ground truth? Does the model code need to change or the JSON Schemas?

 

Deprecated Columns?

This next bit I want to do separately because I really need to understand it to fix the validation function that started this whole thing. The function in question is "load_table_checks" in "isatab/validate/rules/rules_40xx.py". It is printing an unexpected column error for a "Characteristic" column after an "Extract Name" column. That is to say the function is saying that it is not valid to have a "Characteristic" column after an "Extract Name" column. I am quite confident that this is not correct. When I went to double check the truth of this I ran into what seems like contradictory information from my point of view between some of the sources mentioned above. That's why I did a deeper dive into things to try and find a decent list of issues that we can hopefully get resolved so there isn't conflicting information and code.

The main thing I want to ask about now is in the specification here: https://isa-specs.readthedocs.io/en/latest/isatab.html#assay-table-file. Under "2.3.7 Assay Table file" where it talks about "Extract Name" and "Labeled Extract Name" it says they "MAY be qualified with Characteristics, Material Type and Description". The "Characteristics" is consistent with the JSON Schema and model code, but I don't think "Material Type" and "Description" are relevant anymore. At the very least I cannot see what "Material Type" and "Description" would translate to in the JSON or model. I think maybe "Material Type" would just be a characteristic, and "Description" would just be a comment.

I have looked over some of the conversion code and loading code, and I can find where there is some special handling for "Material Type" in the "load_table" function in the "isatab/load/core.py" file, and the first version of the tab2json converter, "ISATab2ISAjson_v1", also has special handling for "Material Type", but I can't find anything for "Description". It should be noted that this "load_table" function is only used for the ISA-Tab validator, the converter, new and old, load tables in a different way. The way the new converter loads tables does not have any special handling for "Material Type" or "Description". All of the special handling for "Material Type" makes it into a characteristic.

Similarly, "Label" has the same issues as "Material Type" and "Description". To me it looks like it should just be a characteristic. I can find special handling for "Label" in the old tab2json converter and in the new converter (ProcessSequenceFactory). Both turn it into a characteristic.

Can we decide exactly what should be done with these 3 columns, "Label", "Material Type", and "Description"? The easiest thing is to just drop them and remove their mention in the documentation, but if you want to stay backward compatible then they need some special handling. "Label" and "Material Type" already have some special handling precedent that simply turn them into characteristics, which I think is fine, but needs to be consistent everywhere in the code. The 3 different loading paradigms in the code all do things differently. "Description" has no precedent for special handling that I can find, so it might be okay to just drop this and remove all mentions. I do need to have an answer to this before I can fix the "load_table_checks" function though.

This last part may need to be it's own issue, but it is related to everything. If you would rather have it separated, then that's fine.

@proccaserra proccaserra added this to To do in ISA Tools Development via automation Feb 19, 2024
@proccaserra
Copy link
Member

I went through the code in the "model" folder, the JSON Schemas in "resources/schemas/isa_model_version_1_0_schemas", the JSON Schemas on https://isa-specs.readthedocs.io/en/latest/isajson.html, and the specification on https://isa-specs.readthedocs.io/en/latest/isatab.html to try and find inconsistencies. My impetus to do this was after trying to figure out the correct way to fix one of the validation functions and realizing that I don't have a source of truth for what should be valid.

I am going to list off some of what I found and categorize them a little bit.

Hard Inconsistencies

  1. The JSON Schemas actually used by the code in "resources/schemas/isa_model_version_1_0_schemas" are different from what is in the documentation at https://isa-specs.readthedocs.io/en/latest/isajson.html. The differences are slight, but the documentation should match what is actually in use.

+1

  1. In the "to_dict" method for the Process class, "comments" for parameterValues are not added to the dictionary created for them. Assuming "to_dict" is how you are supposed to create a JSON version from an Investigation instance, and since there is no where else in the JSON representation that these can appear, this should probably change to include the comments. I did not check if these comments appear in Tab.

@ptth222 this should be fixed when we merge 'issue-511' into 'develop'.

  1. The "to_dict" method for the Protocol class just puts an empty list for "components". The Protocol class also has "components" as a list of OntologyAnnotations, but this does not match what is in the JSON Schema in the "resources/schemas/isa_model_version_1_0_schemas" folder.

@ptth222 the "components" were left out from the implementation as hardly used / inconsistently used. "Parameters" provide a similar capability. Yet, you are right about the undocumented discrepancy between schema and implementation.

  1. The Material class doesn't have "derivesFrom" anywhere, this does not match what is in the JSON Schema.

Yes. Currently 'derivesFrom' is only available from Sample.py, and not from Material.py., which is inconsistant with the current schemas (1.0).

  1. The Characteristic class has the "category" as an OntologyAnnotation, but this does not match the JSON Schema. The StudyAssayMixin class also has "characteristic_categories" as OntologyAnnotations which does not match the JSON Schema, but the "categories_to_dict" method puts them in the form that matches JSON Schema. The Characteristic class's "to_dict" method does not put them in the form that matched the JSON Schema though.

@ptth222 , the schema 'material_attribute_schema.json' is the one you are looking for and it is used by the Study_schema.json and assay_schema.json. Agreed the naming should have been consistent.
AI: @proccaserra and @terazus to cross-check 'to_dict()' method but it is known that creating a 'Characteristic' does not associated to 'characteristic_categories' neither at Study or Assay levels. This needs to be done manually (something we discussed when developing the jupyter notebook testing the round-trip extensively.

  1. The FactorValue class does not put "comments" in the dict created by the "to_dict" method, but the class is Commentable.

known issue: @ptth222 this will be addressed by merging 'issue-511' into 'develop'

Minor Things I Noticed

  1. The schemas in "resources/schemas/isa_model_version_1_0_schemas" include an "organization_schema.json" file that isn't used by any of the schemas in that folder.

correct.

  1. The JSON-LD attributes aren't able to be set using "from_dict" for most classes, like Person and Publication. You can set "@id" for OntologyAnnotation, but not the other LD attributes. These are just some examples.

known issue: @ptth222 this will be addressed by merging 'issue-511' into 'develop'

Questions

  1. Why do some nodes/classes not inherit from Identifiable? For instance, the Publication class doesn't. It makes it so you can't set an "id" and multiple calls to "to_dict" will generate a new "@id" property each time, so it isn't persistent.

Indeed. considered minor but if this is really becoming a nuisance, a fix will be issued. @terazus , what do you think?

  1. There are multiple folders for JSON Schemas in "resources/schemas". In particular, folder "isa_model_version_2_0_schemas" seems like it might be a version 2 based on the name. Are there plans to revamp the JSON representation of ISA? The schemas in "isa_model_version_2_0_schemas" are quite different from what are in "isa_model_version_1_0_schemas".

There are indeed plans to revamp but as the changes would likely break background compatibility, we have held back.

Suggestions or Possible Improvements

  1. For processes it might be a good idea to allow the "performer" attribute to be a person_schema in addition to a string, so people can put an "@id" there if they want. I actually assumed this had to be a person_schema because that would match the pattern used in so many other attributes.

@ptth222 +1, we discussed this with @terazus . could also be an Organization (CRO use case). We did not implement to avoid complexity and also possible privacy issue of 'performer' (e.g. animal studies).

The Hard Inconsistencies between the model code and JSON Schemas really needs addressed, and most importantly, which one is the ground truth? Does the model code need to change or the JSON Schemas?

there is no arguing on this one.

 

Deprecated Columns?

This next bit I want to do separately because I really need to understand it to fix the validation function that started this whole thing. The function in question is "load_table_checks" in "isatab/validate/rules/rules_40xx.py". It is printing an unexpected column error for a "Characteristic" column after an "Extract Name" column. That is to say the function is saying that it is not valid to have a "Characteristic" column after an "Extract Name" column. I am quite confident that this is not correct. When I went to double check the truth of this I ran into what seems like contradictory information from my point of view between some of the sources mentioned above. That's why I did a deeper dive into things to try and find a decent list of issues that we can hopefully get resolved so there isn't conflicting information and code.

@ptth222 +1 we need to check. 'Extract Name' , 'Labeled Extract Name' are Material and should get "Characteristics".
In fact, we have also requests to make "Data Nodes" (e.g. Raw Data File) the bearer of such annotations.
Easy to implement on the ISA-JSON front, more involved on the TAB front.

The main thing I want to ask about now is in the specification here: https://isa-specs.readthedocs.io/en/latest/isatab.html#assay-table-file. Under "2.3.7 Assay Table file" where it talks about "Extract Name" and "Labeled Extract Name" it says they "MAY be qualified with Characteristics, Material Type and Description". The "Characteristics" is consistent with the JSON Schema and model code, but I don't think "Material Type" and "Description" are relevant anymore. At the very least I cannot see what "Material Type" and "Description" would translate to in the JSON or model. I think maybe "Material Type" would just be a characteristic, and "Description" would just be a comment.

This is what happened de-factor-> both 'Material Type' and 'Description' recast as 'Characteristics'.

I have looked over some of the conversion code and loading code, and I can find where there is some special handling for "Material Type" in the "load_table" function in the "isatab/load/core.py" file, and the first version of the tab2json converter, "ISATab2ISAjson_v1", also has special handling for "Material Type", but I can't find anything for "Description". It should be noted that this "load_table" function is only used for the ISA-Tab validator, the converter, new and old, load tables in a different way. The way the new converter loads tables does not have any special handling for "Material Type" or "Description". All of the special handling for "Material Type" makes it into a characteristic.

+1

Similarly, "Label" has the same issues as "Material Type" and "Description". To me it looks like it should just be a characteristic. I can find special handling for "Label" in the old tab2json converter and in the new converter (ProcessSequenceFactory). Both turn it into a characteristic.

Can we decide exactly what should be done with these 3 columns, "Label", "Material Type", and "Description"? The easiest thing is to just drop them and remove their mention in the documentation, but if you want to stay backward compatible then they need some special handling. "Label" and "Material Type" already have some special handling precedent that simply turn them into characteristics, which I think is fine, but needs to be consistent everywhere in the code. The 3 different loading paradigms in the code all do things differently. "Description" has no precedent for special handling that I can find, so it might be okay to just drop this and remove all mentions. I do need to have an answer to this before I can fix the "load_table_checks" function though.

Agreed about clarifying that aspect .

This last part may need to be it's own issue, but it is related to everything. If you would rather have it separated, then that's fine.

+1, I think it is best to have this as a separate issue but as we go over the merge review for issue-511 (which covers more than the initial issue), we will double check this.

Thanks for all the detailed reporting.

@ptth222
Copy link
Author

ptth222 commented Feb 19, 2024

@proccaserra Thank you for the detailed response. I apologize for double posting about some of this. I'm not familiar with all of the issues and was not aware of #511. I have been trying to pay attention to the develop branch and keep my fork up to date. I will take a look at this again once #511 is merged.

@proccaserra
Copy link
Member

@ptth222 , no apologies needed at all. We were hoping to push the corrections sooner , thus avoiding you that pain. I am the one to apologise for these inconsistencies. Your input is very much appreciated

@ptth222
Copy link
Author

ptth222 commented Feb 21, 2024

A small update. The JSON Schema for data_schema has the following types listed for data files:
"Raw Data File",
"Derived Data File",
"Image File",
"Acquisition Parameter Data File",
"Derived Spectral Data File",
"Protein Assignment File",
"Raw Spectral Data File",
"Peptide Assignment File",
"Array Data File",
"Derived Array Data File",
"Post Translational Modification Assignment File",
"Derived Array Data Matrix File",
"Free Induction Decay Data File",
"Metabolite Assignment File",
"Array Data Matrix File"

3 of these don't have a class in datafile.py in the model:
"Image File",
"Array Data Matrix File",
"Metabolite Assignment File"

@ptth222
Copy link
Author

ptth222 commented Feb 21, 2024

I have found another inconsistency with the "Assay Nodes". The terminology gets kind of confusing here, so I am going to try and explain. In the specification, https://isa-specs.readthedocs.io/en/latest/isatab.html, in the "Assay Table file" section, it discusses the use of the column names "Assay Name", "Data Transformation Name", and "Normalization Name". In the code, in "isatab\defaults.py", these column names and some others are listed in a constant "_LABELS_ASSAY_NODES". The first thing that is rather confusing is the naming. Having a column, "Assay Name", inside the assay table file, which already has a name, is confusing. If you look in the code for ProcessSequenceFactory all these columns do is end up giving a process a name, so a column name like "Process Name" would make much more sense. It's also confusing why there are so many name variations on a column that just serves to name a process.

What is inconsistent between the code and the specification is that the specification indicates that these name columns MUST be used, but that is not the case in the code. There are no validations that require them, and the ProcessSequenceFactory just looks for 1 to be present and uses it to set the process name if 1 and only 1 exist for the process. Essentially, they are treated as qualifying columns for a process in the code.

It looks to me like there is some history with these columns that I'm not aware of and their use has changed over time. One of the example assay table files, a_proteome.txt, seems to use them more like a synonym for "Protocol REF". The "preprocess" function in ProcessSequenceFactory.py also seems to support this because it will insert "Protocol REF" columns in front of these name columns if it finds one of these columns unconnected to a "Protocol REF" column.

What is difficult now is how to validate these columns. If they should be qualifying columns for "Protocol REF", then reporting an error when they are out of position is reasonable. If they could be a synonym for "Protocol REF", then you shouldn't report an error. These are mutually exclusive, and I need to know which behavior should be correct for the "load_table_checks" validating function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants