Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISA Validation Configurations Too Restrictive #520

Open
ptth222 opened this issue Jan 19, 2024 · 5 comments
Open

ISA Validation Configurations Too Restrictive #520

ptth222 opened this issue Jan 19, 2024 · 5 comments

Comments

@ptth222
Copy link

ptth222 commented Jan 19, 2024

isajson.validate() has a "config_dir" parameter that defaults to using some JSON configuration files for various technology and measurement types, but if you don't have one of those types, errors are reported. Is the intention to force users to add configurations to validate new experiment types? In general these seem like extra validations that aren't necessary to validate whether something fits the ISA format since they validate specific word choice and protocol order, but there is no way to indicate to the function not to do these checks. My expectation would be that setting config_dir to None would disable this highly specific validation.

I don't see any documentation on needing to create these configuration files or how they are structured, anything at all on them. Am I looking in the wrong place?

For isajson.validate() it looks fairly straight forward to change the code so it would not run these validations if config_dir was None. isatab.validate() has the same problem though, and it would be much more difficult to try and offer a way to not do them. This makes it seem to me that it was intended to force the user to create configuration files, but then the lack of documentation on it contradicts that idea. What is the intention with these configuration files for users with new experiment types?

@proccaserra
Copy link
Member

@ptth222 you are right about the missing documentation about this step, something we are working at addressing.
The thing to know is that if no config_dir is supplied, the validation will run against the 'default' isa configuration, which provided initial support for a number of NGS based assays, some types of MS and NMR based assays.
The validation relies on a set of simplified json documents (derived from the ISA configuration xml files) with hard coded requirements for 'measurement type', 'technology type' and 'protocol type' used in the process sequences.

It turns out that the match of those strings and any departure from the expected value will cause a validation error, which does not return a very helpful message.

If you have a validation failure, it is either because there is mismatch on those values, or an entirely new assay type has been defined locally but is absent from the ISA configuration supplied to the ISA validator or from the default one.

To your point about the purpose of this behaviour, it was somehow to use the ISA configuration to "enforce" so called "Minimal Information Requirement" defined by a community for a technology domain. For instance, the ISA configurations supporting NGS applications are aligned to INSDC SRA specification. Degrading an ISA configuration to simplify it would result in failure to convert to SRA xml down the line.
So yes, the intent, even though not properly documented, was to encourage ISA users :

  • to reuse existing officially vetted configuration for supported technologies for which ISA provides conversion to Repository specific format (e.g., SRA)
  • to propose / submit to the dedicated GitHub repository: https://github.com/ISA-tools/isa-configurations to augment the portfolio of ISA compatible assay extension for specific technology.
    This is for instance what I am currently doing working with EMBL-EBI metabolights
    I can provide more details about what is usually needed:
  • (ontology selection / contribution (e.g. to OBI for protocol type, measurement type or technology type,
  • ISA workflow creation
  • raw data file check
  • sets of annotation requirements (type of parameters and instrument settings, workflow referencing)

Coordination the development and maintenance of all these requires resources but if there are needs in your area, please let us know and let's have a follow-up (zoom call)

atb

@ptth222
Copy link
Author

ptth222 commented Jan 23, 2024

Thank you for the reply. I think I understand a little better. We have something similar built into MESSES, but it is strictly optional. If the user wants to validate more specific requirements then there are ways to do it, but we they have to opt into it. We don't have any defaults built into the package, but we do provide examples that users can start from or go by.

I like that you have put additional validation capabilities in, but I would have preferred them to be more optional and targeted. I am going to enumerate some suggestions here:

  1. Only have 1 set of configuration files for both the JSON and Tab validations. Having JSON configs for one and XML for another that essentially have the same information is needlessly doubling work.
  2. Rather than have a directory of configurations and trying to match technology and measurement type pairs to find a configuration to use, use proper names and add an option so the user can specify which configuration they are trying to match. For instance, if a user was trying to validate for a specific repository or experiment type they could do validate(config="cellcount_flowcytometry"), or something similar.
  3. Some way to easily not do the validations involving the configs. This would be a little complicated to normalize across JSON and Tab so that they have similar interfaces for doing this. The JSON validation is solely controlled by the "config_dir" parameter, but the Tab validation has "config_dir" and "rules". There are rules that use configs and rules that don't, so classifying them into those 2 categories would be helpful, and then some way to easily tell the validation not to run the rules that use configs would be nice.
  4. An easier way for a user to add to the configurations. Currently, a user would have to either add their own configuration to the site-package directly, or copy the configurations out of the package and add their own configuration to the copy. This isn't very robust. A new parameter such as "user_config_dir" that would combine the user configurations and defaults would be nice.

I don't think any of these harm your intent, and would make things easier for the user. I wouldn't mind attempting some of this in a PR, but I would prefer to have some confirmation that these are okay and hash out any specifics if necessary. These changes wouldn't be as small as most of the PRs I have submitted so far.

@proccaserra
Copy link
Member

@ptth222 all great suggestions/observations and thank you for offering contribution. best of organise a call to agree on scope. much appreciated

@ptth222
Copy link
Author

ptth222 commented Jan 28, 2024

Hunter's schedule is a little hectic right now, but we could meet at either of these times this week:
Tuesday Jan 30th 10:00 AM EST
Wednesday Jan 31st 12:00 PM EST

These are in our time, Eastern Standard. I'm not sure what that translates to for you. If those aren't great, maybe you can suggest a time or we might have to try for the week after.

@proccaserra
Copy link
Member

@ptth222 apologies for the belated reply. unfortunately, Jan 30 & 31 are no go for us. i'll follow with a doodle now.
atb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants