Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Data Validation Changes #558

Open
ptth222 opened this issue Mar 29, 2024 · 1 comment
Open

Test Data Validation Changes #558

ptth222 opened this issue Mar 29, 2024 · 1 comment

Comments

@ptth222
Copy link

ptth222 commented Mar 29, 2024

A few of the PRs I still have open and am working on have all ran into this issue, and I need some input from you guys to resolve it.

In tests/validators/test_validate_test_data.py there are tests like:

def test_validate_testdata_bii_s_3_isatab(self):
        test_case = 'BII-S-3'
        with open(os.path.join(utils.TAB_DATA_DIR, test_case, 'i_gilbert.txt')) as test_case_fp:
            report = isatab.validate(fp=test_case_fp, config_dir=utils.DEFAULT2015_XML_CONFIGS_DATA_DIR,
                                     log_level=self._reporting_level)
            if len(report['errors']) > 0:
                self.fail("Error found when validating ISA tab: {}".format(report['errors']))

In #551 though many of them changed to:

def test_validate_testdata_bii_s_3_isatab(self):
        test_case = 'BII-S-3'
        with open(os.path.join(utils.TAB_DATA_DIR, test_case, 'i_gilbert.txt')) as test_case_fp:
            report = isatab.validate(fp=test_case_fp, config_dir=utils.DEFAULT2015_XML_CONFIGS_DATA_DIR,
                                     log_level=self._reporting_level)
            if len(report['errors']) > 0:
                # self.fail("Error found when validating ISA tab: {}".format(report['errors']))
                self.assertTrue("Error found when validating ISA tab: {}".format(report['errors']))

I don't think this was a wise change. The tests changed from testing that the data had no errors, to testing that it does have errors, and doesn't specify the errors. I think a better way to handle this is either to fix the errors in that data or explicitly look for the errors in that data, so you can know if changes are adding or subtracting errors in the future. The reason this is causing me issues for my PRs is because I am trying to fix some of the validations so they are inline with other parts of the code and examples, but the code and examples don't align with each other.

If we could hash out a few things about what is correct I can change the validation and test data to be correct and restore these tests to how they used to be.

Here is an example, tests/data/tab/BII-I-1/a_proteome.txt:

Sample Name Protocol REF Extract Name Protocol REF Labeled Extract Name Label Term Source REF Term Accession Number MS Assay Name Comment[PRIDE Accession] Comment[PRIDE Processed Data Accession] Raw Spectral Data File Normalization Name Protein Assignment File Peptide Assignment File Post Translational Modification Assignment File Data Transformation Name Derived Spectral Data File Factor Value[limiting nutrient] Term Source REF Term Accession Number Factor Value[rate] Unit Term Source REF Term Accession Number
S-0.1-aliquot11 protein extraction S-0.1 ITRAQ labeling JC_S-0.1 iTRAQ reagent 117 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml sulphur 0.1 l/hr
C-0.1-aliquot11 protein extraction C-0.1 ITRAQ labeling JC_C-0.1 iTRAQ reagent 116 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml carbon 0.1 l/hr
N-0.1-aliquot11 protein extraction N-0.1 ITRAQ labeling JC_N-0.1 iTRAQ reagent 115 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml nitrogen 0.1 l/hr
S-0.1-aliquot11 protein extraction S-0.1 ITRAQ labeling Pool1 iTRAQ reagent 114 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml l/hr
C-0.1-aliquot11 protein extraction C-0.1 ITRAQ labeling Pool1 iTRAQ reagent 114 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml l/hr
N-0.1-aliquot11 protein extraction N-0.1 ITRAQ labeling Pool1 iTRAQ reagent 114 8761 8761 8761 spectrum.mzdata norm1 proteins.csv peptides.csv ptms.csv datatransformation1 PRIDE_Exp_Complete_Ac_8761.xml l/hr
C-0.2-aliquot11 protein extraction C-0.2 ITRAQ labeling JC_C-0.2 iTRAQ reagent 117 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml carbon 0.2 l/hr
N-0.2-aliquot11 protein extraction N-0.2 ITRAQ labeling JC_N-0.2 iTRAQ reagent 116 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml nitrogen 0.2 l/hr
P-0.1-aliquot11 protein extraction P-0.1 ITRAQ labeling JC_P-0.1 iTRAQ reagent 115 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml phosphorus 0.1 l/hr
C-0.2-aliquot11 protein extraction C-0.2 ITRAQ labeling Pool2 iTRAQ reagent 114 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml l/hr
N-0.2-aliquot11 protein extraction N-0.2 ITRAQ labeling Pool2 iTRAQ reagent 114 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml l/hr
P-0.1-aliquot11 protein extraction P-0.1 ITRAQ labeling Pool2 iTRAQ reagent 114 8762 8762 8762 spectrum.mzdata norm2 proteins.csv peptides.csv ptms.csv datatransformation2 PRIDE_Exp_Complete_Ac_8762.xml l/hr
P-0.2-aliquot11 protein extraction P-0.2 ITRAQ labeling JC_P-0.2 iTRAQ reagent 116 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml phosphorus 0.2 l/hr
S-0.2-aliquot11 protein extraction S-0.2 ITRAQ labeling JC_S-0.2 iTRAQ reagent 115 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml sulphur 0.2 l/hr
P-0.2-aliquot11 protein extraction P-0.2 ITRAQ labeling Pool3 iTRAQ reagent 117 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml l/hr
S-0.2-aliquot11 protein extraction S-0.2 ITRAQ labeling Pool3 iTRAQ reagent 117 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml l/hr
P-0.2-aliquot11 protein extraction P-0.2 ITRAQ labeling Pool3 iTRAQ reagent 114 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml l/hr
S-0.2-aliquot11 protein extraction S-0.2 ITRAQ labeling Pool3 iTRAQ reagent 114 8763 8763 8763 spectrum.mzdata norm3 proteins.csv peptides.csv ptms.csv datatransformation3 PRIDE_Exp_Complete_Ac_8763.xml l/hr

This has a few issues:

  1. "Protocol REF" columns appear to be missing in 3 places, before each of the "Name" columns. The preprocess function in ProcessSequenceFactory.py warns about this.
  2. The "Factor Value" columns are at the end instead of being after "Sample Name".
  3. There are completely empty "Term Source REF" and "Term Accession Number" columns.

Questions about these issues:

  1. Although missing "Protocol REF" columns will not cause the ProcessSequenceFactory to error, it adds the columns with "unknown" for the value, I think this should still be at least a warning in validation. I don't fully understand what is going on here. Was there a time where the "Name" columns was all that was necessary and you didn't need to specify the protocol? It would be nice to have some consensus about this and make the code and examples align with that.
  2. The ProcessSequenceFactory handles "Factor Values" in a unique way. It splits columns into object groups and then loops over the object groups to parse them into model objects. For things like "Characteristics" and "Parameters" it only looks for them within the object group, but for "Factor Values" it will look for them within the entire data frame. This behavior assumes or suggests that there can only be 1 "Sample Name" column since it adds all factors in the entire file every time it finds a "Sample Name" column. This is correct for assays, but can studies only have 1 "Sample Name" column? This ProcessSequenceFactory also suggests that "Factor Value" columns can be anywhere, but this makes validation more difficult. I would recommend requiring "Factor Values" to be after the "Sample Name" column just like "Characteristics" to make things more consistent.
  3. Completely empty "Term Source REF" and "Term Accession Number" columns doesn't cause any errors, but it kind of suggests that at one time they were required maybe? If we end up editing the data I would like to know whether we could just remove these or not.

I have also come across some other issues while investigating this that aren't illustrated in the example.

  1. Can there be multiple "Name" columns, such as "Assay Name"? If so then they need to be matched a little differently in some of the code. Currently, they are matched using == instead of startswith like "Protocol REF" columns are.
  2. Is order supposed to matter for "Term Source REF" and "Term Accession Number"? As well as "Unit", "Term Source REF", "Term Accession Number"? They do in the code, get_value used by ProcessSequenceFactory requires a specific order, but is it supposed to be that way? Also all of the columns must be present, you can't just have a "Unit" column for instance. I have written some preliminary code that changes both of these conditions. It would make it so if just "Unit" is present it still gets added and if they are in any order it still gets created.

If it is easier to meet and discuss this I am available.

@proccaserra
Copy link
Member

TODO:
change the test from:
self.assertTrue("Error found when validating ISA tab: {}".format(report['errors']))
to:
self.fail("Error found when validating ISA tab: {}".format(report['errors']))

and fix the test files in response to errors detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants