Multiple Data Files of The Same Type Will Only Have 1 Name in Assay Conversion #509

ptth222 · 2023-11-01T16:11:06Z

If you try to create 2 files of the same type in the same assay in a JSON to Tab conversion only the last file will appear as the name in both columns. For example, if you have a Raw Data File, 'data_file1' and 'data_file2', only 'data_file2' will appear in the 2 Raw Data File columns (assuming data_file2 is later in the process sequence).

Example to reproduce:

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/json/BII-I-1/BII-I-1.json', 'r') as jsonFile:
    isa_example = json.load(jsonFile)
    
## Delete process sequence for transcriptome and replace it.
del isa_example["studies"][0]["assays"][2]["processSequence"]

protocol1 = {
          "@id": "#protocol/protocol1",
          "name": "protocol1",
        }
protocol2 = {
          "@id": "#protocol/protocol2",
          "name": "protocol2",
        }
protocol3 = {
          "@id": "#protocol/protocol3",
          "name": "protocol3",
        }
isa_example["studies"][0]["protocols"].append(protocol1)
isa_example["studies"][0]["protocols"].append(protocol2)
isa_example["studies"][0]["protocols"].append(protocol3)


data_file1 = {
          "@id": "#data/data_file1",
          "name": "data_file1",
          "type": "Raw Data File"
        }
data_file2 = {
          "@id": "#data/data_file2",
          "name": "data_file2",
          "type": "Raw Data File"
        }
data_file3 = {
          "@id": "#data/data_file3",
          "name": "data_file3",
          "type": "Raw Data File"
        }
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file1)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file2)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file3)


data_file4 = {
          "@id": "#data/data_file4",
          "name": "data_file4",
          "type": "Raw Data File"
        }
data_file5 = {
          "@id": "#data/data_file5",
          "name": "data_file5",
          "type": "Raw Data File"
        }
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file4)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file5)


new_process = [{
          "@id": "#process/protocol1",
          "executesProtocol": {
            "@id": "#protocol/protocol1"
          },
          "inputs": [
              {'@id': '#sample/sample-C-0.07-aliquot1'}
              ],
          "outputs": [
            {
              "@id": "#data/data_file1"
            },
          ],
          "nextProcess": {"@id": "#process/protocol2"}
        },
    {
          "@id": "#process/protocol2",
          "executesProtocol": {
            "@id": "#protocol/protocol2"
          },
          "inputs": [
              {'@id': "#data/data_file1"}
              ],
          "outputs": [
            {
              "@id": "#data/data_file2"
            },
          ],
          "previousProcess": {"@id": "#process/protocol1"},
          "nextProcess": {"@id": "#process/protocol3"}
        },
    {
          "@id": "#process/protocol3",
          "executesProtocol": {
            "@id": "#protocol/protocol3"
          },
          "inputs": [
              {'@id': "#data/data_file2"}
              ],
          "outputs": [
            {
              "@id": "#data/data_file3"
            },
          ],
          "previousProcess": {"@id": "#process/protocol2"},
        },
    
    
    {
              "@id": "#process/protocol1_1",
              "executesProtocol": {
                "@id": "#protocol/protocol1"
              },
              "inputs": [
                  {'@id': '#sample/sample-C-0.07-aliquot2'}
                  ],
              "outputs": [
                {
                  "@id": "#data/data_file4"
                },
              ],
              "nextProcess": {"@id": "#process/protocol3_1"}
            },
        {
              "@id": "#process/protocol3_1",
              "executesProtocol": {
                "@id": "#protocol/protocol3"
              },
              "inputs": [
                  {'@id': "#data/data_file4"}
                  ],
              "outputs": [
                {
                  "@id": "#data/data_file5"
                },
              ],
              "previousProcess": {"@id": "#process/protocol1_1"},
            }
    
    ]
isa_example["studies"][0]["assays"][2]["processSequence"] = new_process


with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing.json', 'w') as out_fp:
     json.dump(isa_example, out_fp, indent=2)

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing.json') as file_pointer:
    json2isatab.convert(file_pointer, 'C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing/', validate_first=False)

The above example modifies the "BII-I-1" example. I basically delete the transcriptome processSequence and replace it with a simpler one.

The issue appears to be in the isatools\isatab\dump\write.py file, in the write_assay_table_files function. It is similar to issue #500 where multiple data file type column names are not being tracked. I have adjusted the code so it will track the names and the file names appear as expected. I created a PR, #510.

The text was updated successfully, but these errors were encountered:

Should fix issue ISA-tools#509.

proccaserra · 2023-11-16T14:30:52Z

@ptth222 Thank you for the PR.
However it would really work as the isatab reader and specification would be allow it.

The following would be the expected way to representing more than one output to a 'data acquisition' event.

Assay Name	Raw Data File	Protocol REF	Data Transformation Name	Derived Data File
A1	fwd_read.fastq.gz	normalization	DT1	deseq.tsv
A1	rev_read.fastq.gz	normalization	DT1	deseq.tsv

What the PR does is to generate the following output:

Assay Name	Raw Data File	Raw Data File	Protocol REF	Data Transformation Name	Derived Data File
A1	fwd_read.fastq.gz	rev_read.fastq.gz	normalization	DT1	deseq.tsv

This is not allowed and would be require changing the isatab load component.

We now need to check the initial behavior and why only the last output file is kept. This will require adding new tests to the testing suite and possibly amend the parser

ptth222 · 2023-12-04T15:51:57Z

I made new commits to #510 to address what you said. I hope it is better.

I also discovered another issue while making these changes.

There are some inconsistencies between validation and the ProcessSequenceFactory that parses things. There is a defaults.py file in the isatab module that has a list of acceptable column headers, and these are imported for use in the ProcessSequenceFactory, but aren't in the validation. The validation often uses it's own sets of column headers for each rule instead of pulling from defaults or some other unified source. I discovered this because the column name "Derived Data File" was causing a validation error that wouldn't let the conversion continue. This was in the load_table_checks function in the rules_40xx.py file and I added "Derived Data File" to the list in the function. It might be worth while to try unifying the code so it is pulling column headers from 1 unified place.

Testing that the changes fix what was raised in ISA-tools#509.

ptth222 added a commit to ptth222/isa-api that referenced this issue Nov 1, 2023

Add tracking to data file type column names.

62247c1

Should fix issue ISA-tools#509.

ptth222 mentioned this issue Nov 1, 2023

Add tracking to data file type column names. #510

Closed

ptth222 added a commit to ptth222/isa-api that referenced this issue Mar 17, 2024

Added new test

ca12fe4

Testing that the changes fix what was raised in ISA-tools#509.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Data Files of The Same Type Will Only Have 1 Name in Assay Conversion #509

Multiple Data Files of The Same Type Will Only Have 1 Name in Assay Conversion #509

ptth222 commented Nov 1, 2023 •

edited

proccaserra commented Nov 16, 2023 •

edited

ptth222 commented Dec 4, 2023

Multiple Data Files of The Same Type Will Only Have 1 Name in Assay Conversion #509

Multiple Data Files of The Same Type Will Only Have 1 Name in Assay Conversion #509

Comments

ptth222 commented Nov 1, 2023 • edited

proccaserra commented Nov 16, 2023 • edited

ptth222 commented Dec 4, 2023

ptth222 commented Nov 1, 2023 •

edited

proccaserra commented Nov 16, 2023 •

edited