Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Data Files of The Same Type Will Only Have 1 Name in Assay Conversion #509

Open
ptth222 opened this issue Nov 1, 2023 · 2 comments

Comments

@ptth222
Copy link

ptth222 commented Nov 1, 2023

If you try to create 2 files of the same type in the same assay in a JSON to Tab conversion only the last file will appear as the name in both columns. For example, if you have a Raw Data File, 'data_file1' and 'data_file2', only 'data_file2' will appear in the 2 Raw Data File columns (assuming data_file2 is later in the process sequence).

Example to reproduce:

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/json/BII-I-1/BII-I-1.json', 'r') as jsonFile:
    isa_example = json.load(jsonFile)
    
## Delete process sequence for transcriptome and replace it.
del isa_example["studies"][0]["assays"][2]["processSequence"]

protocol1 = {
          "@id": "#protocol/protocol1",
          "name": "protocol1",
        }
protocol2 = {
          "@id": "#protocol/protocol2",
          "name": "protocol2",
        }
protocol3 = {
          "@id": "#protocol/protocol3",
          "name": "protocol3",
        }
isa_example["studies"][0]["protocols"].append(protocol1)
isa_example["studies"][0]["protocols"].append(protocol2)
isa_example["studies"][0]["protocols"].append(protocol3)


data_file1 = {
          "@id": "#data/data_file1",
          "name": "data_file1",
          "type": "Raw Data File"
        }
data_file2 = {
          "@id": "#data/data_file2",
          "name": "data_file2",
          "type": "Raw Data File"
        }
data_file3 = {
          "@id": "#data/data_file3",
          "name": "data_file3",
          "type": "Raw Data File"
        }
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file1)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file2)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file3)


data_file4 = {
          "@id": "#data/data_file4",
          "name": "data_file4",
          "type": "Raw Data File"
        }
data_file5 = {
          "@id": "#data/data_file5",
          "name": "data_file5",
          "type": "Raw Data File"
        }
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file4)
isa_example["studies"][0]["assays"][2]["dataFiles"].append(data_file5)


new_process = [{
          "@id": "#process/protocol1",
          "executesProtocol": {
            "@id": "#protocol/protocol1"
          },
          "inputs": [
              {'@id': '#sample/sample-C-0.07-aliquot1'}
              ],
          "outputs": [
            {
              "@id": "#data/data_file1"
            },
          ],
          "nextProcess": {"@id": "#process/protocol2"}
        },
    {
          "@id": "#process/protocol2",
          "executesProtocol": {
            "@id": "#protocol/protocol2"
          },
          "inputs": [
              {'@id': "#data/data_file1"}
              ],
          "outputs": [
            {
              "@id": "#data/data_file2"
            },
          ],
          "previousProcess": {"@id": "#process/protocol1"},
          "nextProcess": {"@id": "#process/protocol3"}
        },
    {
          "@id": "#process/protocol3",
          "executesProtocol": {
            "@id": "#protocol/protocol3"
          },
          "inputs": [
              {'@id': "#data/data_file2"}
              ],
          "outputs": [
            {
              "@id": "#data/data_file3"
            },
          ],
          "previousProcess": {"@id": "#process/protocol2"},
        },
    
    
    {
              "@id": "#process/protocol1_1",
              "executesProtocol": {
                "@id": "#protocol/protocol1"
              },
              "inputs": [
                  {'@id': '#sample/sample-C-0.07-aliquot2'}
                  ],
              "outputs": [
                {
                  "@id": "#data/data_file4"
                },
              ],
              "nextProcess": {"@id": "#process/protocol3_1"}
            },
        {
              "@id": "#process/protocol3_1",
              "executesProtocol": {
                "@id": "#protocol/protocol3"
              },
              "inputs": [
                  {'@id': "#data/data_file4"}
                  ],
              "outputs": [
                {
                  "@id": "#data/data_file5"
                },
              ],
              "previousProcess": {"@id": "#process/protocol1_1"},
            }
    
    ]
isa_example["studies"][0]["assays"][2]["processSequence"] = new_process


with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing.json', 'w') as out_fp:
     json.dump(isa_example, out_fp, indent=2)

with open('C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing.json') as file_pointer:
    json2isatab.convert(file_pointer, 'C:/Users/Sparda/Desktop/Moseley Lab/Code/MESSES/isadatasets/BII-I-1_testing/', validate_first=False)

The above example modifies the "BII-I-1" example. I basically delete the transcriptome processSequence and replace it with a simpler one.

The issue appears to be in the isatools\isatab\dump\write.py file, in the write_assay_table_files function. It is similar to issue #500 where multiple data file type column names are not being tracked. I have adjusted the code so it will track the names and the file names appear as expected. I created a PR, #510.

ptth222 added a commit to ptth222/isa-api that referenced this issue Nov 1, 2023
@proccaserra
Copy link
Member

proccaserra commented Nov 16, 2023

@ptth222 Thank you for the PR.
However it would really work as the isatab reader and specification would be allow it.

The following would be the expected way to representing more than one output to a 'data acquisition' event.

Assay Name Raw Data File Protocol REF Data Transformation Name Derived Data File
A1 fwd_read.fastq.gz normalization DT1 deseq.tsv
A1 rev_read.fastq.gz normalization DT1 deseq.tsv

What the PR does is to generate the following output:

Assay Name Raw Data File Raw Data File Protocol REF Data Transformation Name Derived Data File
A1 fwd_read.fastq.gz rev_read.fastq.gz normalization DT1 deseq.tsv

This is not allowed and would be require changing the isatab load component.

We now need to check the initial behavior and why only the last output file is kept. This will require adding new tests to the testing suite and possibly amend the parser

@ptth222
Copy link
Author

ptth222 commented Dec 4, 2023

I made new commits to #510 to address what you said. I hope it is better.

I also discovered another issue while making these changes.

There are some inconsistencies between validation and the ProcessSequenceFactory that parses things. There is a defaults.py file in the isatab module that has a list of acceptable column headers, and these are imported for use in the ProcessSequenceFactory, but aren't in the validation. The validation often uses it's own sets of column headers for each rule instead of pulling from defaults or some other unified source. I discovered this because the column name "Derived Data File" was causing a validation error that wouldn't let the conversion continue. This was in the load_table_checks function in the rules_40xx.py file and I added "Derived Data File" to the list in the function. It might be worth while to try unifying the code so it is pulling column headers from 1 unified place.

ptth222 added a commit to ptth222/isa-api that referenced this issue Mar 17, 2024
Testing that the changes fix what was raised in ISA-tools#509.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants