Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New module: Hostile #2501

Merged
merged 17 commits into from May 2, 2024
Merged

New module: Hostile #2501

merged 17 commits into from May 2, 2024

Conversation

SumeetTiwari07
Copy link
Contributor

@SumeetTiwari07 SumeetTiwari07 commented Apr 23, 2024

  • This comment contains a description of changes (with reason)
  • There is example tool output for tools in the https://github.com/MultiQC/test-data repository or attached to this PR
  • Code is tested and works locally (including with --strict flag)
  • docs/modulename.md is created
  • Everything that can be represented with a plot instead of a table is a plot
  • Report sections have a description and help text (with self.add_section)
  • There aren't any huge tables with > 6 columns (explain reasoning if so)
  • Each table column has a different colour scale to its neighbour, which relates to the data (e.g. if high numbers are bad, they're red)
  • Module does not do any significant computational work

Copy link
Contributor Author

@SumeetTiwari07 SumeetTiwari07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

""

@vladsavelyev
Copy link
Member

Thanks a lot for the contribution!

Left a few comments.

Also, it's needed to add self.ignore_samples (see https://multiqc.info/docs/development/modules/#filtering-by-parsed-sample-names) and self. add_software_version (see https://multiqc.info/docs/development/modules/#saving-version-information) calls.

        ...
        for f in self.find_log_files("hostile", filehandles=True):
            self.parse_logs(f)

        self.add_software_version(...)

        self.parse_data = self.ignore_samples(self.parse_data)

        if len(self.parse_data) == 0:
            raise ModuleNoSamplesFound

        log.info(f"Found {len(self.parse_data)} reports")

        self.write_data_file(self.parse_data, "multiqc_hostile")
        ...

@vladsavelyev
Copy link
Member

Please also create a PR in https://github.com/MultiQC/test-data with test examples.

@SumeetTiwari07
Copy link
Contributor Author

SumeetTiwari07 commented Apr 26, 2024

Thank you for reviewing the code and all the suggestions.
I made the changes that were recommended and updated the code as suggested. 069e1c6,
51f61da

Also added the test-data (MultiQC/test-data#319)

data = {}

for f_name, values in self.parse_data.items():
s_name = values[0]["fastq1_in_name"].split(".")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.parse_data is already assumed to be indexed by sample names (i.e. self.ignore_samples takes that assumption), so we can't create a different sample name here. Better move this line into parse_logs. And clean with self.clean_s_name() instead of manually calling split (the function knows about the extensions to remove)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for f_name, values in self.parse_data.items():
    database = os.path.basename(values[0]["index"])
    data[f_name] = {"Cleaned reads": values[0]["reads_out"], "Host reads": values[0]["reads_removed"]}

Removed that line.

self.add_section(
name="Reads Filtering",
anchor="hostile-reads",
description=f"This plot shows the number of cleaned reads vs host-reads per sample (database index: {database}).",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

database is initialized in the loop for each sample separately and this line is outside of the loop. Need to decide how to handle the situation if samples have different database

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Hostile does not allow the screening against multiple databases in a single run, a batch of sequences will be screened against the same database. However, modified the code in which the database is joined with the categories.

        for f_name, values in self.parse_data.items():
            database = os.path.basename(values[0]["index"])
            data[f_name] = {f"Cleaned reads (DB: {database})": values[0]["reads_out"], f"Host reads (DB: {database})": values[0]["reads_removed"]}

        ## categories
        all_categories = [inner_key for outer_dict in data.values() for inner_key in outer_dict.keys()]
        cats = list(set(all_categories))
        #cats = ["Cleaned reads", "Host reads"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description=f"This plot shows the number of cleaned reads vs host-reads per sample (database index: {database}).",
description=f"This plot shows the number of cleaned reads vs host-reads per sample (DB: Database).",

log.warning(f"Could not parse JSON file {json_file['f']}")
return

if len(parse_data) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be more than one entry in the JSON file? How that should be handled? Please add such example to the test-data repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there won't be more than one entry in the JSON file.
if len(parse_data) > 0: deleted from conde.

Suggested change
if len(parse_data) > 0:

for f_name, values in self.parse_data.items():
s_name = values[0]["fastq1_in_name"].split(".")[0]
database = os.path.basename(values[0]["index"])
data[s_name] = {"Cleaned reads": values[0]["reads_out"], "Host reads": values[0]["reads_removed"]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should reads_out and reads_removed sum up to reads_in? Doesn't look like in the test data :

cat test_data/data/modules/hostile/hostile.SAMPLE-A1.json
[
    {
        "version": "1.1.0",
        "aligner": "bowtie2",
        "index": "human-t2t-hla",
        "options": [],
        "fastq1_in_name": "SAMPLE-A1.fastq.gz",
        "fastq1_in_path": "SAMPLE-A1.fastq.gz",
        "fastq1_out_name": "SAMPLE-A1.clean.fastq.gz",
        "fastq1_out_path": "SAMPLE-A1.clean.fastq.gz",
        "reads_in": 241805,
        "reads_out": 234757,
        "reads_removed": 70484,
        "reads_removed_proportion": 0.2915
    }
]

Bar plots assume categories do not overlapp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. I will update the count in the reports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request raised to change the counts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use real outputs from the tool in the test data. Can you run Hostile to generate some?

@vladsavelyev vladsavelyev added the waiting: response Waiting for more information from user label May 1, 2024
@vladsavelyev vladsavelyev self-requested a review May 2, 2024 10:42
Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some updates, and with the test example MultiQC/test-data@2eee448, we are good to go - merging this PR now.

@vladsavelyev vladsavelyev added this to the MultiQC v1.22: Pydantic milestone May 2, 2024
@vladsavelyev vladsavelyev changed the title New Module: Hostile New module: Hostile May 2, 2024
@SumeetTiwari07
Copy link
Contributor Author

I pushed some updates, and with the test example MultiQC/test-data@2eee448, we are good to go - merging this PR now.

Thank you so much. I too added some more logs from a metagenomic sampling.

@vladsavelyev vladsavelyev merged commit ba3522b into MultiQC:main May 2, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: new waiting: response Waiting for more information from user
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants