New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New module: Hostile #2501
New module: Hostile #2501
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""
Thanks a lot for the contribution! Left a few comments. Also, it's needed to add ...
for f in self.find_log_files("hostile", filehandles=True):
self.parse_logs(f)
self.add_software_version(...)
self.parse_data = self.ignore_samples(self.parse_data)
if len(self.parse_data) == 0:
raise ModuleNoSamplesFound
log.info(f"Found {len(self.parse_data)} reports")
self.write_data_file(self.parse_data, "multiqc_hostile")
... |
Please also create a PR in https://github.com/MultiQC/test-data with test examples. |
Thank you for reviewing the code and all the suggestions. |
multiqc/modules/hostile/hostile.py
Outdated
data = {} | ||
|
||
for f_name, values in self.parse_data.items(): | ||
s_name = values[0]["fastq1_in_name"].split(".")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.parse_data
is already assumed to be indexed by sample names (i.e. self.ignore_samples
takes that assumption), so we can't create a different sample name here. Better move this line into parse_logs. And clean with self.clean_s_name()
instead of manually calling split
(the function knows about the extensions to remove)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for f_name, values in self.parse_data.items():
database = os.path.basename(values[0]["index"])
data[f_name] = {"Cleaned reads": values[0]["reads_out"], "Host reads": values[0]["reads_removed"]}
Removed that line.
multiqc/modules/hostile/hostile.py
Outdated
self.add_section( | ||
name="Reads Filtering", | ||
anchor="hostile-reads", | ||
description=f"This plot shows the number of cleaned reads vs host-reads per sample (database index: {database}).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
database
is initialized in the loop for each sample separately and this line is outside of the loop. Need to decide how to handle the situation if samples have different database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Hostile does not allow the screening against multiple databases in a single run, a batch of sequences will be screened against the same database. However, modified the code in which the database is joined with the categories.
for f_name, values in self.parse_data.items():
database = os.path.basename(values[0]["index"])
data[f_name] = {f"Cleaned reads (DB: {database})": values[0]["reads_out"], f"Host reads (DB: {database})": values[0]["reads_removed"]}
## categories
all_categories = [inner_key for outer_dict in data.values() for inner_key in outer_dict.keys()]
cats = list(set(all_categories))
#cats = ["Cleaned reads", "Host reads"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description=f"This plot shows the number of cleaned reads vs host-reads per sample (database index: {database}).", | |
description=f"This plot shows the number of cleaned reads vs host-reads per sample (DB: Database).", |
multiqc/modules/hostile/hostile.py
Outdated
log.warning(f"Could not parse JSON file {json_file['f']}") | ||
return | ||
|
||
if len(parse_data) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be more than one entry in the JSON file? How that should be handled? Please add such example to the test-data repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No there won't be more than one entry in the JSON file.
deleted from conde.if len(parse_data) > 0:
if len(parse_data) > 0: | |
multiqc/modules/hostile/hostile.py
Outdated
for f_name, values in self.parse_data.items(): | ||
s_name = values[0]["fastq1_in_name"].split(".")[0] | ||
database = os.path.basename(values[0]["index"]) | ||
data[s_name] = {"Cleaned reads": values[0]["reads_out"], "Host reads": values[0]["reads_removed"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should reads_out
and reads_removed
sum up to reads_in
? Doesn't look like in the test data :
cat test_data/data/modules/hostile/hostile.SAMPLE-A1.json
[
{
"version": "1.1.0",
"aligner": "bowtie2",
"index": "human-t2t-hla",
"options": [],
"fastq1_in_name": "SAMPLE-A1.fastq.gz",
"fastq1_in_path": "SAMPLE-A1.fastq.gz",
"fastq1_out_name": "SAMPLE-A1.clean.fastq.gz",
"fastq1_out_path": "SAMPLE-A1.clean.fastq.gz",
"reads_in": 241805,
"reads_out": 234757,
"reads_removed": 70484,
"reads_removed_proportion": 0.2915
}
]
Bar plots assume categories do not overlapp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should. I will update the count in the reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use real outputs from the tool in the test data. Can you run Hostile to generate some?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed some updates, and with the test example MultiQC/test-data@2eee448, we are good to go - merging this PR now.
Thank you so much. I too added some more logs from a metagenomic sampling. |
--strict
flag)docs/modulename.md
is createdself.add_section
)