New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new module: sincei #1946
base: main
Are you sure you want to change the base?
new module: sincei #1946
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the contribution!
First, a minor comment, I'd try to refactor to not to assume the order for columns (i.e. indexing them with numbers like cols[0]
, cols[1]
). Since the header is available, I'd rather use csv
to parse the file reader = csv.DictReader(open(filename), delimiter=",")
and index with the column name (e.g. cols['Cell_ID']
, row["# Reads"]
, etc), to make it more robust to future changes of columns and their order. And probably use same column name in MultiQC table whenever reasonable.
A bigger concern is that I feel like using cells as samples is not what MultiQC was designed for. The generated beeswarm plot already hangs my browser even on the test data (I know the beeswarm is really inefficient at the first place and the problem will probably go away with a future new plotting library; but not until we add more data). MultiQC is designed to work with summarised per-sample metrics, and I don't think cells can be counted as samples, as there are way too many of them, and feels too raw to be fed into MultiQC, I think.
It would be good if more summarised per-samples stats can be produced here, e.g. with some different tool downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @vivekbhr! As @vladsavelyev, I think that we need to rethink what data goes into the report here. The docs have a section called Don’t add everything which I think applies. MultiQC should only summarise, but here we're really just reporting on raw data.
The TSV files have a column called sample
(eg. sortChIC-BM-SL1-k4me1-1
) so maybe we could summarise stats for all cells under a given sample identifier?
Thanks a lot for the review and comments @vladsavelyev and @ewels . I understand what you say, I experienced the slow response on the output report myself when using bigger data, so summarizing per-sample stats, such as a sample-wise range and median value for each field could be a better way. Do you think Its OK to do these calculations from the output .tsv file in the multiqc module, or are you saying that this output should already be reported by sincei? I think calculating sample-wise range and median should not be a big overhead for the multiqc module. |
I think it should be totally alright to summarise it in MultiQC! |
@vladsavelyev @ewels I've finally found time to complete the changes as requested. The multiqc module now reports sample summaries (median values), instead of per-cell summary. Also, the column headers are used instead of integer positions. Other minor comments are also resolved. I hope the changes are satisfactory now. Best, |
@vladsavelyev @ewels I have made all the requested changes, but the merging is still blocked with "changes requested". Please let me know if there are still unresolved issues. Thanks! |
Thanks @vivekbhr! We'll get back to this to review the updated code as soon as we can. |
@ewels @vladsavelyev Did you find some time at the end to review the updates? |
Added sincei module to summarise read and count-level single-cell QC metrics from sincei
CHANGELOG.md
has been updated--lint
flag)docs/README.md
is updated with link to belowdocs/modulename.md
is createdself.add_section
)