Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: update datavzrd to wrapper v3.10.2 datavzrd.smk #92

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

dlaehnemann
Copy link
Collaborator

this then uses datavzrd=2.36.10

@@ -93,6 +93,9 @@ enrichment:
# the species specified by resources -> ref -> species above
pathway_database: "panther"

report:
offer_excel: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
offer_excel: true
offer_excel: false

Depending on the size of the input data this might cause extremely long runtimes as the excel writer in rust requires to read all of the input data into memory before one can write it out:

https://github.com/datavzrd/datavzrd/blob/556feb7b6a39745b42900d27fbcad26ee050e158/src/render/portable/mod.rs#L1944-L1981

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for checking the template and for the clarification!

Did you already check for better options for xlsx file writing? Have you for example seen this one:
https://docs.rs/simple-xlsx-writer/latest/simple_xlsx_writer/

At least it doesn't read everything into memory. But not sure, if it will be any quicker and if the interface works for how you parse stuff in datavzrd...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is literally the one we are using right now 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Maybe switch to something else, then... 🤔

But yeah, I think this switch of the default here is good enough for now. I'll add a comment warning of the performance hit when activating this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Maybe switch to something else, then... 🤔

I am not sure there is something that would allow something like flushing the writer every thousand records for example. But it'd great to have something like that for sure. Might even have to do with excel itself maybe?

But yeah, I think this switch of the default here is good enough for now. I'll add a comment warning of the performance hit when activating this...

Yeah a warning when manually activated sounds good!

based on this very approximate statement, but started with a lower estimate here: pachterlab/sleuth#139 (comment)

so we should increase this if we see this failing with out of memory on any real datasets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants