Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elembio bases2fastq #2044

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open

Elembio bases2fastq #2044

wants to merge 60 commits into from

Conversation

ewels
Copy link
Member

@ewels ewels commented Sep 14, 2023

We don't have permission to push directly to the source fork in #1990, so creating this branch + PR to continue work so that we can push minor changes directly to be able to finish the merge.

Please see #1990 for discussion and context.

yuhe.cheng62 and others added 30 commits August 15, 2023 13:28
Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
@ewels
Copy link
Member Author

ewels commented Sep 15, 2023

Set scale: False and everything will work as you'd expect.

@vladsavelyev
Copy link
Member

vladsavelyev commented Sep 15, 2023

On more issue is that when run on all test inputs, it reports that no samples are found:

[2023-09-15 11:33:31] multiqc.modules.bases2fastq.bases2fastq            [ERROR  ]  More than one read length configurations are found in the dataset:150+150,75+75
[2023-09-15 11:33:31] multiqc.modules.bases2fastq.bases2fastq            [ERROR  ]  These runs have 150+150 read length configuration:DVT-0008__2d47,DVT-0013__04fd,DVT-0015__1fa3
[2023-09-15 11:33:31] multiqc.modules.bases2fastq.bases2fastq            [ERROR  ]  These runs have 75+75 read length configuration:DVT-0023__38a4,DVT-0019__85ac,DVT-0018__e1b7,DVT-0020__f923,DVT-0031__0e11,DVT-0030__8a89,DVT-0025__2b41,DVT-0024__17cf,DVT-0021__3f2c,DVT-0022__08c4
[2023-09-15 11:33:31] multiqc                                            [DEBUG  ]  No samples found: bases2fastq

Probably in this case it should raise an exception to be more explicit about the errors.

UPD: one for a separate PR (related to #2047)

@ewels ewels added this to the MultiQC v1.16 milestone Sep 15, 2023
Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@ewels ewels added the awaits-review Awaiting final review and merge. label Sep 18, 2023
@ewels
Copy link
Member Author

ewels commented Sep 21, 2023

@vladsavelyev - getting the same thing here as we just fixed in bclconvert, where the module exits with no results if there are mixtures of runs:

multiqc -f .

  /// MultiQC 🔍 | v1.16.dev0 (de0264c)

|           multiqc | Search path : /Users/ewels/GitHub/MultiQC/TestData/data/modules/bases2fastq
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 303/303
|       bases2fastq | Found 13 total RunStats.json
|       bases2fastq | More than one read length configurations are found in the dataset:150+150,75+75
|       bases2fastq | These runs have 150+150 read length configuration:DVT-0008__2d47,DVT-0013__04fd,DVT-0015__1fa3
|       bases2fastq | These runs have 75+75 read length configuration:DVT-0023__38a4,DVT-0019__85ac,DVT-0018__e1b7,DVT-0020__f923,DVT-0031__0e11,DVT-0030__8a89,DVT-0025__2b41,DVT-0024__17cf,DVT-0021__3f2c,DVT-0022__08c4
|           multiqc | No analysis results found. Cleaning up..
|           multiqc | MultiQC complete

Do you think we can refactor to be able to still generate a report successfully in this case?

@ewels ewels removed the awaits-review Awaiting final review and merge. label Sep 21, 2023
contents_re: ".*NumPolonies.*"
num_lines: 100
bases2fastq/group:
fn: "*_b2fgroup.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to have any test data for the group file, do we?
https://github.com/ewels/MultiQC_TestData/tree/master/data/modules/bases2fastq

Would it be possible to add some? @blajoie

Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one last pass through and added a few changes:

  • Worked around the use of pandas which we don't install along with MultiQC,
  • Got rid of UserWarning. We want to have only one UserWarning per module, which indicates that no data is found. Otherwise, a module should ideally do its best to generate a report for whatever data it found. It doesn't seem like it depends on the read lengths to be identical across all samples, so we can get rid of the UserWarning as well.

What we are missing is some test data for the following:

bases2fastq/project:
  fn: "*_RunStats.json"
  contents_re: ".*NumPolonies.*"
  num_lines: 100
bases2fastq/group:
  fn: "*_b2fgroup.csv"

It would be nice to add it, otherwise a fair share of the module code is not tested at all.

@ewels
Copy link
Member Author

ewels commented Sep 22, 2023

@blajoie / @YuheCheng62 - we are close to getting the module merged now on the code side. However, the size of the test data is presenting some fairly significant problems.

For starters, it's bloated the MultiQC test data a lot, making all the CI checks run quite a lot slower. It's 141MB, which is 30% of the size of the entire test data repo - for one module out of 130 😓

Also the module runs very slowly. Probably the most comparable MultiQC module is bclconvert, and that runs in 0.34 seconds. But bases2fastq takes 20.56 seconds. That is really too slow, when one of the golden rules of MultiQC modules is the "Module should not not do any significant computational work". This is bad for simple stuff like CI tests, but also bad for people running MultiQC at scale.

@blajoie - we spoke on Slack about potentially being able to generate a different flavour of outputs, which are not huge files containing everything in one place. But instead having a lightweight *.bases2fastq-metrics.json file which would be appropriate for MutliQC. This would be:

  • Faster to parse
  • Smaller for the CI
  • Not require user-config overrides to push up the maximum log filesize

Is this a possibility in the near-ish future? I think that the report filesize may be a blocker for merging into MultiQC as it stands..

@ewels ewels added the waiting: changes Issue / PR is on hold, waiting for requested changes label Sep 22, 2023
@ewels ewels modified the milestones: MultiQC v1.16, MultiQC v1.17 Sep 22, 2023
@ewels ewels added the waiting: response Waiting for more information from user label Sep 26, 2023
@ewels ewels modified the milestones: MultiQC v1.17, MultiQC v1.18 Oct 16, 2023
@blajoie
Copy link

blajoie commented Nov 6, 2023

Sorry all - had to pivot to some other high priority work. Hoping to come back to this soon to finish this off!

Appreciate your patience on this ! :)

@vladsavelyev vladsavelyev removed this from the MultiQC v1.18 milestone Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting: changes Issue / PR is on hold, waiting for requested changes waiting: response Waiting for more information from user
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants