Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New module : Glimpse #2492

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

New module : Glimpse #2492

wants to merge 23 commits into from

Conversation

LouisLeNezet
Copy link

@LouisLeNezet LouisLeNezet commented Apr 16, 2024

Here is my first contribution to MultiQC to add support for the GLIMPSE2_concordance analysis log.

  • This comment contains a description of changes (with reason)
  • There is example tool output for tools in the https://github.com/MultiQC/test-data repository or attached to this PR
  • Code is tested and works locally (including with --strict flag)
  • docs/modulename.md is created
  • Everything that can be represented with a plot instead of a table is a plot
  • Report sections have a description and help text (with self.add_section)
  • There aren't any huge tables with > 6 columns (explain reasoning if so)
  • Each table column has a different colour scale to its neighbour, which relates to the data (e.g. if high numbers are bad, they're red)
  • Module does not do any significant computational work

@ewels
Copy link
Member

ewels commented Apr 29, 2024

@LouisLeNezet - I had a quick first pass over the code without thinking about what it's doing, just trying to get the CI tests to pass. I had to do a bit of reformatting and fixing some imports / some broken bits here and there. Please take a look and sanity check that it still works as expected.

Once the tests are passing we'll come back to this and start the proper review ASAP.

@ewels
Copy link
Member

ewels commented Apr 29, 2024

Ok still some linting not working. Please check the tests and also the docs. Specifically:

@LouisLeNezet
Copy link
Author

Hi @ewels,

I'm sorry, I didn't meant for you to review it yet.
The tests data wasn't yet added to the repo so I couldn't check if it was working as expected.
Could you add the test data for me to work on this PR and make it work ?
Thanks a lot for all your troubleshooting, I'll look into it (first time contributing to this wonderfull MultiQC).

Thanks again,
Louis

@ewels
Copy link
Member

ewels commented Apr 29, 2024

Ah ok, no worries. For next time, please reference the test data PR so that it's obvious that it's there. If you're still working on a PR you can mark it as a draft and we'll know that it's not ready for review yet.

@ewels ewels marked this pull request as draft April 29, 2024 07:22
@LouisLeNezet LouisLeNezet marked this pull request as ready for review May 2, 2024 15:24
@LouisLeNezet
Copy link
Author

LouisLeNezet commented May 2, 2024

Hi @ewels,

I've normally sort out the differents missing informations.
There isn't anymore formatting error.
On the other hand, multiqc doesn't seems to detect the module.
Even with a log.info("Test") in the _init_() of the glimpse.py nothing happens.
I've followed the nice documentation you've sent, but I do not see what I've missed.
Could you check ?
Also for such troubleshooting, is it better to ask on the MultiQC forum ?

Thanks a lot !

@LouisLeNezet LouisLeNezet requested a review from ewels May 3, 2024 07:37
@LouisLeNezet LouisLeNezet changed the title Add support for Glimpse New module : Glimpse May 3, 2024
@ewels
Copy link
Member

ewels commented May 3, 2024

The forum is more for usage questions. Anything to do with active pull-requests or reporting bugs / feature requests should happen on GitHub please 🙏🏻

@LouisLeNezet
Copy link
Author

LouisLeNezet commented May 3, 2024

Hi @ewels,
I think the problem come from the extension that I'm looking for.
The files given by Glimpse_concordance are compressed txt files.
When I set the pattern recognition to "*" in the yml and look at the file present in glimpse/glimpse2.0.0 in the multiqc data, only the README.md is showed.
I guess that at some point the txt.gz are removed from multiqc.
What can I do ?

If I work unzip the files they are recognized and I can parse them.
Another problematic I'm facing is that the data are separated for *.error.spl.txt.gz by variants and I wanted to allow the user to check for SNP, InDels or Both.
I thought that parsing the data in a nested dictionnary would be a good solution to parse everything in one go.
But it doesn't seems to work quite well after when I need to plot the data.
What's the recommended way to go for such data ?

{ Sample1: {
    SNP : { var1: val1, ...},
    InDels: {},
    Both: {}
}, Sample2: {}
} 

@LouisLeNezet
Copy link
Author

I've improved the Glimpse module but I'm still facing two problems:

  • The gz extension seems to not be recognized by multiqc. Is is something wanted ?
  • The "xlog" parameter does not seem to works either on the linegraph. Is there another parameter to set for it to work ?
    Thanks again 😄

@vladsavelyev
Copy link
Member

vladsavelyev commented May 6, 2024

*.txt.gz files are ignored by MultiQC by dfault: https://github.com/MultiQC/MultiQC/blob/main/multiqc/config_defaults.yaml#L355

@LouisLeNezet, I assume *.txt.gz is the extension that Glimpse sets to its output files automatically?

@ewels, do we have an approach for such situations? I can't think of anything from the top of my head rather than remove .txt.gz from config defaults, which is not a good idea.

@vladsavelyev vladsavelyev added this to the MultiQC v1.23 milestone May 6, 2024
@ewels
Copy link
Member

ewels commented May 7, 2024

Hi @LouisLeNezet,

As @vladsavelyev says, .txt.gz files are usually used for large data outputs. This kind of file isn't usually suitable for MultiQC, we want to be able to scale to a million files and opening / extracting a million large files is slow.

We can probably make it work with the .gz files but only as a last resort. Is Glimpse generating these directly? Does it have any other formats? How large can these files get with real data?

@LouisLeNezet
Copy link
Author

Hi @ewels ,

Is Glimpse generating these directly? Does it have any other formats?

Unfortunately the Glimpse tool set the file to .txt.gz format and there is no option to choose something else.
I've open a issue on their repository for that.

How large can these files get with real data?

I think only the .r2_sites.txt.gz will become big on real data. Moreover this file is optional (concordance rate for each SNP) and I don't intend to add support for this file in MultiQC.

I could set the search pattern to only .txt extensions and ask the user to decompress the files before running multiqc ?
What do you think ?

@LouisLeNezet
Copy link
Author

Need the new data PR327.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants