Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output variability between runs/systems? #286

Open
jfy133 opened this issue Feb 16, 2024 · 19 comments
Open

Output variability between runs/systems? #286

jfy133 opened this issue Feb 16, 2024 · 19 comments

Comments

@jfy133
Copy link

jfy133 commented Feb 16, 2024

We are in the process in updating nf-core/taxprofiler to include ganon 2.0

During the establishment of CI tests we found there is slight variability in the output (where ganon 1 did not).

More specifically, @sofstam identified that while ganon v2 output compositionally was OK on our local machines, when the tests ran again but on GitHub Actions, the output appeared to not add up to 100% (99.64%)

Is this to be expected with the new update?

@sofstam can provide more info if necessary :)

@pirovc
Copy link
Owner

pirovc commented Feb 16, 2024

Thanks for adding ganon2 to the pipeline!

Is the difference in results between machines happening with the same data? It should indeed always add up to 100%. With access to the files and commands used I can gladly check what is the issue.

@jfy133
Copy link
Author

jfy133 commented Feb 16, 2024

Yes it was the same data I believe @sofstam can you confirm (and share)?

Note the compositionality issue was detected with taxpasta, so there is also that to check, but that literally just adds up the selected column

@sofstam
Copy link

sofstam commented Feb 16, 2024

Yes the data are the same. Tried to upload them but the file is too big. Can be found here.

ganon_GHA.zip Results by running ganon 2 through GHA
ganon_local.zip Results by running ganon 2 on my local machine

@pirovc
Copy link
Owner

pirovc commented Feb 16, 2024

Could you also send the .tax file of the database used?

@sofstam
Copy link

sofstam commented Feb 16, 2024

This is the db used.

test-db-ganon.zip

@pirovc
Copy link
Owner

pirovc commented Feb 19, 2024

I did some investigation. Using the reads 2612_pe against the provided database using the default parameters (v2.1.0): ganon classify -d test-db-ganon -p ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176_2.fastq.gz -o 2612_pe.ganon

I get the following output (2612_pe.ganon.rep):

H1                   NC_012920.1   59  59  0  sequence  NC_012920.1
H1                   NC_001136.10  1   1   0  sequence  NC_001136.10
#total_classified    60                                 
#total_unclassified  632000 

From the files you sent me, ganon_local/db1/2612_pe.ganon.rep looks like:

H1                   NC_012920.1   118  59  0  sequence  NC_012920.1
H1                   NC_001136.10  2    1   0  sequence  NC_001136.10
#total_classified    120                                  
#total_unclassified  1264000 

The report above has inconsistencies, therefore creating the issues reported here. I have some clues for what is happening (probably repeated headers) but I need to replicate it to be able to identify it and fix it. Could you please confirm if this data is correct and tell me is the exact command that taxprofiler is running?

@sofstam
Copy link

sofstam commented Feb 20, 2024

Thank you for looking into that. Yes, the data is correct.

Regarding the command, I clone this branch and then I run:

nextflow run main.nf -profile hasta,dev_prio,test,singularity --run_ganon --outdir test_ganon_output

@pirovc
Copy link
Owner

pirovc commented Feb 20, 2024

Thanks. I actually meant what is the ganon command generated by the taxprofiler.

@jfy133
Copy link
Author

jfy133 commented Feb 20, 2024

@sofstam probably the .command.shfile in the work directory should suffice

@sofstam
Copy link

sofstam commented Feb 20, 2024

Sorry, I misread your comment above.

Both commands:

#!/bin/bash -euo pipefail
dbprefix=$(find -L . -name '*.*ibf' | sed 's/\.h\?ibf$//')

ganon \
    classify \
    --db-prefix ${dbprefix%%.*ibf} \
     \
    --threads 2 \
    --output-prefix 2611_se_db1.ganon \
    --single-reads ERX5474930_ERR5766174_1.fa.gz         2>&1 | tee 2611_se_db1.ganon.log

cat <<-END_VERSIONS > versions.yml
"NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:GANON_CLASSIFY":
    ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
#!/bin/bash -euo pipefail
dbprefix=$(find -L . -name '*.*ibf' | sed 's/\.h\?ibf$//')

ganon \
    report \
    --input 2611_se_db1.ganon.rep \
    --output-prefix 2611_se_db1.ganon_report \
    --db-prefix ${dbprefix%%.*ibf} \
    --report-type reads --ranks default --top-percentile 0 --min-count 0 --max-count 0

cat <<-END_VERSIONS > versions.yml
"NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:GANON_REPORT":
    ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
END_VERSIONS

@pirovc
Copy link
Owner

pirovc commented Feb 20, 2024

Can you also send the command used for the specific run 2612_pe?

@sofstam
Copy link

sofstam commented Feb 20, 2024

#!/bin/bash -euo pipefail
dbprefix=$(find -L . -name '*.*ibf' | sed 's/\.h\?ibf$//')

ganon \
    classify \
    --db-prefix ${dbprefix%%.*ibf} \
     \
    --threads 2 \
    --output-prefix 2612_pe_ERR5766176_db1.ganon \
    --paired-reads ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176_2.fastq.gz         2>&1 | tee 2612_pe_ERR5766176_db1.ganon.log

cat <<-END_VERSIONS > versions.yml
"NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:GANON_CLASSIFY":
    ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
END_VERSIONS
#!/bin/bash -euo pipefail
dbprefix=$(find -L . -name '*.*ibf' | sed 's/\.h\?ibf$//')

ganon \
    report \
    --input 2612_pe_ERR5766176_B_db1.ganon.rep \
    --output-prefix 2612_pe_ERR5766176_B_db1.ganon_report \
    --db-prefix ${dbprefix%%.*ibf} \
    --report-type reads --ranks default --top-percentile 0 --min-count 0 --max-count 0

cat <<-END_VERSIONS > versions.yml
"NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:GANON_REPORT":
    ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
END_VERSIONS

@pirovc
Copy link
Owner

pirovc commented Feb 21, 2024

Thanks for the commands.

Yes the data are the same. Tried to upload them but the file is too big. Can be found here.

ganon_GHA.zip Results by running ganon 2 through GHA ganon_local.zip Results by running ganon 2 on my local machine

From the files you sent, 2 have problems, one in the local (2612_pe_db1.ganon.rep) and the other in the GHA (2612_se_db1.ganon.rep). This means it's not a machine specific issue. Only by looking at the .rep file after ganon classify is possible to detect errors, the sum of columns 4 and 5 should match the #total_classified value reported at the end.

For the local (2612_pe_db1.ganon.rep) I could "replicate" the outcome by providing the input files twice ganon classify --db-prefix test-db-ganon --threads 2 --output-prefix 2612_pe_db1.ganon --paired-reads ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176_2.fastq.gz ERX5474932_ERR5766176_1.fastq.gz ERX5474932_ERR5766176_2.fastq.gz. This runs fine but triggers a know bug in the reassignment process, where repeated headers are counted only once.

For the GHA (2612_se_db1.ganon.rep) I have no clue. The number of reads do not match the supposed input as well.

Before I go into any further debugging, can you double check if the fastq files used are as expected?

@sofstam
Copy link

sofstam commented Feb 21, 2024

Thanks for your reply.

We only updated the module and we did not change anything in taxprofiler. Is this what you mean if the fastq files are as expected?

@pirovc
Copy link
Owner

pirovc commented Feb 22, 2024

I mean to double check that all the inputs are correct, for example, the files are not corrupted or somehow duplicated.

If not, I would need some more information to try understand what is happening. Could you re-run the 2 problematic files I mentioned above (just ganon classify is enough) with --verbose and send me the logs?

@sofstam
Copy link

sofstam commented Feb 28, 2024

@pirovc
Copy link
Owner

pirovc commented Feb 28, 2024

Thanks, the logs are looking alright. Can I also have the .rep files generated from those runs? (2612_pe_ERR5766176_db1.ganon.rep and 2612_se_ERR5766180_db1.ganon.rep)

@sofstam
Copy link

sofstam commented Feb 28, 2024

Had to rename to .txt in order to upload them here.

2612_pe_ERR5766176_db1.ganon.txt

2612_se_ERR5766180_db1.ganon.txt

@pirovc
Copy link
Owner

pirovc commented Feb 28, 2024

Thanks again. I could not detect any issue in those files and reports generated with them will add up to 100%. Note that they are different from the ones you sent before.

The only way for me to try detect a possible bug would be to have the .rep and .log (with --verbose) from the files with issues, which are (based on those zips):

  • ganon_local/db1/2612_pe_db1.ganon.rep
  • ganon_GHA/ganon_GHA/db1/2612_se_db1.ganon.rep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants