Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to add duplicate row #573

Closed
liu-congcong opened this issue Mar 8, 2024 · 4 comments
Closed

Attempting to add duplicate row #573

liu-congcong opened this issue Mar 8, 2024 · 4 comments
Labels
error Help required for a GTDB-Tk error. next version Upcoming feature/fix in staging branch.

Comments

@liu-congcong
Copy link

Hello,

I annotated some bins with GTDB-TK v2.3.2 and always encountered this error.
Is there any parameter I can set to skip this error?

==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on gtdbtk-classify/classify/intermediate_results/pplacer/tree_1/user_msa_file.fasta.
==> Step 2 of 9: Pre-masking sequences.
[2024-03-08 10:52:54] INFO: Calculating RED values based on reference tree.
[2024-03-08 10:53:00] TASK: Traversing tree to determine classification method.
[2024-03-08 10:53:00] INFO: Completed 1 genome in 0.00 seconds (1,468.59 genomes/second).
[2024-03-08 10:53:00] TASK: Calculating average nucleotide identity using FastANI (v1.33).
[2024-03-08 10:53:07] INFO: Completed 36 comparisons in 6.71 seconds (5.37 comparisons/second).
[2024-03-08 10:53:08] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2024-03-08 10:53:09] ERROR: Attempting to add duplicate row: X01
[2024-03-08 10:53:09] ERROR: Controlled exit resulting from an unrecoverable error or warning.
@liu-congcong liu-congcong added the error Help required for a GTDB-Tk error. label Mar 8, 2024
@liu-congcong
Copy link
Author

gtdbtk

@liu-congcong
Copy link
Author

To continue, I modified ClassifySummaryFile.add_row(), to report the gid when the error occurs.
Hopefully this bug will be fixed in a later update.

@pchaumeil
Copy link
Collaborator

Hello,
Thanks for the report.
Could you please send us a small input dataset generating this error so we can reproduce it on our end?
We will also need the command lines you are running

Thanks,
Pierre

@liu-congcong
Copy link
Author

Hi,

The fasta file is:

X1.fasta.gz

The commands are:

FASTAPATH=/path/to/fastas/
THREADS=100
PPLACERTHREADS=2
GTDBMASH=/path/to/gtdb.msh
GTDBTK=gtdbtk

${GTDBTK} identify --extension fasta --cpus ${THREADS} --genome_dir ${FASTAPATH} --out_dir gtdbtk-identify
${GTDBTK} align --cpus ${THREADS} --identify_dir gtdbtk-identify --out_dir gtdbtk-align
${GTDBTK} classify --extension fasta --cpus ${THREADS} --pplacer_cpus ${PPLACERTHREADS} --genome_dir ${FASTAPATH} --align_dir gtdbtk-align --mash_db ${GTDBMASH} --out_dir gtdbtk-classify

And env is:

[2024-03-09 01:23:18] INFO: GTDB-Tk v2.3.2
[2024-03-09 01:23:18] INFO: gtdbtk check_install
[2024-03-09 01:23:18] INFO: Using GTDB-Tk reference data version r214: /path/to/gtdb-214
[2024-03-09 01:23:18] INFO: Running install verification
[2024-03-09 01:23:18] INFO: Checking that all third-party software are on the system path:
[2024-03-09 01:23:18] INFO: |-- FastTree OK
[2024-03-09 01:23:18] INFO: |-- FastTreeMP OK
[2024-03-09 01:23:18] INFO: |-- fastANI OK
[2024-03-09 01:23:18] INFO: |-- guppy OK
[2024-03-09 01:23:18] INFO: |-- hmmalign OK
[2024-03-09 01:23:18] INFO: |-- hmmsearch OK
[2024-03-09 01:23:18] INFO: |-- mash OK
[2024-03-09 01:23:18] INFO: |-- pplacer OK
[2024-03-09 01:23:18] INFO: |-- prodigal OK
[2024-03-09 01:23:18] INFO: Checking integrity of reference package: /path/to/gtdb-214
[2024-03-09 01:23:19] INFO: |-- pplacer OK
[2024-03-09 01:23:19] INFO: |-- masks OK
[2024-03-09 01:23:20] INFO: |-- markers OK
[2024-03-09 01:23:20] INFO: |-- radii OK
[2024-03-09 01:23:36] INFO: |-- msa OK
[2024-03-09 01:23:36] INFO: |-- metadata OK
[2024-03-09 01:23:36] INFO: |-- taxonomy OK

Best,
Cong-Cong

pchaumeil added a commit that referenced this issue Apr 8, 2024
In some cases, when running the 3 classify steps independently, a genome may be filtered out in the alignment step.
However, it's still present in the ani screening from the classify step and can have a ANI > 95% ( this can happen with partial genomes, where AF can still be high)
Tk would try to report it twice in the summary file and would return an error. Instead we report it as classified with ani,
 but with a warning from the alignment step ( MSA < 10%).
 skani should reduce the number of such cases as it keep AF low for partial genomes.
pchaumeil added a commit that referenced this issue Apr 10, 2024
@pchaumeil pchaumeil added the next version Upcoming feature/fix in staging branch. label Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error Help required for a GTDB-Tk error. next version Upcoming feature/fix in staging branch.
Projects
None yet
Development

No branches or pull requests

2 participants