Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pyani dnadiff subcommand #424

Open
kiepczi opened this issue Mar 19, 2024 · 5 comments
Open

add pyani dnadiff subcommand #424

kiepczi opened this issue Mar 19, 2024 · 5 comments
Assignees
Labels
enhancement something we'd like pyani to do that it doesn't already
Projects
Milestone

Comments

@kiepczi
Copy link
Collaborator

kiepczi commented Mar 19, 2024

Summary:

Add subcommand to use dnadiff approach to calculate ANI %ID and coverage

Description:

Different methods/approaches can lead to slightly different numbers being reported. In my previous meetings with @widdowquinn, we agreed that adding a dnadiff subcommand to replicate the values for ANI %ID and coverage would be a good idea.

We previously attempted to replicate the values of AlignedBases and AverageIdentity given in the .report file returned by dnadiff. However, we were unable to do so solely by parsing the delta files due to differences in how they are processed by different programs (e.g., show-coords). One way of doing this would be to run dnadiff to obtain all necessary files (.coords and .rdiff), and calculate the values from them.

@kiepczi
Copy link
Collaborator Author

kiepczi commented Mar 20, 2024

I have been working under the branch dnadiff.

Our main issue is that dnadiff returns 9 files for a single pairwise comparison. We do not want to generate that many files, especially when they are not needed. After my investigation, I managed to replicate the values for AlignedBases and AvgIdentity by running only scripts/programs that generate what we need.

To replicate the results, we need to generate 4 command lines:

  1. nucmer to generate alignments with the --maxmatch parameter.
  2. delta-filter to generate M-to-M alignments by calling the -m parameter.
    NOTE: Both commands 1 and 2 are different from the ones currently generated in the pyani anim subcommand.
  3. show-coords to generate the .mcoords file needed to calculate AlignedBases and AvgIdentity.
  4. show-diff to generate the .rdiff files needed for the AlignedBases calculation.

I have used the exact process implemented by dnadiff and successfully replicated the numbers for three separate test sets. You can find all the scripts and data here.

@widdowquinn widdowquinn added the enhancement something we'd like pyani to do that it doesn't already label Mar 20, 2024
@widdowquinn widdowquinn added this to To do in pyani via automation Mar 20, 2024
@widdowquinn widdowquinn added this to the 0.3.1 milestone Mar 20, 2024
@widdowquinn
Copy link
Owner

Can we move the working branch for this to issue_424 to be consistent, please?

@widdowquinn
Copy link
Owner

See note in #422 regarding implementation with show-coords/show-diff.

@widdowquinn
Copy link
Owner

pyani dnadiff is something probably best reserved for the ground-up rebuild in pyani-plus - let's keep development for that project, not for this v0.3.

@kiepczi
Copy link
Collaborator Author

kiepczi commented Mar 26, 2024

Can we move the working branch for this to issue_424 to be consistent, please?

As requested, the working branch for this issue was moved to issue_424.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement something we'd like pyani to do that it doesn't already
Projects
pyani
  
To do
Development

No branches or pull requests

2 participants