Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use --noextend in NUCmer as a rule #342

Open
baileythegreen opened this issue Sep 30, 2021 · 28 comments
Open

Use --noextend in NUCmer as a rule #342

baileythegreen opened this issue Sep 30, 2021 · 28 comments
Labels
bug something isn't working how it should method the issue relates to how results are calculated VERSION_3 issues relating to version 0.3.x of pyani
Milestone

Comments

@baileythegreen
Copy link
Contributor

Summary:

Make --noextend the default NUCmer behaviour for v0.3, and implement an --extend option for pyani anim that allows us to use pyani compare to see what the damage is likely to have been under the assumption above.

Description:

Issue #340 pointed out that there can be overlapping alignments under the current NUCmer job submission, which is not ideal.

@baileythegreen baileythegreen added bug something isn't working how it should method the issue relates to how results are calculated VERSION_3 issues relating to version 0.3.x of pyani labels Sep 30, 2021
@baileythegreen
Copy link
Contributor Author

Thoughts on implementation:

  • --noextend should be the default; it should not be an error to submit pyani anim … --noextend
  • it should also be possible to submit pyani anim … --extend
  • pyani anim … --noextend --extend is nonsensical and should error

a mutually-exclusive argument group seems like the best way to do this.

@baileythegreen
Copy link
Contributor Author

baileythegreen commented Sep 30, 2021

See first attempt on branch issue_342. (I shouldn't have done this at midnight, but I did.)

I've not yet updated any tests, but I have run this locally, and the CLI is working as I would expect; i.e., the default behaviour is --noextend; the points I mention above are met; and it correctly recognises the uniqueness of new runs that incorporate this option.

@baileythegreen
Copy link
Contributor Author

This is going to require some more serious work on the tests (which I will do at more respectable hours):

! pytest -v
=============================================================== test session starts ================================================================
platform darwin -- Python 3.8.3, pytest-5.3.5, py-1.10.0, pluggy-0.13.1 -- /Users/baileythegreen/Software/miniconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/baileythegreen/Software/pyani, inifile: pytest.ini
plugins: dash-1.14.0, cov-2.11.1, ordering-0.6
collected 100 items                                                                                                                                

tests/test_anib.py::test_get_version_no_exe PASSED                                                                                           [  1%]
tests/test_anib.py::test_get_version_exe_not_executable PASSED                                                                               [  2%]
tests/test_anib.py::test_get_version_exe_no_version PASSED                                                                                   [  3%]
tests/test_anib.py::test_blastall_dbjobdict PASSED                                                                                           [  4%]
tests/test_anib.py::test_blastall_graph PASSED                                                                                               [  5%]
tests/test_anib.py::test_blastall_multiple PASSED                                                                                            [  6%]
tests/test_anib.py::test_blastall_single PASSED                                                                                              [  7%]
tests/test_anib.py::test_blastn_dbjobdict PASSED                                                                                             [  8%]
tests/test_anib.py::test_blastn_graph PASSED                                                                                                 [  9%]
tests/test_anib.py::test_blastn_multiple PASSED                                                                                              [ 10%]
tests/test_anib.py::test_blastn_single PASSED                                                                                                [ 11%]
tests/test_anib.py::test_formatdb_multiple PASSED                                                                                            [ 12%]
tests/test_anib.py::test_formatdb_single PASSED                                                                                              [ 13%]
tests/test_anib.py::test_fragment_files PASSED                                                                                               [ 14%]
tests/test_anib.py::test_makeblastdb_multiple PASSED                                                                                         [ 15%]
tests/test_anib.py::test_makeblastdb_single PASSED                                                                                           [ 16%]
tests/test_anib.py::test_parse_legacy_blastdir PASSED                                                                                        [ 17%]
tests/test_anib.py::test_parse_blastdir PASSED                                                                                               [ 18%]
tests/test_anib.py::test_parse_blasttab PASSED                                                                                               [ 19%]
tests/test_anib.py::test_parse_legacy_blasttab PASSED                                                                                        [ 20%]
tests/test_aniblastall.py::test_get_version_missing_exe PASSED                                                                               [ 21%]
tests/test_aniblastall.py::test_get_version_not_executable PASSED                                                                            [ 22%]
tests/test_aniblastall.py::test_get_version_no_version PASSED                                                                                [ 23%]
tests/test_aniblastall.py::test_get_version_os_incompatible PASSED                                                                           [ 24%]
tests/test_anim.py::test_get_version_no_exe PASSED                                                                                           [ 25%]
tests/test_anim.py::test_get_version_exe_not_executable PASSED                                                                               [ 26%]
tests/test_anim.py::test_get_version_exe_no_version PASSED                                                                                   [ 27%]
tests/test_anim.py::test_deltadir_parsing PASSED                                                                                             [ 28%]
tests/test_anim.py::test_deltafile_parsing PASSED                                                                                            [ 29%]
tests/test_anim.py::test_maxmatch_single FAILED                                                                                              [ 30%]
tests/test_anim.py::test_mummer_multiple FAILED                                                                                              [ 31%]
tests/test_anim.py::test_mummer_single FAILED                                                                                                [ 32%]
tests/test_anim.py::test_mummer_job_generation PASSED                                                                                        [ 33%]
tests/test_cli_parsing.py::test_createdb PASSED                                                                                              [ 34%]
tests/test_cli_parsing.py::test_download_single_genome PASSED                                                                                [ 35%]
tests/test_concordance.py::test_anim_concordance FAILED                                                                                      [ 36%]
tests/test_concordance.py::test_anib_concordance PASSED                                                                                      [ 37%]
tests/test_concordance.py::test_aniblastall_concordance PASSED                                                                               [ 38%]
tests/test_concordance.py::test_tetra_concordance PASSED                                                                                     [ 39%]
tests/test_dependencies.py::test_import_biopython PASSED                                                                                     [ 40%]
tests/test_dependencies.py::test_import_matplotlib PASSED                                                                                    [ 41%]
tests/test_dependencies.py::test_import_numpy PASSED                                                                                         [ 42%]
tests/test_dependencies.py::test_import_pandas PASSED                                                                                        [ 43%]
tests/test_dependencies.py::test_import_scipy PASSED                                                                                         [ 44%]
tests/test_dependencies.py::test_blastn_available PASSED                                                                                     [ 45%]
tests/test_dependencies.py::test_run_blastall XPASS                                                                                          [ 46%]
tests/test_dependencies.py::test_run_nucmer PASSED                                                                                           [ 47%]
tests/test_graphics.py::test_png_mpl PASSED                                                                                                  [ 48%]
tests/test_graphics.py::test_svg_mpl PASSED                                                                                                  [ 49%]
tests/test_graphics.py::test_pdf_mpl PASSED                                                                                                  [ 50%]
tests/test_graphics.py::test_png_seaborn PASSED                                                                                              [ 51%]
tests/test_graphics.py::test_svg_seaborn PASSED                                                                                              [ 52%]
tests/test_graphics.py::test_pdf_seaborn PASSED                                                                                              [ 53%]
tests/test_jobs.py::test_create_job PASSED                                                                                                   [ 54%]
tests/test_jobs.py::test_create_job_with_command PASSED                                                                                      [ 55%]
tests/test_jobs.py::test_add_dependency PASSED                                                                                               [ 56%]
tests/test_jobs.py::test_remove_dependency PASSED                                                                                            [ 57%]
tests/test_jobs.py::test_create_jobgroup PASSED                                                                                              [ 58%]
tests/test_jobs.py::test_1d_jobgroup PASSED                                                                                                  [ 59%]
tests/test_jobs.py::test_2d_jobgroup PASSED                                                                                                  [ 60%]
tests/test_jobs.py::test_add_group_dependency PASSED                                                                                         [ 61%]
tests/test_jobs.py::test_remove_group_dependency PASSED                                                                                      [ 62%]
tests/test_legacy_scripts.py::test_legacy_anim_sns PASSED                                                                                    [ 63%]
tests/test_legacy_scripts.py::test_legacy_anim_mpl PASSED                                                                                    [ 64%]
tests/test_legacy_scripts.py::test_legacy_anib_sns PASSED                                                                                    [ 65%]
tests/test_legacy_scripts.py::test_legacy_anib_mpl PASSED                                                                                    [ 66%]
tests/test_legacy_scripts.py::test_legacy_tetra_sns PASSED                                                                                   [ 67%]
tests/test_legacy_scripts.py::test_legacy_tetra_mpl PASSED                                                                                   [ 68%]
tests/test_legacy_scripts.py::test_legacy_genome_downloads PASSED                                                                            [ 69%]
tests/test_multiprocessing.py::test_multiprocessing_run PASSED                                                                               [ 70%]
tests/test_multiprocessing.py::test_cmdsets PASSED                                                                                           [ 71%]
tests/test_multiprocessing.py::test_dependency_graph_run PASSED                                                                              [ 72%]
tests/test_parsing.py::test_anim_delta PASSED                                                                                                [ 73%]
tests/test_subcmd_01_download.py::test_download_dry_run PASSED                                                                               [ 74%]
tests/test_subcmd_01_download.py::test_download_c_blochmannia PASSED                                                                         [ 75%]
tests/test_subcmd_01_download.py::test_download_kraken PASSED                                                                                [ 76%]
tests/test_subcmd_02_index.py::TestIndexSubcommand::test_index PASSED                                                                        [ 77%]
tests/test_subcmd_02_index.py::TestIndexSubcommand::test_missing_index PASSED                                                                [ 78%]
tests/test_subcmd_03_createdb.py::TestCreatedbSubcommand::test_createdb PASSED                                                               [ 79%]
tests/test_subcmd_04_anim.py::TestANImSubcommand::test_anim FAILED                                                                           [ 80%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_genomes PASSED                                                                    [ 81%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_genomes_runs PASSED                                                               [ 82%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_matrices FAILED                                                                   [ 83%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_results PASSED                                                                    [ 84%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_runs PASSED                                                                       [ 85%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_runs_genomes PASSED                                                               [ 86%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_jpg FAILED                                                                   [ 87%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_pdf FAILED                                                                   [ 88%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_png FAILED                                                                   [ 89%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_svg FAILED                                                                   [ 90%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_jpg FAILED                                                               [ 91%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_pdf FAILED                                                               [ 92%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_png FAILED                                                               [ 93%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_svg FAILED                                                               [ 94%]
tests/test_subcmd_07_classify.py::TestClassifySubcommand::test_classify_defaults FAILED                                                      [ 95%]
tests/test_subcmd_07_classify.py::TestClassifySubcommand::test_classify_no_thresh FAILED                                                     [ 96%]
tests/test_subcmd_08_anib.py::TestANIbsubcommand::test_anib XFAIL                                                                            [ 97%]
tests/test_tetra.py::test_tetraclean PASSED                                                                                                  [ 98%]
tests/test_tetra.py::test_zscore PASSED                                                                                                      [ 99%]
tests/test_tetra.py::test_correlations PASSED                                                                                                [100%]

@widdowquinn
Copy link
Owner

Thoughts on implementation:

* `--noextend` should be the default; it should not be an error to submit `pyani anim … --noextend`

* it should also be possible to submit `pyani anim … --extend`

* `pyani anim … --noextend --extend` is nonsensical and should error

a mutually-exclusive argument group seems like the best way to do this.

I'm not sure about this. Where there's a two-state choice like "extend" or "noextend" where one choice is a default, I wouldn't expect the reselection of the default to be accommodated at all. Here, for instance:

  • pyani anim --extend would override the default behaviour
  • pyani anim --noextend would raise an error saying that the option isn't recognised

and this avoids the problem of the mutually exclusive specification of both options.

I think that's neater, and more closely matches my expectation as a user.

@widdowquinn
Copy link
Owner

widdowquinn commented Oct 1, 2021

This is going to require some more serious work on the tests

This is going to require a bit of investigation and thought. There's a tension between:

  • retaining backwards compatibility (always desirable)
  • implementing what we believe to be a more reliable approach

Here's my initial idea:

  1. The Rossello-Mora paper states that they used default options for MUMmer/NUCmer. This is IIRC what JSpecies also does/did. The JSpecies output comprises our consistency test target. So, if we want to maintain the cross-compatibility for regression testing we will possibly need to run that consistency test with --extend as an option (if it does not remain default).

  2. In v0.2 I made the decision to use the --mum option to ensure we didn't double-count alignment anchors. I (wrongly) assumed that this would avoid the problem of double-counting aligned regions. Even with the --mum option in place, we still maintained consistency with JSpecies, and that was enough for me at the time. I provided the --maxmatch option to restore default MUMmer behaviour (a query region could match two or more reference regions) for those who wanted it. But that was where I departed from the Rossello-Mora/JSpecies description/implementation - with minor/acceptable consequences.

  3. Donovan's observation indicates that even with the --mum option we get nonsensical results with at least one pair of virus genomes. The circumstance is - biology being what it is - undoubtedly more general than that. The cause is double-counting of regions where MUMmer alignments overlap. We can avoid the worst of it in his example by using --noextendbut implementing this more widely is certain to cause differences in ANIm output. Your tests show that this difference is enough to break backwards agreement with JSpecies/earlier ANIm implementations in pyani. We need a way to approach this. My current thoughts are:

3a. Change the top subcommand priority from pyani classify to pyani compare: this will give us a framework for evaluating the consequences of the --extend/--noextend change.

3b. Implement the --noextend (rather than --extend) behaviour as an option, in the first instance. We won't need to rewrite tests yet and this will allow us to spot regression failures. As we go, we can distinguish between extend and noextend tests (so we can make the swap later more easily if default behaviour changes).

3c. Evaluate --noextend against --extend. My expectation was that using --noextend would not make a massive difference to ANI, but the failing tests suggest this is not the case. However, without more information we don't know whether ANI and coverage are stable, but aligned length etc. fall outside tolerance. That's why we need pyani compare. If we do not have stable ANI/coverage results we will have to make an informed choice about whether we do, in fact, want to change the default ANIm behaviour.

  1. If we do not, in the end, change default behaviour to --noextend, we should still make this an option for the user. It is clearly helpful for small genomes (e.g. viruses) where even a small overlap can give nonsensical results for coverage, and skew ANI.

  2. We might consider a further pyani-specific constraint on MUMmer/NUCmer results after parsing, and when calculating output measures. Even with --mum --noextend on Donovan's example genomes there was still an overlap of alignment anchors, resulting in double-counting. However, relative to total genome size the overlap is small. If we want to use ANIm as a true metric, we can't have this double-counting. One way to overcome this would be to make an interval graph and use information about overlaps in our ANI/coverage calculations to only single-count aligning regions. I think pybedtools has an efficient implementation of interval graphs that we could use, here.

What are your thoughts?

(which I will do at more respectable hours):

I'd definitely support that change!

@baileythegreen
Copy link
Contributor Author

I was writing this before you sent the long message, explaining why I implemented both flags:

It's quite simple to specify mutually-exclusive options, in practice: https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser.add_mutually_exclusive_group.

With this implementation, you can run pyani anim without specifying either, and it will use --noextend; but I think not allowing the use of the --noextend flag, as you suggest, invites confusion.

--extend is the default in MUMmer, and we're intending to have the reverse be pyani's default. Someone accustomed to MUMmer's interface might be frustrated by this change in both default and CLI.

Additionally, while it makes sense for --extend to map to True and --noextend to map to false in my head, this may not be true for others, and I have at times been annoyed with flags that claim to be Boolean, but don't then accept a Boolean value as an argument, for instance.

The help information needs modification if we have both, though.

This is all a very front-end debate, so whether we allow both flags really only affects the parser. The rest of the changes are already made. (Though, your long message portends more extensive modifications.)

@widdowquinn
Copy link
Owner

With this implementation, you can run pyani anim without specifying either, and it will use --noextend; but I think not allowing the use of the --noextend flag, as you suggest, invites confusion.

I'm not convinced of that. I see it more as a --verbose-like toggle. If you don't use the argument, you get standard output; if you do use it, you get more information. There's not usually a --notverbose option to reinforce the idea that you haven't changed a setting.

--extend is the default in MUMmer, and we're intending to have the reverse be pyani's default. Someone accustomed to MUMmer's interface might be frustrated by this change in both default and CLI.

One of the major user-level differences between legacy BLAST and BLAST+ was the change in default BLASTN algorithm, which really does modify the output you get significantly in many cases. You can still select the original algorithm choice as an option. Sometimes it's OK just to make things "better" - I think what's important is that what is done is clear, and an option to revert is there for the user, if they want it.

I don't think that having a different default in pyani than you get in MUMmer is necessarily bad. We already do this with the --mum argument.

I don't think many users are strongly aware of the precise options we have for MUMmer and, in any case, if --noextend really borks the output for the most common use-cases, we might not be well-advised make it the default, after all.

Additionally, while it makes sense for --extend to map to True and --noextend to map to false in my head, this may not be true for others, and I have at times been annoyed with flags that claim to be Boolean, but don't then accept a Boolean value as an argument, for instance.

I see that, but I think we're in a better position here than, for instance, tools with both a --quiet and --verbose option. This, at least, is a toggle, of sorts - there are only two settings.

(Though, your long message portends more extensive modifications.)

It's the curse of thinking about things…

@baileythegreen
Copy link
Contributor Author

When I saw how many tests it broke, I figured you'd be concerned about backwards compatibility. I think prioritising pyani compare once this is a bit stabilised is a good way to figure out the extent of the (potential) issue.

As for the balance between that and a more reliable approach: I don't think there's much virtue in maintaining backwards-compatibility of something that is unreliable. Now, calling pyani anim unreliable is probably a gross overstatement, and is not what I'm saying. But that is the opinion I hold for an extreme case, and I think it would take a really compelling argument and only a small degree of trivial unreliability for me to see backwards compatibility as the more important aspect. I haven't really looked into it, yet, but I'm fairly certain we aren't going to be able to do both. At least, I can't see how we would.

3b. Implement the --noextend (rather than --extend) behaviour as an option, in the first instance. We won't need to rewrite tests yet and this will allow us to spot regression failures. As we go, we can distinguish between extend and noextend tests (so we can make the swap later more easily if default behaviour changes).

This is easily done; it's just a matter of changing the default value assigned in the parser.

We might consider a further pyani-specific constraint on MUMmer/NUCmer results after parsing, and when calculating output measures. Even with --mum --noextend on Donovan's example genomes there was still an overlap of alignment anchors, resulting in double-counting. However, relative to total genome size the overlap is small. If we want to use ANIm as a true metric, we can't have this double-counting. One way to overcome this would be to make an interval graph and use information about overlaps in our ANI/coverage calculations to only single-count aligning regions. I think pybedtools has an efficient implementation of interval graphs that we could use, here.

This sounds like it might be useful, but I think it's probably a future debate, for now. At least, once I've been able to look more into it/what it would require.

@baileythegreen
Copy link
Contributor Author

I see it more as a --verbose-like toggle. If you don't use the argument, you get standard output; if you do use it, you get more information. There's not usually a --notverbose option to reinforce the idea that you haven't changed a setting.

But in this case, MUMmer actually does have the equivalent of --verbose and --notverbose. I think that complicates it in my head.

Idk. When I was adding it last night, just adding --extend and saying the default was for it to be 'off' didn't really seem to communicate that one would use --extend and the other --noextend very well.

@baileythegreen
Copy link
Contributor Author

Some of my overthinking of this may also be complicated by the fact I chose to name the internal variable for this extend, and therefore --extend .

@widdowquinn
Copy link
Owner

As for the balance between that and a more reliable approach: I don't think there's much virtue in maintaining backwards-compatibility of something that is unreliable. Now, calling pyani anim unreliable is probably a gross overstatement, and is not what I'm saying. But that is the opinion I hold for an extreme case, and I think it would take a really compelling argument and only a small degree of trivial unreliability for me to see backwards compatibility as the more important aspect.

I agree. I think that the idea I'm trying to get across is this, for moving to --noextend as the default:

Pros:

  • switching from default behaviour to --noextend corrects an implausible output for an example pair of genomes
  • using --noextend also seems, conceptually, to be closer to the kind of alignment/comparison I see being most meaningful for ANIm

Cons:

  • we don't know how much of a difference this change would induce in results we previously consider "stable" and consistent with both ANIb and previous implementations of ANIm. Do old "good" results with this option now become implausible?
  • --noextend still does not eliminate overlapping alignments and it is possible that an alternative genome comparison could still generate implausible results

The pros are good pros. We can investigate and evaluate the cons, and we absolutely need to before pushing out the change.

I haven't really looked into it, yet, but I'm fairly certain we aren't going to be able to do both. At least, I can't see how we would.

My motivation for proceeding step-wise and retaining backwards compatibility is to ensure that we don't have accidental code regressions as we implement the new option. Even if we implement MUMmer's --noextend as the default, having the option to use --extend when testing means (i) we can confirm that we can reproduce old results with a different choice of settings, and (ii) we didn't break something accidentally.

One way to overcome this would be to make an interval graph and use information about overlaps in our ANI/coverage calculations to only single-count aligning regions.

This sounds like it might be useful, but I think it's probably a future debate, for now. At least, once I've been able to look more into it/what it would require.

I think there's a possible outcome where --noextend breaks useful and meaningful output so significantly that it's unusable (we'll discover if this is true when we test it). If that is the case, then I think the interval graph approach might fix both Donovan's example, improve ANIm in the way we've been discussing, and also be a step forward in the use of ANIm as a true metric, and become the actual default. But we've some work to do before we know.

@widdowquinn
Copy link
Owner

Some of my overthinking of this may also be complicated by the fact I chose to name the internal variable for this extend, and therefore --extend .

Language constrains thought… as I think we've discussed offline ;)

@widdowquinn
Copy link
Owner

But in this case, MUMmer actually does have the equivalent of --verbose and --notverbose. I think that complicates it in my head.

Yes. I think that's not the best design choice in MUMmer.

@baileythegreen
Copy link
Contributor Author

Some of my overthinking of this may also be complicated by the fact I chose to name the internal variable for this extend, and therefore --extend .

Language constrains thought… as I think we've discussed offline ;)

Calling it anything else is just going to introduce more arbitrary language into the equation…. That doesn't seem helpful.

@baileythegreen
Copy link
Contributor Author

Looking at our ANIm parser, we also have this:

    parser.add_argument(
        "--nofilter",
        dest="nofilter",
        action="store_true",
        default=False,
        help="do not use delta-filter for 1:1 NUCmer matches",
    )

which I now Have Thoughts About.

@widdowquinn
Copy link
Owner

Calling it anything else is just going to introduce more arbitrary language into the equation…. That doesn't seem helpful.

I don't have the same sense about it, I think. Where we want to force --extend to be a conscious choice for the user (MUMmer's --noextend as the default), your choice is exactly what I'd do.

Where we want to force --noextend to be a conscious choice for the user, I'd use the internal variable noextend. I internalise this as the True/False status of the option with that name.

As well as the --nofilter option you mention, there's --noclobber elsewhere. In both cases, the user is explicitly telling the code not to do something it would otherwise do, and it's a binary choice.

@widdowquinn
Copy link
Owner

which I now Have Thoughts About.

That may become redundant, depending on the way the --noextend/interval graph stuff turns out.

@baileythegreen
Copy link
Contributor Author

Calling it anything else is just going to introduce more arbitrary language into the equation…. That doesn't seem helpful.

I don't have the same sense about it, I think. Where we want to force --extend to be a conscious choice for the user (MUMmer's --noextend as the default), your choice is exactly what I'd do.

Where we want to force --noextend to be a conscious choice for the user, I'd use the internal variable noextend. I internalise this as the True/False status of the option with that name.

As well as the --nofilter option you mention, there's --noclobber elsewhere. In both cases, the user is explicitly telling the code not to do something it would otherwise do, and it's a binary choice.

It's the double negatives. I think they impede clarity of thought in things like this. That's why I don't like them in binary variable names.

@widdowquinn
Copy link
Owner

It's the double negatives. I think they impede clarity of thought in things like this. That's why I don't like them in binary variable names.

I see. I take your point.

tl;dr

I'd like to retain --nofilter as the command-line option, as it is makes explicit the idea of turning off a default with only two settings.

I'm happy for you to swap the internal representation of the choice from nofilter to filter (with the corresponding inversion of logic), for the sake of greater readability.

why I think that

I think the way I've learned to accommodate this kind of thing is to consider who/what is "understanding" the code, and at what level. There are tradeoffs.

For a command-line option, I think it's clearest to the user if the option describes what the option does intuitively. For instance, --nofilter and --filter False are probably equally understandable to the user. The choice is then whether you want an interface that tends towards single-option (--verbose --force --noclobber --nofilter) or key-value (--verbose True --force True --clobber False --filter False). Both are valid.

I guess I made the decision early, perhaps without thinking too much about it, that I preferred the single-option approach for toggles, which led towards the first set of options. I think it's concise, and easier to parse (practically - we don't need to check argument validity, for example).

Having made that decision, the readability burden is then on the coder.

Both options:

parser.add_argument(
        "--nofilter",
        dest="nofilter",
        action="store_true",
        help="do not use delta-filter for 1:1 NUCmer matches",
    )

leading to a variable named nofilter, and

parser.add_argument(
        "--nofilter",
        dest="filter",
        action="store_false",
        help="do not use delta-filter for 1:1 NUCmer matches",
    )

leading to a variable named filter, are reasonable (though filter() is a Python built-in and a variable with the same name should be avoided, this is trivial to fix). I see the tradeoff as:

  • nofilter as a variable retains direct equivalence with the command-line option. If I read this as a label and not as English syntax, it is easy for me to understand the consistency of "user chose --nofilter so nofilter is True", but it reads as a double-negative for you.
  • filter as a variable represents the positive action of filtering when it is True. It is easy for you to understand that the code means that filtering should be applied when filter is True, but the name link with --nofilter is not exact.

Both are valid and correct (as far as the computer cares…), but being the sole developer so long meant I didn't have to think about both perspectives. My code reading tends to see variable names as labels, and I think you're picking up on the syntax to a greater extent.

baileythegreen added a commit that referenced this issue Oct 1, 2021
@baileythegreen
Copy link
Contributor Author

Looking at the details of the failed tests right now, and it may be that many of the failures are down to the change in the submitted NUCmer commands.

I'm going to create a second branch locally, so I can test both default behaviours—fixing the tests so they pass, unless it's due to actual differences in results—and then we can decide what to do with that.

@baileythegreen
Copy link
Contributor Author

Good news!

On my local noextend_342 branch (which assumes --noextend is the default), the only test that fails is concordance.

This may still mean there is a lot to think about, but using that option doesn't inherently break a bunch of other things.

! pytest -v
=============================================================== test session starts ================================================================
platform darwin -- Python 3.8.3, pytest-5.3.5, py-1.10.0, pluggy-0.13.1 -- /Users/baileythegreen/Software/miniconda3/bin/python
cachedir: .pytest_cache
rootdir: /Users/baileythegreen/Software/pyani, inifile: pytest.ini
plugins: dash-1.14.0, cov-2.11.1, ordering-0.6
collected 100 items                                                                                                                                

tests/test_anib.py::test_get_version_no_exe PASSED                                                                                           [  1%]
tests/test_anib.py::test_get_version_exe_not_executable PASSED                                                                               [  2%]
tests/test_anib.py::test_get_version_exe_no_version PASSED                                                                                   [  3%]
tests/test_anib.py::test_blastall_dbjobdict PASSED                                                                                           [  4%]
tests/test_anib.py::test_blastall_graph PASSED                                                                                               [  5%]
tests/test_anib.py::test_blastall_multiple PASSED                                                                                            [  6%]
tests/test_anib.py::test_blastall_single PASSED                                                                                              [  7%]
tests/test_anib.py::test_blastn_dbjobdict PASSED                                                                                             [  8%]
tests/test_anib.py::test_blastn_graph PASSED                                                                                                 [  9%]
tests/test_anib.py::test_blastn_multiple PASSED                                                                                              [ 10%]
tests/test_anib.py::test_blastn_single PASSED                                                                                                [ 11%]
tests/test_anib.py::test_formatdb_multiple PASSED                                                                                            [ 12%]
tests/test_anib.py::test_formatdb_single PASSED                                                                                              [ 13%]
tests/test_anib.py::test_fragment_files PASSED                                                                                               [ 14%]
tests/test_anib.py::test_makeblastdb_multiple PASSED                                                                                         [ 15%]
tests/test_anib.py::test_makeblastdb_single PASSED                                                                                           [ 16%]
tests/test_anib.py::test_parse_legacy_blastdir PASSED                                                                                        [ 17%]
tests/test_anib.py::test_parse_blastdir PASSED                                                                                               [ 18%]
tests/test_anib.py::test_parse_blasttab PASSED                                                                                               [ 19%]
tests/test_anib.py::test_parse_legacy_blasttab PASSED                                                                                        [ 20%]
tests/test_aniblastall.py::test_get_version_missing_exe PASSED                                                                               [ 21%]
tests/test_aniblastall.py::test_get_version_not_executable PASSED                                                                            [ 22%]
tests/test_aniblastall.py::test_get_version_no_version PASSED                                                                                [ 23%]
tests/test_aniblastall.py::test_get_version_os_incompatible PASSED                                                                           [ 24%]
tests/test_anim.py::test_get_version_no_exe PASSED                                                                                           [ 25%]
tests/test_anim.py::test_get_version_exe_not_executable PASSED                                                                               [ 26%]
tests/test_anim.py::test_get_version_exe_no_version PASSED                                                                                   [ 27%]
tests/test_anim.py::test_deltadir_parsing PASSED                                                                                             [ 28%]
tests/test_anim.py::test_deltafile_parsing PASSED                                                                                            [ 29%]
tests/test_anim.py::test_maxmatch_single PASSED                                                                                              [ 30%]
tests/test_anim.py::test_mummer_multiple PASSED                                                                                              [ 31%]
tests/test_anim.py::test_mummer_single PASSED                                                                                                [ 32%]
tests/test_anim.py::test_mummer_job_generation PASSED                                                                                        [ 33%]
tests/test_cli_parsing.py::test_createdb PASSED                                                                                              [ 34%]
tests/test_cli_parsing.py::test_download_single_genome PASSED                                                                                [ 35%]
tests/test_concordance.py::test_anim_concordance FAILED                                                                                      [ 36%]
tests/test_concordance.py::test_anib_concordance PASSED                                                                                      [ 37%]
tests/test_concordance.py::test_aniblastall_concordance PASSED                                                                               [ 38%]
tests/test_concordance.py::test_tetra_concordance PASSED                                                                                     [ 39%]
tests/test_dependencies.py::test_import_biopython PASSED                                                                                     [ 40%]
tests/test_dependencies.py::test_import_matplotlib PASSED                                                                                    [ 41%]
tests/test_dependencies.py::test_import_numpy PASSED                                                                                         [ 42%]
tests/test_dependencies.py::test_import_pandas PASSED                                                                                        [ 43%]
tests/test_dependencies.py::test_import_scipy PASSED                                                                                         [ 44%]
tests/test_dependencies.py::test_blastn_available PASSED                                                                                     [ 45%]
tests/test_dependencies.py::test_run_blastall XPASS                                                                                          [ 46%]
tests/test_dependencies.py::test_run_nucmer PASSED                                                                                           [ 47%]
tests/test_graphics.py::test_png_mpl PASSED                                                                                                  [ 48%]
tests/test_graphics.py::test_svg_mpl PASSED                                                                                                  [ 49%]
tests/test_graphics.py::test_pdf_mpl PASSED                                                                                                  [ 50%]
tests/test_graphics.py::test_png_seaborn PASSED                                                                                              [ 51%]
tests/test_graphics.py::test_svg_seaborn PASSED                                                                                              [ 52%]
tests/test_graphics.py::test_pdf_seaborn PASSED                                                                                              [ 53%]
tests/test_jobs.py::test_create_job PASSED                                                                                                   [ 54%]
tests/test_jobs.py::test_create_job_with_command PASSED                                                                                      [ 55%]
tests/test_jobs.py::test_add_dependency PASSED                                                                                               [ 56%]
tests/test_jobs.py::test_remove_dependency PASSED                                                                                            [ 57%]
tests/test_jobs.py::test_create_jobgroup PASSED                                                                                              [ 58%]
tests/test_jobs.py::test_1d_jobgroup PASSED                                                                                                  [ 59%]
tests/test_jobs.py::test_2d_jobgroup PASSED                                                                                                  [ 60%]
tests/test_jobs.py::test_add_group_dependency PASSED                                                                                         [ 61%]
tests/test_jobs.py::test_remove_group_dependency PASSED                                                                                      [ 62%]
tests/test_legacy_scripts.py::test_legacy_anim_sns PASSED                                                                                    [ 63%]
tests/test_legacy_scripts.py::test_legacy_anim_mpl PASSED                                                                                    [ 64%]
tests/test_legacy_scripts.py::test_legacy_anib_sns PASSED                                                                                    [ 65%]
tests/test_legacy_scripts.py::test_legacy_anib_mpl PASSED                                                                                    [ 66%]
tests/test_legacy_scripts.py::test_legacy_tetra_sns PASSED                                                                                   [ 67%]
tests/test_legacy_scripts.py::test_legacy_tetra_mpl PASSED                                                                                   [ 68%]
tests/test_legacy_scripts.py::test_legacy_genome_downloads PASSED                                                                            [ 69%]
tests/test_multiprocessing.py::test_multiprocessing_run PASSED                                                                               [ 70%]
tests/test_multiprocessing.py::test_cmdsets PASSED                                                                                           [ 71%]
tests/test_multiprocessing.py::test_dependency_graph_run PASSED                                                                              [ 72%]
tests/test_parsing.py::test_anim_delta PASSED                                                                                                [ 73%]
tests/test_subcmd_01_download.py::test_download_dry_run PASSED                                                                               [ 74%]
tests/test_subcmd_01_download.py::test_download_c_blochmannia PASSED                                                                         [ 75%]
tests/test_subcmd_01_download.py::test_download_kraken PASSED                                                                                [ 76%]
tests/test_subcmd_02_index.py::TestIndexSubcommand::test_index PASSED                                                                        [ 77%]
tests/test_subcmd_02_index.py::TestIndexSubcommand::test_missing_index PASSED                                                                [ 78%]
tests/test_subcmd_03_createdb.py::TestCreatedbSubcommand::test_createdb PASSED                                                               [ 79%]
tests/test_subcmd_04_anim.py::TestANImSubcommand::test_anim PASSED                                                                           [ 80%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_genomes PASSED                                                                    [ 81%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_genomes_runs PASSED                                                               [ 82%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_matrices PASSED                                                                   [ 83%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_results PASSED                                                                    [ 84%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_runs PASSED                                                                       [ 85%]
tests/test_subcmd_05_report.py::TestReportSubcommand::test_runs_genomes PASSED                                                               [ 86%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_jpg PASSED                                                                   [ 87%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_pdf PASSED                                                                   [ 88%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_png PASSED                                                                   [ 89%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_mpl_svg PASSED                                                                   [ 90%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_jpg PASSED                                                               [ 91%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_pdf PASSED                                                               [ 92%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_png PASSED                                                               [ 93%]
tests/test_subcmd_06_plot.py::TestPlotSubcommand::test_plot_seaborn_svg PASSED                                                               [ 94%]
tests/test_subcmd_07_classify.py::TestClassifySubcommand::test_classify_defaults PASSED                                                      [ 95%]
tests/test_subcmd_07_classify.py::TestClassifySubcommand::test_classify_no_thresh PASSED                                                     [ 96%]
tests/test_subcmd_08_anib.py::TestANIbsubcommand::test_anib XFAIL                                                                            [ 97%]
tests/test_tetra.py::test_tetraclean PASSED                                                                                                  [ 98%]
tests/test_tetra.py::test_zscore PASSED                                                                                                      [ 99%]
tests/test_tetra.py::test_correlations PASSED                                                                                                [100%]

===================================================================== FAILURES =====================================================================
______________________________________________________________ test_anim_concordance _______________________________________________________________

paths_concordance_fna = [PosixPath('/Users/baileythegreen/Software/pyani/tests/fixtures/concordance/GCF_002243555.1_ASM224355v1_genomic.fna'),...'), PosixPath('/Users/baileythegreen/Software/pyani/tests/fixtures/concordance/GCF_000011325.1_ASM1132v1_genomic.fna')]
path_concordance_jspecies = PosixPath('/Users/baileythegreen/Software/pyani/tests/fixtures/concordance/jspecies_output.tab'), tolerance_anim = 0.1
tmp_path = PosixPath('/private/var/folders/hg/wysfx1957s59pq2fbyss21lm0000gn/T/pytest-of-baileythegreen/pytest-86/test_anim_concordance0')

    @pytest.mark.skip_if_exe_missing("nucmer")
    def test_anim_concordance(
        paths_concordance_fna, path_concordance_jspecies, tolerance_anim, tmp_path
    ):
        """Check ANIm results are concordant with JSpecies."""
        # Perform ANIm on the input directory contents
        # We have to separate nucmer/delta-filter command generation
        # because Travis-CI doesn't play nicely with changes we made
        # for local SGE/OGE integration.
        # This might be avoidable with a scheduler flag passed to
        # jobgroup generation in the anim.py module. That's a TODO.
        ncmds, fcmds = anim.generate_nucmer_commands(paths_concordance_fna, tmp_path)
        (tmp_path / "nucmer_output").mkdir(exist_ok=True, parents=True)
        run_mp.multiprocessing_run(ncmds)
    
        # delta-filter commands need to be treated with care for
        # Travis-CI. Our cluster won't take redirection or semicolon
        # separation in individual commands, but the wrapper we wrote
        # for this (delta_filter_wrapper.py) can't be called under
        # Travis-CI. So we must deconstruct the commands below
        dfcmds = [
            " > ".join([" ".join(fcmd.split()[1:-1]), fcmd.split()[-1]]) for fcmd in fcmds
        ]
        run_mp.multiprocessing_run(dfcmds)
    
        orglengths = pyani_files.get_sequence_lengths(paths_concordance_fna)
    
        results = anim.process_deltadir(tmp_path / "nucmer_output", orglengths)
        result_pid = results.percentage_identity
        result_pid.to_csv(tmp_path / "pyani_anim.tab", sep="\t")
    
        # Compare JSpecies output to results
        result_pid = (result_pid.sort_index(axis=0).sort_index(axis=1) * 100.0).values
        tgt_pid = parse_jspecies(path_concordance_jspecies)["ANIm"].values
    
>       assert result_pid - tgt_pid == pytest.approx(0, abs=tolerance_anim)
E       assert array([[0.0, ... dtype=object) == 0 ± 1.0e-01
E         -array([[0.0, 5.585841092489133, 0.4892132128268969],\n
E         -       [5.605841092489129, 0.0, 5.644002047082907],\n
E         -       [0.4892132128268969, 5.624002047082911, 0.0]], dtype=object)
E         +0 ± 1.0e-01

tests/test_concordance.py:203: AssertionError
---------------------------------------------------------------- Captured log call -----------------------------------------------------------------
DEBUG    pyani.anim:anim.py:249 GCF_000011325.1_ASM1132v1_genomic_vs_GCF_000227605.2_ASM22760v2_genomic
DEBUG    pyani.anim:anim.py:249 GCF_000011325.1_ASM1132v1_genomic_vs_GCF_002243555.1_ASM224355v1_genomic
DEBUG    pyani.anim:anim.py:249 GCF_000227605.2_ASM22760v2_genomic_vs_GCF_002243555.1_ASM224355v1_genomic
INFO     pyani.anim:anim.py:368 /private/var/folders/hg/wysfx1957s59pq2fbyss21lm0000gn/T/pytest-of-baileythegreen/pytest-86/test_anim_concordance0/nucmer_output has 3 files to load
================================================================= warnings summary =================================================================
tests/test_anib.py:54
  /Users/baileythegreen/Software/pyani/tests/test_anib.py:54: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
    from pandas.util.testing import assert_frame_equal

tests/test_legacy_scripts.py::test_legacy_anim_sns
tests/test_legacy_scripts.py::test_legacy_anim_sns
  /Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/seaborn/matrix.py:619: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
    linkage = hierarchy.linkage(self.array, method=self.method,

-- Docs: https://docs.pytest.org/en/latest/warnings.html
==================================== 1 failed, 97 passed, 1 xfailed, 1 xpassed, 3 warnings in 518.67s (0:08:38) ====================================

@widdowquinn
Copy link
Owner

🎉🎉🎉🎉🎉🎉🎉🎉

That looks like (mostly) good news!

@baileythegreen
Copy link
Contributor Author

baileythegreen commented Oct 1, 2021

Yeah, the concordance test is the only one we'll have to worry about.

Outputs of the concordance test

assert result_pid - tgt_pid == pytest.approx(0, abs=tolerance_anim)

on my two branches:

# noextend_342, using `--noextend`
# this fails
Result {result_pid - tgt_pid}:
[[0.0                 5.585841092489133  0.4892132128268969]
 [5.605841092489129                 0.0   5.644002047082907]
 [0.4892132128268969  5.624002047082911                 0.0]]
Expected {pytest.approx(0, abs=tolerance_anim)}:
0 ± 1.0e-01


# issue_342, using `--extend`
# this passes
Result {result_pid - tgt_pid}:
[[                0.0  0.07518815694456293  0.06984331581938363]
 [0.09518815694455895                  0.0  0.08911322006224509]
 [0.06984331581938363  0.06911322006224907                  0.0]]
Expected {pytest.approx(0, abs=tolerance_anim)}:
0 ± 1.0e-01

@baileythegreen
Copy link
Contributor Author

I am currently annoyed that I had to actually break the (working) test in order to get pytest to show me the output of the sys.stdout.write() calls I added. Ridiculous.

@baileythegreen
Copy link
Contributor Author

I've pushed both branches, so you can look at them. They follow different philosophies on CLI and negative variable naming, but you shouldn't need to know about any of that to look at the concordance.

@baileythegreen
Copy link
Contributor Author

@widdowquinn, I wonder if you ever had a chance to look over these two branches or decide what we want to do here? I don't know if subsequent findings related to Issue #340 changed our mind on anything here.

I had forgotten about the mention of pybedtools in the early part of this issue's discussion. I have been looking into whether using that library might be a preferable way forward with the nucmer file parsing, and I will post what I've found in Issue #34O, where I feel it is more directly relevant.

@baileythegreen
Copy link
Contributor Author

I have been trying to rebase the two branches relevant to this issue—issue_342 and noextend_342, run anim on each, and run pyani compare on the results.

I am having trouble with the rebasing part. In addition to the anim concordance test, which I expected to fail, I am seeing failures for these two tests:

tests/test_legacy_scripts.py::test_legacy_anim_sns FAILED
tests/test_legacy_scripts.py::test_legacy_anim_mpl FAILED 

and I am having trouble figuring out why. It looks like no files exist in the deltadir when process_deltadir() is called in calculate_anim(), but it is not clear to me why that should be.

I may not be able to finish this before I go on holiday, but I will dig a bit more today.

@baileythegreen baileythegreen added this to the 0.3.0 milestone May 6, 2022
@baileythegreen
Copy link
Contributor Author

Both issue_342 and noextend_342 have been successfully rebased now.

The last significant bit of this discussion can be found in these two comments:
#342 (comment) — showing general test results and
#342 (comment) — showing detailed results for concordance tests

The question has also been raised as to whether this issue (and these branches) are relevant anymore, given discussion in #340, and the extent of the issues with current calculation of ANI and coverage from mummer results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something isn't working how it should method the issue relates to how results are calculated VERSION_3 issues relating to version 0.3.x of pyani
Projects
None yet
Development

No branches or pull requests

2 participants