Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perceptual hash #150

Closed
wants to merge 20 commits into from
Closed

Perceptual hash #150

wants to merge 20 commits into from

Conversation

bjlittle
Copy link
Contributor

@bjlittle bjlittle commented May 19, 2022

The Why...

Some background context...

Please refer to issue #146.

This PR extends the capability of pytest-mpl to support perceptual image hashes.

In a nutshell, the use of cryptographic hashing algorithms e.g., sha256 et al, only support strict binary image identity. By their very nature, they are not able to offer any meaningful way of determining the "similarity" of non-identical images. As a result, they offer limited use when comparing images, as 1 single pixel makes a huge difference.

In the same guise, the rms + tolerance approach can also be brittle, albeit certainly better than using a cryptographic hash. Distilling the richness of the detail and structure inherent within an image down to a single valued statistic results in an incredibly lossy metric, and hence can also lead to a fragile and sensitive workflow.

The real world up-shot of all this woe is a tangible drain on precious developer resource/effort/time in order to check graphic test failures... and TBH we've all got better things to do with our day 😉

Over on SciTools/iris we've been using perceptual image hashes for a few years now, and it's pretty much resolved the pain and fear induced by the sha and rms approaches. It's way more robust, and through controlling the hash-size and high frequency factor it's allowed us to find a comfortable and reasonably reliable sweet-spot for graphical testing.

Therefore, we're really keen to make perceptual hashing available in pytest-mpl, so that the greater matplotlib community can benefit (as well as ourselves).

The What...

A high-level summary... This PR is two-fold:
  • it introduces a simple and hopefully light-weight framework that supports easily configurable hashing algorithms or kernels that can be used to create associated hashing libraries
  • provisions the default sha256 cryptographic hashing kernel (which pytest-mpl currently implements), but also introduces the phash perceptual hashing kernel (non-cryptographic) provisioned by https://github.com/JohannesBuchner/imagehash

I've attempted to keep the PR as minimal as possible, by making my changes as additive as possible. However, I have rationalised some sections of the code-base (hopefully nothing too controversial) and applied some minor consistency changes (which were only due to fresh eyes on unfamiliar code - so happy to be shot down btw).

This PR is currently a work-in-progress, so all feedback is welcomed 👍

I'm super keen for pytest-mpl devs to get a feel for what I'm proposing early on, so that we can shape it appropriately. But from a purely selfish perspective, I'd love to see this feature banked, as it potentially opens the door to extensible and more robust image testing.

In terms of work still TODO, I figure that I still require to (at least) do the following:

  • Testing
  • refactor after PR default hash savefig format #160
  • Pull-thru changes and/or check report summary generation:
    • json
    • html
    • basic-html
  • Documentation
    • Check doc-strings
    • Update README.rst

BTW If you guys buy-into this PR, then I may need help or advise along the way to service the above TODOs. However, I'd like to try my best to address them myself (as a learning opportunity) but I'll quickly shout if I get stuck 😉

API Changes...

What's New... In terms of public API, I've extended it as follows :
  • CLI options:

    • --mpl-kernel : str, either of [sha256, phash], default is sha256 to maintain existing default hashing behaviour
    • --mpl-hash-size : int, default is 16 bits, giving a hash size of 16**2 = 256 bits, only applies to phash
    • --mpl-hamming-tolerance : int, default is 2 bits, only applies to phash
    • --mpl-high-freq-factor : int, default is 4, only applies to phash
  • INI options:

    • mpl-kernel : str
    • mpl-hash-size : int
    • mpl-hamming-tolerance : int
    • mpl-high-freq-factor : int
  • mpl_image_compare marker kwargs:

    • hamming_tolerance : int

Note that, my choice of defaults is heavily bias towards what we settled on for SciTools/iris.

https://github.com/JohannesBuchner/imagehash offers several other different hashing algorithms, but I've only opted to implement support for imagehash.phash. Given the initial kernel framework in this PR, the potential is there to extend support for any of those hashing algorithms, or indeed any other novel approach from the community... well, that's the grand hope and intention here.

Closes #146

@bjlittle
Copy link
Contributor Author

Ping @Cadair @astrofrog @dopplershift

f'{kernel_name!r} does not match configured runtime '
f'kernel {self.kernel.name!r}.')
pytest.fail(emsg)

Copy link
Contributor Author

@bjlittle bjlittle May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to have the name of the kernel that created the hash library as metadata within the hash library.

This will avoid situations where the configured runtime kernel is different to the kernel that generated the hashes in the library.

Note that, the choice of META_HASH_LIBRARY_KERNEL name i.e., pytest-mpl-kernel, intentionally can't parse as a Python variable, and thus means that it's safe to inject this key into the dictionary, as it's simply not possible for a user of pytest-mpl to decorate a test with such a name... well, that's my thinking anyways 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also record the high frequency factor used to generate the hash library? (We can probably determine the hash size used from the hashes themselves.) That way, similar to what I was saying above about the kernel name, we can dynamically set the default hash size and high frequency factor to the values used when generating the hash library (only if these have not been explicitly configured by the user as the CLI or ini option).

What do you think of this approach of getting the default hash settings from the hash library?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. See 5958c51 👍

@bjlittle bjlittle marked this pull request as draft May 19, 2022 16:04
pytest_mpl/plugin.py Outdated Show resolved Hide resolved
@Cadair
Copy link
Contributor

Cadair commented May 19, 2022

Thanks for the PR @bjlittle it looks great. It will probably be next week before I have time to go through it properly.

@bjlittle
Copy link
Contributor Author

@Cadair Okay, so I appear to have opened a big old can of worms testing-wise, by:

  • updating the summary dictionary to contain the three new necessary entries for kernel, hamming_distance and hamming_tolerance
  • tweaking and changing various status and summary messages

I guess at this point it would be good to know whether my suggested changes for the status messages et al are acceptable.

Based on accommodating any review request alterations from yourself, I'm guessing that I then need to recreate/respin the expected tests.subtests.summaries.*.json files used by the tests.subtests.test_subtests, which mostly all seem pretty upset with me atm 😉

I may need some guidance on how to go about doing that... unless it's captured in some dev docs somewhere?

@ConorMacBride
Copy link
Member

Hi @bjlittle, the tests in tests/subtests/ are very tedious to update (which I realised in #145 after adding only three new tests). All the tests/subtests/ tests were added in #134 so we can more fully test how the plugin is running internally and verify the outputs under a range of different configurations. In a separate issue/PR this should be discussed/improved, so I would recommend just disabling all the tests in tests/subtests/ for this PR. I'll think of a way to make them easier to update, and at the very least I'll add more documentation!

value = marker.kwargs.get(self.option)
if value is not None:
# Override with the decorator marker value.
self.hamming_tolerance = int(value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this behaviour is relevant to the discussion over on #154

@bjlittle bjlittle mentioned this pull request May 21, 2022
@ConorMacBride ConorMacBride mentioned this pull request May 27, 2022
1 task
import matplotlib.ft2font
import pytest
from packaging.version import Version

from .helpers import assert_existence, diff_summary, patch_summary

# TODO: Remove this skip, see issue https://github.com/matplotlib/pytest-mpl/issues/159
pytest.skip(reason="temporarily disabled sub-tests", allow_module_level=True)

Copy link
Contributor Author

@bjlittle bjlittle May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConorMacBride I'll remove this and rebase when #163 is merged, then re-spin the test data 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Out of curiosity, I tried that and it generates the baseline data correctly. The only modification that's needed to the baseline generation functions is #163 (comment)

Copy link
Member

@ConorMacBride ConorMacBride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brilliant! 🎉 I've went through it all and left some comments. Otherwise, I think this feature is ready. Updating the HTML can be a separate PR if that would be easier for reviewing? (As-is the HTML output will work but it won't say which hash kernel is used.)

Also, (as a separate PR) I think phash should be made the default. As it's more lenient to different versions of matplotlib it will make pytest-mpl easier to use which would encourage people to use it. (And those who want the stricter SHA hashing can configure that.) As mentioned in other issues, the whole pytest-mpl API should be reviewed to decide what other breaking changes should be included in v1.0.0.

pytest_mpl/kernels.py Show resolved Hide resolved
pytest_mpl/kernels.py Show resolved Hide resolved
setup.cfg Show resolved Hide resolved
tests/test_pytest_mpl.py Show resolved Hide resolved
pytest_mpl/plugin.py Outdated Show resolved Hide resolved
f'{kernel_name!r} does not match configured runtime '
f'kernel {self.kernel.name!r}.')
pytest.fail(emsg)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also record the high frequency factor used to generate the hash library? (We can probably determine the hash size used from the hashes themselves.) That way, similar to what I was saying above about the kernel name, we can dynamically set the default hash size and high frequency factor to the values used when generating the hash library (only if these have not been explicitly configured by the user as the CLI or ini option).

What do you think of this approach of getting the default hash settings from the hash library?

@bjlittle
Copy link
Contributor Author

bjlittle commented May 29, 2022

@ConorMacBride Awesome! I'll address your review comments asap.

But I might take you up on your suggestion for splitting this PR up into a couple of follow-on PRs. Might be a wise strategy, given that it'll probably get noisy after reinstating the subtests (with a #163 rebase) 👍

@bjlittle
Copy link
Contributor Author

bjlittle commented Jun 3, 2022

@ConorMacBride Okay, so I've addressed your review comments... and generalised the approach to serializing metadata for the kernel within the hash library and results json. This should extend nicely when we come to add more kernel support for other algorithms.

I guess at this point we're now blocked on this PR until #163 is banked.

That said, if you guys are looking for another pytest-mpl core dev, then I'm more than willing to help out and contribute. I'm completely vested with pytest-mpl now, given this PR 👍

@bjlittle bjlittle changed the title WIP: Perceptual hash Perceptual hash Jun 4, 2022
@bjlittle bjlittle marked this pull request as ready for review June 4, 2022 07:00
@dopplershift
Copy link
Contributor

I don't know that we have a procedure in place, but I'm happy to add @bjlittle. I'd appreciate having a 👍 from someone else though.

@ConorMacBride
Copy link
Member

ConorMacBride commented Aug 22, 2022

Hi @bjlittle, do you think you'll have time to complete this PR? I think it just needs the disabled summary tests to be reenabled, branch rebased, and the new configuration documented.

(I can do the rebase if you like? You probably know best for the documentation.)

I think it's well tested and the extra summary tests can be added in a separate PR. Full integration with the HTML summary can also be separate (it should work, but will not currently specify the type of hash for example).

@bjlittle
Copy link
Contributor Author

bjlittle commented Sep 7, 2022

Hey @ConorMacBride,

Major apologies for being AWOL. I've recently been maxed out on other project work and just come back from a much needed long summer vacation.

I'm aiming to spin back up on this PR and catch-up on all the recent awesome changes on pytest-mpl.

Super keen to get this one banked 👍

@ConorMacBride
Copy link
Member

Hi @bjlittle, would you like me to rebase this for you? I'm still very keen to get this feature in!

@bjlittle
Copy link
Contributor Author

bjlittle commented Apr 5, 2023

@ConorMacBride Hey, thanks for the nudge, again.

I've been majorly distracted by another major project, but excuses aside let's get this banked 👍

I'm going to work on this next week to get this pr across the line. I may ping you for some support to expedite it, if I get stuck. Is that alright?

ConorMacBride added a commit to astrofrog/pytest-mpl that referenced this pull request May 29, 2023
@bjlittle bjlittle mentioned this pull request Oct 17, 2023
22 tasks
@bjlittle bjlittle self-assigned this Nov 6, 2023
@bjlittle
Copy link
Contributor Author

bjlittle commented Nov 20, 2023

@ConorMacBride I'm closing this PR and replacing it with another

@bjlittle bjlittle closed this Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perceptual Image Hash
4 participants