Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default hash savefig format #160

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bjlittle
Copy link
Contributor

@bjlittle bjlittle commented May 21, 2022

This PR addresses issue #152, by ensuring that a default format='png' is injected to the savefig kwargs when generating a hash.

Closes #152

@bjlittle
Copy link
Contributor Author

Okay, wowzers... can anyone explain the test failure behaviour of py310 with mpl35 across all platforms ? 😕

@ConorMacBride
Copy link
Member

Very strange. It seems that passing format='png' to savefig changes the hashes. I wonder if a default has changed in 3.5?

@ConorMacBride
Copy link
Member

The hashes are different on main also (testing py310-test-mpl35 on macOS locally) so not related to this PR. Matplotlib 3.5.2 was released 19 days ago so my guess is that something has changed there.

@ConorMacBride
Copy link
Member

Yes, the tests pass on MPL 3.5.1. Can you pin this to 3.5.1 rather than 3.5.*?

mpl35: matplotlib==3.5.*

Hopefully this is something that the perceptual hashes can solve, as we really should be testing against the latest bug fixes from matplotlib!

@bjlittle
Copy link
Contributor Author

Awesome detective work 🕵️

@bjlittle
Copy link
Contributor Author

bjlittle commented May 21, 2022

@ConorMacBride If the sha hashes are sensitive to some change from mpl>3.5.1, what does that mean for default hash testing? Switch to phash?

What are the implications to the community if we bank this new default format behaviour in this PR? Isn't this a breaking change in behaviour?

@@ -26,7 +26,7 @@ deps =
mpl32: matplotlib==3.2.*
mpl33: matplotlib==3.3.*
mpl34: matplotlib==3.4.*
mpl35: matplotlib==3.5.*
mpl35: matplotlib==3.5.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConorMacBride We should make an issue to unpin this, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to change this.

@ConorMacBride
Copy link
Member

@ConorMacBride If the sha hashes are sensitive to some change from mpl>3.5.1, what does that mean for default hash testing? Switch to phash?

sha hashes are definitely too sensitive. It would be great if such a constrained testing environment wasn't necessary as that would make the package much easier to use. I would need to experiment more with phash to get an understanding about the sensitivity. Although it would be up to the maintainers (astrofrog, Cadair and dopplershift) to decide.

I presume the default hash size, high frequency factor and hamming tolerance can be tuned to a value that minimises, what an average user would consider, a test passing incorrectly? (I would expect most of the phash examples here should result in a failed hash comparison.)

I suspect phash may not be appropriate for tests that need to be particularly colour sensitive? All images are converted to 8-bit grayscale so, for example, this image would be similar to an image where all the values have been multiplied by -1 (based on what the bwr colormap looks like in grayscale).

What are the implications to the community if we bank this new default format behaviour in this PR? Isn't this a breaking change in behaviour?

From reading the savefig documentation, the only way I can see this being a breaking change is if someone has a different value set in savefig.format rcparams or they are using a different backend (by passing a backend kwarg to the pytest-mpl decorator, which is an undocumented feature).

backend = compare.kwargs.get('backend', 'agg')

I've added some code below. In the default case, both hashes are the same so no breaking change. However, if a custom backend has been specified, passing format="png" seems to force Agg which would be a breaking change. However, svg changes hashes every time (due to random ids referencing objects), and ps has similar behaviour. So it might be unlikely that someone is changing the backend while using hash comparison.

So, if the user asks for a different backend, setting format="png" would have the effect of ignoring that when doing hash comparison. For phash, the image data needs to be in a format that pillow/PIL can read although I think it should work for more than just png?

To not interfere with the requested backend, I think we should only set format="png" if a backend kwarg isn't set. Or maybe we should not be setting format="png" as that's the default unless the user has requested something different?

Click for code
import io
import hashlib
import matplotlib
import matplotlib.pyplot as plt


def _hash_file(in_stream):
    """
    Hashes an already opened file.
    """
    in_stream.seek(0)
    buf = in_stream.read()
    hasher = hashlib.sha256()
    hasher.update(buf)
    return hasher.hexdigest()


def hash(**kwargs):
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.plot([1, 3, 2, 4])
    imgdata = io.BytesIO()
    fig.savefig(imgdata, **kwargs)
    plt.close()
    return _hash_file(imgdata)


# Default (pytest-mpl sets agg)
matplotlib.use("agg")
print(hash())
# b7d7754ed51a78949e119e3900b72b4123fcc3674fd6a5750b9a93f7091293ab
print(hash(format="png"))
# b7d7754ed51a78949e119e3900b72b4123fcc3674fd6a5750b9a93f7091293ab

# User passes backend="svg" to pytest-mpl decorator
metadata = {"Creator": None, "Date": None, "Format": None, "Type": None}
matplotlib.use("svg")
print(hash(metadata=metadata))
# 81910e47aa39455d96c41893f73793c65b5405af14af7919dbe8a7172e004fcf
print(hash(metadata=metadata))
# 0bf13ec69c253c6a8d3b6b004aa4855e343e5f40a79106e87fee72c208f7de89
print(hash(format="png"))
# b7d7754ed51a78949e119e3900b72b4123fcc3674fd6a5750b9a93f7091293ab

@bjlittle
Copy link
Contributor Author

sha hashes are definitely too sensitive. It would be great if such a constrained testing environment wasn't necessary as that would make the package much easier to use. I would need to experiment more with phash to get an understanding about the sensitivity. Although it would be up to the maintainers (astrofrog, Cadair and dopplershift) to decide.

Agreed.

@bjlittle
Copy link
Contributor Author

bjlittle commented May 23, 2022

I presume the default hash size, high frequency factor and hamming tolerance can be tuned to a value that minimises, what an average user would consider, a test passing incorrectly? (I would expect most of the phash examples here should result in a failed hash comparison.)

That's pretty much my experience so far. We've opted for a hash_size=16 (which gives a 256-bit hash) highfreq_factor=4 (imagehash.phash default) and a hamming distance tolerance of <=2 for a pass with phash, and that suits our needs.

@bjlittle
Copy link
Contributor Author

bjlittle commented May 23, 2022

I suspect phash may not be appropriate for tests that need to be particularly colour sensitive? All images are converted to 8-bit grayscale so, for example, this image would be similar to an image where all the values have been multiplied by -1 (based on what the bwr colormap looks like in grayscale).

I've re-ran and updated a notebook that I wrote to initially investigate and play with imagehash, which is available here.

To create a conda environment to run it, simply conda create -n imagehash -c conda-forge jupyter iris imagehash, and download the notebook gist.

Interestingly, changing the colormap on a plot is detected by phash. But yeah, play around and get a feel for what it's offering 👍

@bjlittle
Copy link
Contributor Author

Overall, I wait for you guys to come to a consensus and advise on what to do next here with this PR.

@Cadair
Copy link
Contributor

Cadair commented May 23, 2022

I think matplotlib puts metadata into the png including the version iirc. I think we can turn that off.

See the metadata keyword argument here: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html and more info on png here: https://matplotlib.org/stable/api/backend_agg_api.html#matplotlib.backends.backend_agg.FigureCanvasAgg.print_png

@ConorMacBride
Copy link
Member

Given how effective removing the metadata version number was in #163 I think we should definitely remove it by default and make it a breaking change (for v1.0.0 😉).

This was referenced May 28, 2022
Copy link
Member

@ConorMacBride ConorMacBride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to set this for the reasons in the second half of #160 (comment). Should we just close this PR?

@@ -26,7 +26,7 @@ deps =
mpl32: matplotlib==3.2.*
mpl33: matplotlib==3.3.*
mpl34: matplotlib==3.4.*
mpl35: matplotlib==3.5.*
mpl35: matplotlib==3.5.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to change this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

default savefig format
3 participants