Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink the testing mechanism for images #963

Closed
seisman opened this issue Feb 24, 2021 · 21 comments
Closed

Rethink the testing mechanism for images #963

seisman opened this issue Feb 24, 2021 · 21 comments
Labels
maintenance Boring but important stuff for the core devs
Milestone

Comments

@seisman
Copy link
Member

seisman commented Feb 24, 2021

If you're unclear about how PyGMT tests images, please read the "Testing plots" section in the contributing guides first.


In short, for image-based tests, we need to specify the baseline/reference image. When we make any changes to the code, we can generate the new "test" image and compare it with the "baseline" image. If the two images are different, then we know the changes break the tests. The most important thing is, to ensure that the "baseline" images are correct.

Currently, we have two different methods to generate the "baseline" image and compare them:

  1. using the decorator @pytest.mark.mpl_image_compare
  2. using the decorator @check_figures_equal()

The @pytest.mark.mpl_image_compare method is the most straightforward way to do image testing. Using the decorator, we need to generate baseline images, check their correctness, and store them in the repository (https://github.com/GenericMappingTools/pygmt/tree/master/pygmt/tests/baseline).

Pros:

  1. We can visually check the baseline images to make sure they are correct

Cons:

  1. Have to store the static PNG images in the repository. The repository size grows quickly.

To avoid storing many large static images in the repository, we (mainly @weiji14 and @seisman) had some discussions (in #451, #522) and developed the @check_figures_equal decorator (#555, #590, #600).

Below is an example test using the @check_figures_equal() decorator:

@check_figures_equal()
def test_basemap_polar():
"""
Create a polar basemap plot.
"""
fig_ref, fig_test = Figure(), Figure()
# Use single-character arguments for the reference image
fig_ref.basemap(R="0/360/0/1000", J="P6i", B="afg")
fig_test.basemap(region=[0, 360, 0, 1000], projection="P6i", frame="afg")
return fig_ref, fig_test

In this example, the baseline/reference image fig_ref is generated using basemap(R="0/360/0/1000", J="P6i", B="afg"), while the test image fig_test is generated using basemap(region=[0, 360, 0, 1000], projection="P6i", frame="afg"). We can't see what the baseline image looks like, but we're somehow confident that the baseline image is correct, because the basemap wrapper is very simple.

Pros:

  1. Don't need to store static images in the repository, thus keep the repository size small

Cons:

  1. For each test, we have to generate two images (baseline and test images), which doubles the execution time
  2. We can't visually check the correctness of the baseline images
  3. If we decided to disable single-character parameters (i.e, J="X10c/10c" is disallowed) as proposed in Disallow single character arguments #262 (also related to Fail for invalid input arguments  #256), then most of the code for generating reference images will be invalid.

For some complicated wrappers, we even can't easily know if the reference image is correct. For example,

@check_figures_equal()
def test_subplot_direct():
"""
Plot map elements to subplot directly using the panel parameter.
"""
fig_ref, fig_test = Figure(), Figure()
with fig_ref.subplot(nrows=2, ncols=1, Fs="3c/3c"):
fig_ref.basemap(region=[0, 3, 0, 3], frame="af", panel=0)
fig_ref.basemap(region=[0, 3, 0, 3], frame="af", panel=1)
with fig_test.subplot(nrows=2, ncols=1, subsize=("3c", "3c")):
fig_test.basemap(region=[0, 3, 0, 3], frame="af", panel=[0, 0])
fig_test.basemap(region=[0, 3, 0, 3], frame="af", panel=[1, 0])
return fig_ref, fig_test

In this test, we expect that the baseline image has a 2-row-by-1-column subplot layout. However, if we make a silly mistake in Figure.subplot, resulting in a 1-row-by-2-column layout, the test still passes, because both the baseline and test images have the same "wrong" layout. Then the test is useless to us.


Almost every plotting tools have to decide if they want to store static images in the repository. There are some similar discussions in the upstream GMT project (GenericMappingTools/gmt#3470) and the matplotlib project (matplotlib/matplotlib#16447).


As we're having more active developers now, I think we should rethink how we want to test PyGMT.

@seisman seisman added the question Further information is requested label Feb 24, 2021
@seisman seisman mentioned this issue Feb 24, 2021
5 tasks
@willschlitzer
Copy link
Contributor

I'm a fan of switching to @check_figures_equal(), although I understand the difficulties associated with having our testing functions make so many plots. I mostly like the idea of not having a large repository of reference plots, especially for when we have changes in the plotting function, like in GMT 6.2.

That being said, I have begun to realize that it doesn't seem like we're always effectively testing all that much when we're just passing the same parameter to single-letter and PyGMT arguments and testing that the plots turn out the same. As we've discussed in #771, the thing we should be most concerned about is the "Python" parts of the function, not the aliases that pass arguments to the GMT API. I think we can reduce the testing workload by consolidating some of the tests that test nothing more than aliases (which includes many of the tests I've written) and focus on the Python parts (my recent example is testing the Python parts of solar to make sure the terminator and datetime inputs produce the same result as a GMT string passed to the T parameter).

@seisman
Copy link
Member Author

seisman commented Feb 25, 2021

the thing we should be most concerned about is the "Python" parts of the function, not the aliases that pass arguments to the GMT API. I think we can reduce the testing workload by consolidating some of the tests that test nothing more than aliases and focus on the Python parts.

Yes, I agree. This is what we must do.

Still, the biggest challenge is "how to make sure that baseline images are correct.".


We had some discussions in #451.

One solution is storing static images in a separate repository (e.g. pygmt_baseline_image), but it is too complicated when we want to add a new test.

Should we start using submodules? I.e. split the pygmt/tests/baseline folder into a separate repository (see https://docs.github.com/en/github/using-git/splitting-a-subfolder-out-into-a-new-repository).

I've been reading up on git submodules/subtrees/git-lfs and there doesn't seem to be an easy way to do this, there will be a learning curve in any case. Matplotlib currently has a big PR at matplotlib/matplotlib#17557 to move their baseline images into a separate place, and I really do not want myself or anyone to handle that in X years.

It's too complicated. When we add a test, we have to open two separate PRs in two repositories, one for the baseline images and one for the tests. How can the tests PR know it should get the new baseline images in the corresponding branch?

Yeah, and I don't think it will be friendly for new contributors either. Surely there must be a better way to store the images, or test them


Another solution is generating baseline images by directly calling Session.call_module(). It's almost testing the equivalence of a Python script (using PyGMT syntax) and a bash script (using GMT CLI). This is the method I prefer.

For other non-grd tests, perhaps we could generate the reference images by directly passing arguments to GMT modules. For examples,

fig_test.basemap(region=[0, 10, 0, 10], projection='X10c/10c', frame=['xaf', 'yaf', 'WSen'])

should be identical to the reference image generated by:

lib.call_module("basemap", "-R0/10/0/10 -JX10c/10c -Bxaf -Byaf -BWSen")

@willschlitzer
Copy link
Contributor

willschlitzer commented Feb 25, 2021

Another solution is generating baseline images by directly calling Session.call_module(). It's almost testing the equivalence of a Python script (using PyGMT syntax) and a bash script (using GMT CLI). This is the method I prefer.

For other non-grd tests, perhaps we could generate the reference images by directly passing arguments to GMT modules. For examples,

fig_test.basemap(region=[0, 10, 0, 10], projection='X10c/10c', frame=['xaf', 'yaf', 'WSen'])

should be identical to the reference image generated by:

lib.call_module("basemap", "-R0/10/0/10 -JX10c/10c -Bxaf -Byaf -BWSen")

I like this idea. I think it more effectively tests that the aliases and Python functions line up with the expected outcome from GMT, as opposed to seeing if passing the same arguments to PyGMT twice will produce different results. We assume that if the "correct" inputs are sent to GMT, the figure will turn out as expected, much like a reference image. The downside is that it will expect someone to learn GMT commands, but I don't think this is too advanced from someone wrapping a new module.

Why wouldn't this also be applicable for grd tests? Since we use standard GMT-hosted grids, wouldn't we be able to add @earth_relief in the string passed to lib.call_module()?

How would this work with @check_figures_equal()? The GMT 6.2 release seems like an ideal time to make this switch, since we will have to either update the tests or update the reference images.

@seisman
Copy link
Member Author

seisman commented Feb 25, 2021

We assume that if the "correct" inputs are sent to GMT, the figure will turn out as expected, much like a reference image.

Yes, it sounds reasonable and valid assumption.

Why wouldn't this also be applicable for grd tests? Since we use standard GMT-hosted grids, wouldn't we be able to add @earth_relief in the string passed to lib.call_module()?

It can also be applied to grid tests.

How would this work with @check_figures_equal()?

@check_figures_equal()
def test_grd2cpt(grid):
"""
Test creating a CPT with grd2cpt to create a CPT based off a grid input and
plot it with a color bar.
"""
fig_ref, fig_test = Figure(), Figure()
# Use single-character arguments for the reference image
fig_ref.basemap(B="a", J="W0/15c", R="d")
grd2cpt(grid="@earth_relief_01d")
fig_ref.colorbar(B="a2000")
fig_test.basemap(frame="a", projection="W0/15c", region="d")
grd2cpt(grid=grid)
fig_test.colorbar(frame="a2000")
return fig_ref, fig_test

Just take this test (written by you) as an example, I think I mentioned before that the test may still pass even if grd2cpt() doesn't work as expected. The test can be rewritten to:

from pygmt.clib import Session


@check_figures_equal()
def test_grd2cpt(grid):
    """
    Test creating a CPT with grd2cpt to create a CPT based off a grid input and
    plot it with a color bar.
    """
    # reference image
    fig_ref = Figure()
    with Session() as lib:
        lib.call_module("basemap", "-Ba -JW0/15c -Rd")
        lib.call_module("grd2cpt", "@earth_relief_01d")
        lib.call_module("colorbar", "-Ba2000")

    # test image
    fig_test = Figure()
    fig_test.basemap(frame="a", projection="W0/15c", region="d")
    grd2cpt(grid=grid)
    fig_test.colorbar(frame="a2000")

    return fig_ref, fig_test

@willschlitzer
Copy link
Contributor

@GenericMappingTools/python-contributors Does anyone have opinions on this? I'm in support of @seisman's example using with Session() as lib.

@core-man
Copy link
Member

As a new PyGMT and an old GMT user, it seems that the test method by with Session() as lib is better.

If the right figure cannot be generated by PyGMT, I guess there are two possible reasons: 1) some bugs exist in PyGMT; 2) GMT has some bugs. We should fix the first one in PyGMT, while we should report GMT bugs to upstream.

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

@seisman
Copy link
Member Author

seisman commented Feb 28, 2021

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

Any new functions in PyGMT would reply on GMT, so we can always find equivalent GMT command lines. For example, in the new Figure.hlines() (#923) function, we can call low-level gmt plot commands to generate the reference images.

@michaelgrund
Copy link
Member

But if the PyGMT project plans to develop more functions that are not in GMT, this testing mechanism will not work.

Any new functions in PyGMT would reply on GMT, so we can always find equivalent GMT command lines. For example, in the new Figure.hlines() (#923) function, we can call low-level gmt plot commands to generate the reference images.

You're right, so far I made no use in the tests to compare the images as you suggested @seisman. Hopefully I have time to work on this the upcoming weekend.

@willschlitzer
Copy link
Contributor

@seisman Should we begin working on rewriting the tests, or should we wait until the GMT 6.2 release? I'm assuming we want to prioritize the rewriting the tests that use @pytest.mark.mpl_image_compare, but will ideally update the @check_figures_equal tests to use with Session as lib()?

@weiji14
Copy link
Member

weiji14 commented Mar 2, 2021

Not to throw a spinner into things, but do we want to reconsider using @pytest.mark.mpl_image_compare? I've mentioned it before at #451 (comment) about storing the PNG images offsite, while commiting the 'hash' of the image here on the pygmt repo using things like git-lfs.

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Alternatively, I wonder if storing SVG instead of PNG would make things lighter?

And yes, all this should be done closer to the GMT 6.2.0 release. We have the GMT dev tests set up on that CI for that matter and should be able to fix most tests before the actual GMT 6.2.0 package is out on conda-forge.

@seisman
Copy link
Member Author

seisman commented Mar 3, 2021

Not to throw a spinner into things, but do we want to reconsider using @pytest.mark.mpl_image_compare? I've mentioned it before at #451 (comment) about storing the PNG images offsite, while commiting the 'hash' of the image here on the pygmt repo using things like git-lfs.

I agree that @pytest.mark.mpl_image_compare is the most accurate way to compare reference and test images. As you said, if we can store images offsite, @pytest.mark.mpl_image_compare is the best solution.

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Not sure if it really works for us. How can we download and update baseline images if we want to run tests locally?

Alternatively, I wonder if storing SVG instead of PNG would make things lighter?

Unfortunately, GMT doesn't support SVG anymore (because recent Ghostscript versions drop the SVG support). Even GMT can save figures in SVG formats, I doubt that it may still not work. The GMT project stores PS files (ASCII) in the repository, and the repository size still grows quickly, because images (especially figures generated by grdimage) are saved as binary data in ASCII PS files (Not sure if I explain it clearly, but you can plot an image using grdimage and open it using your text editor.).

@weiji14
Copy link
Member

weiji14 commented Mar 3, 2021

There's also solutions like dvc (basically 'git' but for data) that are maturing quite nicely and might be worth considering, especially if we can automate things using Github Actions (e.g. a bot checks if images differ from baseline, and we can do /yes-bot or /no-bot to update the image). Might be able to host it on Github Artifacts or the like.

Not sure if it really works for us. How can we download and update baseline images if we want to run tests locally?

The hash of the images will be stored in a .dvc file. To download/update the PNG images locally, use dvc pull (similar to git pull). Adding files would be through dvc add (similar to git add). In fact, most of the dvc commands are based on git (see also https://realpython.com/python-data-version-control/), so the learning curve shouldn't be too steep (hopefully). They also have a Python API at https://dvc.org/doc/api-reference we could plug into.

I'll probably need to open up a demo PR to illustrate how things would work, but things we'll need to do are:

  1. Think about where to store the PNG images in the cloud.
  2. Change our testing workflow (for plotting images) back to @pytest.mark.mpl_image_compare

@seisman
Copy link
Member Author

seisman commented Mar 3, 2021

I'll probably need to open up a demo PR to illustrate how things would work

Yes, that would be better.

  1. Think about where to store the PNG images in the cloud.

If it works, can we just store the PNG images in another github repository?

@weiji14
Copy link
Member

weiji14 commented Mar 5, 2021

  1. Think about where to store the PNG images in the cloud.

If it works, can we just store the PNG images in another github repository?

Will need to have a think about where to store things as I create that PR. Probably won't have time to do this until v0.4.0 though.

Edit: Just mirrored the PyGMT repo at https://dagshub.com/GenericMappingTools/pygmt. DAGsHub is a web platform for data version control (see FAQ). Give me a few days or weeks and I'll try and get a pipeline of some sort set-up for us to start uploading images!

@weiji14 weiji14 added this to To do in Release v0.4.x via automation Mar 5, 2021
@weiji14 weiji14 added this to the 0.4.0 milestone Mar 5, 2021
@weiji14
Copy link
Member

weiji14 commented Mar 18, 2021

Ok, #1036 has been merged which sets up data version control (dvc) for the PyGMT repo. The new dvc based workflow addresses the main con of using @pytest.mark.mpl_image_compare (storing large images in the repo) by storing them on https://dagshub.com/GenericMappingTools/pygmt instead (please create an account there everyone using your GitHub login).

We will slowly migrate the tests from @check_figures_equal to @pytest.mark.mpl_image_compare. Instructions are documented at https://github.com/GenericMappingTools/pygmt/blob/master/CONTRIBUTING.md#using-data-version-control-dvc-to-manage-test-images. The test migration will proceed as follows over the next few weeks/months:

  1. After this PR Initialize data version control for managing test images #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare
  2. Bump minimum GMT version to 6.2.0
  3. Fix all the test images that have changed, storing the new test images with dvc on DAGsHub
  4. Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in Show test execution times in pytest #835/Improve some tests to speed up the CI  #840)
  5. Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close Directly check if two figures returned by a function are equal matplotlib/pytest-mpl#95?
  6. Write new tests for new functionality using @pytest.mark.mpl_image_compare only

Originally posted by @weiji14 in #1036 (comment)

  1. Update the install instructions, because pygmt.test() won't work for users.
  2. Maybe add the baseline images as a release asset when making releases.

Originally posted by @seisman in #1036 (comment)

I'd encourage everyone to use for their open PRs when creating test images, and feel free to ask any questions if things are unclear!

@maxrjones
Copy link
Member

Wow, @weiji14 and @seisman this looks fantastic! Great job! I'm excited to try it out and find out whether it could be a solution for core-gmt's testing woes as well 😄

@weiji14 weiji14 mentioned this issue Mar 18, 2021
5 tasks
@seisman
Copy link
Member Author

seisman commented Mar 18, 2021

@weiji14 Perhaps you could open a separate issue or several issues with a list of TODOs so that people who want to help have a better idea of what to do.

  1. After this PR Initialize data version control for managing test images #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare
  2. Bump minimum GMT version to 6.2.0
  3. Fix all the test images that have changed, storing the new test images with dvc on DAGsHub
  4. Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in Show test execution times in pytest #835/Improve some tests to speed up the CI  #840)

Question: Do we want to do the migration before GMT 6.2.0 or after? I prefer to do the migration before v6.2.0, although it means more work for us. After bumping GMT to 6.2.0, most tests will fail due to the changes in GMT 6.2.0, but I feel it's also a good opportunity for us to learn the GMT changes and find potential bugs by comparing the images generated by GMT 6.1.1 and 6.2.0.

FYI, one week ago, I built the PyGMT documentation using GMT 6.2.0, and found several issues with the GMT dev version (GenericMappingTools/gmt#4955), and they were all fixed in less than one week!

  1. Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close Directly check if two figures returned by a function are equal matplotlib/pytest-mpl#95?

I think we may still need @check_figures_equal() in some cases, especially for grid plotting functions. You may remember that we found some upstream bugs by checking if the images generated from a file and a xarray.DataArray are the same. Sometimes upstream bugs can cause tiny differences between these two images and are difficult to identify by visually checking baseline images.

@maxrjones
Copy link
Member

@weiji14, could I please get added on DAGshub?

@weiji14
Copy link
Member

weiji14 commented Mar 20, 2021

@weiji14, could I please get added on DAGshub?

Ok done

@seisman
Copy link
Member Author

seisman commented Apr 11, 2021

After this PR #1036, change recommended way of testing from @check_figures_equal() to @pytest.mark.mpl_image_compare

Done in #1108.

Bump minimum GMT version to 6.2.0

Waiting for the GMT v6.2.0 release.

Fix all the test images that have changed, storing the new test images with dvc on DAGsHub

Tracked by issue #1131.

Optional - Migrate @check_figures_equal tests to @pytest.mark.mpl_image_compare (can prioritize the slow tests as reported in #835/#840)

Tracked by issue #1131.

Optional - Fully deprecate @check_figures_equal(), removing it from codebase and documentation in CONTRIBUTING.md, also close matplotlib/pytest-mpl#95?

I think we still need it when testing grids.

Write new tests for new functionality using @pytest.mark.mpl_image_compare only

Yes, it's already documented in the contributing guides.

Update the install instructions, because pygmt.test() won't work for users.

I just opened issue #1200 for discussions.

Maybe add the baseline images as a release asset when making releases.

I just opened issue #1201 for discussions.

@seisman seisman added maintenance Boring but important stuff for the core devs and removed question Further information is requested labels Apr 11, 2021
@seisman
Copy link
Member Author

seisman commented Apr 11, 2021

I think we can close the issue.

@seisman seisman closed this as completed Apr 11, 2021
Release v0.4.x automation moved this from To do to Done Apr 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
No open projects
Development

No branches or pull requests

6 participants