`Data.equals`: add unit test & migrate to Dask #254

sadielbartholomew · 2021-08-27T22:54:21Z

Fix #251 & address the Data.equals method for #182.

cf/test/test_Data.py

cf/data/data.py

sadielbartholomew · 2021-09-03T03:42:00Z

Hi @davidhassell, RE #254 (comment) and the (intermediate, towards this PR) daskification of _numpy_allclose (which we should rename I think as per our comments in that thread), I have been investigating approaches and the following is a summary of my thoughts so far.

On `_numpy_allclose`

Currently (pre-migration):

Checks if a and b are MaskedArray objects.
If neither is MaskedArray: np.allclose(a, b) -> done.
Else if both are a MaskedArray: compare masks, if different -> False, done.
Otherwise, if only one is a MaskedArray: check if either has a non-trivial (False) mask and if so, they have different masks, so False -> done.
Otherwise, at least one is a MaskedArray but the masks are the same, so need to run np.ma.allclose to compare data. -> done.

To migrate:

I am not yet sure if this suggestion is the best approach, particularly as I am just getting to grips with all of the possibilities provided with Dask and which are more efficient to use and advisable to use in what cases, but it is at least a starting point...

Check the type of the underlying array on a and b (this is type(dx._meta), as reported in the repr under the name chunktype) to determine if either are numpy.MaskedArray rather than numpy.ndarray (those are the only two options, here, right, since it the method is only applied for numpy arrays I guess?).
If one is MaskedArray and the other ndarray, is False -> done.
If both are numpy.ndarray, just use dask.allclose -> done.
If both are MaskedArray, it's a bit more complicated, but doing something along the lines of the following does seem to work and the visualize graph (included below) seems promising, though of course at both of the equal and isclose steps, it will have to wait on both possible parallel task routes which is not ideal. I guess we would want to do the following blockwise, though that is not incorporated into the demo code:

In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: x = np.ma.array([1, 2, 3], mask=[0, 1, 0])

In [4]: y = np.ma.array([999, 2, 3], mask=[1, 0, 0])

In [5]: dx = da.from_array(x)

In [6]: dy = da.from_array(y)

In [7]: dx_mask = da.ma.getmaskarray(dx)

In [8]: dy_mask = da.ma.getmaskarray(dy)

In [9]: dx_data = da.ma.getdata(dx)

In [10]: dy_data = da.ma.getdata(dy)

In [11]: mask_comparison = da.equal(dx_mask, dy_mask)

In [13]: data_comparison = da.isclose(dx_data, dy_data)

In [14]: result = da.all(da.logical_and(mask_comparison, data_comparison))

In [15]: result.visualize(filename="test007.png")
Out[15]: <IPython.core.display.Image object>

In [16]: result.compute()
Out[16]: False

In [17]: def compare(f, g):
    ...:     mask_comparison = da.equal(da.ma.getmaskarray(f), da.ma.getmaskarray(g))
    ...:     data_comparison = da.isclose(da.ma.getdata(f), da.ma.getdata(g))
    ...:     result = da.all(da.logical_and(mask_comparison, data_comparison))
    ...:     return result
    ...: 

In [19]: compare(np.ma.array([1, 2, 3], mask=[0, 1, 0]), np.ma.array([1, 2, 3],
    ...: mask=[0, 1, 0])).compute()
Out[19]: True

In [20]: compare(np.ma.array([1, 2, 3], mask=[0, 1, 0]), np.ma.array([1, 2, 3],
    ...: mask=[0, 1, 1])).compute()
Out[20]: False

In [21]: compare(np.ma.array([1, 2, 3], mask=[0, 1, 0]), np.ma.array([5, 2, 3],
    ...: mask=[0, 1, 0])).compute()
Out[21]: False

In [22]: compare(np.ma.array([1, 2, 3]), np.ma.array([5, 2, 3])).compute()
Out[22]: False

In [23]: compare(np.ma.array([1, 2, 3]), np.ma.array([1, 2, 3])).compute()
Out[23]: True

As the outputs indicate, the process given in (4) could also apply for the general case, i.e. of any combo. of ndarray or MaskedArray, but given the need to process both a mask and a data array it won't be as efficient even under a dask task workflow, surely.

I should add that Dask doesn't (yet) have a da.ma.allclose method available that could be used instead of something like the above.

Task graph from the above process for step (4)

Also fix some issues in the functions module so flake8 passes

sadielbartholomew · 2021-11-09T04:10:31Z

(FYI @davidhassell, as you may have seen from notifications, I have pushed some new commits here, but please don't look at this until I tag you again to let you know it is ready - I have prepared review aids that I will share but I am thinking over some aspects before I finalise everything and open this for review.)

sadielbartholomew · 2021-11-09T13:51:31Z

Hi again @davidhassell, all ready now, though depending on your thoughts regarding the approach to take with regards to laziness of cf.Data (see the questions and comments I have added to our developer notes in 7a01cbe) then I may want to revert the penultimate commit which converts the overall approach to Case 1 (as covered in those notes) whereas previously I was abiding by Case 2.

Please also see below for notes conveying the context of the PR and our discussions, since they took place a while back now so it is good to note down our thoughts.

Note also I will push up a commit to add a test for the new _da_ma_allclose internal method in a moment, but otherwise I am done here bar any changes relating to the discussion of approach to laziness, as above.

Notes for context

After our September discussion we realised we should amend the task graph from that suggested in #254 (comment) to the following, where I have drawn a red box to outline the key part and that which is different to the previous graph:

and I have implemented that. The overall logic corresponds to this interactive example, in case you wished to play around interactively with it:

In [1]: import dask.array as da
   ...: import numpy as np
   ...: x = np.ma.array([1, 2, 3], mask=[1, 0, 0])
   ...: y = np.ma.array([999, 2, 3], mask=[1, 0, 0])
   ...: dx = da.from_array(x)
   ...: dy = da.from_array(y)

In [2]: dx_mask = da.ma.getmaskarray(dx)
   ...: dy_mask = da.ma.getmaskarray(dy)
   ...: mask_comparison = da.equal(dx_mask, dy_mask)

In [3]: def da_ma_allclose(x, y, masked_equal=True, rtol=1e-05, atol=1e-08):
   ...:     x = da.asanyarray(x)
   ...:     y = da.asanyarray(y)
   ...:     return da.map_blocks(
   ...:         np.ma.allclose, x, y, masked_equal=masked_equal, rtol=rtol,
   ...:         atol=atol,
   ...:     )
   ...: 

In [4]: data_comparison = da_ma_allclose(dx, dy, masked_equal=False)

In [5]: result = da.all(da.logical_and(mask_comparison, data_comparison))

In [6]: result.visualize(filename="cf_equals_test_003.png")
Out[6]: <IPython.core.display.Image object>

In [7]: result.compute()
Out[7]: False

Notice the graph from the above is as desired:

davidhassell · 2021-11-09T18:49:13Z

Hi Sadie - I think the logic is sound, but unfortunately it doesn't work for me if dx and dy have different chunk sizes

dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,))

which is a limitation/feature of map_blocks, I think.

After some messing about I came up with:

import dask.array as da
import numpy as np

def allclose(a_blocks, b_blocks, rtol=1e-05, atol=1e-08):
    result = True
    for a, b in zip(a_blocks, b_blocks):
        result &= np.ma.allclose(a, b, rtol=rtol, atol=atol, masked_equal=True)

    return result

# -----------------------------------------
# Case 1
# -----------------------------------------
x = np.ma.array([1,   2, 3], mask=[1, 0, 0])
y = np.ma.array([999, 2, 3], mask=[1, 0, 0])
dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,))

dx_mask = da.ma.getmaskarray(dx)
dy_mask = da.ma.getmaskarray(dy)
mask_comparison = da.allclose(dx_mask, dy_mask)

axes = tuple(range(dx.ndim))

data_comparison = da.blockwise(
    allclose, '', dx, axes, dy, axes, dtype=bool,
    rtol=1e-05, atol=1e-08
)

result = mask_comparison & data_comparison

print('CASE 1')
print('mask equal?', mask_comparison.compute())
print('data equal?', data_comparison.compute())
print('     equal?', result.compute())

# -----------------------------------------
# Case 2
# -----------------------------------------
x = np.ma.array([1,   999, 3], mask=[0, 1, 0])   # Different
y = np.ma.array([999,   2, 3], mask=[1, 0, 0])
dx = da.from_array(x, chunks=(1, 2))
dy = da.from_array(y, chunks=(3,))

dx_mask = da.ma.getmaskarray(dx)
dy_mask = da.ma.getmaskarray(dy)
mask_comparison = da.allclose(dx_mask, dy_mask)

axes = tuple(range(dx.ndim))

data_comparison = da.blockwise(
    allclose, '', dx, axes, dy, axes, dtype=bool,
    rtol=1e-05, atol=1e-08
)

result = mask_comparison & data_comparison

print('\nCASE 2')
print('mask equal?', mask_comparison.compute())
print('data equal?', data_comparison.compute())
print('     equal?', result.compute())

which produces, as required:

CASE 1
mask equal? True
data equal? True
     equal? True

CASE 2
mask equal? False
data equal? True
     equal? False

This assume that dx and dy have the same shape, but we'll already have checked that and short circuited if they haven't.

What do you think?

sadielbartholomew · 2021-11-09T20:02:51Z

HI @davidhassell, thanks for your detailed (re-)review, as ever. Good spot regarding the chunk sizes - I hoped Dask would handle all of that, but I guess with our custom Dask method we need to be careful. I will add in a test for the chunk sizes akin to your examples, and a proper unit test for the new allclose method, first of all. Then try out your suggested solution - thanks for investigating and sharing it.

I'll start that all after my dinner. I am also about to push up further developer notes as quite a lot of other important aspects were discussed in relation to this PR!

davidhassell

Hi Sadie - I think we're pretty much there, excellent work.

I have "commented" rather than "requested changes", as we might need to sign off on a couple of things (perhaps the string data type question?).

Thanks,
David

cf/data/utils.py

davidhassell · 2022-01-18T10:41:02Z

cf/data/utils.py

+    # top-level dtype in the NumPy dtype hierarchy; see the
+    # 'Hierarchy of type objects' figure diagram under:
+    # https://numpy.org/doc/stable/reference/arrays.scalars.html#scalars
+    return np.issubdtype(dtype, np.number) or np.issubdtype(dtype, np.bool_)


I think this conversation is now resolved in the light of subsequent conversations, and the agreement to introduce new keywords (such as equal_nan).

cf/test/test_Data.py

davidhassell · 2022-01-18T10:46:10Z

cf/test/test_Data.py

+        d2 = cf.Data(a.astype(np.float32), "m")  # different datatype to d
+        self.assertTrue(d2.equals(d2))
+        with self.assertLogs(level=cf.log_level().value) as catch:
+            self.assertFalse(d2.equals(d))


OK - I agree that d2.equals(d)) is False!

cf/test/test_Data.py

cf/data/dask_utils.py

cf/test/test_Data_utils.py

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

sadielbartholomew · 2022-01-19T02:28:07Z

Thanks once again for the feedback, @davidhassell. All should be resolved now, noting that for the issue outlined in the thread here I have now adjusted the test now so it tests for the agreed behaviour, but commented out the line with the assertion in question, which currently fails, with a note that the correct behaviour will be added via cfdm:

https://github.com/NCAS-CMS/cf-python/pull/254/files#diff-bc5ca24d93eb1dce7c459430c3decfc73dd85fded3face2985e76b9565cf6881R284-R287

So it isn't resolved in the sense of being implemented, but is ready in terms of this PR: I will open up a follow-up issue on cfdm to cover it.

davidhassell · 2022-01-19T19:22:34Z

Just had a thought - it'd be good to set all of the dask arrays in all of the tests to have multiple dask chunks:

d = cf.Data(a, "m", chunks=(2, 2)

sadielbartholomew · 2022-01-19T21:04:17Z

Just had a thought - it'd be good to set all of the dask arrays in all of the tests to have multiple dask chunks:

Right-o, good idea. One moment and I'll push a commit doing that.

sadielbartholomew · 2022-01-20T02:46:42Z

@davidhassell since my last comment I've:

added dask chunks on all cf.Data calls in test_Data_equals, and I will add chunks to the other Data-module tests as I go along with daskification (and can throw chunks into the other already-ready tests in a lone commit to push post-merge);
handled the usgae of the daskified decorator to reduce the verbosity of the daskification monitoring messages;
resolved most of the linting failures, despite many of them being not related to the PR and e.g. residual from the recent merge of master or from your initial LAMA to Dask work, though I couldn't fully resolve the linting to get flake8 to pass via pre-commit run --all-files because I wasn't sure of the nature of two undefined variables (see detail given below): please could you clarify so I can rectify it and the linting tests will pass?

Overall, I think we are ready to merge (ideally after we address the more general linting failures so the CI jobs can run cleanly).

Outstanding linting failures

What are/were is_small and is_very_small? Can they go or does some logic need to be (re-)added?

$ pre-commit run --all-files
Check python ast.........................................................Passed
Debug Statements (Python)................................................Passed
Fix End of Files.........................................................Passed
Trim Trailing Whitespace.................................................Passed
black....................................................................Passed
docformatter.............................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

cf/data/creation.py:137:12: F821 undefined name 'is_small'
cf/data/creation.py:196:12: F821 undefined name 'is_very_small'
cf/data/creation.py:234:12: F821 undefined name 'is_small'
cf/data/creation.py:240:12: F821 undefined name 'is_small'
cf/data/creation.py:266:20: F821 undefined name 'is_small'

isort (python)...........................................................Passed
isort (cython).......................................(no files to check)Skipped
isort (pyi)..........................................(no files to check)Skipped

davidhassell · 2022-01-20T12:32:36Z

Hi Sadie,

What are/were is_small and is_very_small? Can they go or does some logic need to be (re-)added?

I think "were"! and the easiest thing is probably just to comment out those lines - that won't stop any (relevant) units tests from passing, and they'll get wiped anyway when #297 is merged.

sadielbartholomew · 2022-01-21T13:08:16Z

I think "were"!

Sorry, I realise now I had worded that badly - obviously they have gone as they are not defined, but by 'were' I meant to question whether it has gone AWOL and if so how to re-add it 🙂

and the easiest thing is probably just to comment out those lines - that won't stop any (relevant) units tests from passing, and they'll get wiped anyway when #297 is merged.

Aha, sure - I didn't realise #297 would touch them. Will remove those and then we might be ready to merge...

sadielbartholomew · 2022-01-21T16:40:31Z

OK just going to trigger the CI jobs via open-close, which should now pass as everything is good locally, including the linting...

davidhassell

All done!

sadielbartholomew self-assigned this Aug 27, 2021

sadielbartholomew marked this pull request as ready for review September 1, 2021 02:33

sadielbartholomew requested a review from davidhassell September 1, 2021 02:42

sadielbartholomew commented Sep 1, 2021

View reviewed changes

cf/test/test_Data.py Outdated Show resolved Hide resolved

davidhassell requested changes Sep 1, 2021

View reviewed changes

cf/data/data.py Outdated Show resolved Hide resolved

cf/data/data.py Outdated Show resolved Hide resolved

cf/data/data.py Outdated Show resolved Hide resolved

cf/data/data.py Outdated Show resolved Hide resolved

davidhassell added the dask Relating to the use of Dask label Sep 30, 2021

sadielbartholomew mentioned this pull request Oct 7, 2021

Migrate Data.transpose from LAMA to Dask #247

Merged

sadielbartholomew added 14 commits November 5, 2021 18:12

Remove redundant tests on log messages; now test_Data_equals passes

6eafaf5

Flesh out docstring for new internal utility method _da_ma_allclose()

2f16574

Document undoc'd kwargs for Data.equals() method

22a0fee

Add note in Data.equals() providing & regarding old log message

309e91d

Set all daskified decorator log levels to default, to adapt as req.

419e6de

Add basic (excluding parameters) unit test for Data.equals

9735498

Migrate Data.equals from LAMA to Dask

caad9cb

Address several DeprecationWarning warnings raised by pytest

90af93b

Also fix some issues in the functions module so flake8 passes

Improve test by adding assertions on raised log messages

6c1fff5

Fix equality test w/ numpy inf & NaN by comparing to array of same dtype

0e916c0

Complete Data.equals unit test by covering all valid parameters

ea54b2c

Add missing closibng parenthesis in log message for differing Data units

bf3d58a

Conversions from str.format() to f-strings

d85b5cf

New approach to equals() migration: define own da.ma.allclose

ee6ed32

sadielbartholomew force-pushed the lama-to-dask-6-new branch from aa3524b to 309e91d Compare November 9, 2021 03:45

sadielbartholomew mentioned this pull request Nov 9, 2021

Replace LAMA with Dask to parallelise cf.Data #182

Closed

sadielbartholomew added 2 commits November 9, 2021 13:02

Potential alt. approach: pre-compute result inside Data.equals()

8a4005b

Add developer notes on questions RE laziness of execution

7a01cbe

davidhassell reviewed Jan 18, 2022

View reviewed changes

sadielbartholomew and others added 8 commits January 18, 2022 13:38

Update cf/data/utils.py: more accurate docstring

5d6838a

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

Update cf/data/dask_utils.py: improve docstring formatting

201a91f

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

Update cf/test/test_Data.py: test equality to self (not a copy)

981ad20

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

Update cf/data/dask_utils.py: document kwarg type

bc9a434

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

Merge branch 'lama-to-dask' into lama-to-dask-6-new

4c86612

Address DH feedback by removing loop over pairs in test_Data_equals

42e6263

Address DH feedback: do not check numpy arrays in test_Data_utils

3846d07

Address DH feedback: adjust test_Daya_equals RE string dtypes

35b5adb

sadielbartholomew added 4 commits January 19, 2022 21:36

Address DH feedback: use multiple dask chunks in test_Data_equals

7cd2e89

Update test_Data verbosity management and testing

aedec2f

Address isort & black linting failures from master merge & PR

8467f6e

Address all flake linting failures except is*_small

f272842

Address remaining flake linting failures: is*_small

07d937d

sadielbartholomew closed this Jan 21, 2022

sadielbartholomew reopened this Jan 21, 2022

davidhassell approved these changes Jan 21, 2022

View reviewed changes

davidhassell merged commit 629368b into NCAS-CMS:lama-to-dask Jan 21, 2022

sadielbartholomew deleted the lama-to-dask-6-new branch January 21, 2022 17:58

This was referenced Jan 21, 2022

Consider string-like data equal despite different data type NCAS-CMS/cfdm#172

Open

Add unit test for Data.equals method #251

Closed

Object equality: doctest-able examples covering special cases #299

Open

davidhassell added this to the 3.14.0 milestone Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Data.equals`: add unit test & migrate to Dask #254

`Data.equals`: add unit test & migrate to Dask #254

sadielbartholomew commented Aug 27, 2021

sadielbartholomew commented Sep 3, 2021 •

edited

sadielbartholomew commented Nov 9, 2021

sadielbartholomew commented Nov 9, 2021

davidhassell commented Nov 9, 2021

sadielbartholomew commented Nov 9, 2021

davidhassell left a comment

davidhassell Jan 18, 2022

davidhassell Jan 18, 2022

sadielbartholomew commented Jan 19, 2022

davidhassell commented Jan 19, 2022

sadielbartholomew commented Jan 19, 2022

sadielbartholomew commented Jan 20, 2022 •

edited

davidhassell commented Jan 20, 2022

sadielbartholomew commented Jan 21, 2022 •

edited

sadielbartholomew commented Jan 21, 2022 •

edited

davidhassell left a comment

Data.equals: add unit test & migrate to Dask #254

Data.equals: add unit test & migrate to Dask #254

Conversation

sadielbartholomew commented Aug 27, 2021

sadielbartholomew commented Sep 3, 2021 • edited

On _numpy_allclose

Currently (pre-migration):

To migrate:

Task graph from the above process for step (4)

sadielbartholomew commented Nov 9, 2021

sadielbartholomew commented Nov 9, 2021

Notes for context

davidhassell commented Nov 9, 2021

sadielbartholomew commented Nov 9, 2021

davidhassell left a comment

Choose a reason for hiding this comment

davidhassell Jan 18, 2022

Choose a reason for hiding this comment

davidhassell Jan 18, 2022

Choose a reason for hiding this comment

sadielbartholomew commented Jan 19, 2022

davidhassell commented Jan 19, 2022

sadielbartholomew commented Jan 19, 2022

sadielbartholomew commented Jan 20, 2022 • edited

Outstanding linting failures

davidhassell commented Jan 20, 2022

sadielbartholomew commented Jan 21, 2022 • edited

sadielbartholomew commented Jan 21, 2022 • edited

davidhassell left a comment

Choose a reason for hiding this comment

`Data.equals`: add unit test & migrate to Dask #254

`Data.equals`: add unit test & migrate to Dask #254

sadielbartholomew commented Sep 3, 2021 •

edited

On `_numpy_allclose`

sadielbartholomew commented Jan 20, 2022 •

edited

sadielbartholomew commented Jan 21, 2022 •

edited

sadielbartholomew commented Jan 21, 2022 •

edited