add pandas uncertainty array #184

andrewgsavage · 2024-01-01T08:04:37Z

@MichaelTiemannOSC here's where I got to with creating a pandas extensionarray.

pandas backports taken from
https://github.com/googleapis/python-db-dtypes-pandas/blob/main/db_dtypes/pandas_backports.py

andrewgsavage · 2024-01-01T15:26:27Z

ideally would need to revive this pandas PR
pandas-dev/pandas#45544

the backport of the NDArrayMixin is a bit outdated causing some tests to fail in new pandas versions

newville · 2024-01-02T00:51:25Z

@andrewgsavage Thanks - this looks interesting. I admit to not knowing much about Pandas development, but I might suggest that the goal here could be to target Pandas 2+ only (or even 2.1+), or perhaps 1.5+. Would that change what needed to be backported?

Perhaps can you give some idea of what versions of Pandas are targeted and/or supported here?

andrewgsavage · 2024-01-02T09:06:49Z

I was using the same approach as python-db-dtypes-pandas, which uses the backported NDArrayMixin and NDArrayBackedExtensionArray to reduce the code needed for the ExtensionArray. The version I'd copied in the backport file is from a few pandas versions ago.

The NDArrayMixin and NDArrayBackedExtensionArray are part of pandas' private namespace pandas.core.arrays._mixins. I swapped to using that, but it is liable to change or move location in future pandas versions.

I'd only expect to support the current pandas version.

newville · 2024-01-02T20:55:34Z

@andrewgsavage Thanks -- I guess that I would a bit hesitant about expecting to add pandas as a required dependency for uncertainties. The code here also seems to use quite a bit of pandas API code, implying that the people developing and maintaining uncertainties would need to be familiar with pandas API and follow its development.

I wonder if there might be a simpler way. That is, if uncertainties used https://numpy.org/neps/nep-0018-array-function-protocol.html (see #47) to create a UArray with __array_function__() methods, then maybe it would be easier for other nd-array-like projects (xarray, pint, pandas) to wrap "nd-array of values with uncertainties". It looks like this has been discussed off and on for a few years now.

I have to admit, I do not have any experience trying to subclass ndarray or using the __array_function__ protocol, but I would guess that if uncertainties put in the effort to "use numpy __array_function__" to create a modern UArray, then pandas, xarray, ... could use that in their code.

Does that seem reasonable?

andrewgsavage · 2024-01-03T10:19:32Z

pandas would be an optional dependency, or the code in this PR can be moved to a seperate package if that's preferred. We went with the second option for pint-pandas.. If you go with a seperate package you can make me a maintainer for it as I'm following pandas' develop for pint-pandas anyway.

A UArray with __array_function__ is a different topic to this. I'm not familiar with subclassing either. The __array_function__ protocol allows a module to define how numpy functions behave. This is quite a bit of work; each function requires implementing and testing, and there are over 200 functions! Many functions behave in similar ways so code can be shared, but you still end up with a lot of code to write. eg for pint the bulk of the __array_function__ and __array_ufunc__ code is here: https://github.com/hgrecco/pint/blob/master/pint/facets/numpy/numpy_func.py

For interoperating with pandas, you'd still need code for a ExtensionArray similar to what's in this PR. Here the UncertaintyArray is stroing the ufloats in a object dtype numpy array, which could be changed to a UArray in the future. That is to say this PR does not preclude making a UArray.

FYI, pandas and xarray have both created apis to allow other modules like pint-pandas, pint-xarray to be created. These sit outside of pandas or xarray, so the core module maintainers have less to deal with.

I've mostly put this together because @MichaelTiemannOSC has been using uncertainties with pint-pandas and wanting to add code to support uncertainties to it, which I thought was better suited to living in an uncertainties module, and would benefit more people if it allowed people to use uncertainties with pandas without needing the unit support from pint-pandas.

I do need to run this through linting and such before it's properly reviewed! If we go with a seperate module for uncertainties-pandas, it might be easier to set up the CI for that there since this is using pytest but uncertainties is not at the moment.

newville · 2024-01-04T04:40:21Z

@andrewgsavage OK, thanks. Even as an "optional dependency" (is it fair to ask "Well, is it optional or is it a dependency"?), having that code here implies that the authors/maintainers of uncertainties will assume the responsibility for understanding and maintaining this code.

I would guess that it might be more in keeping with the basic goals of this package to support an object for "ndarray of values with uncertainties" by subclassing ndarray and/or using the _array_function__ protocol - ther is kind of such an object, it just doesn't use that more modern (and presumably more inter-operable) interface. Yes, that would be some work to convert to using that../, but it might make it easier for those interested in making a "Pandas Series of values with uncertainties" and "pint object of value with uncertainties", or xarray, dask, etc.

MichaelTiemannOSC

It's exciting to see this work becoming aligned with mainstream pandas (and other developments). My changes were just enough to make uncertainties work as wrappers within the NumPy / Pandas worlds. This is clearly next-level.

MichaelTiemannOSC · 2024-01-04T07:03:58Z

uncertainties/test_uncertainty_array.py

+    "sum",
+    "max",
+    "min",
+    "mean",


mean is a non-trivail calculation in the world of uncertainties. I don't think uncertainties does it. I've created code that does it:

# https://stackoverflow.com/a/74137209/1291237 def umean(unquantified_data): """ Assuming Gaussian statistics, uncertainties stem from Gaussian parent distributions. In such a case, it is standard to weight the measurements (nominal values) by the inverse variance. Following the pattern of np.mean, this function is really nan_mean, meaning it calculates based on non-NaN values. If there are no such, it returns np.nan, just like np.mean does with an empty array. This function uses error propagation on the to get an uncertainty of the weighted average. :param: A set of uncertainty values :return: The weighted mean of the values, with a freshly calculated error term """ arr = np.array( [v if isinstance(v, ITR.UFloat) else ITR.ufloat(v, 0) for v in unquantified_data if not ITR.isnan(v)] ) N = len(arr) if N == 0: return np.nan if N == 1: return arr[0] nominals = ITR.nominal_values(arr) if any(ITR.std_devs(arr) == 0): # We cannot mix and match "perfect" measurements with uncertainties # Instead compute the mean and return the "standard error" as the uncertainty # e.g. ITR.umean([100, 200]) = 150 +/- 50 w_mean = sum(nominals) / N w_std = np.std(nominals) / np.sqrt(N - 1) else: # Compute the "uncertainty of the weighted mean", which apparently # means ignoring whether or not there are large uncertainties # that should be created by elements that disagree # e.g. ITR.umean([100+/-1, 200+/-1]) = 150.0+/-0.7 (!) w_sigma = 1 / sum([1 / (v.s**2) for v in arr]) w_mean = sum([v.n / (v.s**2) for v in arr]) * w_sigma w_std = w_sigma * np.sqrt(sum([1 / (v.s**2) for v in arr])) result = ITR.ufloat(w_mean, w_std) return result

This is the sort of thing that __array_function__ would help with, so I could do np.mean(UArray) without needing to understand the uncertainty logic

MichaelTiemannOSC · 2024-01-04T07:04:21Z

uncertainties/test_uncertainty_array.py

+    "max",
+    "min",
+    "mean",
+    # "prod",


No reason we cannot reduce with prod.

MichaelTiemannOSC · 2024-01-04T07:08:09Z

uncertainties/uncertainty_array.py

+
+    def _validate_scalar(self, value):
+        """
+        Validate and convert a scalar value to datetime64[ns] for storage in


This comment should not reference datetime64[ns]

andrewgsavage · 2024-03-09T23:04:11Z

Just looked at this and realised I'd need to add pandas to the testing matrix, which would slow tests down somewhat.
@lmfit/uncertainties-admins, @MichaelTiemannOSC, any thoughts as to whether this should be its own standalone module? I'm leaning towards a standalone module.

MichaelTiemannOSC · 2024-03-10T15:48:28Z

Just looked at this and realised I'd need to add pandas to the testing matrix, which would slow tests down somewhat. @lmfit/uncertainties-admins, @MichaelTiemannOSC, any thoughts as to whether this should be its own standalone module? I'm leaning towards a standalone module.

I would agree, as it would more than double the normal test time (double, because all the tests would have to run with or without uncertainties, and more than double because uncertain magnitudes are slower than float64 magnitudes).

wshanks · 2024-03-10T19:00:02Z

The tests are very fast right now, so test time is not too bad (though I haven't looked at how much time the new tests take). I would say to weigh how much the code would be coupled to uncertainties (sorry, I haven't looked at it closely yet). Will it need coordinated releases with uncertainties to add new features, or easily remain independent? How convenient would it be to have everything in one package for a user? Already there is numpy support included in an optional way.

newville · 2024-03-11T02:53:44Z

It would probably be best to test with and without pandas. I don't think testing runtime is currently much of a concern.

But, as elsewhere, this does not seem as high a priority as getting a release with cleaned-up tests and code base. Can this wait?

add uncertainty array

d78c98b

get some more tests working

260a4d1

andrewgsavage added 4 commits January 2, 2024 10:17

arithmetic ops

5217c85

isna

94cd976

isnumeric

c074f96

tests running

c3cbbaa

MichaelTiemannOSC reviewed Jan 4, 2024

View reviewed changes

wshanks force-pushed the master branch from 7950418 to 2738c32 Compare March 7, 2024 13:31

Merge branch 'master' into pandas

e829224

andrewgsavage mentioned this pull request Apr 3, 2024

Move tests to tests folder #216

Merged

wshanks mentioned this pull request Apr 8, 2024

Pandas sort_values with multiple columns does not work for AffineScalarFunc #186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add pandas uncertainty array #184

add pandas uncertainty array #184

andrewgsavage commented Jan 1, 2024

andrewgsavage commented Jan 1, 2024

newville commented Jan 2, 2024

andrewgsavage commented Jan 2, 2024

newville commented Jan 2, 2024

andrewgsavage commented Jan 3, 2024

newville commented Jan 4, 2024

MichaelTiemannOSC left a comment

MichaelTiemannOSC Jan 4, 2024

andrewgsavage Jan 4, 2024

MichaelTiemannOSC Jan 4, 2024

MichaelTiemannOSC Jan 4, 2024

andrewgsavage commented Mar 9, 2024

MichaelTiemannOSC commented Mar 10, 2024

wshanks commented Mar 10, 2024

newville commented Mar 11, 2024

add pandas uncertainty array #184

Are you sure you want to change the base?

add pandas uncertainty array #184

Conversation

andrewgsavage commented Jan 1, 2024

andrewgsavage commented Jan 1, 2024

newville commented Jan 2, 2024

andrewgsavage commented Jan 2, 2024

newville commented Jan 2, 2024

andrewgsavage commented Jan 3, 2024

newville commented Jan 4, 2024

MichaelTiemannOSC left a comment

Choose a reason for hiding this comment

MichaelTiemannOSC Jan 4, 2024

Choose a reason for hiding this comment

andrewgsavage Jan 4, 2024

Choose a reason for hiding this comment

MichaelTiemannOSC Jan 4, 2024

Choose a reason for hiding this comment

MichaelTiemannOSC Jan 4, 2024

Choose a reason for hiding this comment

andrewgsavage commented Mar 9, 2024

MichaelTiemannOSC commented Mar 10, 2024

wshanks commented Mar 10, 2024

newville commented Mar 11, 2024