BUG: unstack with sort=False fails when used with the level parameter… #56357

renanffernando · 2023-12-06T05:20:00Z

When sort = False, the previous implementation of unstack assumes that the data will appear in the new data frame in the same order as the old data, but this is not always true. Moreover, the code uses the sorted labels to map the nan values, but it led to a wrong result when sort = False.

To fix the first problem, I assign a 'code id' for each label in a way to simulate that they are already sorted. In this way, the indexer created will map the old data in the new one correctly. The second issue is solved simply by using the old labels.

closes BUG: unstack with sort=False fails when used with the level parameter #54987 (Replace xxxx with the GitHub issue number)
closes BUG: sort argument in unstack gives wrong results #55516
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

rhshadrach · 2023-12-18T22:08:12Z

Thanks for the PR @renanffernando. I only took a quick look, I think there are perhaps more performant ways of carrying out the changes you have here. Running the ASVs as-is but adding sort=False to the unstack benchmarks, I'm seeing this:

       before           after         ratio
     [14cf864c]       [3b175962]
       <main>        <franco-unstack> 
+         913±2μs      10.9±0.02ms    11.93  reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+         932±9μs      10.9±0.02ms    11.72  reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+     1.71±0.01ms      11.7±0.05ms     6.87  reshape.ReshapeExtensionDtype.time_unstack_slow('datetime64[ns, US/Pacific]')
+        1.70±0ms      11.7±0.05ms     6.85  reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+     1.47±0.01ms      10.0±0.02ms     6.79  reshape.SimpleReshape.time_unstack
+     2.04±0.01ms       12.2±0.1ms     5.98  reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+     2.03±0.01ms      12.1±0.08ms     5.94  reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+     5.22±0.05ms      15.2±0.04ms     2.92  reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+     5.21±0.06ms      15.2±0.06ms     2.91  reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+      23.3±0.6ms       44.5±0.1ms     1.91  reshape.Unstack.time_full_product('int')
+         925±2μs      1.69±0.01ms     1.82  reshape.SparseIndex.time_unstack
+      16.6±0.3ms       25.4±0.3ms     1.53  reshape.Unstack.time_full_product('category')
+      17.7±0.2ms       26.9±0.3ms     1.52  reshape.Unstack.time_without_last_row('category')
+      52.5±0.3ms       74.7±0.2ms     1.42  reshape.Unstack.time_without_last_row('int')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Perhaps some of the performance can be clawed back - I'll be taking a more detailed look in the next few days.

renanffernando · 2023-12-21T03:51:08Z

I did not know about these performance tests @rhshadrach.

I did some experiments and the cause of this poor performance is the creation of the new label's codes. However, I tested some alternatives, but no one performed well. Anyway, I updated the mapping to the best version I found.

Do you know a better alternative to do this mapping?

rhshadrach · 2024-01-02T22:33:36Z

Do you know a better alternative to do this mapping?

It sounds like you're looking for factorize with sort=False: https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

…pandas-dev#54987) Assign new codes to labels when sort=False. This is done so that the data appears to be already sorted, fixing the bug.

renanffernando · 2024-01-24T18:35:43Z

I changed the code to use the factorize function as you recommended @rhshadrach, and the performance decrease is much better now. However, it still has a performance decrease by a factor of two in some tests, as I present below. Do you think that is ok?

Change	Before [`46c8da3`] <v2.3.0.dev0~122>	After [`3bd4fdf`]	Ratio	Benchmark (Parameter)
+	26.9±0.1ms	57.9±3ms	2.16	reshape.Unstack.time_full_product('int')
+	2.35±0.01ms	4.61±0.9ms	1.96	reshape.SimpleReshape.time_stack
+	877±9μs	1.58±0.2ms	1.8	reshape.SparseIndex.time_unstack
+	65.0±1ms	108±10ms	1.67	reshape.Unstack.time_without_last_row('int')
+	1.42±0.02ms	2.18±0.4ms	1.53	reshape.SimpleReshape.time_unstack
+	1.96±0.01ms	2.96±0.5ms	1.51	reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+	1.97±0.03ms	2.87±0.4ms	1.45	reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+	874±4μs	1.23±0.05ms	1.41	reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+	866±6μs	1.15±0.03ms	1.33	reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+	5.16±0.1ms	6.71±2ms	1.3	reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+	7.95±0.07ms	10.2±4ms	1.29	reshape.ReshapeMaskedArrayDtype.time_stack('Int64')
+	1.71±0.09ms	2.18±0.2ms	1.27	reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+	3.33±0.02ms	3.95±0.3ms	1.19	reshape.Cut.time_cut_datetime(4)
+	16.5±0.3ms	19.2±0.4ms	1.17	reshape.Unstack.time_full_product('category')
+	7.94±0.06ms	9.21±0.6ms	1.16	reshape.ReshapeMaskedArrayDtype.time_stack('Float64')
+	40.1±0.06ms	45.8±2ms	1.14	reshape.Crosstab.time_crosstab_normalize_margins
+	4.17±0.02ms	4.74±0.2ms	1.14	reshape.Cut.time_cut_datetime(10)
+	105±2ms	120±6ms	1.14	reshape.Pivot.time_reshape_pivot_time_series
+	31.1±0.2ms	34.9±2ms	1.12	reshape.PivotTable.time_pivot_table_margins_only_column
+	3.78±0.02ms	4.25±0.2ms	1.12	reshape.ReshapeExtensionDtype.time_stack('Period[s]')
+	5.21±0.1ms	5.79±0.5ms	1.11	reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+	27.7±0.2ms	30.6±0.8ms	1.1	reshape.PivotTable.time_pivot_table_agg

rhshadrach

Great update! I think the performance penalty we're seeing looks good for fixing the behavior. Also needs a note in the 3.0.0 whatsnew under the Reshaping section (once you merge main).

rhshadrach · 2024-01-28T15:12:29Z

pandas/core/reshape/reshape.py

+
+        if not self.sort:
+            # Create new codes considering that labels are already sorted
+            codes = [np.array(factorize(code)[0], dtype=code.dtype) for code in codes]


I think you don't need to wrap with np.array(...) - is that right?

rhshadrach · 2024-01-28T15:17:12Z

pandas/core/reshape/reshape.py

+    def sorted_labels(self) -> list[np.ndarray]:
        if self.sort:
-            indexer, _ = self._indexer_and_to_sort
+            return self.labels


This seems a bit confusing to me: what was sorted_labels has become labels, and the code for sorted_labels returns the same result as labels when sort=True. If there are good reasons behind these names, maybe add a short docstring to make that clear? Otherwise, perhaps a renaming is in order.

github-actions · 2024-02-28T00:05:19Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

rhshadrach · 2024-02-28T22:32:11Z

@renanffernando - are you interested in continuing here? If not, I plan to finish this up.

rhshadrach · 2024-03-31T13:28:07Z

With the latest commit, I'm seeing:

asv continuous -f 1.1 -E virtualenv upstream/main HEAD -b "^reshape"
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

…co-unstack

rhshadrach · 2024-05-02T22:39:25Z

pandas/tests/reshape/test_pivot.py

@@ -2703,16 +2703,3 @@ def test_pivot_table_with_margins_and_numeric_column_names(self):
            index=Index(["a", "b", "All"], name=0),
        )
        tm.assert_frame_equal(result, expected)
-
-    @pytest.mark.parametrize("m", [1, 10])
-    def test_unstack_shares_memory(self, m):


This test was added in #57487. To fix this bug, we now need to unconditionally do a take in _make_sorted_values. I don't think the fact that it shared memory in the m=1 case was important - I think that was the only reason this test was added.

cc @phofl

I would like to keep the test, it still ensures correct behavior. You can remove the shares memory assertion, but everything else should still pass. A failure would imply a legit bug

Sure thing - done. Renamed since shares_memory was no longer accurate.

mroeschke

Looks OK. Just needs a whatsnew

…co-unstack

rhshadrach · 2024-05-16T20:17:13Z

@mroeschke - good to merge?

phofl · 2024-05-16T20:38:18Z

One comment about the test, the tests should stay, just removing the assertion on shares memory is the way to go

…co-unstack

mroeschke · 2024-05-21T16:13:12Z

Thanks @renanffernando and @rhshadrach

renanffernando marked this pull request as draft December 6, 2023 05:20

renanffernando force-pushed the franco-unstack branch 3 times, most recently from e057c9f to 3b17596 Compare December 8, 2023 23:03

renanffernando marked this pull request as ready for review December 8, 2023 23:48

renanffernando changed the title ~~[WIP] BUG: unstack with sort=False fails when used with the level parameter…~~ BUG: unstack with sort=False fails when used with the level parameter… Dec 8, 2023

mroeschke requested a review from rhshadrach December 18, 2023 19:27

rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 18, 2023

renanffernando force-pushed the franco-unstack branch from 3b17596 to dcb44c4 Compare December 21, 2023 03:45

renanffernando force-pushed the franco-unstack branch 3 times, most recently from 5d39ff9 to c27929d Compare December 21, 2023 16:43

BUG: unstack with sort=False fails when used with the level parameter (…

3bd4fdf

…pandas-dev#54987) Assign new codes to labels when sort=False. This is done so that the data appears to be already sorted, fixing the bug.

renanffernando force-pushed the franco-unstack branch from c27929d to 3bd4fdf Compare January 24, 2024 18:30

rhshadrach added this to the 3.0 milestone Jan 28, 2024

rhshadrach requested changes Jan 28, 2024

View reviewed changes

github-actions bot added the Stale label Feb 28, 2024

rhshadrach removed the Stale label Feb 28, 2024

rhshadrach self-assigned this Feb 28, 2024

rhshadrach added 2 commits March 31, 2024 07:58

Merge remote-tracking branch 'upstream/main' into franco-unstack

895b8f8

Minor refactor and cleanup

86f7017

Merge branch 'main' of https://github.com/pandas-dev/pandas into fran…

61078aa

…co-unstack

Cleanup & remove test

345eb4f

rhshadrach reviewed May 2, 2024

View reviewed changes

rhshadrach requested a review from mroeschke May 2, 2024 22:40

mroeschke reviewed May 3, 2024

View reviewed changes

rhshadrach added 2 commits May 4, 2024 07:58

whatsnew

cff156b

Merge branch 'main' of https://github.com/pandas-dev/pandas into fran…

6edef4f

…co-unstack

mroeschke approved these changes May 4, 2024

View reviewed changes

rhshadrach and others added 2 commits May 8, 2024 18:36

Merge branch 'main' into franco-unstack

387a550

Merge branch 'main' into franco-unstack

14f5da6

rhshadrach added 2 commits May 20, 2024 21:32

Merge branch 'main' of https://github.com/pandas-dev/pandas into fran…

091b6e1

…co-unstack

Revert test removal

436fc8f

mroeschke merged commit b991274 into pandas-dev:main May 21, 2024
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: unstack with sort=False fails when used with the level parameter… #56357

BUG: unstack with sort=False fails when used with the level parameter… #56357

renanffernando commented Dec 6, 2023 •

edited by rhshadrach

rhshadrach commented Dec 18, 2023

renanffernando commented Dec 21, 2023 •

edited

rhshadrach commented Jan 2, 2024

renanffernando commented Jan 24, 2024

rhshadrach left a comment

rhshadrach Jan 28, 2024

rhshadrach Jan 28, 2024

github-actions bot commented Feb 28, 2024

rhshadrach commented Feb 28, 2024

rhshadrach commented Mar 31, 2024

rhshadrach May 2, 2024 •

edited

phofl May 16, 2024

rhshadrach May 21, 2024

mroeschke left a comment

rhshadrach commented May 16, 2024

phofl commented May 16, 2024

mroeschke commented May 21, 2024

BUG: unstack with sort=False fails when used with the level parameter… #56357

BUG: unstack with sort=False fails when used with the level parameter… #56357

Conversation

renanffernando commented Dec 6, 2023 • edited by rhshadrach

rhshadrach commented Dec 18, 2023

renanffernando commented Dec 21, 2023 • edited

rhshadrach commented Jan 2, 2024

renanffernando commented Jan 24, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Jan 28, 2024

Choose a reason for hiding this comment

rhshadrach Jan 28, 2024

Choose a reason for hiding this comment

github-actions bot commented Feb 28, 2024

rhshadrach commented Feb 28, 2024

rhshadrach commented Mar 31, 2024

rhshadrach May 2, 2024 • edited

Choose a reason for hiding this comment

phofl May 16, 2024

Choose a reason for hiding this comment

rhshadrach May 21, 2024

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

rhshadrach commented May 16, 2024

phofl commented May 16, 2024

mroeschke commented May 21, 2024

renanffernando commented Dec 6, 2023 •

edited by rhshadrach

renanffernando commented Dec 21, 2023 •

edited

rhshadrach May 2, 2024 •

edited