Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

Series combine #821

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Series combine #821

wants to merge 21 commits into from

Conversation

Rubtsowa
Copy link
Contributor

No description provided.

examples/series/series_combine.py Outdated Show resolved Hide resolved
examples/series/series_combine.py Outdated Show resolved Hide resolved

len_val = max(len(self), len(other))
result = numpy.empty(len_val, self._data.dtype)
for ind in range(len_val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we parallel the method based on chunks?

if fill_value is None:
fill_value = numpy.nan

len_val = max(len(self), len(other))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if all indexes are different? I think we should use sdc_join_series_indexes to find len of result series


len_val = max(len(self), len(other))
result = numpy.empty(len_val, self._data.dtype)
for ind in range(len_val):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is case for non-indexes series. Also, it should rewrite with prange

Copy link
Contributor

@densmirn densmirn May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ usage of chunks to predict scalability



@njit
def series_copy():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong name function

Comment on lines 4930 to 4931
if fill_value is None:
fill_value = numpy.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make fill_value type undefined at compile time. You can probably use the same approach as in operators:

_fill_value = numpy.nan if fill_value_is_none == True else fill_value # noqa

fill_value = numpy.nan

len_val = max(len(self), len(other))
result = numpy.empty(len_val, self._data.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually wrong, result dtype should be common dtype for result dtype of func(a, b) where a,b are series values and dtype of _fill_value. Provided tests do not cover this, but e.g. this (where fill_value is float and series are integers) won't pass:

    def test_series_combine_integer_new(self):
        def test_impl(S1, S2):
            return S1.combine(S2, lambda a, b: 2 * a + b, 16.2)
        hpat_func = self.jit(test_impl)

        S1 = pd.Series([1, 2, 3, 4, 5])
        S2 = pd.Series([6, 21, 3, 5])
        result = hpat_func(S1, S2)
        result_ref = test_impl(S1, S2)
        print(f"DEBUG: result:\n{result},\nresult_ref:\n{result_ref}")
        pd.testing.assert_series_equal(result, result_ref)

Copy link
Contributor

@kozlov-alexey kozlov-alexey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add deducing result dtype and add parallelization.

@@ -2825,7 +2817,7 @@ def test_impl(S1, S2):
S2 = pd.Series([1, 2, 3, 4, 5])
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))

@skip_numba_jit
@unittest.expectedFailure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment why the test is skipped.
@unittest.expectedFailure # ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rubtsowa No need to skip the test if that's how impl is intended to work. Use check_dtype=False in assert_series_equal and add a comment just before this check to refer to SDC Limitation.

Comment on lines 4959 to 4969
if self_indexes[j] == -1:
val_self = _fill_value
else:
ind_self = self_indexes[j]
val_self = self[ind_self]._data[0]

if other_indexes[j] == -1:
val_other = _fill_value
else:
ind_other = other_indexes[j]
val_other = other[ind_other]._data[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self_indexes[j] == -1:
val_self = _fill_value
else:
ind_self = self_indexes[j]
val_self = self[ind_self]._data[0]
if other_indexes[j] == -1:
val_other = _fill_value
else:
ind_other = other_indexes[j]
val_other = other[ind_other]._data[0]
self_idx = self_indexes[j]
if self_idx == -1:
val_self = _fill_value
else:
val_self = self[self_idx]._data[0]
other_idx = other_indexes[j]
if other_idx == -1:
val_other = _fill_value
else:
val_other = other[other_idx]._data[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or

Suggested change
if self_indexes[j] == -1:
val_self = _fill_value
else:
ind_self = self_indexes[j]
val_self = self[ind_self]._data[0]
if other_indexes[j] == -1:
val_other = _fill_value
else:
ind_other = other_indexes[j]
val_other = other[ind_other]._data[0]
self_idx = self_indexes[j]
val_self = _fill_value if self_idx == -1 else self[self_idx]._data[0]
other_idx = other_indexes[j]
val_other = _fill_value if other_idx == -1 else other[other_idx]._data[0]

def sdc_pandas_series_combine_impl(self, other, func, fill_value=None):

if fill_value is None:
fill_value = numpy.nan
if fill_value is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if fill_value is not None:
_fill_value = numpy.nan if fill_value is None else fill_value:

if self_idx == -1:
val_self = _fill_value
else:
val_self = self[self_idx]._data[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self_idx (and other_idx) is position in the Series, not the index, so instead of using getitem on a Series, that performs index lookup and returns a Series, so that you have to take _data[0] from it, you can just write:

Suggested change
val_self = self[self_idx]._data[0]
val_self = self._data[self_idx]

Copy link
Contributor

@kozlov-alexey kozlov-alexey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks OK, but I would reorganize and extend existing tests a bit.


Limitations
-----------
- Only supports the case when data in series of the same type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not correct - impl handles all cases. For the next line we need exact definition of difference to pandas, e.g:

Suggested change
- Only supports the case when data in series of the same type.
- Resulting series dtype may be wider than in pandas due to type-stability requirements and depends on fill_value dtype and result of series indexes alignment.

@@ -2771,7 +2770,6 @@ def test_impl(S1, S2):
S2 = pd.Series([6.0, 21., 3.6, 5.])
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))

@skip_numba_jit
def test_series_combine_float3264(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has incorrect code, which should be corrected probably:

        S1 = pd.Series([np.float64(1), np.float64(2),
                        np.float64(3), np.float64(4), np.float64(5)])
        S2 = pd.Series([np.float32(1), np.float32(2),
                        np.float32(3), np.float32(4), np.float32(5)]) 

S2.dtype will be float64 on Win, not float32. Moreover, series dtype should be specified this way:

        S1 = pd.Series([1, 2, 3, 4, 5], dtype=np.int64)
        S2 = pd.Series([1, 2, 3, 4, 5], dtype=np.int32)

Comment on lines 4952 to 4962
chunks = parallel_chunks(len_val)
for i in prange(len(chunks)):
chunk = chunks[i]
for j in range(chunk.start, chunk.stop):
self_idx = self_indexes[j]
val_self = _fill_value if self_idx == -1 else self._data[self_idx]

other_idx = other_indexes[j]
val_other = _fill_value if other_idx == -1 else other._data[other_idx]

result[j] = func(val_self, val_other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not?

Suggested change
chunks = parallel_chunks(len_val)
for i in prange(len(chunks)):
chunk = chunks[i]
for j in range(chunk.start, chunk.stop):
self_idx = self_indexes[j]
val_self = _fill_value if self_idx == -1 else self._data[self_idx]
other_idx = other_indexes[j]
val_other = _fill_value if other_idx == -1 else other._data[other_idx]
result[j] = func(val_self, val_other)
result = numpy.empty(len_val, res_dtype)
for i in prange(len(result)):
self_idx, other_idx = self_indexes[i], other_indexes[i]
val_self = _fill_value if self_idx == -1 else self._data[self_idx]
val_other = _fill_value if other_idx == -1 else other._data[other_idx]
result[i] = func(val_self, val_other)

def test_series_combine_assert1(self):
def test_impl(S1, S2):
return S1.combine(S2, lambda a, b: 2 * a + b)
hpat_func = self.jit(test_impl)

S1 = pd.Series([1, 2, 3])
S2 = pd.Series([6., 21., 3., 5.])
with self.assertRaises(AssertionError):
hpat_func(S1, S2)
pd.testing.assert_series_equal(hpat_func(S1, S2), test_impl(S1, S2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment for tests: not all combinations of input series dtypes and fill_value are tested e.g. the one I mentioned before - where float fill_value is assigned to otherwise int series. There are no tests with series with non-default indexes (we refer to samelen, but it's not fully correct - series may have same len, but not same indexes), and no tests for checking func impact on result dtype, so it's hard to see from such tests what's really tested and what is not. So the suggestion is to organize tests in a different manner:

  1. product of diff series dtypes (default int, int64, float64),
    same series indexes (but not same series sizes),
    fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
    Covers: test_series_combine_value_samelen
  2. product of diff series dtypes (default int, int64, float64),
    same series indexes (but not same series sizes),
    with fill_value is omitted
    Covers: test_series_combine_float3264, test_series_combine_integer_samelen, test_series_combine_samelen, test_series_combine_different_types
  3. product of diff series dtypes (default int, int64, float64),
    series indexes that align with and without -1 in indexers
    fill_value is specified and of different dtypes (None, np.nan, 4, 4.2)
    Covers: test_series_combine_integer, test_series_combine_value
  4. product of diff series dtypes (default int, int64, float64),
    series indexes that align with and without -1 in indexers
    fill_value is omitted
    Covers: test_series_combine, test_series_combine_assert1, test_series_combine_assert2, test_series_combine_different_types

New test:
5. (for testing func changes dtype properly)
product of diff series dtypes (default int, int64, float64),
same series indexes (but not same series sizes),
fill_value = 0
with diff functions (chaning and not chaning res dtype e.g. preserving int domain, e.g. ** and + and not, e.g. /)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, test 1 can look like this (it can also be split into two: one when we use check_dtype=False and one when we don't):

    def test_series_combine_same_index_fill_value(self):
        def test_impl(S1, S2):
            return S1.combine(S2, lambda a, b: 2 * a + b)
        hpat_func = self.jit(test_impl)

        n = 11
        np.random.seed(0)
        A = np.random.randint(-100, 100, n)
        B = np.arange(n) * 2 + 1
        series_index = 1 + np.arange(n)

        series_dtypes = [None, np.int64, np.float64]
        fill_values = [None, np.nan, 4, 4.2]
        for dtype1, dtype2, fill_value in product(series_dtypes, series_dtypes, fill_values):
            S1 = pd.Series(A, index=series_index, dtype=dtype1)
            S2 = pd.Series(B, index=series_index, dtype=dtype2)
            with self.subTest(S1_dtype=dtype1, S2_dtype=dtype2, fill_value=fill_value):
                result = hpat_func(S1, S2)
                result_ref = test_impl(S1, S2)
                # check_dtype=False due to difference to pandas in some cases
                pd.testing.assert_series_equal(result, result_ref, check_dtype=False)

@pep8speaks
Copy link

pep8speaks commented Jun 1, 2020

Hello @Rubtsowa! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-01 13:52:44 UTC

Comment on lines +4929 to +4931
- Indixes should be strictly ascending, as inside the function
they are sorted in ascending order and the answer becomes
different from the result of the pandas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rubtsowa What? This sounds like a bug...

Copy link
Contributor Author

@Rubtsowa Rubtsowa Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlov-alexey This 'bug' in function sdc_join_series_indexes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rubtsowa Then it should be fixed (please create a JIRA case with a reproducer) before this is merged. Adapting to a bug is no way. @AlexanderKalistratov correct?

Copy link
Contributor

@PokhodenkoSA PokhodenkoSA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very old PR. It is better to close it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants