[BUG] python and sdc-compiled functions generate different output with same input #996

dlee992 · 2021-11-24T07:06:21Z

Reporting a bug

In [25]: num_columns = 20
    ...: features = [f'col{i}' for i in range(num_columns)]
    ...: df = pd.DataFrame(np.random.rand(5, num_columns), columns=features)
    ...: target_col = 'col0'

In [26]: df
Out[26]:
       col0      col1      col2      col3      col4      col5  ...    
0  0.847436  0.116855  0.782481  0.485027  0.027340  0.328801  ...  
1  0.482504  0.845380  0.753603  0.535273  0.243581  0.861275  ...  
2  0.190646  0.539439  0.901377  0.770925  0.908361  0.454777  ...  
3  0.355888  0.451189  0.672876  0.745438  0.576982  0.907190  ...  
4  0.535901  0.394481  0.118837  0.199040  0.557401  0.653302  ...  

[5 rows x 20 columns]

In [27]: def _modified_pipeline(df, target_col):
    ...:     samples = df[df['col1'] >= 0.2]
    ...:     p_sum = (samples[target_col] >= 0.5).sum()
    ...:     r_sum = (samples[target_col] <= 0.5).sum()
    ...:     cnt = len(samples)
    ...:     return p_sum, r_sum, cnt
    ...:

In [28]: from numba import njit
    ...: @njit
    ...: def jit_modified_pipeline(df, target_col):
    ...:     samples = df[df['col1'] >= 0.2]
    ...:     p_sum = (samples[target_col] >= 0.5).sum()
    ...:     r_sum = (samples[target_col] <= 0.5).sum()
    ...:     cnt = len(samples)
    ...:     return p_sum, r_sum, cnt
    ...:

In [29]: _modified_pipeline(df, target_col)
Out[29]: (1, 3, 4)

In [30]: jit_modified_pipeline(df, target_col)
<ipython-input-28-bbc0261853d0>:5: NumbaPerformanceWarning:
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.
.....
Out[30]: (1, 2, 3)

As you can see, python and sdc obtain different outputs with the same inputs.

Python 3.7.9 & numba 0.52.0 & sdc 0.38.0 & pandas 1.2.0

The text was updated successfully, but these errors were encountered:

kozlov-alexey · 2021-11-25T17:53:23Z

@dlee992 Hi, Thank you for the report! This is a bug in incorrect definition of a layout for the SeriesType. Unfortunately, strided arrays/series are not tested well in our tests, but we will fix it shortly. You can workaround it by creating DF from dictionary built from column names and transposed array (so that native array layout of columns is contiguous), i.e.

df = pd.DataFrame(dict(zip(features, p.random.rand(5, num_columns).transpose())))

dlee992 · 2021-11-26T03:26:20Z

@kozlov-alexey, thanks, the workaround makes sense. I tested a bit more, as below:

def foo1(df):
    choose_col = 'col1'
    filter_series = df[choose_col].apply(lambda x: 0 if x < 0.5 else 1)
    filtered_sum = (df[target_col] * filter_series).sum()
    return filtered_sum

def foo2(df):
    lst  = ['col1', 'col2', 'col3']
    for cho_col in lst:
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
        filtered_sum = (df[target_col] * filter_series).sum()

foo1 can be compiled and executed with @njit, (but the filtered_sum has accuracy drift compared with python result, the degree of accuracy drift will become larger in dataframe with more rows, I am not sure that this drift is expected because of fastmath or some numerical optimizations, or just a bug), while foo2 can't be compiled, why does this happen?

# foo1 without and with njit, df has 10 rows
- [4.207398879779037, 10.0]
?                  ^

+ [4.207398879779036, 10.0]
?                  ^

# foo1 without and with njit, df has 10_000_000 rows
- [2501705.7589422013, 10000000.0]
?                 ^^^

+ [2501705.7589421924, 10000000.0]
?                ++ ^

# foo2 compilation error
  File "/usr/local/anaconda3/envs/.../lib/python3.7/site-packages/numba/core/dispatcher.py", line 482, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/usr/local/anaconda3/envs/.../lib/python3.7/site-packages/numba/core/dispatcher.py", line 423, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Cannot request literal type.

File "test.py", line 57:
def _modified_pipeline(df):
    <source elided>
    for cho_col in lst:
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
        ^

During: typing of intrinsic-call at /Users/.../sdc/tests/tests_ant/test_ant_9.py (57)

Tested on the newest master branch of SDC.

kozlov-alexey · 2021-11-29T21:48:58Z

@dlee992, The second error is kind of current limitation of SDC (mostly based on Numba, that is a JIT compiler with static typing), so generally iteration over heterogeneous collections using normal python syntax is forbidden. The reason is simple, in your example DF could have columns with different types, so that variable filtered_sum would need to be of different type on different iterations of the loop. Specifically for this, Numba provides literal_unroll feature, that allows code with minimal changes to be compiled, e.g.

from numba import literal_unroll

@njit
def foo2_error(df):
    lst  = ('col1', 'col2', 'col3')      # a tuple of column names instead of list
    results = []
    for cho_col in literal_unroll(lst):  # literal_unroll used
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
        filtered_sum = (df[target_col] * filter_series).sum()
        results.append(filtered_sum)

    return results

Above should work. Regarding the first problem, I think this deviation in precision is somewhat expected, as parallelization of sum with SDC has the consequence of values being added in different order. As far as I see, if operating on sorted sequence and using parallel=False, the summation via explicit loop gives exactly same result for compiled and pure python versions:

# on sorted data with sum via explicit loop:
arr_result:        2500270.1616518456   # numba jitted explicit loop, parallel=False
arr_result_ref:  2500270.1616518456   # python explicit loop

But thank you to pointing this out, we will dig deeper if something can be improved.

Details: definition of underlying data type of Series was done from PyObject dtype only and didn't take into account layout of original array, as a result 'C' layout was always inferred, where the original array might have other layout, breaking iteration over such Series (DF columns). Fixes IntelPython#996.

kozlov-alexey added the bug label Nov 25, 2021

kozlov-alexey self-assigned this Nov 25, 2021

kozlov-alexey added this to the gold milestone Nov 25, 2021

kozlov-alexey mentioned this issue Dec 21, 2021

Fixes incorrect definition of layout for SeriesType #1001

Open

dlee992 closed this as completed May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] python and sdc-compiled functions generate different output with same input #996

[BUG] python and sdc-compiled functions generate different output with same input #996

dlee992 commented Nov 24, 2021

kozlov-alexey commented Nov 25, 2021

dlee992 commented Nov 26, 2021 •

edited

kozlov-alexey commented Nov 29, 2021

[BUG] python and sdc-compiled functions generate different output with same input #996

[BUG] python and sdc-compiled functions generate different output with same input #996

Comments

dlee992 commented Nov 24, 2021

Reporting a bug

kozlov-alexey commented Nov 25, 2021

dlee992 commented Nov 26, 2021 • edited

kozlov-alexey commented Nov 29, 2021

dlee992 commented Nov 26, 2021 •

edited