Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

[BUG] python and sdc-compiled functions generate different output with same input #996

Closed
dlee992 opened this issue Nov 24, 2021 · 3 comments · May be fixed by #1001
Closed

[BUG] python and sdc-compiled functions generate different output with same input #996

dlee992 opened this issue Nov 24, 2021 · 3 comments · May be fixed by #1001
Assignees
Labels
Milestone

Comments

@dlee992
Copy link

dlee992 commented Nov 24, 2021

Reporting a bug

In [25]: num_columns = 20
    ...: features = [f'col{i}' for i in range(num_columns)]
    ...: df = pd.DataFrame(np.random.rand(5, num_columns), columns=features)
    ...: target_col = 'col0'

In [26]: df
Out[26]:
       col0      col1      col2      col3      col4      col5  ...    
0  0.847436  0.116855  0.782481  0.485027  0.027340  0.328801  ...  
1  0.482504  0.845380  0.753603  0.535273  0.243581  0.861275  ...  
2  0.190646  0.539439  0.901377  0.770925  0.908361  0.454777  ...  
3  0.355888  0.451189  0.672876  0.745438  0.576982  0.907190  ...  
4  0.535901  0.394481  0.118837  0.199040  0.557401  0.653302  ...  

[5 rows x 20 columns]

In [27]: def _modified_pipeline(df, target_col):
    ...:     samples = df[df['col1'] >= 0.2]
    ...:     p_sum = (samples[target_col] >= 0.5).sum()
    ...:     r_sum = (samples[target_col] <= 0.5).sum()
    ...:     cnt = len(samples)
    ...:     return p_sum, r_sum, cnt
    ...:

In [28]: from numba import njit
    ...: @njit
    ...: def jit_modified_pipeline(df, target_col):
    ...:     samples = df[df['col1'] >= 0.2]
    ...:     p_sum = (samples[target_col] >= 0.5).sum()
    ...:     r_sum = (samples[target_col] <= 0.5).sum()
    ...:     cnt = len(samples)
    ...:     return p_sum, r_sum, cnt
    ...:

In [29]: _modified_pipeline(df, target_col)
Out[29]: (1, 3, 4)

In [30]: jit_modified_pipeline(df, target_col)
<ipython-input-28-bbc0261853d0>:5: NumbaPerformanceWarning:
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.
.....
Out[30]: (1, 2, 3)

As you can see, python and sdc obtain different outputs with the same inputs.

Python 3.7.9 & numba 0.52.0 & sdc 0.38.0 & pandas 1.2.0

@kozlov-alexey
Copy link
Contributor

@dlee992 Hi, Thank you for the report! This is a bug in incorrect definition of a layout for the SeriesType. Unfortunately, strided arrays/series are not tested well in our tests, but we will fix it shortly. You can workaround it by creating DF from dictionary built from column names and transposed array (so that native array layout of columns is contiguous), i.e.

df = pd.DataFrame(dict(zip(features, p.random.rand(5, num_columns).transpose())))

@kozlov-alexey kozlov-alexey self-assigned this Nov 25, 2021
@kozlov-alexey kozlov-alexey added this to the gold milestone Nov 25, 2021
@dlee992
Copy link
Author

dlee992 commented Nov 26, 2021

@kozlov-alexey, thanks, the workaround makes sense. I tested a bit more, as below:

def foo1(df):
    choose_col = 'col1'
    filter_series = df[choose_col].apply(lambda x: 0 if x < 0.5 else 1)
    filtered_sum = (df[target_col] * filter_series).sum()
    return filtered_sum

def foo2(df):
    lst  = ['col1', 'col2', 'col3']
    for cho_col in lst:
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
        filtered_sum = (df[target_col] * filter_series).sum()

foo1 can be compiled and executed with @njit, (but the filtered_sum has accuracy drift compared with python result, the degree of accuracy drift will become larger in dataframe with more rows, I am not sure that this drift is expected because of fastmath or some numerical optimizations, or just a bug), while foo2 can't be compiled, why does this happen?

# foo1 without and with njit, df has 10 rows
- [4.207398879779037, 10.0]
?                  ^

+ [4.207398879779036, 10.0]
?                  ^

# foo1 without and with njit, df has 10_000_000 rows
- [2501705.7589422013, 10000000.0]
?                 ^^^

+ [2501705.7589421924, 10000000.0]
?                ++ ^

# foo2 compilation error
  File "/usr/local/anaconda3/envs/.../lib/python3.7/site-packages/numba/core/dispatcher.py", line 482, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/usr/local/anaconda3/envs/.../lib/python3.7/site-packages/numba/core/dispatcher.py", line 423, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Cannot request literal type.

File "test.py", line 57:
def _modified_pipeline(df):
    <source elided>
    for cho_col in lst:
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
        ^

During: typing of intrinsic-call at /Users/.../sdc/tests/tests_ant/test_ant_9.py (57)

Tested on the newest master branch of SDC.

@kozlov-alexey
Copy link
Contributor

@dlee992, The second error is kind of current limitation of SDC (mostly based on Numba, that is a JIT compiler with static typing), so generally iteration over heterogeneous collections using normal python syntax is forbidden. The reason is simple, in your example DF could have columns with different types, so that variable filtered_sum would need to be of different type on different iterations of the loop. Specifically for this, Numba provides literal_unroll feature, that allows code with minimal changes to be compiled, e.g.

from numba import literal_unroll

@njit
def foo2_error(df):
    lst  = ('col1', 'col2', 'col3')      # a tuple of column names instead of list
    results = []
    for cho_col in literal_unroll(lst):  # literal_unroll used
        filter_series = df[cho_col].apply(lambda x: 0 if x < 0.5 else 1)
        filtered_sum = (df[target_col] * filter_series).sum()
        results.append(filtered_sum)

    return results

Above should work. Regarding the first problem, I think this deviation in precision is somewhat expected, as parallelization of sum with SDC has the consequence of values being added in different order. As far as I see, if operating on sorted sequence and using parallel=False, the summation via explicit loop gives exactly same result for compiled and pure python versions:

# on sorted data with sum via explicit loop:
arr_result:        2500270.1616518456   # numba jitted explicit loop, parallel=False
arr_result_ref:  2500270.1616518456   # python explicit loop 

But thank you to pointing this out, we will dig deeper if something can be improved.

kozlov-alexey added a commit to kozlov-alexey/sdc that referenced this issue Dec 21, 2021
Details: definition of underlying data type of Series was
done from PyObject dtype only and didn't take into account
layout of original array, as a result 'C' layout was always
inferred, where the original array might have other layout,
breaking iteration over such Series (DF columns).

Fixes IntelPython#996.
kozlov-alexey added a commit to kozlov-alexey/sdc that referenced this issue Dec 21, 2021
Details: definition of underlying data type of Series was
done from PyObject dtype only and didn't take into account
layout of original array, as a result 'C' layout was always
inferred, where the original array might have other layout,
breaking iteration over such Series (DF columns).

Fixes IntelPython#996.
kozlov-alexey added a commit to kozlov-alexey/sdc that referenced this issue Dec 21, 2021
Details: definition of underlying data type of Series was
done from PyObject dtype only and didn't take into account
layout of original array, as a result 'C' layout was always
inferred, where the original array might have other layout,
breaking iteration over such Series (DF columns).

Fixes IntelPython#996.
@dlee992 dlee992 closed this as completed May 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants