Implement & test handling of NaNs in preprocessing with handle_nans decorator #130

henrifroese · 2020-07-30T14:24:07Z

Decorator used everywhere it's necessary in preprocessing.py with appropriate fill values
test_nan.py added with tests for the preprocessing module

- Decorator used everywhere it's necessary in preprocessing.py with appropriate fill values - test_nan.py added with test's for the preprocessing module Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

jbesomi · 2020-07-30T17:05:51Z

Thank you for the PR Henri!

I believe we are not dealing with nan in the correct "Pandas" way.

The rule should be:
If input Pandas Series has some nan at indexes X, output Pandas Series should have the same nan at indexes X.

Preprocessing:
This can be "easily" reached in all preprocessing functions, probably at the exception of phrases. This PR is not necessary then.

Example: remove_diacritics yield

>>> s = pd.Series(["héllo", pd.Na])
>>> hero.remove_diacritics(s)
0. hello
1. (empty space)

Where the Pandas-user expects:

0. hello
1. pd.Na

Representation:

Here we need to deal with such NA values. The problems arises when we pass NA values to scikit learn functions, right?
The first question is:

Is there a way we can pass NA to scikit learn function? If yes we don't even need handle_nan, rather just test_na

If that's not possible, your initial idea was to use a decorator as it might be useful. It keeps the code simple and does two things:

Check and warns if s has NA
Fill NA

Personally, I'm not a fan of the decorator in this case as the processes are a bit hidden. Instead, we can simply have a function: s=_handle_nan(s). We might want to call it more explicitly warn_and_fill_nans or something similar. We need the same number of lines but we gain in clearness as we know exactly at which line the changes are made, we have a better idea of what handle_nan means and also we get some extra information regarding the input and output types.

My proposal is:

With preprocessing function: if na return na`
With representation function:
1. Check if we cannot pass nan directly to scikitlearn
2. If not we fillna(), compute and then either fill([], np.nan) or use your previous solution with the copy (we will need to evaluate the faster one)
Add test_na to verify the previously mentioned "rule" is respected

As you already spent quite a large amount of time, if you wish I can take care of this.

P.s I'm sorry this is taking that long. It's surely my fault for not having explained this task properly as well as not having seen the issues initially.

jbesomi · 2020-07-30T17:16:28Z

Update the description of issue #86
Also, this task is not our first priority yet, probably #43 is ...

mk2510 · 2020-08-01T15:11:23Z

Hi @jbesomi I think we will leave it till Monday and have a quick talk about it then 🚀

jbesomi · 2020-08-03T18:57:08Z

Henri and Max, can you please just confirm you want me to take over this?

henrifroese · 2020-08-04T15:12:54Z

Yes that would be great 👌

Test handling of NaNs in Preprocessing with handle_nans decorator.

b338622

- Decorator used everywhere it's necessary in preprocessing.py with appropriate fill values - test_nan.py added with test's for the preprocessing module Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>

vercel bot deployed to Preview July 30, 2020 14:24 View deployment

jbesomi marked this pull request as draft July 30, 2020 17:09

jbesomi mentioned this pull request Jul 30, 2020

All function to deal with np.nan #86

Open

henrifroese added the bug Something isn't working label Aug 3, 2020

ryangawei mentioned this pull request Sep 8, 2020

Infer test cases from input HeroSeries in test_indexes #179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement & test handling of NaNs in preprocessing with handle_nans decorator #130

Implement & test handling of NaNs in preprocessing with handle_nans decorator #130

henrifroese commented Jul 30, 2020 •

edited

jbesomi commented Jul 30, 2020

jbesomi commented Jul 30, 2020

mk2510 commented Aug 1, 2020

jbesomi commented Aug 3, 2020

henrifroese commented Aug 4, 2020

Implement & test handling of NaNs in preprocessing with handle_nans decorator #130

Are you sure you want to change the base?

Implement & test handling of NaNs in preprocessing with handle_nans decorator #130

Conversation

henrifroese commented Jul 30, 2020 • edited

jbesomi commented Jul 30, 2020

jbesomi commented Jul 30, 2020

mk2510 commented Aug 1, 2020

jbesomi commented Aug 3, 2020

henrifroese commented Aug 4, 2020

henrifroese commented Jul 30, 2020 •

edited