Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement & test handling of NaNs in preprocessing with handle_nans decorator #130

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

henrifroese
Copy link
Collaborator

@henrifroese henrifroese commented Jul 30, 2020

  • Decorator used everywhere it's necessary in preprocessing.py with appropriate fill values
  • test_nan.py added with tests for the preprocessing module

- Decorator used everywhere it's necessary in preprocessing.py with appropriate fill values
- test_nan.py added with test's for the preprocessing module

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
@jbesomi
Copy link
Owner

jbesomi commented Jul 30, 2020

Thank you for the PR Henri!

I believe we are not dealing with nan in the correct "Pandas" way.

The rule should be:
If input Pandas Series has some nan at indexes X, output Pandas Series should have the same nan at indexes X.

Preprocessing:
This can be "easily" reached in all preprocessing functions, probably at the exception of phrases. This PR is not necessary then.

Example: remove_diacritics yield

>>> s = pd.Series(["héllo", pd.Na])
>>> hero.remove_diacritics(s)
0. hello
1. (empty space)

Where the Pandas-user expects:

0. hello
1. pd.Na

Representation:

Here we need to deal with such NA values. The problems arises when we pass NA values to scikit learn functions, right?
The first question is:

  1. Is there a way we can pass NA to scikit learn function? If yes we don't even need handle_nan, rather just test_na

If that's not possible, your initial idea was to use a decorator as it might be useful. It keeps the code simple and does two things:

  1. Check and warns if s has NA
  2. Fill NA

Personally, I'm not a fan of the decorator in this case as the processes are a bit hidden. Instead, we can simply have a function: s=_handle_nan(s). We might want to call it more explicitly warn_and_fill_nans or something similar. We need the same number of lines but we gain in clearness as we know exactly at which line the changes are made, we have a better idea of what handle_nan means and also we get some extra information regarding the input and output types.

My proposal is:

  1. With preprocessing function: if na return na`
  2. With representation function:
    1. Check if we cannot pass nan directly to scikitlearn
    2. If not we fillna(), compute and then either fill([], np.nan) or use your previous solution with the copy (we will need to evaluate the faster one)
  3. Add test_na to verify the previously mentioned "rule" is respected

As you already spent quite a large amount of time, if you wish I can take care of this.

P.s I'm sorry this is taking that long. It's surely my fault for not having explained this task properly as well as not having seen the issues initially.

@jbesomi jbesomi marked this pull request as draft July 30, 2020 17:09
@jbesomi
Copy link
Owner

jbesomi commented Jul 30, 2020

Update the description of issue #86
Also, this task is not our first priority yet, probably #43 is ...

@mk2510
Copy link
Collaborator

mk2510 commented Aug 1, 2020

Hi @jbesomi I think we will leave it till Monday and have a quick talk about it then 🚀

@henrifroese henrifroese added the bug Something isn't working label Aug 3, 2020
@jbesomi
Copy link
Owner

jbesomi commented Aug 3, 2020

Henri and Max, can you please just confirm you want me to take over this?

@henrifroese
Copy link
Collaborator Author

Yes that would be great 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants