Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suprising behaviour of Impute.replace #137

Open
ueliwechsler opened this issue Aug 17, 2023 · 2 comments
Open

Suprising behaviour of Impute.replace #137

ueliwechsler opened this issue Aug 17, 2023 · 2 comments

Comments

@ueliwechsler
Copy link

ueliwechsler commented Aug 17, 2023

I was struggle a bit with the Impute.replace functionality, as at first I did not realize that replacing 0 is not the same as replacing 0.0.

See the following minimal example:

julia> using Impute
julai> using DataFrames
julia> df_test = DataFrame(a = [missing, 1.2, 2.0, 3.5])
julia> transform!(df_test, "a" => Impute.replace(values=0) => "a")
4×1 DataFrame
 Row │ a         
     │ Float64?  
─────┼───────────
   1missing   
   21.2
   32.0
   43.54×1 DataFrame
julia> transform!(df_test, "a" => Impute.replace(values=0.0) => "a")
 Row │ a        
     │ Float64? 
─────┼──────────
   10.0
   21.2
   32.0
   43.5

I know it is written in the docs

If the input data is of a different type then the no replacement will be performed.

But it does not feel intuitive to me to not replace the value (and not provide a warning), especially if it could be directly converted into the type of column values.

Is there something that could be done to not "silently fail" if the value is not replaced?

@rofinn
Copy link
Member

rofinn commented Aug 18, 2023

This is an intentional design choice in order support passing tuples of values.

# From https://github.com/invenia/Impute.jl/blob/master/test/imputors/replace.jl#L23
imp = Replace(; values=(DateTime(0), -9999, NaN, ""))
df_table = DataFrame(
    :time => [missing, DateTime(2020, 02, 02), DateTime(2121, 12, 12)],
    :loc => [12, -5, missing],
    :val => [1.5, missing, 3.0],
    :desc => ["foo", "bar", missing],
)
df_expected = DataFrame(
    :time => [DateTime(0), DateTime(2020, 02, 02), DateTime(2121, 12, 12)],
    :loc => [12, -5, -9999],
    :val => [1.5, NaN, 3.0],
    :desc => ["foo", "bar", ""],
)

Simply adding a type conversion would break this functionality and adding a warning would be very noisy. My only thought is that you could probably use typejoin and either iteratively/recursively call supertype to identify the closest match. Though I don't think the added complexity and overhead would be worth it.

@ueliwechsler
Copy link
Author

Thank you for the explanation. My example was only considering replacing the values for one column. In the more general context of replacing values for more than one column, the design makes sense to me.

Would you mind me creating a PR to update the documentation such that the example shows a case where the values are not replaced and maybe add a caveat box to highlight the behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants