Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: allow skipmissing column types #3398

Open
adienes opened this issue Nov 5, 2023 · 3 comments
Open

feature request: allow skipmissing column types #3398

adienes opened this issue Nov 5, 2023 · 3 comments
Labels

Comments

@adienes
Copy link

adienes commented Nov 5, 2023

I understand the rationale for the very elaborate missing logic in that it forces the user to be explicit about how to handle missing values and potentially avoids sneaky statistical bugs

however

for "quick and dirty" tasks just trying to make sense of some data, it quickly becomes cumbersome to constantly be wrapping things in skipmissing or dropmissing etc. etc.

I would love some way to tag columns (or the whole table) as skipmissing 'ed so that all future transformations will automatically insert a skipmissing. maybe like transform(df, All() .=> skipmissing) or skipmissing!(df) or such

@bkamins
Copy link
Member

bkamins commented Nov 5, 2023

I understand your concern and share it. There is a wide difference between "production code" and "data discovery" workflows.

What you ask for is doable already with metadata. However, I thnk a better solution is rather to have a set of functions that provide an alternative set of behaviors. This is what https://sl-solution.github.io/InMemoryDatasets.jl/stable/man/missing/#Functions-which-skip-missing-values does. The question is, though, how to get a common agreement how to approach it in terms of package ecosystem.

@mkitti
Copy link

mkitti commented Nov 28, 2023

From https://discourse.julialang.org/t/why-are-missing-values-not-ignored-by-default/106756/115?u=mkitti , it does not appear that hard to do. I'm not clear if this should be part of DataFrames.jl though.

julia> using CSV, DataFrames, Statistics

julia> struct SkipMissingDataFrame
           parent::DataFrame
       end

julia> Base.parent(smdf::SkipMissingDataFrame) = getfield(smdf, :parent)

julia> Base.getproperty(smdf::SkipMissingDataFrame, sym::Symbol) = skipmissing(Base.getproperty(parent(smdf), sym))

julia> write("blah.csv","""
       "col1", "col2"
       "5", "6"
       "1", "2"
       "30", "31"
       "22", "23"
       "NA"
       "50"
       """)
65

julia> df = CSV.read("blah.csv", DataFrame; silencewarnings=true);
julia> smdf = SkipMissingDataFrame(df)
SkipMissingDataFrame(6×2 DataFrame
 Row │ col1     col2    
     │ String3  Int64?  
─────┼──────────────────
   15              6
   21              2
   330            31
   422            23
   5 │ NA       missing 
   650       missing )

julia> smdf.col2 |> mean
15.5

julia> smdf.col2 |> x->Iterators.filter(>(10),x) |> mean
27.0

@nalimilan
Copy link
Member

This example is indeed simple, but as soon as you want to support operations on data frames, you have to reimplement all of the DataFrames.jl API. It's doable but quite some code.

This also creates new issues: df.col3 = 2 .* df.col2 wouldn't work anymore.

I tend to think that this would be better handled with improved macros in DataFramesMeta.

See also #2314.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants