feature request: allow `skipmissing` column types #3398

adienes · 2023-11-05T18:07:09Z

I understand the rationale for the very elaborate missing logic in that it forces the user to be explicit about how to handle missing values and potentially avoids sneaky statistical bugs

however

for "quick and dirty" tasks just trying to make sense of some data, it quickly becomes cumbersome to constantly be wrapping things in skipmissing or dropmissing etc. etc.

I would love some way to tag columns (or the whole table) as skipmissing 'ed so that all future transformations will automatically insert a skipmissing. maybe like transform(df, All() .=> skipmissing) or skipmissing!(df) or such

The text was updated successfully, but these errors were encountered:

bkamins · 2023-11-05T22:56:07Z

I understand your concern and share it. There is a wide difference between "production code" and "data discovery" workflows.

What you ask for is doable already with metadata. However, I thnk a better solution is rather to have a set of functions that provide an alternative set of behaviors. This is what https://sl-solution.github.io/InMemoryDatasets.jl/stable/man/missing/#Functions-which-skip-missing-values does. The question is, though, how to get a common agreement how to approach it in terms of package ecosystem.

mkitti · 2023-11-28T19:27:09Z

From https://discourse.julialang.org/t/why-are-missing-values-not-ignored-by-default/106756/115?u=mkitti , it does not appear that hard to do. I'm not clear if this should be part of DataFrames.jl though.

julia> using CSV, DataFrames, Statistics

julia> struct SkipMissingDataFrame
           parent::DataFrame
       end

julia> Base.parent(smdf::SkipMissingDataFrame) = getfield(smdf, :parent)

julia> Base.getproperty(smdf::SkipMissingDataFrame, sym::Symbol) = skipmissing(Base.getproperty(parent(smdf), sym))

julia> write("blah.csv","""
       "col1", "col2"
       "5", "6"
       "1", "2"
       "30", "31"
       "22", "23"
       "NA"
       "50"
       """)
65

julia> df = CSV.read("blah.csv", DataFrame; silencewarnings=true);
julia> smdf = SkipMissingDataFrame(df)
SkipMissingDataFrame(6×2 DataFrame
 Row │ col1     col2    
     │ String3  Int64?  
─────┼──────────────────
   1 │ 5              6
   2 │ 1              2
   3 │ 30            31
   4 │ 22            23
   5 │ NA       missing 
   6 │ 50       missing )

julia> smdf.col2 |> mean
15.5

julia> smdf.col2 |> x->Iterators.filter(>(10),x) |> mean
27.0

nalimilan · 2023-11-28T20:21:41Z

This example is indeed simple, but as soon as you want to support operations on data frames, you have to reimplement all of the DataFrames.jl API. It's doable but quite some code.

This also creates new issues: df.col3 = 2 .* df.col2 wouldn't work anymore.

I tend to think that this would be better handled with improved macros in DataFramesMeta.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: allow `skipmissing` column types #3398

feature request: allow `skipmissing` column types #3398

adienes commented Nov 5, 2023 •

edited

bkamins commented Nov 5, 2023

mkitti commented Nov 28, 2023

nalimilan commented Nov 28, 2023

feature request: allow skipmissing column types #3398

feature request: allow skipmissing column types #3398

Comments

adienes commented Nov 5, 2023 • edited

bkamins commented Nov 5, 2023

mkitti commented Nov 28, 2023

nalimilan commented Nov 28, 2023

feature request: allow `skipmissing` column types #3398

feature request: allow `skipmissing` column types #3398

adienes commented Nov 5, 2023 •

edited