Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zscore with missing values #898

Open
bkamins opened this issue Oct 2, 2023 · 5 comments
Open

zscore with missing values #898

bkamins opened this issue Oct 2, 2023 · 5 comments

Comments

@bkamins
Copy link
Contributor

bkamins commented Oct 2, 2023

@nalimilan do you think we could make zscore work with vectors containing missing values? The issue is that using skipmissing is problematic in this context.

@pdeffebach
Copy link

This seems like a nice use for my stalled `spreadmissings PR.

@aplavin
Copy link
Contributor

aplavin commented Oct 2, 2023

Currently, zscore restricts argument types to Vector{Real}. If it wasn't the case, the following perfectly general solution would work:

# like zscore() but no type restrictions:
julia> function myzscore(x)
           μ, σ = mean_and_std(x)
           map(x -> (x - μ) / σ, x)
       end

julia> a = [1, 2,  missing, 3]

julia> using Accessors

julia> @modify(myzscore, skipmissing(a))
4-element Vector{Union{Missing, Float64}}:
 -1.0
  0.0
   missing
  1.0

It reads like "modify skipmissing(a) by myzscore", or more verbosely "apply myzscore to skipmissing(a), write the result back to skipmissing(a), and return the modified copy of a".

@modify is a neat API, useful whenever you want to modify some part of an object and return the modified copy back.

Bonus: not only missing, arbitrary skip predicates work:

julia> a = [1, 2, NaN, 3]

julia> using Skipper

julia> @modify(myzscore, a |> skip(isnan))
4-element Vector{Float64}:
  -1.0
   0.0
 NaN
   1.0

@Mattriks
Copy link

Mattriks commented Oct 3, 2023

If I wanted col means of dataframe with missings, I could do mapcols(mean∘skipmissing, df). So I would like something similar to mapcols(zscore∘spreadmissing, df)

@nalimilan
Copy link
Member

Yes that's typically a case that spreadmissings was supposed to handle. Maybe it's time to do something about it. ;-)

@aplavin
Copy link
Contributor

aplavin commented Oct 4, 2023

Thanks to composability of Accessors, they can already do normalization of all table columns, for some supported table types:

tbl = StructArray(a=[1, 2, missing, 3], b=[1, missing, 1, 2])
@modify(myzscore, StructArrays.components(tbl) |> Elements() |> skipmissing)

If there's interest, it can be generalized to Tables interface, and more table types encouraged to support it, so that one would write

@modify(myzscore, columns(tbl) |> Elements() |> skipmissing)

and it worked for any table.

Still, this is mostly just to show how powerful existing composable interfaces are in Julia, not to detract anyone from implementing spreadmissing. I understand that many specialized functions each solving one well-defined problem are often easier to explain to new users than fewer general functions that can compose in different ways to solve different tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants