Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider allowing default in quantile and median #132

Open
bkamins opened this issue Nov 28, 2022 · 30 comments
Open

Consider allowing default in quantile and median #132

bkamins opened this issue Nov 28, 2022 · 30 comments

Comments

@bkamins
Copy link
Contributor

bkamins commented Nov 28, 2022

In generic code the fact that the following operations error:

julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

julia> quantile(Int[], 0.5)
ERROR: ArgumentError: empty data vector

is inconvenient (as the only work-around is try-catch). Maybe we could allow passing default kwarg that would be returned when the passed collection is empty?

@aplavin
Copy link
Contributor

aplavin commented Nov 29, 2022

Would make sense for mean as well. Currently, empty mean returns NaN for numbers and throws for non-numbers.

Btw, sum lets you specify the default instead of zero, but the argument is called init there.

@bkamins
Copy link
Contributor Author

bkamins commented Nov 29, 2022

Currently, empty mean returns NaN for numbers and throws for non-numbers.

Yes, but at least it does not throw an error in standard case (but indeed we could allow for providing the other value of default as NaN, while correct is highly non-standard for Julia data processing)

called init there

The reason is that sum is a reduction, while e.g. median is not a reduction, so that is why init is not a best name in this case. But indeed - I was not sure what would be the best name of the kwarg.

@aplavin
Copy link
Contributor

aplavin commented Nov 29, 2022

Yes, but at least it does not throw an error in standard case

Would be weird to have default in median, but not in mean. And I also wanted this functionality a couple of times!

NaN, while correct is highly non-standard for Julia data processing

Dunno, I like NaNs the majority of time, it's the only way to signal "there's no number here" in a type-stable way. I mostly wanted to use default for values like 0 or something similar, when I need an actual number.

that is why init is not a best name in this case

Yeah, init doesn't seem to fit well for mean/median, I agree. Just pointed out what exists elsewhere.

@pdeffebach
Copy link

FWIW, R returns NULL which is roughly equivalent to nothing.

@aplavin
Copy link
Contributor

aplavin commented Nov 29, 2022

For mean, NaN seems the only right default choice, considering that mean(x) = sum(x) / length(x) and 0.0 / 0 is NaN.

However, if these functions all returned nothing for empty inputs, setting a default would be a straightforward something(median(x), 0.0). It's already common for Julia functions to return nothing indicating "no result" in some sense, eg findfirst-like functions.

nothing is also great in that functions deliberately don't propagate it further. So when a user gets nothing and forgets to handle it, he gets an exception very soon.

@bkamins
Copy link
Contributor Author

bkamins commented Nov 29, 2022

These are all good ideas. But then the question is how to convert it to API in a non-breaking way?
We probably need a kwarg that would allow to switch behavior.

@aplavin
Copy link
Contributor

aplavin commented Nov 29, 2022

Maybe, median/quantile can simply return nothing instead of throwing? Seems nonbreaking, and allows for easy default selection through something(...).

@aplavin
Copy link
Contributor

aplavin commented Nov 29, 2022

Another idea is a general wrapper function so that the interface looks like onempty(median, xs, default=0.0).

@bkamins
Copy link
Contributor Author

bkamins commented Nov 29, 2022

Seems nonbreaking, and allows for easy default selection through something(...).

It would be non-breaking, but I feared proposing it for such a common function 😄.

a general wrapper function so that the interface looks like onempty(median, xs, default=0.0).

This general wrapper function would require special casing for each function passed as its first argument to be efficient.

The point is that isempty(skipmissing(x)) is O(n) operation in general:

julia> x = missings(Int, 10^8);

julia> @time isempty(skipmissing(x))
  0.077613 seconds (1 allocation: 16 bytes)
true

But maybe it is good enough?

@aplavin
Copy link
Contributor

aplavin commented Nov 29, 2022

The point is that isempty(skipmissing(x)) is O(n) operation in general

Good point! I mostly considered applying functions to an actual materialized array, not to skipmissing.

An advantage of onempty vs either median(; default) or something(median(), default) is that it makes clear when the default value is used - exactly when the input is empty. The other two variants read more ambiguous regarding when the default gets returned.

OTOH, advantages of something(median(), ...) is that it doesn't involve creating any extra functions, and is trivially efficient wrt isempty(...) checks.

@bkamins
Copy link
Contributor Author

bkamins commented Nov 29, 2022

@nalimilan - what do you think?

@nalimilan
Copy link
Member

I don't think there's a precedent of a function that returns nothing when the input is empty, right? So adding a default keyword argument seems simpler. It's too bad we can't call it init for consistency with sum.

@aplavin
Copy link
Contributor

aplavin commented Nov 30, 2022

Some Julia functions throw an exception when no result is available (eg first), some do return nothing (eg findfirst).
Btw, first-or-default and only-or-default functions would also useful to have...

@bkamins
Copy link
Contributor Author

bkamins commented Nov 30, 2022

In general where possible I personally prefer special value-based behavior rather than error-based. For median and quantile returning nothing is safe.

However, I understand the concern of @nalimilan that it requires a careful design so that we have one common solution rather than many special cases. E.g. nothing as a special value works for mean, median, quantile, and e.g. std, but it is not a good value for sum as:

julia> isnothing(sum([nothing]))
true

@aplavin
Copy link
Contributor

aplavin commented Nov 30, 2022

It's unfortunate, but is a fact, that functions in Julia tend not to really consider that nothing can be contained within a collection. For example, findfirst returns nothing when it didn't find anything, or when it found the result, and its index is nothing:

# can distinguish these two:
julia> findfirst(x -> true, Dict(1 => 2))
1
julia> findfirst(x -> false, Dict(1 => 2))

# but not these two:
julia> findfirst(x -> true, Dict(nothing => 2))

julia> findfirst(x -> false, Dict(nothing => 2))

Potentially, median could reasonably be extended to non-numbers in the future, and return the only element for length-1 collections. But I personally don't see it as a major issue...

@mreichMPI-BGC
Copy link

Following this https://discourse.julialang.org/t/summarizing-grouped-dataframe-where-a-group-is-entirely-missing/90920/20?u=mreichmpi-bgc my 5 cents from a user perspective (and with a Tidyverse background as the most recent):

I like the general wrapper on_empty . And I think the model of purrr in Tidyverse of having wrappers for capturing side effects is quite convenient in data analysis, e.g. safely, possibly and quietly etc. Cf. Capture side effects. — safely • purrr.

In general I believe the user base will expand if such function are added, even at the expense of performance or other design aspects.

@bkamins
Copy link
Contributor Author

bkamins commented Dec 17, 2022

I was thinking what is best and came to the conclusion that we could define somewhere:

withempty(f, default) = x -> isempty(x) ? default : f(x)

(name is tentative)

With skipmissing the performance will not be optimal, but I think most of the time users will not complain (you would need to have a vector of all missing values with millions of such elements to notice the problem.

In this post I have shown an example usage.

@aplavin
Copy link
Contributor

aplavin commented Dec 17, 2022

I also like this solution, but wonder if it's possible to generalize in the following sense, while keeping simple and intuitive.

Often, one wants to compute mean/median/std/whatever only when at least several values are available. If you are interested in the distribution location, mean of a single sample isn't of much use - even more so for scale measurements such as std.
From time to time, I find myself doing things like length(x) >= 4 ? mean(x) : NaN. Conceptually, this seems very similar to the ifempty check discussed here.

Maybe, there's some clean solution to both?.. Anything to borrow from R/others?

@bkamins
Copy link
Contributor Author

bkamins commented Dec 17, 2022

We could define:

withcondition(f; default=missing, condition=isempty) = x -> condition(x) ? default : f(x)

(I proposed adding defaults, but we could skip them).

Then, by default we would have "return missing if empty, but one could write withcondition(median, default=NaN, condition=x->length(x) < 5) to get what you want.

@nalimilan
Copy link
Member

That sounds overkilll to me. The point of such a function is to be simple and convenient, as users can easily define other functions if needed. Actually we could almost have emptymissing(f) = x -> isempty(x) ? missing : f(x) to make things even simpler for the common case (and add that to Missings.jl).

@bkamins
Copy link
Contributor Author

bkamins commented Dec 17, 2022

I would be also OK with such a solution as it likely covers most of the use cases.

@jariji
Copy link

jariji commented Dec 17, 2022

After reading @bkamins blog post I came here to suggest

whenever(predicate, f, default) = x -> predicate(x) ? default : f(x)

It works for isempty, isnothing, ismissing, isnan. I like the positional version more than the kwargs version suggested above. I'm not committed to the name at the moment.

@nalimilan
Copy link
Member

whenever(predicate, f, default) is only 3 characters shorter than x -> predicate(x) ? default : f(x) so I don't really see the point of adding such a general function.

@bkamins
Copy link
Contributor Author

bkamins commented Dec 17, 2022

I don't really see the point of adding such a general function.

There are two points that can be considered:

  1. you actually need to write (x -> predicate(x) ? default : f(x)) in practice, and these extra ( and ) create visual noise and are easy to forget.
  2. x -> predicate(x) ? default : f(x) will be re-compiled every time it is written, while whenever(predicate, f, default) is compiled once

@jariji
Copy link

jariji commented Dec 17, 2022

BTW, I'm not totally sold on the argument order being f, default. whenever(predicate, value, function) seems closer to the familiar if-then-else ordering. This ambiguity is the downside of positionals.

Tangential, but I don't really like that the predicate functions we have are positive in the degenerate case rather than the good case. If there were better words for nonmissing/nonempty/nonnan that could be nicer than the current system.

@mreichMPI-BGC
Copy link

We could define:
withcondition(f; default=missing, condition=isempty) = x -> condition(x) ? default : f(x)
(I proposed adding defaults, but we could skip them).

I really like that most. I like, when there are reasonable, intuitive defaults that cover most of the cases, but then allow the user to decide to make things as complicated as she/he wants.

If would probably like to see these three functions (at least):

catchCondition(f; condition=isempty,  default=missing) = x -> condition(x) ? default : f(x)
catchAny(f; condition=isempty,  default=missing) = x -> condition(x) ? default : try f(x) catch e  default end
catchMissing(f; catchFun=catchCondition) = x -> catchFun(f)(skipmissing(x))

The last one would just allow avoiding typing skipmissing everytime. Just for convenience, but I think important for easy usage.

I would like to be able to pass the kwargs from catchMissing also to the other functions, but somehow catchMissing(f; catchFun=catchCondition, kwargs...) did not work (sorry too new to Julia...).

Tangential, but I don't really like that the predicate functions we have are positive in the degenerate case rather than the good case. If there were better words for nonmissing/nonempty/nonnan that could be nicer than the current system.

Agree - that's why I'd suggest having "catch" in the name, which makes it more intuitive logically IMO.

whenever(predicate, f, default) = x -> predicate(x) ? default : f(x)
It works for isempty, isnothing, ismissing, isnan. I like the positional version more than the kwargs version suggested above. I'm not committed to the name at the moment.

I really like the keyword version with defaults much more. Also I think for the user it is very important to have a consistent design, e.g. here always have f at the beginning. This is what makes tidyverse so attractive I believe.

@aplavin
Copy link
Contributor

aplavin commented Dec 18, 2022

Another idea for a general and convenient interface, that encompasses both isempty and length checks:

iflength(median, 0 => NaN)(x)  # NaN if x is empty
iflength(median, 0:3 => NaN)(x)  # NaN if length(x) is <= 3

Also, no matter what predicates are incorporated, I believe

default=missing

shouldn't be included by default. If anything, the default value should be nothing as common in Julia elsewhere: at least, nothing immediately throws on almost any operation instead of silently propagating, and cannot be forgotten.
Or (even better, I think) the default argument shouldn't have a default value ( :) ) at all, so that users always choose explicitly. Two common propagating not-a-value placeholders are NaN and missing.

@bkamins
Copy link
Contributor Author

bkamins commented Dec 19, 2022

Apart from the general solution discussed here, in JuliaData/Missings.jl#143 I propose adding emptymissing as @nalimilan since it covers the most common problematic case.

@nalimilan
Copy link
Member

Thanks @bkamins. Regarding the discussion about more general suggestions, note that this issue is currently in Statistics.jl, but it wouldn't be super logical to add such a general function in this package IMO. It's important to think about where it would go if that function was to be added.

@bkamins
Copy link
Contributor Author

bkamins commented Dec 24, 2022

It's important to think about where it would go if that function was to be added.

I agree. That is why I thought adding emptymissing in Missings.jl was a clean choice.

I thought, that once a consensus is reached about the API, a decision can be made where such a function should live (maybe even Base Julia if it is general enough)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants