Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return basic statistical functionality to Base #27374

Closed
ararslan opened this issue Jun 1, 2018 · 46 comments
Closed

Return basic statistical functionality to Base #27374

ararslan opened this issue Jun 1, 2018 · 46 comments
Labels
domain:maths Mathematical functions
Milestone

Comments

@ararslan
Copy link
Member

ararslan commented Jun 1, 2018

It seems incredibly unfortunate to me, and indeed almost actively user-hostile, to remove basic functionality such as std from Base. @sbromberger summarized it well in Slack: "if the function is generally understood by a layperson, it should not be removed." You don't have to be a statistics expert to understand what a standard deviation is. The collection of basic statistics functionality in Base in previous versions of Julia was fairly lean and I think struck the perfect balance of "here are the basics, now load a package if you want to get more advanced." Indeed, in the release-0.6 branch, base/statistics.jl is a couple hundred lines of code, including whitespace and extensive documentation. Who was it hurting to have in Base?

@ararslan ararslan added the domain:maths Mathematical functions label Jun 1, 2018
@sbromberger
Copy link
Contributor

sbromberger commented Jun 1, 2018

I note that mean and median are still in Base. What are the current criteria for moving things out?

The collection of basic statistics functionality in Base in previous versions of Julia was fairly lean and I think struck the perfect balance of "here are the basics, now load a package if you want to get more advanced."

This, 100%. I will fight the "npm-ification" of Julia as long as I'm a participating member of the community. Having things ready to go out of the box was a huge selling point for me.

Moving large chunks of "basic" functionality out of Base to other packages (especially third-party packages; I'm willing to consider stdlib packages like Random and Iterators special cases, though I haven't really made up my mind as to whether this is a good thing yet) also makes the library ecosystem that much more fragile – while Pkg.add("StatsBase") isn't so much of a problem for an end user, it introduces a dependence fragility on any libraries that need a function like std now, which then propagates to other libraries / end users.

@pablosanjose
Copy link
Contributor

pablosanjose commented Jun 1, 2018

FWIW I agree.

Is there an official decision of where precisely the line is drawn between Base, stdlib and third-party packages?

I think this is a crucial definition that can totally affect the character and feel of a language, from Julia-the-language to Julia-the-platform. With the current strategy, the development of Julia-the-language is fast and nimble, but at the potential cost of a fragile/fragmented Julia-the-platform, as nicely pointed out above.

@fredrikekre
Copy link
Member

fredrikekre commented Jun 1, 2018

if the function is generally understood by a layperson, it should not be removed.

How about matrix multiplication? Or solving a linear system? These are things a layperson understand even better than std! Should we move LinearAlgebra back too?

Moving large chunks of "basic" functionality out of Base to other packages (especially third-party packages; I'm willing to consider stdlib packages like Random and Iterators special cases, though I haven't really made up my mind as to whether this is a good thing yet) also makes the library ecosystem that much more fragile – while Pkg.add("StatsBase") isn't so much of a problem for an end user, it introduces a dependence fragility on any libraries that need a function like std now, which then propagates to other libraries / end users.

I thought the plan was that StatsBase will be a stdlib (perhaps named Statistics?). It just felt unnecessary to move StatsBase into this repo, only to move it out soon again.

@sbromberger
Copy link
Contributor

sbromberger commented Jun 1, 2018

How about matrix multiplication?

Matrix multiplication is in stdlib, which is "close enough to Base" right now, I guess. I'm still not really sold on "small base, large stdlib" but as long as the functionality is being distributed and made available via a standard Julia install, I guess it's better than going third-party.

I thought the plan was that StatsBase will be a stdlib (perhaps named Statistics?).

That would perhaps change things somewhat, but currently, it's not that way, and in order to use std in 0.7, we now have to rely on a third-party package. That doesn't seem right.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jun 1, 2018

Yes, the plan is to move StatsBase to stdlib/Statistics. I had proposed moving these functions into a new stdlib/Statistics module which would ship with Julia and then move the appropriate parts of StatsBase later but people preferred doing it this way.

@sbromberger
Copy link
Contributor

@StefanKarpinski until that's done, can we please keep std (and any others) in Base?

@StefanKarpinski
Copy link
Sponsor Member

No, because we can't move things out of Base later but we can move things into stdlib later.

@sbromberger
Copy link
Contributor

sbromberger commented Jun 1, 2018

so until then the burden falls on the library developers to change where they source their functions from? That seems ... wrong, somehow.

This wasn't an issue for Random, or any of the others (disregard Iterators for the moment). Why is it an issue now?

@ararslan
Copy link
Member Author

ararslan commented Jun 1, 2018

Why do these need to be removed from Base at all?

@StefanKarpinski
Copy link
Sponsor Member

We already had this discussion and came to a pretty clear consensus at the time, I'm not inclined to rehash the whole thing.

@ararslan
Copy link
Member Author

ararslan commented Jun 1, 2018

Seems like there's fairly broad support now for undoing it.

ararslan added a commit that referenced this issue Jun 1, 2018
Revert "move cor, cov, std, stdm, var, varm and linreg to StatsBase (#27152)"
This reverts commit 746d08f.
Fixes #27374
@ararslan
Copy link
Member Author

ararslan commented Jun 1, 2018

PR open to revert the change. #27375

@catawbasam
Copy link
Contributor

+1 for "moving these functions into a new stdlib/Statistics module which would ship with Julia" now.

@JeffBezanson JeffBezanson added the status:triage This should be discussed on a triage call label Jun 13, 2018
@ararslan
Copy link
Member Author

Here's what I would suggest:

  • Return standard deviation to Base as stddev
  • Return variance to Base as variance
  • Leave the rest where they currently are in StatsBase

@StefanKarpinski
Copy link
Sponsor Member

Yes, I'm 100% on board with that. Can you make a PR?

@ararslan
Copy link
Member Author

Yep, can do.

@KristofferC KristofferC added this to the 0.7 milestone Jun 14, 2018
@KristofferC KristofferC removed the status:triage This should be discussed on a triage call label Jun 14, 2018
@JeffBezanson
Copy link
Sponsor Member

So before, the story was "trust me, std is a super special case that absolutely must be in Base". Now I see we're just casually extending that to var as well. W H A T E V E R

@ararslan
Copy link
Member Author

...they're basically the same thing though, so it would be weird to split them

@andreasnoack
Copy link
Member

I agree with @ararslan here. It's either both or none.

@Nosferican
Copy link
Contributor

From a developer/maintainer it is just a mess to have code all over the place due to lay-man definitions (which really is more field-based than anything). The original base/statistics had the relevant code all together and doing these splits will make it hard to look up code (a huge component in transparency for reproducibility, security, etc.). I would be very happy with taking these out of base and just a robust stdlib/Statistics. Personally, I can't wait for most stdlib to move out, but other compromises could include (loading certain stdlib by default or very easy setting to accomplish this (e.g., R provides several ways to do this) or have easily customizable distributions (choose the packages or ecosystems you want and get a custom download/image)... Not sure if it has to be managed by Julialang, but could be third-party as long as there is a way to easily to accomplish this.

@ViralBShah
Copy link
Member

Are we still going to try to do something here, with the beta tagged? This issue is still on the milestone, so perhaps it is ok to try make this change. I am personally ok with stddev and variance.

@ararslan
Copy link
Member Author

Since they're already removed from Base, it would be non-breaking to put them back, regardless of the name, which means it can happen any time. That said, if we're going to do this, I think we should try to do it for 0.7. I haven't moved forward with this change as there doesn't appear to be broad agreement over the names stddev and variance.

@Nosferican
Copy link
Contributor

What's the issue in having a dependency? Especially one that is just math constants (extremely light)? If you were to have it in Base it would be loaded with every session for everyone regardless of whether they need it or use it. If you don't want to have it, you could just copy it in your project, but why not reuse good, maintained code (in this simple case it would really not matter)? How often do people just want plain vanilla standard deviation? Usually you would want to do something else in addition such as LinearAlgebra or DataFrames at the very least or maybe plotting which would require other packages.

@sbromberger
Copy link
Contributor

γ is not in Base. It is in a package just like std. You would be doing,

This is a bit disingenuous. It's true that it's not in Base, but it is in stdlib, which does not require any additional packages to be installed in order to use it. This is in contrast to std, which as of now requires the explicit installation of StatsBase from an external repository.

@sbromberger
Copy link
Contributor

sbromberger commented Jun 26, 2018

How often do people just want plain vanilla standard deviation?

Are you suggesting that this doesn't happen? Because this is precisely why the original issue was opened.

Usually you would want to do something else in addition such as LinearAlgebra or DataFrames at the very least or maybe plotting which would require other packages.

Not in LightGraphs' case.

@ViralBShah
Copy link
Member

@sbromberger I think it would be good to mention or at least refer to your personal reasons for not wanting external repositories, in order to provide the full background to folks reading this.

Also, for other discussion in this thread, it doesn't help bringing up examples of other stuff and comparing if feature x is more widely used than std. If something is better maintained elsewhere, or is not reasonably commonly used functionality, please make an independent case for it and we can figure out the right place for it.

@ararslan I do agree we should do it for 0.7 and stddev and variance are consistent with our general naming conventions. It's not a good idea to take up 3 letter names.

@sbromberger
Copy link
Contributor

sbromberger commented Jun 27, 2018

I think it would be good to mention or at least refer to your personal reasons for not wanting external repositories, in order to provide the full background to folks reading this.

Sorry, @ViralBShah. I'm all talked out about the larger issue – most of the core team is probably sick of hearing from me by now – and we've worked around this one by removing our use of std. I'd rather not rehash the specific situation I'm in, but I hope you'll indulge me anyway.

Two things I'd like to put out there for consideration:

  1. The argument that "it's common enough and small enough that everyone's going to have it installed anyway" is dangerous for two reasons: first, it makes the edge cases that don't / can't follow this convention that much more difficult to satisfy AND more prevalent, and second, it increases the overall fragility of the ecosystem: when everyone's depending on a common set of third-party utilities, then you've got what is essentially a "core install" made up of things that can change at the whim of devs who might not share the same commitment to stability or multi-use (see below) as the language core team. (This is not to disparage other devs; it's just the way it is.) My opinion is that if code is "common enough" that everyone should have it installed anyway, then it should at a minimum be in stdlib.

  2. It would be great if we could give some thought to separating data structures from functions that commonly operate on those data structures. As an example, there are lots of things one can do with sparse matrices other than perform linear algebra on them. Moving the data structures into a package that is domain specific (like linear algebra) ignores these use cases at the expense of added complexity for those applications that don't treat the structures the way that others do. Continuing the example, in a language like Julia, it's easy and natural for people to place an emphasis on linear algebra. There are those of us who see the language as more than just a fast way to perform LA operations, and the talk of moving sparse matrices to a package primarily focused on linear algebra really makes it feel like we're second-class citizens in this language. At least when sparse matrices are in Base or stdlib, I can feel some assurance that someone will understand the non-traditional use cases, as it's more likely that there's someone on the team whose primary interest is not linear algebra.

    Finally (I promise), it seems as if the bulk of the development work is not on the data structures themselves; rather, it's on optimizing functions that operate on them. That is, SparseMatrixCSC has not significantly changed in at least 3 years. The argument that we need to move things out to third party packages to improve our ability to make quick changes falls flat here when we're talking about the data structure.

Also, for other discussion in this thread, it doesn't help bringing up examples of other stuff and comparing if feature x is more widely used than std.

I didn't intend to do this, and I apologize if I did.

Thanks for the opportunity to weigh in, and I apologize both for not directly answering your question, and for the length of my (non-)response.

@JeffBezanson
Copy link
Sponsor Member

I'd like to second Viral's point that it doesn't help to point to other things in Base like eulergamma. It's quite likely a bunch of other stuff should be removed too! When the stdlib directory was first created, initially around 2 functions moved out of Base. One might have argued "What? Are you saying these are the two least important functions in Base that must be removed, while everything else gets to stay?" No, of course not --- pretty soon stdlib had ~30 packages.

I'll also point out that my original proposal was to move the functions to a stdlib package. In light of @sbromberger 's situation we should reconsider that --- we didn't think stdlib vs. external StatsBase was such a big difference, but apparently in some environments it is.

I like the point about separating data structures and functions; julia's design makes that especially easy and natural (though apparently some people call it type piracy :) ).

@Datseris
Copy link

I'd like to second Viral's point that it doesn't help to point to other things in Base like eulergamma.

You are right, I see the flaw in my example. Of course I didn't state it as the "absolute argument against the change", but only to get a point across.

@Sacha0
Copy link
Member

Sacha0 commented Jun 27, 2018

That is, SparseMatrixCSC has not significantly changed in at least 3 years. The argument that we need to move things out to third party packages to improve our ability to make quick changes falls flat here when we're talking about the data structure.

To note, several parties have long had in mind an overhaul of the sparse data structures. That those data structures have not changed in the last three years reflects lack of developer bandwidth rather than lack of utility in being able to rapidly iterate when developer bandwidth exists :). Best!

@ViralBShah
Copy link
Member

ViralBShah commented Jun 28, 2018

There is quite a bit of sparse matrix experimentation outside of Base that people have done, and having something in Base has also deterred others from trying alternate ideas (because it would be so difficult to get anyone to consider using them).

But I don't want to make this about sparse matrices. :-)

@JeffBezanson
Copy link
Sponsor Member

Resolved by #27834

@alanedelman
Copy link
Contributor

not a tutorial goes by that mean , median and std is missed
the argument is not only that's known by a lay person (a fifth grader?)
it's something that shows up for teachers analyzing their class grade data
(which is way more common than general statistics analysis),

Can we please, please, please get these three functions in base?

@alanedelman
Copy link
Contributor

What would it take to get these three back into base?

@andreasnoack
Copy link
Member

The conclusion here was to put these in the standard library Statistics. I think it's unlikely that they'll return to Base.

@alanedelman
Copy link
Contributor

what would it take to change these so very common functions back to Base???

@Nosferican
Copy link
Contributor

I think the answer is more along the willpower and making a compelling case enough to convince the core team to revert the decision. That would probably be the biggest force needed relative to the technical aspect. I am sided with the latest determination of moving those to the stdlib. To revisit the decision, I would say it would require a documented stable pattern of issues for it (e.g., soft-scope decision).

@quinnj
Copy link
Member

quinnj commented Jul 21, 2021

I don't think I've seen a compelling argument to move these functions from Statistics stdlib back to Base. For users, it's literally typing using Statistics and then using the functions. This is standard in every language I'm aware of out there (adding imports for non-core functionality you're using).

On the other hand, I think it'd be a better use of developer time/effort to work on allowing stdlibs to upgrade separately from the rest of Base release process. Then we can work on consolidating functionality from StatsBase.jl and other "core" statistics packages to the Statistics stdlib and provide even more useful stats functionality in a very easy/stable/"blessed" way to users.

@catawbasam
Copy link
Contributor

catawbasam commented Jul 21, 2021 via email

@alanedelman
Copy link
Contributor

My argument is based primarily on around 35 years of teaching students. While functions could
be based on usage patterns, technical categorization, or frequency, psychology plays the biggest
role. Around the world, mean and median are learned at a very very young age. Later students take
tests, and scores take on an excessive importance to them, and they will hear about std. Are these
technically statistics, sure? But lets take +,-,etc out of base and let's require
using BasisArithmetic

My argument is that students learn these three sometimes before they even learn the word "statistics"
(well mean and median anyway) and I've seen student after student just get angry over these three functions.

(I'm also not a big fan of argument from authority -- but perhaps argument from extensive consumer experience
is more compelling, in this case the consumers are the students I teach.)

@oscardssmith
Copy link
Member

We could always re-triage this...

@alanedelman
Copy link
Contributor

keep me posted, just got another round of complaints from students especially for mean and std

@oscardssmith oscardssmith added the status:triage This should be discussed on a triage call label Apr 1, 2022
@LilithHafner
Copy link
Member

I'm neutral on this, but I think the current status is "not planned" rather than "completed"

@LilithHafner LilithHafner closed this as not planned Won't fix, can't repro, duplicate, stale Mar 2, 2023
@LilithHafner LilithHafner removed the status:triage This should be discussed on a triage call label Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:maths Mathematical functions
Projects
None yet
Development

No branches or pull requests