Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.0 of scikit-learn #14386

Closed
MartinThoma opened this issue Jul 17, 2019 · 43 comments
Closed

Version 1.0 of scikit-learn #14386

MartinThoma opened this issue Jul 17, 2019 · 43 comments

Comments

@MartinThoma
Copy link
Contributor

I just realized (by looking at 0ver.org ) that scikit-learn is also in Version 0.x. I could not find any discussion about version 1.0 in the issues.

I would like to understand the reasoning / see if there is any other channel where this topic is discussed.

Why it matters

Semantic Versioning is wide spread. People who are new to Python still know (parts of) semantic versioning. Having software in a 0.x version feels as if the software is brittle / prone to get breaking changes.

scikit-learn does not use any of the Development Status :: trove classifiers (setup.py, list of trove classifers). Although I guess anybody working with Python has heard from scikit-learn, it might be hard to reason about the maturity of the project quickly as a newcomer.

An alternative is calendar based versioning.

Why scikit-learn should be 1.0

  • Wide-Spread: 35,895 stars on Github
  • Maturity:
    • First release in 2010
    • Releases so far: 29
    • A lot of software relies on it (according to Github: 61504 repositories!)
    • 17910 articles cited the version 0.8 publication

The Process to get to 1.0

scipy made this really nice. I guess some of the developers there also have a look at scikit-learn, so I hope to get more details.

From my perspective, it looked as if the scipy community made the following steps to get to 1.0:

  • Code changes (see 1.0.0 Milestone of scipy):
    • Are there key features missing?
    • Are there important interface changes that should be done?
    • Are there any other issues that need to be solved before 1.0?
  • Add a community governance document (scipy)
  • Write an version 1.0 paper (scipy) - this might be a nice reward for a couple of contributors, if they are in academica. Lasagne (deep learning library) did a simpler version of it (lasagne software publication), but that is still nice so people can cite what they used. scikit-learn did that a while ago as well. There is also a nice Tensorflow Whitepaper.
@amueller
Copy link
Member

amueller commented Jul 17, 2019

There's a milestone:
https://github.com/scikit-learn/scikit-learn/issues?q=is%3

Personally, I think #7242 and #10603 need to be fixed.
Right now it's not possible to train a pipeline with preprocessing and logistic regression on the titanic dataset and figure out what the coefficients mean. This is work in progress. We already made strides. Once we have support for feature names, I think we're at a reasonable point.

I know some other people, including @adrinjalali and @GaelVaroquaux feel strongly about #4497 and #4143. As you can see from the numbers, these issues are quite old. There is no consensus yet on how to address these.
These also relate to being able to undersample and oversample for imbalanced data, which scikit-learn doesn't support.

We have delayed 1.0 to allow a breaking change to fix these issues. Whether this is (still) a good strategy is debatable.

We very recently introduced a governance document, roadmap and a enhancement proposal formalism.

These have actually allowed us to discuss some of the longstanding issues in a more productive way. We could decide to postpone some of the issues, make a polished 1.0 and then address them in 2.0.
Or we could keep working on them and release 1.0 once we addressed or punted on them.
It is helpful to think about a timeline for 1.0, I think, and what we want from it.

There's actually two separate things we might desire from a 1.0: stable interfaces, and reliable implementation. So far most of our discussion has been around having the right interfaces, but there's also issues with our implementations. There's issues in LatentDirichletAllocation, in much of the cross_decomposition module, in some of the Bayesian linear models, and there's pretty annoying issues wrt to convergence and solver choices in LogisticRegression and LinearSVC.

I would at least like to resolve the issues in LogisticRegression and LinearSVC before we do a 1.0.

I'm not sure if writing a 1.0 paper is helpful, but it's something to consider.

@MartinThoma
Copy link
Contributor Author

Cool, I missed the 1.0 milestone - let's see if I can contribute :-)

It's awesome to see that this is already in progress. scikit-learn is a project that helped me a lot during my studies / career; I will try to find some time to give something back.

I'm not sure if writing a 1.0 paper is helpful, but it's something to consider.

Personally, I would consider this as the "cherry on the top": Very nice to have, a very rewarding thing to do, probably less useful than many (all?) other things in the issue list. And also something that can be done at any point in time.

I'm not sure if this "issue" should be closed then. Maybe it is a good way to channel comments / suggestions?

@amueller
Copy link
Member

One of the issue with adding additional papers is that it gets less clear for users what to cite and it splits our citation count.
On the other hand it allows new contributors to share in the citations (me and @jnothman are not in the published journal version of the previous paper).
These are somewhat tangential issues though.

I think having an issue to discuss 1.0 is not a bad idea so I think it's fine to leave this open to have a central place for discussion.

@amueller
Copy link
Member

amueller commented Dec 4, 2019

Since this came up again today: I'm a bit torn between wanting to have something I'm really happy with and getting a 1.0 out of the door.

I don't think the wish-list items will be done for the next release (currently called 0.22), and there's maybe a slight chance they will be done for the one after that.

If we want 1.0 do be stable in some sense, than we would really need to prioritize those issues, which we haven't done so far (from what I can tell).

@jnothman
Copy link
Member

jnothman commented Dec 4, 2019

I think I have come to agree that we should just do 1.0 and if we want to make any big changes that should be 2.0.

We've certainly got enough content and enough quality assurance tools to suggest that we can be 1.0. If we're aiming for 1.0 we should work out what we want to include, focusing, I think, more on consistency than features. 1.0 for instance might be a good opportunity to improve some parameter name/definition consistency, scale (and sample weight) invariance in parameter definitions, etc.

FWIW, some of the changes around sample props may be best with backwards incompatibility. The change to NamedArray may also present backwards incompatibility that would deserve a major release. But, indeed, there would be no great harm if that major release was 1. to 2. rather than 0. to 1.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 5, 2019 via email

@ahowe42
Copy link

ahowe42 commented Dec 5, 2019

Looking over the issues mentioned by @amueller in July, I wouldn't be concerned about 7242. Ensuring that the columns used for training / testing / inference are consistent is pretty basic. Regarding 10603, that is a valid point, and I think it should be true for a 1.0 relase. Issue 4497 seems more like something that should not hold up a 1.0 release, while I do think 4143 is important enough that I'd like to see it in 1.0.

With the prevalence of pandas, I do have to say that named features is probably important enough to ensure that's in a 1.0 release.

@NicolasHug
Copy link
Member

Another feature I'd personally like to see before 1.0 is native support for categorical data (in tree models, or at least some of them). Which is sort of a prerequisite for @amueller's #10603. And also make an informed decision on the randomness handling scikit-learn/enhancement_proposals#24

I agree with most of what has been said and I'm very happy to start considering 1.0 right now.

Let's bring up the 1.0 topic during the next meeting so we can start figuring out what could / should be in there

@agramfort
Copy link
Member

agramfort commented Dec 5, 2019 via email

@qinhanmin2014
Copy link
Member

+1 to release 1.0 ASAP, two questions:
(1) Is it acceptable to have experimental features in 1.0? (I guess we have to do so)
(2) We mention things like "XXX is deprecated in 0.22 and will be removed in 0.24" so we promise that there will be 0.24?

@NicolasHug
Copy link
Member

(1) ideally these would be stable by then IMO
(2) There will probably be 2 major releases between the time we decide on 1.0 and the time we release it so that might not be a problem

@jnothman
Copy link
Member

jnothman commented Dec 6, 2019 via email

@adrinjalali adrinjalali added this to To do in Meeting Issues Jan 6, 2020
@VarIr
Copy link
Contributor

VarIr commented Jan 19, 2020

I would like to second the proposal for a version 1.0 paper, as publications are still an essential corner stone in the academic world.

As a PhD student considering an academic career, and non-core developer of scikit-learn, my contributions currently work like this:

  1. Stumble upon some issue that must be solved for my own projects building upon scikit-learn
  2. Fix the code for my project during working hours
  3. Create PR outside working hours, because there are always so many other tasks, and those for which I can get academic credit have precedence. In the end, I want to contribute, so I do this in my free time.

If there was a clear commitment to a publication, I would have leverage in discussions with my supervisor/faculty about allocating more time towards contributing to scikit-learn. I imagine other contributors are in similar situations.

One of the issue with adding additional papers is that it gets less clear for users what to cite and it splits our citation count.

I think these issues can be addressed. In my field (computational biology), papers about public resources are often updated every few years, i.e. there might be "The XY database in 2017", "in 2019", etc. One typically cites the latest iteration/highest version, which could be easily provided at https://scikit-learn.org/stable/about.html#citing-scikit-learn.
Aggregating two (later on: a handful) numbers for a global scikit-learn citation count should be doable as well.
In addition, there are a number of academic metrics that only take into account publications of the last five years, already excluding the JMLR paper.

@rth
Copy link
Member

rth commented Aug 21, 2020

I would propose to release 0.25 as 1.0 and be done with it. One year after this discussion was started we did move forward with some of those major points, but I can't say we are close to resolving all of them either. We can very well do a 2.0 for them if needed, otherwise the risk is that we will never release 1.0 (or at least not in the next few years) which is not great.

The semantic versioning specifies that the version 1.0 should happen when software is used in production and there is a stable API. We had that for a very long time, and this wait for 1.0 until everything is is solved can slightly hurt scikit-learn image for users who implicitly expect it to follow semantic versioning.

@alfaro96
Copy link
Member

I think that we should release 0.25 to be consistent with the deprecations of 0.23 (e.g. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_export.py#L138-L143). Nevertheless, I will be happy to see release 0.26 as 1.0.

WDYT?

@adrinjalali
Copy link
Member

@alfaro96 there are always rolling deprecations, and if we go with that, we can never release version 1.0 :D

@rth I'd be happy to go with version 1.0, especially if that allows us to be backward incompatible w.r.t. sample props (cc @jnothman ). The solution we have right now is mostly backward compatible, except some edge cases, it'd be nice if we could just not be backward compatible.

@NicolasHug
Copy link
Member

Maybe we should start to make a list of backward-incompatible stuff that should be reasonably easy to get in within 0.25 (maybe stretch to 0.26). On top of sample props, #15880 would be a good candidate IMHO.

@rth let's discuss this during the next meeting? https://hackmd.io/AuqfmgwvTf-bFz60yjVG1g

@NicolasHug
Copy link
Member

Another one is fixing the loss names inconsistencies #18248

@thomasjpfan
Copy link
Member

Coming up with a plotting API for scoring and metrics interfaces: #15880

@lorentzenchr
Copy link
Member

I would propose to release 0.25 as 1.0 and be done with it. [...] otherwise the risk is that we will never release 1.0 [...]

I'm fine1 with releasing one of the next 0.2X, X≥5, as 1.0.
How do we proceed from there, as before or with SemVer? What will be the next major release, 2.0 or 1.1?

1Emotionally, just "fine" is an understatement🎉

@NicolasHug
Copy link
Member

Aren't we already using SemVer? I think the next major release would be 2.0 (with breaking changes) and we would be releasing 1.1, 1.2 etc every 6 months

@jnothman
Copy link
Member

jnothman commented Aug 25, 2020 via email

@rth
Copy link
Member

rth commented Aug 25, 2020

since minor releases are
not backwards compatible with all minor releases from the same major
version series.

There is an exception for 0.X versions though where "Anything can change at any time". So technically we could release 1.0 and claim following SemVer. Though then we would have to increment the major version for any breaking change. On once side it's a shame that the scientific python ecosystem doesn't follow SemVer.

Personal story from today: had package v3.X installed, a dependency requires version v2.0.0 (v2.0 to v2.8 exist); what's the latest version of package that would work with the dependency? Unknown. v2.8 doesn't work. Had to bisect all the way to v2.3. Without SemVer it's hard to know what works with what.

Though, on the other side I have a hard time seeing how we don't end up at version e.g. 16.0.0 with semver after a few years, since each release has some breaking changes even if they are preceded by a deprecation window. Maybe it's more suitable to smaller libraries, not sure. To be clear, I'm not proposing to follow SemVer, just wondering about it.

@amueller
Copy link
Member

My preference would be to do a standard release 0.25 as 1.0 as we would have done (keeping standard deprecations), but reserve the right to do breaking changes for future major versions. So 1.0 wouldn't be very special, but we establish that 2.0 could have breaking changes.

Breaking changes that might be interesting in the future

  • make pipeline clone
  • allow .fit.transform != fit_transform (not technically an incompatible change)

@MartinThoma
Copy link
Contributor Author

Though, on the other side I have a hard time seeing how we don't end up at version e.g. 16.0.0 with semver after a few years

Is this a problem? I know a couple of projects with very high version numbers ... starting with those, that use calendar versioning (CalVer) 😄

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 31, 2020 via email

@agramfort
Copy link
Member

agramfort commented Aug 31, 2020 via email

@NicolasHug
Copy link
Member

I understand most of us seem to favor a backward-compatible 1.0, but these 2 points are still unclear to me:

  1. If we're going to release 1.0, why not take the opportunity to introduce breaking changes? It seems to me that this is what major versions are for?
  2. Why allow 2.0 to have breaking changes, but not 1.0? What's so special about 2.0 that 1.0 doesn't have?

Sorry, I guess some of you made these points during today's meeting, but I couldn't follow everybody's POV.

@lorentzenchr
Copy link
Member

IIUC, we are technically breaking strict backward compatibility in every minor release so far (after a deprecation cycle). With 1.0.0 we want to signal the maturity of scikit-learn. It will have some (deprecation cycled) breaking changes.
After pandas, arrow, dplyr and xgboost released their 1.0.0 in 2020, I'm in favour of releasing either 0.24 (still 2020?) or more likely 0.25 as 1.0.0.

@MartinThoma
Copy link
Contributor Author

The main point of version 1.0 for me is to show that scikit-learn is production-ready. I would like to get this signal soon and having not too many changes makes it more likely that it happens soon :-)

@cmarmo
Copy link
Member

cmarmo commented Sep 1, 2020

I understand most of us seem to favor a backward-compatible 1.0, but these 2 points are still unclear to me:

1. If we're going to release 1.0, why not take the opportunity to introduce breaking changes? It seems to me that this is what major versions are for?

2. Why allow 2.0 to have breaking changes, but not 1.0? What's so special about 2.0 that 1.0 doesn't have?

Sorry, I guess some of you made these points during today's meeting, but I couldn't follow everybody's POV.

@NicolasHug, all, I think it's time to bring this issue to a pragmatic discussion: there is a milestone for version 1.0, how many issues listed there (and those you might want to add to the list) will break backward compatibility? Are they supposed to be solved for 1.0 (~0.25 in an ideal world)? If the work is doable for 0.25 let's break things, if not, let's keep the 'mess' for 2.0... scikit-learn needs a 1.0.. :)
I think this was the point raised during the meeting (... I know, I still have to sum up the notes, doing that now... sorry): I would be happy to create the label 'break compatibility', but I don't have rights for that... ;)

@ogrisel
Copy link
Member

ogrisel commented Sep 1, 2020

I have created a new Breaking Change label to identify problematic issues that could not be easily be managed by the usual rolling deprecation cycle.

Please use it liberally to tag issues that would mandate breaking backward compat harder than usual and would let us better decide if switching to a new versioning scheme (without the leading "0.") would also require us to evolve our current deprecation policy or not.

@NicolasHug
Copy link
Member

@cmarmo I understand that the timing constraint would not allow us to cram everything from the milestone into 1.0. Indeed we need to be pragmatic and my initial #14386 (comment) was the following, but perhaps I did a poor job at explaining this:

Maybe we should start to make a list of backward-incompatible stuff that should be reasonably easy to get in within 0.25

Typically, the random state issue that was mentioned during the meeting does not qualify as reasonably easy to get in. But most of the other things mentioned here in this thread do (pipeline clones, sample props (trusting @adrinjalali on this), the loss name unification...)

I'm still interested in @amueller @GaelVaroquaux @ogrisel answers to my questions above (#14386 (comment)). To clarify: I'm not trying to push for breaking things at all costs, I'm fine with not breaking backward compatibility. But I'd like the understand the reasoning behind this decision. I don't understand it so far and it makes me think that there's something obvious that I'm missing.

@ogrisel
Copy link
Member

ogrisel commented Sep 18, 2020

I understand most of us seem to favor a backward-compatible 1.0, but these 2 points are still unclear to me:

If we're going to release 1.0, why not take the opportunity to introduce breaking changes? It seems to me that this is what major versions are for?

I would rather avoid any breaking change if we can. Always following our deprecation cycle with warnings is nicer to our users.

Why allow 2.0 to have breaking changes, but not 1.0? What's so special about 2.0 that 1.0 doesn't have?

I would rather not have any breaking change in 2.0 either.

To me the point of dropping the leading 0 in our version number is just psychological/communication but it would not necessarily lead us to implement SemVer if the rolling deprecation policy can be preserved (just shifting what we consider major / bugfix releases by one digit to the left).

@NicolasHug
Copy link
Member

Thanks @ogrisel, at least I understand your POV because it's consistent. But I still need to understand @amueller and @GaelVaroquaux 's reasoning then ;)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 18, 2020 via email

@NicolasHug
Copy link
Member

@GaelVaroquaux so would you be OK with introducing a few minor breaking changes in 1.0, provided that the next breaking changes would happen many years from now (if ever)?

I have the impression that during the last meeting, people understood my point of view as "let's make breaking changes often" but that's absolutely not what I'm advocating for.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 18, 2020 via email

@cmarmo
Copy link
Member

cmarmo commented Sep 18, 2020

Also, I'd like to be convinced that there is no way to avoid a smooth deprecation ⁣

Until no list of breaking features is available, no evaluation is possible: @ogrisel kindly provided the suitable label that hasn't been used for now.

@NicolasHug
Copy link
Member

There are a bunch of comments with such issues in this thread already. #14386 (comment) #14386 (comment) #14386 (comment) #14386 (comment)
The label is useful once we're sure we want to go this way, but most of them are candidates so far (and no decision was made yet).

@rth
Copy link
Member

rth commented Sep 21, 2020

BTW there is some relevant discussion about versioning models for numpy numpy/numpy#10156 and there is also NEP 23 which formalizes some of it.

I would rather avoid any breaking change if we can.

Technically we have breaking changes in each release, at least for some infrequent use cases (e.g. removal of a parameter) after a deprecation cycle. I don't think that would change after 1.0 unless we want to significantly change our deprecation policy? Though I agree that we should try to minimize the amount of breaking changes as much as possible, including for v1.0.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 21, 2020 via email

@lorentzenchr
Copy link
Member

Version 1.0 was released recently. On the way, it was good to have this issue for discussions and raising concerns and questions. A big thank you to everyone who participated!

Meeting Issues automation moved this from To do to Done Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests