Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupedSplitter #2809

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open

GroupedSplitter #2809

wants to merge 27 commits into from

Conversation

YSaxon
Copy link

@YSaxon YSaxon commented Sep 16, 2020

See here and here in the DevChat in Forums.

I'm very much unsure if I've added the feature in all the right places.. I'm pretty much expecting (and hoping) to be told what more I need to do to make this a valid pull request.

@YSaxon YSaxon requested a review from jph00 as a code owner September 16, 2020 22:13
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@YSaxon
Copy link
Author

YSaxon commented Sep 18, 2020

See here for a notebook with the basic steps broken down.

@YSaxon
Copy link
Author

YSaxon commented Sep 18, 2020

Note that it is guaranteed to always return a validation set. But in various edge cases, like where there is only a single group, or if the validation fraction is set abnormally high (ie .80 with two groups of equal size), it is possible that it will mark all items to be part of the validation set. If you'd like, I could set it to raise an error in that case.

It is not guaranteed to always return the very best possible validation set (ie closest possible count to the fraction requested). This seems to be an NP-hard problem. Wikipedia Stack Overflow. Not to mention, what you'd really want would be a random pick from among the various possible sets that are reasonably close to the best set you could possibly make, which would probably involve several NP-hard problems plus a judgement call.

In practice, what we could maybe do is raise an alert if the size of the returned set deviates significantly (maybe +/- .05?) from the size requested. Let me know if that's something you'd like me to implement.

@jph00
Copy link
Member

jph00 commented Sep 28, 2020

Many thanks. Sorry it took me a while to review.

It's a little hard for me to review it right now, and will also be hard for people to use it, because it has no docs, and the tests are quite long and complex. Could you please write some explanations and simple examples, along with motivation for the feature, in the notebook? The extra details you included above in the PR could also be added to the notebook. In general, try to provide all the information that someone might want in order to understand what this does, whether they might want to use it, and how to use it.

Also, please resolve the conflict that's been introduced since you submitted this.

@YSaxon
Copy link
Author

YSaxon commented Sep 30, 2020

My first post on the dev forum, which this pull request was based off of, giving my justification and an example:

Idea, which I already implemented for myself inside a notebook, but I wonder if it might have broader appeal:

A split_with_grouping(group_from_filepath_re, pct, seed) function that would allow you to split a dataset randomly by percentage, like split_by_rand_pct, but without splitting up groups as defined by a regex.

For example, I’m using the 50States10k dataset of US Google StreetView images(https://arxiv.org/pdf/1810.03077.pdf, smaller dataset here). This has folders for each State, and then files for each cardinal direction for each of many randomly selected points labeled by some kind of hash, so for instance:

50States10k/Alabama/2007_-NPWPMrYipeYcLsiZqKRyw_0.jpg
50States10k/Alabama/2007_-NPWPMrYipeYcLsiZqKRyw_180.jpg

50States10k/Alabama/2009_3BS7oprV5tjwg-M4dA1nLA_270.jpg

50States10k/South Dakota/2011_iloPUAZx7Vw59X-qJB2OQw_90.jpg

Now if you simply use split_by_rand_pct, you will wind up with an unfair validation set, as for each validation image, in most cases you were training with images from other cardinal directions of the same point,. You want to instead validate it with examples of streetviews from locations it has never seen at all.

You could make a csv file and split the images manually but that sounds like a major pain.

So instead why not just have a function that can take an regex which can identify that, for instance, the top two examples (and two others) are all part of the same group and need to be collectively assigned to either the training or validation set.

(in this case, what I used was:
re.match(r'\d{4}([\w-]+)\d+',Path(x).stem).group(1) which for example above spits out -NPWPMrYipeYcLsiZqKRyw)


My second post:

Hi, just wanted to follow up about this idea. I suspect it got lost amidst the upgrade to v2.

To generalize a bit (and update for v2), you could have a splitter function in the new datablock api (GroupPreservingSplitter? SegregatedSplitter?) which takes a function (item->groupidentifier) and a percentage, and splits into training and validation sets without splitting up groups (as identified by the function).

Edit: I went looking for an implementation of the underlying algorithm (to avoid reinventing the wheel), and this 1 is the only one I could find. It’s definitely more polished than what I had written for myself but not substantially different.

@YSaxon
Copy link
Author

YSaxon commented Sep 30, 2020

@racheltho 's blog post https://www.fast.ai/2017/11/13/validation-sets/, subheading New people, new boats, new… is another good example.

@YSaxon
Copy link
Author

YSaxon commented Sep 30, 2020

The whole idea of GroupedSplitter is to allow fastai to automatically ensure that the pictures of the same people, same boats etc, end up wholly in either the training set or the validation set, and not split between sets.

@YSaxon
Copy link
Author

YSaxon commented Sep 30, 2020

I'll add docs and fix the merge conflicts when I get a chance. It might be a little bit, as I'm pretty busy with a few projects at the moment.

@YSaxon
Copy link
Author

YSaxon commented Sep 30, 2020

I did also already write a slightly more complicated version which will pick the most accurate (compared to requested split percentage) split out of n tries (default to 5), and also warn if the final returned split is off by +- 5 from the percent requested. I'll have to push or otherwise link that to this pull request.

@YSaxon YSaxon changed the title GroupSplitter GroupedSplitter Sep 30, 2020
@jph00
Copy link
Member

jph00 commented Sep 30, 2020

Thanks @YSaxon - I look forward to seeing it. :) No hurry though!

@YSaxon
Copy link
Author

YSaxon commented Oct 5, 2020

Had a couple minutes so I fixed the conflict

Still need to do the rest

@jph00
Copy link
Member

jph00 commented Oct 6, 2020

Please at-mention me when done, so I don't miss it :)

@YSaxon
Copy link
Author

YSaxon commented Oct 16, 2020

Will do
(I haven’t forgotten about it)

@jph00
Copy link
Member

jph00 commented Nov 4, 2020

@YSaxon, are you still planning to handle to remaining issues? Or would you prefer I closed this PR?

Copy link
Member

@jph00 jph00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write some explanations and simple examples, along with motivation for the feature, in the notebook. The extra details you included above in the PR could also be added to the notebook. In general, try to provide all the information that someone might want in order to understand what this does, whether they might want to use it, and how to use it. The tests should be as simple and clear as possible.

@YSaxon
Copy link
Author

YSaxon commented Nov 5, 2020

Should that documentation go in 05_data.transforms.ipynb? Or somewhere else?

@YSaxon
Copy link
Author

YSaxon commented Nov 13, 2020

I'm not sure I understand the error message this is failing with.

@YSaxon
Copy link
Author

YSaxon commented Nov 13, 2020

More generally, I thought it might be a good idea to demonstrate the use of this splitter with a real dataset, but I also realize that it might slow down tests. Is there any way to mark those cells so they are no automatically run?

@YSaxon
Copy link
Author

YSaxon commented Nov 13, 2020

In general, is this the sort of documentation you wanted? Or is it too verbose? If so, is there anywhere else better suited to a more verbose explanation and example? Maybe in the tutorial section?

@YSaxon
Copy link
Author

YSaxon commented Nov 13, 2020

Also, do you like the dual usage of groupkey (if items is a list, then groupkey should be a callable, if items is a dataframe then it should be a column name)? In any other language I would overload the function instead, like so, but it's not clear to me if that's even possible in python, let alone the pythonic way.

def GroupedSplitter(Callable item2group,valid_pct=0.2,seed=None,n_tries=3):
def _inner(o):
assert not isinstance(o,pd.DataFrame)

def GroupedSplitter(str group_col,valid_pct=0.2,seed=None,n_tries=3):
def _inner(o):
assert isinstance(o,pd.DataFrame)

@YSaxon
Copy link
Author

YSaxon commented Nov 13, 2020

@jph00

@jph00 jph00 closed this Nov 23, 2020
@jph00 jph00 reopened this Nov 23, 2020
@YSaxon
Copy link
Author

YSaxon commented Nov 25, 2020

@jph00 Please take a look

@hamelsmu
Copy link
Member

hamelsmu commented Apr 9, 2021

@jph00 I fixed the sync issues in this PR as well, it is ready for review.

@YSaxon
Copy link
Author

YSaxon commented Aug 19, 2022

@jph00 I just came across this old pull request of mine. Did you ever have a chance to review it?

@jph00
Copy link
Member

jph00 commented Aug 20, 2022

@YSaxon No, I apologise, I didn't see @hamelsmu's message last April that he'd fixed the CI issues so it was ready to review.

Have you taken a look at the various options here?: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection . Is this PR doing something different to those classes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants