GroupedSplitter #2809

YSaxon · 2020-09-16T22:13:49Z

See here and here in the DevChat in Forums.

I'm very much unsure if I've added the feature in all the right places.. I'm pretty much expecting (and hoping) to be told what more I need to do to make this a valid pull request.

review-notebook-app · 2020-09-16T22:13:53Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

YSaxon · 2020-09-18T15:04:52Z

See here for a notebook with the basic steps broken down.

YSaxon · 2020-09-18T15:24:21Z

Note that it is guaranteed to always return a validation set. But in various edge cases, like where there is only a single group, or if the validation fraction is set abnormally high (ie .80 with two groups of equal size), it is possible that it will mark all items to be part of the validation set. If you'd like, I could set it to raise an error in that case.

It is not guaranteed to always return the very best possible validation set (ie closest possible count to the fraction requested). This seems to be an NP-hard problem. Wikipedia Stack Overflow. Not to mention, what you'd really want would be a random pick from among the various possible sets that are reasonably close to the best set you could possibly make, which would probably involve several NP-hard problems plus a judgement call.

In practice, what we could maybe do is raise an alert if the size of the returned set deviates significantly (maybe +/- .05?) from the size requested. Let me know if that's something you'd like me to implement.

jph00 · 2020-09-28T16:29:41Z

Many thanks. Sorry it took me a while to review.

It's a little hard for me to review it right now, and will also be hard for people to use it, because it has no docs, and the tests are quite long and complex. Could you please write some explanations and simple examples, along with motivation for the feature, in the notebook? The extra details you included above in the PR could also be added to the notebook. In general, try to provide all the information that someone might want in order to understand what this does, whether they might want to use it, and how to use it.

Also, please resolve the conflict that's been introduced since you submitted this.

YSaxon · 2020-09-30T00:56:52Z

My first post on the dev forum, which this pull request was based off of, giving my justification and an example:

Idea, which I already implemented for myself inside a notebook, but I wonder if it might have broader appeal:

A split_with_grouping(group_from_filepath_re, pct, seed) function that would allow you to split a dataset randomly by percentage, like split_by_rand_pct, but without splitting up groups as defined by a regex.

For example, I’m using the 50States10k dataset of US Google StreetView images(https://arxiv.org/pdf/1810.03077.pdf, smaller dataset here). This has folders for each State, and then files for each cardinal direction for each of many randomly selected points labeled by some kind of hash, so for instance:

50States10k/Alabama/2007_-NPWPMrYipeYcLsiZqKRyw_0.jpg
50States10k/Alabama/2007_-NPWPMrYipeYcLsiZqKRyw_180.jpg
…
50States10k/Alabama/2009_3BS7oprV5tjwg-M4dA1nLA_270.jpg
…
50States10k/South Dakota/2011_iloPUAZx7Vw59X-qJB2OQw_90.jpg
…
Now if you simply use split_by_rand_pct, you will wind up with an unfair validation set, as for each validation image, in most cases you were training with images from other cardinal directions of the same point,. You want to instead validate it with examples of streetviews from locations it has never seen at all.

You could make a csv file and split the images manually but that sounds like a major pain.

So instead why not just have a function that can take an regex which can identify that, for instance, the top two examples (and two others) are all part of the same group and need to be collectively assigned to either the training or validation set.

(in this case, what I used was:
re.match(r'\d{4}([\w-]+)\d+',Path(x).stem).group(1) which for example above spits out -NPWPMrYipeYcLsiZqKRyw)

My second post:

Hi, just wanted to follow up about this idea. I suspect it got lost amidst the upgrade to v2.

To generalize a bit (and update for v2), you could have a splitter function in the new datablock api (GroupPreservingSplitter? SegregatedSplitter?) which takes a function (item->groupidentifier) and a percentage, and splits into training and validation sets without splitting up groups (as identified by the function).

Edit: I went looking for an implementation of the underlying algorithm (to avoid reinventing the wheel), and this 1 is the only one I could find. It’s definitely more polished than what I had written for myself but not substantially different.

YSaxon · 2020-09-30T01:00:25Z

@racheltho 's blog post https://www.fast.ai/2017/11/13/validation-sets/, subheading New people, new boats, new… is another good example.

YSaxon · 2020-09-30T01:02:36Z

The whole idea of GroupedSplitter is to allow fastai to automatically ensure that the pictures of the same people, same boats etc, end up wholly in either the training set or the validation set, and not split between sets.

YSaxon · 2020-09-30T01:04:10Z

I'll add docs and fix the merge conflicts when I get a chance. It might be a little bit, as I'm pretty busy with a few projects at the moment.

YSaxon · 2020-09-30T01:07:26Z

I did also already write a slightly more complicated version which will pick the most accurate (compared to requested split percentage) split out of n tries (default to 5), and also warn if the final returned split is off by +- 5 from the percent requested. I'll have to push or otherwise link that to this pull request.

jph00 · 2020-09-30T01:22:31Z

Thanks @YSaxon - I look forward to seeing it. :) No hurry though!

YSaxon · 2020-10-05T15:04:34Z

Had a couple minutes so I fixed the conflict

Still need to do the rest

jph00 · 2020-10-06T23:12:25Z

Please at-mention me when done, so I don't miss it :)

YSaxon · 2020-10-16T00:03:46Z

Will do
(I haven’t forgotten about it)

jph00 · 2020-11-04T21:19:51Z

@YSaxon, are you still planning to handle to remaining issues? Or would you prefer I closed this PR?

jph00

Please write some explanations and simple examples, along with motivation for the feature, in the notebook. The extra details you included above in the PR could also be added to the notebook. In general, try to provide all the information that someone might want in order to understand what this does, whether they might want to use it, and how to use it. The tests should be as simple and clear as possible.

YSaxon · 2020-11-05T15:47:17Z

Should that documentation go in 05_data.transforms.ipynb? Or somewhere else?

…tter

YSaxon · 2020-11-13T15:25:19Z

I'm not sure I understand the error message this is failing with.

YSaxon · 2020-11-13T15:25:24Z

More generally, I thought it might be a good idea to demonstrate the use of this splitter with a real dataset, but I also realize that it might slow down tests. Is there any way to mark those cells so they are no automatically run?

YSaxon · 2020-11-13T15:26:40Z

In general, is this the sort of documentation you wanted? Or is it too verbose? If so, is there anywhere else better suited to a more verbose explanation and example? Maybe in the tutorial section?

YSaxon · 2020-11-13T15:38:05Z

Also, do you like the dual usage of groupkey (if items is a list, then groupkey should be a callable, if items is a dataframe then it should be a column name)? In any other language I would overload the function instead, like so, but it's not clear to me if that's even possible in python, let alone the pythonic way.

def GroupedSplitter(Callable item2group,valid_pct=0.2,seed=None,n_tries=3):
def _inner(o):
assert not isinstance(o,pd.DataFrame)

def GroupedSplitter(str group_col,valid_pct=0.2,seed=None,n_tries=3):
def _inner(o):
assert isinstance(o,pd.DataFrame)

YSaxon · 2020-11-13T15:38:50Z

@jph00

…tterbranch2

YSaxon · 2020-11-25T19:20:22Z

@jph00 Please take a look

hamelsmu · 2021-04-09T05:23:17Z

@jph00 I fixed the sync issues in this PR as well, it is ready for review.

YSaxon · 2022-08-19T18:14:09Z

@jph00 I just came across this old pull request of mine. Did you ever have a chance to review it?

jph00 · 2022-08-20T03:26:38Z

@YSaxon No, I apologise, I didn't see @hamelsmu's message last April that he'd fixed the CI issues so it was ready to review.

Have you taken a look at the various options here?: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection . Is this PR doing something different to those classes?

Yaakov Saxon and others added 4 commits September 16, 2020 13:36

adding grouped-splitter in notebook

8b1ecf4

update grouped splitter

3b9c7fe

Merge remote-tracking branch 'origin/master'

c629414

minor (change word subgroups to groups)

2c6fef5

YSaxon requested a review from jph00 as a code owner September 16, 2020 22:13

clean file with nb-clean

1aa1139

YSaxon force-pushed the groupSplitter branch from 6d80186 to 2c6fef5 Compare September 17, 2020 14:43

updated wording

a49c7d0

change apply to applymap so it works correctly for elementwise funcs

15472ad

YSaxon changed the title ~~GroupSplitter~~ GroupedSplitter Sep 30, 2020

YSaxon added 3 commits October 4, 2020 21:02

Merge remote-tracking branch 'origin/master' into groupSplitter

49cb3f8

undo most of nbclean to fix conflict

2d53a6f

Merge remote-tracking branch 'origin/master' into groupSplitter

a861599

jph00 requested changes Nov 4, 2020

View reviewed changes

Merge remote-tracking branch 'origin/master' into groupSplitter

b6413fe

YSaxon added 9 commits November 5, 2020 12:06

docs

491c3a3

more docs and one less test

bc30c16

nbdev_build_lib

a7a9cd9

nbdev_clean_nbs

a94864c

minor renames and doc changes

d5d4d91

changes to alice bob charlie example

fdbae86

zebra finch example

e65b9fd

Merge remote-tracking branch 'origin/master' into enhancedGroupedSpli…

dbc224d

…tter

removing warning since it isn't working

fa8bba1

YSaxon force-pushed the groupSplitter branch from c477ee5 to fa8bba1 Compare November 12, 2020 20:16

jph00 closed this Nov 12, 2020

jph00 reopened this Nov 12, 2020

jph00 closed this Nov 23, 2020

jph00 reopened this Nov 23, 2020

YSaxon added 5 commits November 25, 2020 11:08

update checks.txt

a87158a

Merge remote-tracking branch 'origin/master' into enhancedGroupedSpli…

5f11f1e

…tterbranch2

added #slow to zebra_finch tests

c2b3f0a

Split GroupedSplitter into two, one for lists, one for dfs

a62a372

improve documentation

6c83829

sync

6b13149

hamelsmu added the enhancement label May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupedSplitter #2809

GroupedSplitter #2809

YSaxon commented Sep 16, 2020

review-notebook-app bot commented Sep 16, 2020

YSaxon commented Sep 18, 2020

YSaxon commented Sep 18, 2020

jph00 commented Sep 28, 2020

YSaxon commented Sep 30, 2020

YSaxon commented Sep 30, 2020

YSaxon commented Sep 30, 2020

YSaxon commented Sep 30, 2020

YSaxon commented Sep 30, 2020

jph00 commented Sep 30, 2020

YSaxon commented Oct 5, 2020

jph00 commented Oct 6, 2020

YSaxon commented Oct 16, 2020

jph00 commented Nov 4, 2020

jph00 left a comment

YSaxon commented Nov 5, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 25, 2020

hamelsmu commented Apr 9, 2021

YSaxon commented Aug 19, 2022

jph00 commented Aug 20, 2022

GroupedSplitter #2809

Are you sure you want to change the base?

GroupedSplitter #2809

Conversation

YSaxon commented Sep 16, 2020

review-notebook-app bot commented Sep 16, 2020

YSaxon commented Sep 18, 2020

YSaxon commented Sep 18, 2020

jph00 commented Sep 28, 2020

YSaxon commented Sep 30, 2020

YSaxon commented Sep 30, 2020

YSaxon commented Sep 30, 2020

YSaxon commented Sep 30, 2020

YSaxon commented Sep 30, 2020

jph00 commented Sep 30, 2020

YSaxon commented Oct 5, 2020

jph00 commented Oct 6, 2020

YSaxon commented Oct 16, 2020

jph00 commented Nov 4, 2020

jph00 left a comment

Choose a reason for hiding this comment

YSaxon commented Nov 5, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 13, 2020

YSaxon commented Nov 25, 2020

hamelsmu commented Apr 9, 2021

YSaxon commented Aug 19, 2022

jph00 commented Aug 20, 2022