New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GroupedSplitter #2809
base: master
Are you sure you want to change the base?
GroupedSplitter #2809
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
6d80186
to
2c6fef5
Compare
See here for a notebook with the basic steps broken down. |
Note that it is guaranteed to always return a validation set. But in various edge cases, like where there is only a single group, or if the validation fraction is set abnormally high (ie .80 with two groups of equal size), it is possible that it will mark all items to be part of the validation set. If you'd like, I could set it to raise an error in that case. It is not guaranteed to always return the very best possible validation set (ie closest possible count to the fraction requested). This seems to be an NP-hard problem. Wikipedia Stack Overflow. Not to mention, what you'd really want would be a random pick from among the various possible sets that are reasonably close to the best set you could possibly make, which would probably involve several NP-hard problems plus a judgement call. In practice, what we could maybe do is raise an alert if the size of the returned set deviates significantly (maybe +/- .05?) from the size requested. Let me know if that's something you'd like me to implement. |
Many thanks. Sorry it took me a while to review. It's a little hard for me to review it right now, and will also be hard for people to use it, because it has no docs, and the tests are quite long and complex. Could you please write some explanations and simple examples, along with motivation for the feature, in the notebook? The extra details you included above in the PR could also be added to the notebook. In general, try to provide all the information that someone might want in order to understand what this does, whether they might want to use it, and how to use it. Also, please resolve the conflict that's been introduced since you submitted this. |
My first post on the dev forum, which this pull request was based off of, giving my justification and an example: Idea, which I already implemented for myself inside a notebook, but I wonder if it might have broader appeal: A split_with_grouping(group_from_filepath_re, pct, seed) function that would allow you to split a dataset randomly by percentage, like split_by_rand_pct, but without splitting up groups as defined by a regex. For example, I’m using the 50States10k dataset of US Google StreetView images(https://arxiv.org/pdf/1810.03077.pdf, smaller dataset here). This has folders for each State, and then files for each cardinal direction for each of many randomly selected points labeled by some kind of hash, so for instance: 50States10k/Alabama/2007_-NPWPMrYipeYcLsiZqKRyw_0.jpg You could make a csv file and split the images manually but that sounds like a major pain. So instead why not just have a function that can take an regex which can identify that, for instance, the top two examples (and two others) are all part of the same group and need to be collectively assigned to either the training or validation set. (in this case, what I used was: My second post: Hi, just wanted to follow up about this idea. I suspect it got lost amidst the upgrade to v2. To generalize a bit (and update for v2), you could have a splitter function in the new datablock api (GroupPreservingSplitter? SegregatedSplitter?) which takes a function (item->groupidentifier) and a percentage, and splits into training and validation sets without splitting up groups (as identified by the function). Edit: I went looking for an implementation of the underlying algorithm (to avoid reinventing the wheel), and this 1 is the only one I could find. It’s definitely more polished than what I had written for myself but not substantially different. |
@racheltho 's blog post https://www.fast.ai/2017/11/13/validation-sets/, subheading New people, new boats, new… is another good example. |
The whole idea of GroupedSplitter is to allow fastai to automatically ensure that the pictures of the same people, same boats etc, end up wholly in either the training set or the validation set, and not split between sets. |
I'll add docs and fix the merge conflicts when I get a chance. It might be a little bit, as I'm pretty busy with a few projects at the moment. |
I did also already write a slightly more complicated version which will pick the most accurate (compared to requested split percentage) split out of n tries (default to 5), and also warn if the final returned split is off by +- 5 from the percent requested. I'll have to push or otherwise link that to this pull request. |
Thanks @YSaxon - I look forward to seeing it. :) No hurry though! |
Had a couple minutes so I fixed the conflict Still need to do the rest |
Please at-mention me when done, so I don't miss it :) |
Will do |
@YSaxon, are you still planning to handle to remaining issues? Or would you prefer I closed this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write some explanations and simple examples, along with motivation for the feature, in the notebook. The extra details you included above in the PR could also be added to the notebook. In general, try to provide all the information that someone might want in order to understand what this does, whether they might want to use it, and how to use it. The tests should be as simple and clear as possible.
Should that documentation go in 05_data.transforms.ipynb? Or somewhere else? |
c477ee5
to
fa8bba1
Compare
I'm not sure I understand the error message this is failing with. |
More generally, I thought it might be a good idea to demonstrate the use of this splitter with a real dataset, but I also realize that it might slow down tests. Is there any way to mark those cells so they are no automatically run? |
In general, is this the sort of documentation you wanted? Or is it too verbose? If so, is there anywhere else better suited to a more verbose explanation and example? Maybe in the tutorial section? |
Also, do you like the dual usage of groupkey (if def GroupedSplitter(Callable item2group,valid_pct=0.2,seed=None,n_tries=3): def GroupedSplitter(str group_col,valid_pct=0.2,seed=None,n_tries=3): |
@jph00 Please take a look |
@jph00 I fixed the sync issues in this PR as well, it is ready for review. |
@jph00 I just came across this old pull request of mine. Did you ever have a chance to review it? |
@YSaxon No, I apologise, I didn't see @hamelsmu's message last April that he'd fixed the CI issues so it was ready to review. Have you taken a look at the various options here?: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection . Is this PR doing something different to those classes? |
See here and here in the DevChat in Forums.
I'm very much unsure if I've added the feature in all the right places.. I'm pretty much expecting (and hoping) to be told what more I need to do to make this a valid pull request.