Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/pubfig83 #46

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Feature/pubfig83 #46

wants to merge 23 commits into from

Conversation

yamins81
Copy link
Contributor

@yamins81 yamins81 commented Mar 1, 2012

standardized splits and tests -- @jaberg, @npinto, your comments would be great

Comments on how things are done are in the test_pubfig83 file.

@yamins81
Copy link
Contributor Author

yamins81 commented Mar 1, 2012

@zstone, you comments would be great as well

p = rng.permutation(len(remainder))
if 'Train%d' % _ind not in splits:
splits['Train%d' % _ind] = []
splits['Train%d' % _ind].extend(remainder[p[:80]].copy())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to have hardcoded values here (eg 90, 10, 80) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally preferred the variable ntest, ntrain approach, like you suggest. But, on the other hand, it might be good to have "standard" splits. I've occasionally heard @zstone and @davidcox talking about standard 90/10 splits. Of course, we COULD make them variables and then just set defaults. I think it depends on how the creators of this dataset intended to be used -- @npinto, you and @zstone worked on this together (?) so if you think giving people guidance that suggests that non-standard-sized splits are "OK", then that seems fine to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support the creation of a set of standard splits of any dataset, but I would favor canonizing a minimal, language-agnostic materialization of said standard splits rather than defining them implicitly by code in a specific language that must be run correctly in a specific environment to regenerate the right splits. Concretely, I would advocate looking for the simplest-possible JSON format that would completely specify, say, ten 90/10 splits with reference to relative image paths within a canonical archive of a dataset.

That's just my opinion, though, and it may not fit the way scikit-data already works!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zstone I'm not sure if it's obvious how to use JSON inside scikit-data in a natural way. however, could you comment on the split scheme currently proposed in the code? I just added support for non-standard splits. I put the split related data in the dataset init, because we might want to have the split data travel with the dataset instance. The current code has the idea that to change splits, you have to re-instantiate a new instance (or I guess, monkey around inside the code; probably the various attributes should be private). What do you think?

@jaberg
Copy link
Owner

jaberg commented Mar 14, 2013

@yamins81 Coming back to this after forever - is this the branch of pubfig83 that went into our ICML2013 paper? If not, could you point me to that code? I'd like to merge it in (and I'm sorry for not doing it a long time ago!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants