[Area Page] Data Selection #30

codyaustun · 2021-07-21T03:08:58Z

Data selection methods, such as active learning and core-set selection, are useful and important tools for machine learning on large datasets. Major AI/ML conferences such as NeurIPS and ICML have consistently featured workshops and tutorials on these topics:

SubSetML: Subset Selection in Machine Learning: From Theory to Practice, ICML 2021
Workshop on Dataset Curation and Security, NeurIPS 2020
Active Learning From Theory to Practice, ICML 2019

what story you might tell about the topic's importance to data-centric AI
Large-scale unlabeled datasets can contain millions or billions of examples covering a wide variety of underlying concepts. Yet, these massive datasets often skew towards a relatively small number of common concepts, for example ‘cats’, ‘dogs’, and ‘people’. Rare concepts, such as ‘harbor seals’, tend to only appear in a small fraction of the data (usually less than 1%). However, performance on these rare concepts is critical in many settings. For example, harmful or malicious content may comprise only a small percentage of user-generated content, but it can have a disproportionate impact on the overall user experience. Similarly, when debugging model behavior for safety-critical applications like autonomous vehicles, or when dealing with representational biases in models, obtaining data that captures rare concepts allows machine learning practitioners to combat blind spots in model performance. Even a simple task, such as stop sign detection by an autonomous vehicle, can be difficult due to the diversity of real-world data. Stop signs may appear in a variety of conditions (e.g., on a wall or held by a person), can be heavily occluded, or have modifiers (e.g., “Except Right Turn”). Large-scale datasets are essential but not sufficient; finding the relevant examples for these long-tail tasks is challenging. Data selection methods, active learning, active search, and core-set selection methods, have the potential to automate the process of identifying these rare, high-value data points. (See "Similarity Search for Efficient Active Learning and Search of Rare Concepts" for more detail)

whether this topic is related to other areas in data-centric AI, and why existing discussions may not be sufficient
All of the other areas focus on how we process data, not which data should we process.

what subtopics, resources and related work you may discuss in the area page

Active learning
Active search
Core-set selection
Cooperative learning

krandiash · 2021-07-22T20:42:58Z

This is great, would love to have you add edits to the area page. We also have some other work around data distillation and other framing around active learning that we want to discuss, so I can take over and do edits once you have some boilerplate text committed.

codyaustun · 2021-07-23T14:17:20Z

Sounds good! I'll draft something up next week.

Also, I might be able to rope in a few of my active learning collaborators and folks from the SubSetML workshop to pitch in, so we can do a broad overview.

codyaustun · 2021-09-02T22:33:20Z

@krandiash I haven't forgotten about this, but August was busier than I expected with NeurIPS. I am aiming to knock out the area page over the long weekend. Does that work for you?

krandiash · 2021-09-03T05:55:50Z

That totally works: a first draft is good, and we can get other folks in that community to build it out further. Thanks a lot!

codyaustun · 2021-09-10T02:59:43Z

I filled in the main README with #50, but I still need to complete the area page.

codyaustun added the proposal label Jul 21, 2021

codyaustun mentioned this issue Jul 22, 2021

Add stub for Data Selection #31

Merged

codyaustun mentioned this issue Sep 10, 2021

Update Data Selection stub on the main README #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Area Page] Data Selection #30

[Area Page] Data Selection #30

codyaustun commented Jul 21, 2021

krandiash commented Jul 22, 2021

codyaustun commented Jul 23, 2021

codyaustun commented Sep 2, 2021

krandiash commented Sep 3, 2021

codyaustun commented Sep 10, 2021

[Area Page] Data Selection #30

[Area Page] Data Selection #30

Comments

codyaustun commented Jul 21, 2021

krandiash commented Jul 22, 2021

codyaustun commented Jul 23, 2021

codyaustun commented Sep 2, 2021

krandiash commented Sep 3, 2021

codyaustun commented Sep 10, 2021