Skip to content

rungalileo/bulk-labeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bulk-labeling

A tool for quickly adding labels to unlabeled datasets

Running on streamlit!

How to use

We can walk through a simple example of going from an unlabeled dataset to some usable labels in just a few minutes

First, go to the streamlit app above, or you can run it locally

Then upload a csv file with your text. The only requirement of the file is that it must have a text column. Any other columns added can be used for coloring the embedding plot. If you don't have one, you can use the conv-intent dataset from this repo!

image

Once the embeddings have processed, you'll see your dataframe on the left and embeddings on the right. The dataframe view comes with an extra text_length column that you can sort by, or color the embeddings plot with (in case text length is useful to you).

You can filter with the text search (regex coming soon!) or, by lasso selecting embedding clusters from the chart. You can also color the chart and resize the points using the menu on the left

image

Since we see some clear clusters already, let's start by investigating them. We can see one cluster with a lot of references to weather. Let's select this cluster

Screen.Recording.2022-10-04.at.4.31.31.PM.mov

Confirming that this is about weather, we can register a new label "weather" and assign our samples

Screen.Recording.2022-10-04.at.4.33.19.PM.mov

The UI will reset automatically. Let's look at another one. This cluster has a lot of references to bookings and reservations. Let's select that one.

Screen.Recording.2022-10-04.at.4.34.45.PM.mov

We can use the streamlit table's builtin text search (by clicking on the table, then CMD+F) to see how many references to "book" there are. Unlike the text search filter, this won't actually filter the selection.

Screen.Recording.2022-10-04.at.4.37.30.PM.mov

Loads of samples have "book" in them, but we can be a bit more generic and call this "reservations". Let's register a new label "reservations" and label these samples.

Screen.Recording.2022-10-04.at.4.39.00.PM.mov

We can inspect our labeled samples in the label-viewer page.

image

image

Once we are ready, we simple click "Export assigned labels" and then click the "Download" button

export.mov

We just labeled N samples in a few minutes!

There are some pretty funny "mistakes" in the embeddings (samples that are semantically similar to other categories, but have words that trigger weather/reservation) that should be considered! The embeddings aren't perfect. We are using a smaller model (paraphrase-MiniLM-L3-v2) in order to get embeddings in a reasonable speed. But it's a good start! Feel free to run this locally and use a better model

image

Run locally

If you have a GPU running locally, want to try different encoder algorithms, or don't want to upload your data, you can run this locally.

  1. Create a virtual environment (I recommend pyenv)
pyenv install $(cat .python-version)
python -m venv .venv
source .venv/bin/activate
# Check that it worked
which python pip
  1. Install reqs pip install -r requirements.txt && pyenv rehash
  2. Run the app: streamlit run app.py

About

A tool for quickly adding labels to unlabeled datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages