Text clustering: HDBSCAN is probably all you need

Goal

Segment common items in a text dataset to pinpoint core themes and their distribution.

Clusters cover the main topics/subtopics in the dataset
Clusters backed by accurate, LLM generated summaries

Background

We employ HDBSCAN for probabilistic clustering. This algorithm is advantageous in many ways, including:

Don’t be wrong: Cluster can have varying densities, don’t need to be globular, and won’t include noise
Intuitive parameters: Choosing a minimum cluster size is very reasonable, and the number of k clusters does not need to be specified (HDBSCAN finds the optimal k for you)
Stability: HDBSCAN is stable over runs and subsampling and has good stability over parameter choices
Performance: When implemented well HDBSCAN can be very efficient; the current implementation has similar performance to fastcluster’s agglomerative clustering

See the HDBSCAN docs on comparing clustering algorithms and how hdbscan works for more information.

Citations

Datasets
- fka/awesome-chatgpt-prompts
- gustavosta/stable-diffusion-prompts
Embedding models
- sentence-transformers/all-mpnet-base-v2

Experiments

1. Visualizing core themes in fka/awesome-chatgpt-prompts

These figures correspond to experiments/02_09_2023_16_54_32

Figure 1. HDBSCAN splits the 153 text to text prompts from fka/awesome-chatgpt-prompts into two clusters: Cluster 1 with 44 prompts (orange) and Cluster 2 with 105 prompts (blue). The 4 remaining prompts (gray) were filtered out as outliers/noise.

Figure 2. The most persistent prompts in each leaf cluster are known as "exemplars". These represent the hearts around which the ultimate cluster formed. See the HDBSCAN docs on soft clustering explanation for supporting information and functions.

Figure 3. Additional clustering is conducted around the exemplars to identify sub-topics in the dataset. The cases in each sub-cluster subsequently serve as retrieved context for the LLM theme summarization calls below.

Figure 4. Visualizing the "Computer Programming and Software Development" theme, which covers 13% of the dataset. The summary was generated by gpt-3.5-turbo-16k. The above was created with jsoncrack.com/editor.

2. Drift detection for gustavosta/stable-diffusion-prompts

These figures correspond to experiments/04_09_2023_03_02_25

HDBSCAN splits the 73,718 text to image prompts from gustavosta/stable-diffusion-prompts into 78 clusters with 25,019 (33%) of the dataset represented. The remaining 48,699 (66%) were filtered out as outliers/noise. The 5 largest clusters cover 9.5% of the dataset - these are the segments we will examine for drift below.

cluster id	theme
56	Portraits and artistic depictions of female anime characters, beautiful women, and fashionable young women
13	Symmetrical portraits of people, characters, and sci-fi figures
61	Futuristic sci-fi spaceship concept art
50	Portraits of famous actresses as characters in various roles, outfits, and styles
74	Surreal, cinematic, and futuristic digital art

cluster id	train count (73.7k rows)	test count (8.19k rows)	drift detection (% change)
56	2530 (3.43%)	310 (3.79%)	10.50
13	1343 (1.82%)	149 (1.82%)	0.00
61	1287 (1.75%)	131 (1.60%)	-8.57
50	1055 (1.43%)	135 (1.65%)	15.38
74	749 (1.02%)	109 (1.33%)	30.39

Tables 1 & 2. Drift detection for the top 5 largest clusters (bottom), alongside their claude-2 summaries (top).

Prompt: "Beautiful painting of an Aspen forest at sunset, digital art, award winning illustration, golden hour, smooth, sharp lines, concept art, trending on artstation"
Model: Runway Gen-2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Beautiful landscape paintings and matte art (cluster id: 75)

Prompt: "Futuristic batman, brush strokes, oil painting, greg rutkowski"
Model: Midjourney V5.2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Art and portraits of Batman characters (cluster id: 41)

Prompt: "Futuristic Porsche designed by Apple, a detailed matte painting by Kitagawa Utamaro, cgsociety, octane render, highly detailed, matte painting, concept art, sci-fi"
Model: Midjourney V5.2 (accessed by Daniel Furman on Sep 4, 2023)
Theme: Futuristic and fantasy vehicle concept art (cluster id: 52)

Figure 5. A sample of 3 text to image generations with various models for prompts from the gustavosta/stable-diffusion-prompts dataset (alongside their cluster id).

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
experiments		experiments
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments

experiments

notebooks

notebooks

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Text clustering: HDBSCAN is probably all you need

Goal

Background

Citations

Experiments

1. Visualizing core themes in fka/awesome-chatgpt-prompts

2. Drift detection for gustavosta/stable-diffusion-prompts

About

Languages

License

daniel-furman/awesome-chatgpt-prompts-clustering

Folders and files

Latest commit

History

Repository files navigation

Text clustering: HDBSCAN is probably all you need

Goal

Background

Citations

Experiments

1. Visualizing core themes in fka/awesome-chatgpt-prompts

2. Drift detection for gustavosta/stable-diffusion-prompts

About

Topics

Resources

License

Stars

Watchers

Forks

Languages