Skip to content

Summarization, clastering and characterization of text categories using LLM

Notifications You must be signed in to change notification settings

Darveivoldavara/clustering_and_naming_categories

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clustering and defining text categories

The presented examples demonstrate how LLM can be utilized for:

  • Extracting the brief essence from texts
  • Clustering texts into categories based on their content
  • Forming descriptions and characteristics of categories

Objective

The results obtained can be leveraged by businesses, for instance, to understand the most common inquiries made to customer service centers or technical support by clients and company employees.


Used tools

GPT 3.5 and GPT 4 were used depending on the volume of texts and the complexity of the task, as well as the final processing cost.

Additionally, on large datasets, KMeans was employed for clustering and RuBERT tiny 2 was used for generating text embeddings.


Receiving Q&A file based on Telegram messages

OpenAI API key setup

To get image descriptions from your chat, first, you need to set your OpenAI API key environment variable on your OS. Just run the following script in your command line and specify your API key:

bash setup_openai_key.sh

Telegram message history export

To retrieve your chat history in Telegram, go to the chat interface, click on the three dots for options at the top right corner, and select "Export chat history". Next, make sure to select "Format": JSON and other necessary parameters as needed. Specify the save path as "Path" to the root of this project, and you will have a similar folder named source with chat data.

Без имени

Retrieving Q&A file

Then, you can run qa_extract.py:

python3 qa_extract.py

and the resulting qa.json file will appear in the data folder.