Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add new category? #19

Open
b0tm1nd opened this issue Jan 11, 2024 · 3 comments
Open

How to add new category? #19

b0tm1nd opened this issue Jan 11, 2024 · 3 comments

Comments

@b0tm1nd
Copy link

b0tm1nd commented Jan 11, 2024

What is the scenario for adding a new category ?

@b0tm1nd
Copy link
Author

b0tm1nd commented Jan 12, 2024

From what I understood, we need a new dataset in .jsonl with text and labels.
Could you share datasets that this was trained on? Especially for not_news.
By reading the telegram contest I see that for russian content they mostly used lenta.ru archive.
But what about ukrainian?

@NyanNyanovich
Copy link
Owner

@b0tm1nd
Copy link
Author

b0tm1nd commented Jan 15, 2024

@NyanNyanovich Thanks, I have found train_clf.py already and tried to train it with a single category but then on send.sh classificator failed probably because of "not_news" missing..

I have taken a dataset for Ukrainian news website which tagged their news, grouped only related to corruption and gotten about 700 entries which I united with categories_train.jsonl.

And after training I've became getting much worse results: many from war/politics became triggering corruption now and resulting as "unknown".
I have found out that in the added dataset the median text size is 1000+ characters when in yours about 450.

So I have a few questions about the hints for a dataset for the new category:

  1. Does smaller article size improves accuracy?
  2. Do multiple labels for the new category (like ["corruption", "war"] or ["corruption", "politics"]) will increase accuracy?
  3. What was your strategy (or was it random?) in news selection for your training dataset:

Labels sorted by Count:
politics: 1200 occurrences
war: 1062 occurrences
economy: 760 occurrences
incident: 699 occurrences
not_news: 451 occurrences
entertainment: 426 occurrences
tech: 418 occurrences
sports: 324 occurrences
science: 138 occurrences
other: 37 occurrences

  1. What are the other hints you might suggest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants