Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic Classification #48

Open
esthicodes opened this issue Oct 29, 2022 · 2 comments
Open

Topic Classification #48

esthicodes opened this issue Oct 29, 2022 · 2 comments
Assignees

Comments

@esthicodes
Copy link
Owner

Input: News Headline

Output: Classification of News Category
zb: datasets for Text classification are used to categorize natural language texts according to content. For example, news articles by topic classification, or book reviews based on a positive or negative response classification. Most language detection, organizing customer feedback, and fraud detection are using TC.

Automation with machine learning models.

Category classification, for news, is a multi-label text classification problem. The goal is to assign one or more categories to a news article. A standard technique in multi-label text classification is to use a set of binary classifiers.

@esthicodes esthicodes self-assigned this Oct 29, 2022
@esthicodes
Copy link
Owner Author

AutoCrawler

Google, Naver multiprocess image crawler (High Quality & Speed & Customizable)

How to use

  1. Install Chrome

  2. pip install -r requirements.txt

  3. Write search keywords in keywords.txt

  4. Run "main.py"

  5. Files will be downloaded to 'download' directory.

Arguments

usage:

python3 main.py [--skip true] [--threads 4] [--google true] [--naver true] [--full false] [--face false] [--no_gui auto] [--limit 0]
--skip true        Skips keyword if downloaded directory already exists. This is needed when re-downloading.
--threads 4        Number of threads to download.
--google true      Download from google.com (boolean)
--naver true       Download from naver.com (boolean)
--full false       Download full resolution image instead of thumbnails (slow)
--face false       Face search mode
--no_gui auto      No GUI mode. (headless mode) Acceleration for full_resolution mode, but unstable on thumbnail mode.
                   Default: "auto" - false if full=false, true if full=true
                   (can be used for docker linux system)
                   
--limit 0          Maximum count of images to download per site. (0: infinite)
--proxy-list ''    The comma separated proxy list like: "socks://127.0.0.1:1080,http://127.0.0.1:1081".
                   Every thread will randomly choose one from the list.

Full Resolution Mode

You can download full resolution image of JPG, GIF, PNG files by specifying --full true

Data Imbalance Detection

Detects data imbalance based on number of files.

When crawling ends, the message show you what directory has under 50% of average files.

I recommend you to remove those directories and re-download.

Remote crawling through SSH on your server

sudo apt-get install xvfb <- This is virtual display
sudo apt-get install screen <- This will allow you to close SSH terminal while running.
screen -S s1
Xvfb :99 -ac & DISPLAY=:99 python3 main.py

Customize

You can make your own crawler by changing collect_links.py

Issues

As google site consistently changes, please make issues if it doesn't work.

@esthicodes
Copy link
Owner Author

Text Classification of News Articles

  • Text Classification
  • Category classification, for news, is a multi-label text classification problem. The goal is to assign one or more categories to a news article. A standard technique in multi-label text classification is to use a set of binary classifiers.
  1. Know about Data
    For the task of news classification with machine learning, I have collected a dataset from Kaggle, which contains news articles including their headlines and categories.

Data Fields

Article Id – Article id unique given to the record
Article – Text of the header and article
Category – Category of the article (tech, business, sport, entertainment, politics)

3. Data Cleaning and Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms.

4. Import Libraries

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import re import nltk from nltk.corpus import stopwords nltk.download('stopwords') from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer nltk.download('wordnet') from nltk.tokenize import word_tokenize from nltk.tokenize import sent_tokenize nltk.download('punkt') from wordcloud import WordCloud from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from sklearn.metrics import make_scorer, roc_curve, roc_auc_score from sklearn.metrics import precision_recall_fscore_support as score from sklearn.metrics.pairwise import cosine_similarity from sklearn.multiclass import OneVsRestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC, LinearSVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

5. Import Dataset

6. Shape of Dataset

dataset.shape

7. Check Information of Columns of Dataset

`

dataset.info()

Columns of Dataset
`

7. Count Values of Categories

dataset['Category'].value_counts()

8. Convert Categories Name into Numerical Index

# Associate Category names with numerical index and save it in new column CategoryId target_category = dataset['Category'].unique() print(target_category)

Image

`
convert categories

dataset['CategoryId'] = dataset['Category'].factorize()[0]
dataset.head()

`

Image

9. Show Category’s Name w.r.t Category ID

Here you can show that news category’s name with respect to the following unique category ID.

# Create a new pandas dataframe "category", which only has unique Categories, also sorting this list in order of CategoryId values category = dataset[['Category', 'CategoryId']].drop_duplicates().sort_values('CategoryId') category

Image

Exploratory Data Analysis (EDA)

In data mining, Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. It is not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may be tedious, boring, and/or overwhelming to derive insights by looking at plain numbers. Exploratory data analysis techniques have been devised as an aid in this situation.

Visualizing Data

The below graph shows the news article count for category from our dataset.

dataset.groupby('Category').CategoryId.value_counts().plot(kind = "bar", color = ["pink", "orange", "red", "yellow", "blue"]) plt.xlabel("Category of data") plt.title("Visulaize numbers of Category of data") plt.show()

Image

fig = plt.figure(figsize = (5,5)) colors = ["skyblue"] business = dataset[dataset['CategoryId'] == 0 ] tech = dataset[dataset['CategoryId'] == 1 ] politics = dataset[dataset['CategoryId'] == 2] sport = dataset[dataset['CategoryId'] == 3] entertainment = dataset[dataset['CategoryId'] == 4] count = [business['CategoryId'].count(), tech['CategoryId'].count(), politics['CategoryId'].count(), sport['CategoryId'].count(), entertainment['CategoryId'].count()] pie = plt.pie(count, labels = ['business', 'tech', 'politics', 'sport', 'entertainment'], autopct = "%1.1f%%", shadow = True, colors = colors, startangle = 45, explode = (0.05, 0.05, 0.05, 0.05,0.05))

Image

10.. Visualizing Category Related Words

Here we use the word cloud module to show the category-related words.

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites.

`from wordcloud import WordCloud

stop = set(stopwords.words('english'))

business = dataset[dataset['CategoryId'] == 0]

business = business['Text']

tech = dataset[dataset['CategoryId'] == 1]

tech = tech['Text']

politics = dataset[dataset['CategoryId'] == 2]

politics = politics['Text']

sport = dataset[dataset['CategoryId'] == 3]

sport = sport['Text']

entertainment = dataset[dataset['CategoryId'] == 4]

entertainment = entertainment['Text']

def wordcloud_draw(dataset, color = 'white'):

words = ' '.join(dataset)

cleaned_word = ' '.join([word for word in words.split()

if (word != 'news' and word != 'text')])

wordcloud = WordCloud(stopwords = stop,

background_color = color,

width = 2500, height = 2500).generate(cleaned_word)

plt.figure(1, figsize = (10,7))

plt.imshow(wordcloud)

plt.axis("off")

plt.show()

print("business related words:")

wordcloud_draw(business, 'white')

print("tech related words:")

wordcloud_draw(tech, 'white')

print("politics related words:")

wordcloud_draw(politics, 'white')

print("sport related words:")

wordcloud_draw(sport, 'white')

print("entertainment related words:")

wordcloud_draw(entertainment, 'white')`

Show Text Column of Dataset
Show Category Column of Dataset
Remove All Tags
Remove Special Characters
Convert Everything in Lower Case
Remove all Stopwords
Lemmatizing the Words
After Cleaning Text our Dataset
Declared Dependent and Independent Value
Create and Fit Bag of Words Model
Train Test and Split the Dataset
Create Empty List
Create, Fit and Predict all ML Model
Logistic Regression
Multinomial Naive Bayes

Support Vector Machine

Decision Tree

KNN

Gaussian Naive Bayes

  • Create Dataframe of Model, Accuracy, Precision, Recall, and F1

Best Model to Perform Accuracy Score

Fit & predict best ML Model

Predict News Article

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔖 Ready
Development

No branches or pull requests

1 participant