Automatic classification of consumer goods from text and images

Problem

The aim of this project is to automate the task of assigning an item's category, based on a photo and description of the item for sale, submitted by a seller to an e-commerce marketplace.

The volume of items is currently very small, and an item's category is assigned manually by the sellers, and is therefore unreliable.

The automation of this task is necessary:

To improve the user experience of sellers (facilitate the posting of new articles)
To improve the user experience of buyers (facilitate the search for products)
To scale up to millions of items.

Motivation

This is project 6 for the Master in Data Science (in French, BAC+5) from OpenClassrooms.

The project demonstrates the feasibility of automatically grouping same category products:

pre-processing product descriptions and images
extraction of features, from the processed data or its embedding within a model
dimension reduction techniques
clustering, confirmed by similarity between real categories and clusters.
visualization of clusters of products

Requirements

To run the notebooks, the dataset must be placed in a DATA_FOLDER ('data/raw'). Python libraries are listed in requirements.txt. Each notebook also includes a list of its own requirements, and a procedure for pip install of any missing libraries.

Data: A first dataset (~330Mb) of 1050 articles with photo and an associated description: the link to download

Python libraries :

numpy, pandas, matplotlib, seaborn, scikit-learn, tensorflow, yellowbrick
text :nltk, gensim, transformers, tensorflow_hub, tensorflow_text, wordcloud
images : pillow, opencv-contrib-python, tensorflow, plotly, kaleido, pydot, graphviz

Files

Notes : Files are in French. As requested for the project, the jupyter notebooks have not been "cleaned up" : the focus is the practice of techniques for pre-processing, setting up, tuning, visualising and evaluating text/image machine learning and deep learning algorithms.

Custom functions created in this project for data pre-processing, statistical analysis and data visualization are encapsulated within each notebook, to avoid importing and versioning custom libraries. Open https://nbviewer.org/ and paste notebook GitHub url if GitHub takes too long to render.

P6_01_text_nlp.ipynb: Text classification techniques.
P6_02_image_classification.ipynb: Image classification, plus classification when combined with text features.
P6_03_support.pdf: Presentation and conclusion

Summary of approaches

Note : The quality of pre-processing of images and text descriptions has a huge impact on the performance of the models

Unsupervised, semi-supervised and supervised classification techniques were used for product categorization

based only on product text descriptions
based only on product images`
combining features extracted from both text and images

Text classification (Natural Language Processing) was undertaken using:

Bag of Words (BoW): word count and TF-IDF vectorization, with n-grams
Topic Modelling using Latent Dirichlet Allocation
Word Embedding using Word2Vec pre-trained models
Word Embedding using deep learning (contextuel skipgrams in LSTM neural networks: BERT, HuggingFace transformers, Universal Sentence Encoder
Keras (supervised) word embedding: train-test split, demonstrating overfitting of the training data.

Image classification (Computer Vision) was performed:

image feature extraction : bag of visual features (SIFT, ORB) ; Visual feature vectors
supervised training on simple Convolution Neural Networks (CNN)
semi-supervised on VGG16 pretrained (ImageNet - 1000 features)
unsupervised (on VGG16 features minus 2 layers)
supervised transfer learning, with fine-tuning
regularization through the use of image augmentation and dropout layers

Combined text and image features were used to improve the final product categorization

classification	description
unsupervised	K-means clustering after feature selection and dimension reduction, selecting the number of categories which provides the most distinct clusters
semi-supervised	the number of clusters was fixed (K=7)
supervised	the (labelled) categorised data was split into train, test and validation sets, to learn the features of each category.

Supervised classification was conducted using neural network models:

shallow neural networks were created to quickly test the impact of pre-processing of text and images, and regularization mechanisms
deep neural networks were trained on the best pre-processing models

Visualising the work undertaken:

`

Text classification (NLP)

Text pre-processing

Cleaning ("stop phrases", tokenization, stemming, stop words (identified with low IDF)
Lemmatization (removes context, so excluded from sentence embedding models)

Extraction of text features:

Bag of Words : word count and TF-IDF vectorization (Term Frequency–Inverse Document Frequency)
- Tuning with use of n-grams and regex patterns

Topic Modelling : Latent Dirichlet Allocation(LDA)

To identify the most suitable category names, (semi-supervised) topic modelling was applied to the Bag-of-Words features. The TF-IDF vectorization provided a good correlation between the discovered topics and the existing 7 categories.

Topic visualization with word clouds

Dimensionality Reduction using PCA and t-SNE

The extracted features (for example, word frequencies) were reduced by principal component analysis (PCA), keeping 99% of explained variance, before applying t-distributed stochastic neighbor embedding (t-SNE) to reduce to two dimensions.

Optimal number of categories (unsupervised clustering using Kmeans)

K-means clustering was applied to identify the clusters, for number of clusters ranging from 4 to 12

Automatic classification works best when the categories are clearly separated.

elbow of distortion score
high silhouette score
low davies-bouldin score

Unsupervised classification produced most clearly separated clusters with 7 categories.

Evaluation of semi-supervised clustering (k=7)

The performance of each model was evaluated by the multicategory confusion matrix, from which we can calculate, for each category:

precision
recall
accuracy

These can be summarised in the classification report, and visualised in a Sankey Diagram

Adjusted Rand Index (ARI)

measure of similarity between predicted and actual categories

Word Embedding (Word2Vec)

Word embedding using word2vec is based on skipgrams: words found close together sequentially tend to be closely related, and so will have similar feature vectors. Clustering of word vectors (after dimensionality reduction by PCA and TSNE) gives the following most frequent words, coloured by cluster:

Sentence embedding using BERT the Universal Sentence Encoder (USE)

BERT (Bidirectional Encoder Representations from Transformers) and USE models were tested supplying unlemmatized descriptions to pretrained models.

Despite being deep learning models, and taking time to process the embedded words, the results were less impressive than the simpler text models.

This may be because the product descriptions are mostly not sentences, but often generated from key-value pairs of product characteristics. Using skipgrams, the keys such as {color, length, width, height, quantity,...) may add noise rather than context. By contrast, these words have little weight in TF-IDF vectorization.

Supervised text classification

Tensorflow was used to test supervised classification (data split: 80% train, 20% test), improving the results to close to 90% accuracy on the test set after 10 epochs. However, these models overfit to the 7 categories, and are unlikely to be useful for new product categories.`

Conclusion on Text classification

Based solely on text descriptions, clustering using TF_IDF categorization gave the best similarity with the labelled categories

Image Classification (Computer Vision)

pre-processing of images

Images were adjusted for

exposition
equalization of histogram
noise filters
colour/greyscale
resize
normalization of values to between -1 and 1

Image feature extraction : bag of visual features (SIFT, ORB)

SIFT (Scale-Invariant Feature Transform)
ORB (Oriented FAST and Rotated BRIEF) Clustering of products after dimension reduction via PCA/t-SNE was not very clear

Convolution Neural Networks for Image feature extraction

A simple convolution neural network composed of 2 convolution layers (with maxpooling), a dropout layer for regularization, a flattening layer and 2 dense layers was used to quickly test pre-processing pipelines, and evaluate the effect of regularization (~1 million parameters, training times of a few seconds).

For this particular problem, better results were obtained from CNN deep learning models, pretrained on millions of images

The deep learning convolution neural network VGG-16 model (2014) was used in this project. However, it can easily be replaced by other models such as ResNet (2015), Inception-V3 (2015), or EfficientNet (2019) for example.

TensorFlow provides these deep learning models, pre-trained for 1000 categories using the ImageNet dataset (14 million labelled images).

Semi-supervised classification using CNN pre-trained deep learning model

The VGG-16 pretrained model (ImageNet weights) was used to detect the probability of each image belonging to a given category. These 1000 features were reduced in dimension by PCA followed by t-SNE, using the same procedures as for text classification. The result was an ARI score of 0.38, corresponding to an accuracy of around 60%

Unsupervised classification using CNN pre-trained deep learning model

To improve classification, the last two layers were removed, leaving 4096 underlying features instead of the 1000 categories. Applying dimension reduction and K-means clustering resulted in an ARI score of 0.53, equivalent to an accuracy of around 70%

Simple supervised classification

a simple convolution network was trained on the images.
overfitting was observed, so image augmentation and a dropout layer were added

Transfer Learning (supervised classification)

the dense layers were removed and replaced with a flattening layer and new dense layers, along with a final softmax function to choose between the 7 categories.
the convolution layers were kept and their pre-trained weights were frozen to avoid losing the pretrained image features
fine tuning was applied by adjusting only the weights in the new dense and softmax layers, whilst freezing the pre-trained weights in the convolutional layer
categorical crossentropy was used as the loss function
the Adam optimization algorithm was used to for fast optimization (an extension to stochastic gradient descent)

Summary of image classification results

feature extraction using SIFT and ORB were not very successful
unsupervised classification using features after removing the last 2 layers of VGG-16 gave the best results for classification based soley on the images

type	model	ARI score
semi-supervised	SIFT	0.05
semi-supervised	ORB	0.04
unsupervised	VGG-16 pretrained (1000 features)	0.38
unsupervised	VGG-16 pretrained, last 2 layers removed (4096 features)	0.53
supervised	Transfer Learning	0.45
supervised	Transfer Learning after fine tuning	0.50

Text and Image features combined for unsupervised classification

The best results were obtained by combining the best text features with the best image features, resulting in an accuracy of 84%

This can probaly be improved using more recent deep learning model for word embedding and image transfer learning.

Conclusion

The images are visualised on the t-SNE axes for the final model (text and image features combined):

Possible Improvements

Remove as much noise as possible in the text descriptions (example: stop phrases) and in the images (equalization and noise filters) :
- pre-processing has a major impact on the performance
Replace VGG-16 with recent (pre-trained) deep learning models are faster, more efficient, more accurate
Try out different dense layer and regularization mechanisms
Adjust the learning rate during fine-tuning of the transfer learning model
Add the text features extracted by TF-IDF as inputs to the deep learning model, alongside the image features extracted by convolution layers, before fine-tuning the weights of the final dense layers and the softmax layer.
Alternatively, fine tune a Keras word embedding model, then extract the features from one of the final layers as input to the image dense layers

Features (keywords)

text classification, natural language processing (NLP)

text pre-processing : stop phrases, tokenization, stopword, lemmatization
text feature extraction : bag of words (Count, TF-IDF vectorization, n-grams, regex patterns)
topic modelling : LDA – Latent Dirichlet Allocation
topic visualization : wordClouds
word vectors : Word2Vec, skip-grams
word embedding : contextual skip-grams, deep learning, LSTM neural networks, BERT, HuggingFace transformers, Universal Sentence Encoder
Keras word embedding train-test split, overfitting, variance-bias, regularization, validation set

image classification, computer vision (CV)

image pre-processing : resize, colour/greyscale, exposition, equalization, noise filters, squarify, normalization
image feature extraction : bag of visual features (SIFT, ORB), visual feature vectors
convolution neural networks (CNN) : VGG16 pretrained (ImageNet) – semi-supervised (1000 features)
unsupervised (features minus 2 layers)
supervised image classification : transfer learning, fine-tuning,
deep learning : pooling layers, dense layers, activation layers (reLu, softmax)
regularization : image augmentation, dropout layers

dimensionality reduction

PCA, t-SNE

K-means clustering

silhouette score, distortion, intra/inter cluster variance,
cluster similarity, adjusted rand index (ARI), multiclass confusion matrix,
precision, recall, f1-score, classification report, sankey diagrams

Skills acquired

Preprocess text data to obtain a usable dataset for Natural Language Processing
Unsupervised text classification and topic modelling techniques
Preprocess image data to obtain a usable dataset for Computer Vision
Implement dimension reduction techniques
Represent large-scale data graphically

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
data/out		data/out
img		img
support		support
.editorconfig		.editorconfig
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.prettierrc		.prettierrc
P6_01_text_nlp.ipynb		P6_01_text_nlp.ipynb
P6_02_image_classification.ipynb		P6_02_image_classification.ipynb
P6_03_support.pdf		P6_03_support.pdf
README.md		README.md
requirements.txt		requirements.txt

mrcreasey/oc-ds-p6-nlp-computer-vision

Folders and files

Latest commit

History

Repository files navigation