Clustering Challenge Project ML Intern Test 2024

Overview

This repository contains the code and documentation for a clustering challenge involving a dataset of URLs, textual content, and high-dimensional embedding vectors. The objective of this project is to explore and understand the inherent groupings within the dataset using unsupervised machine learning techniques.

Approach to solve this challenge:

Load the dataset :from the provided daset.parquet file to examine its structure and contents.
Data Preprocessing: Ensure that the numerical data is in the correct format for analysis, handling any missing or malformed data.
Exploratory Data Analysis (EDA): Perform EDA to understand the distribution of the data and possibly reduce dimensionality for visualization.
Clustering: Apply unsupervised machine learning algorithms like K-means, DBSCAN to segment the data.
Evaluation: Evaluate the clusters using metrics like silhouette score or Davies-Bouldin index to assess the performance of the algorithm.
Interpretation: Use the URL and text contents as supplementary information to understand the context of the clusters formed.
Further Steps: Based on the initial findings, decide on further steps to refine the model, such as feature engineering, using different algorithms, or incorporating the supplementary data into the model in some form.

Project Structure

The project is organized as follows:

data/: Directory containing the dataset files, including the original dataset.csv and dataset.parquet, as well as cluster-specific data files.
Generated sample clsuter files which are generated from the Step-6
To run the code there is a file named MachineLearning_test.ipynb
Make sure install all the Libraries requried

Key Steps and Methodology

The project includes the following key steps and methodologies:

Data Loading and Preprocessing: The dataset was loaded from dataset.parquet, and data preprocessing steps were applied to handle missing values and ensure data consistency.
Dimensionality Reduction: Incremental PCA was used to reduce the dimensionality of the high-dimensional embedding vectors while retaining meaningful information.
Text Vectorization: PCA & TF-IDF vectorization was applied to the textual contents of the URLs to convert text into numerical data.
Clustering: Mini-Batch K-Means clustering was employed to group similar URLs together based on the reduced data.
Evaluation: The silhouette score was used as an evaluation metric to assess the quality of the clustering results.
Visualization: UMAP was used for dimensionality reduction and visualization to gain insights into the cluster structure.

Results

The clustering experiments resulted in moderate cluster separation, as indicated by the silhouette score.
Visualizations generated using UMAP provided a 2D representation of the data, offering insights into the clustering effectiveness.

Future Work

Further more parameter tuning and experimentation with alternative clustering algorithms could potentially improve cluster quality.
Exploring advanced text vectorization techniques and feature engineering may enhance clustering results.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
MachineLearning_Test.ipynb		MachineLearning_Test.ipynb
README.md		README.md
cluster_0_full_data.csv		cluster_0_full_data.csv
cluster_0_sample.csv		cluster_0_sample.csv
cluster_1_sample.csv		cluster_1_sample.csv
cluster_2_sample.csv		cluster_2_sample.csv
cluster_3_sample.csv		cluster_3_sample.csv
cluster_4_sample.csv		cluster_4_sample.csv
cluster_5_sample.csv		cluster_5_sample.csv
cluster_6_sample.csv		cluster_6_sample.csv
cluster_7_sample.csv		cluster_7_sample.csv
cluster_8_sample.csv		cluster_8_sample.csv
cluster_9_sample.csv		cluster_9_sample.csv
dataset.parquet		dataset.parquet
requirements.txt		requirements.txt

nagasriramnani/Clustering-challenge

Folders and files

Latest commit

History

Repository files navigation

Clustering Challenge Project ML Intern Test 2024

Overview

Approach to solve this challenge:

Project Structure

Key Steps and Methodology

Results

Future Work

About

Resources

Stars

Watchers

Forks

Languages