Skip to content
/ NoClu Public

Code for the AP project for the class 'Advanced Python for NLP' on Russian noun clustering.

License

Notifications You must be signed in to change notification settings

ansost/NoClu

Repository files navigation

Dimensionality reduction on Russian noun embedding clusters

by Anna Stein

Project description:

Dimensionality reduction techniques are commonly used to reduce the dimensions of data before clustering. High dimensional data usually leads to worse clustering results. However, in the process of reducing the data, some data may get lost which can lead to poor performance of the clustering algorithms. This project aims to investigate the effect of Principal Component Analysis and t-distributed stochastic neighbor embedding on performance of clustering algorithms k-means and DBSCAN. Results show that DBSCAN may be more susceptible to different numbers of dimensions of the input data when its produced by t-SNE. No such effect us found for k-means. Overall, the number of PCA components (of 95% and 99% variance) do not affect the performance of the clustering algorithms as strongly as the dimension reduction by t-SNEdoes. Further, more detailed research is needed to confirm these findings.

See repository structure
.
├── LICENSE
├── README.md
├── REQUIREMENTS.txt
├── src
├── scripts
├── data
├── notebooks
├── figures
.

Getting the code

Either clone the git repository:

git clone git@github.com:ansost/NoClu.git

Or download a zip archive.

Requirements

See requirements.txt for a full list of requirements. The fastest way to install the requirements is using pip and a virtual environment (like venv).

Make sure to substitute <name_of_vev> with an actual name for your environment.

python3 -m venv <name_of_venv>
source <name_of_venv>/bin/activate
pip install -r requirements.txt

Software implementation

All source code used to generate the results and figures in this paper are in the src/ and scripts/ folder. The calculations and figure generation are run in Python scripts with Python 3.8.10.

This repository uses pre-commit hooks. Learn more about them and how to install/use them here: https://pre-commit.com/.

Two optional scripts for producing plots are run in Jupyter notebooks.

Data

The primary data source is a pre-trained fastText model with word embeddings for Russian noun cases. More information on the data used can be found in the preprocessing script (scripts/preprocess.py).

Since the model is very large and currently stored in git large file storage, please contact the author if you would like to use it.

Preprocessing

Filter syncretic forms from the word embeddings and extract the vectors for the nonsynchretic forms. Also, gather the gold labels.

python3 preprocess.py

Dimensionality Reduction

Note that you must navigate to the scripts/ folder to run this script and the ones in the following sections.

Use just PCA or PCA followed by t-SNE to reduce the dimensions of the vectors. See the docstring of the script and the top of the config file (data/config_files/npclu.py) for more information on the input parameters.

Note that all computations involving t-SNE may take a long time to run (1h+).

python3 reduce.py

Clustering and Evaluation

Cluster the low-dimensional data using kmeans and DBSCAN. Evaluate the results using a maximum flow minimum cost algorithm implemented in networkx.

python3 execute.py

Results are saved in data/clustering_output/ and data/result.csv. An overview is printed in the command line.

Plotting

These are optional scripts and notebooks that can be used to reproduce the plots in the report. They are not needed to run the clustering and evaluation.

Run the following script to plot the results of the clustering and evaluation as a bar and line plot. The absolute value of the cost is displayed on the y-axis, while the number of clusters (for k-means) or the epsilon value (for DBSCAN) is displayed on the x-axis. The script output is saved in the 'figures/' folder.

python3 plot_results.py

The two notebooks in the notebooks/ folder can be used to plot elbow plots to find the optimal number of clusters for k-means and the optimal epsilon value for DBSCAN. Additionally, there is a notebook for finding the optimal number of components for PCA.

License

All source code is made available under a BSD 3-clause license. You can freely use and modify the code without warranty if you provide attribution to the authors. See LICENSE.md for the full license text. The project report and slide presentation content are not open source. The author reserves the rights to the content.


If you are having problems with anything regarding this repository, please write me email: anna.stein@hhu.de