wmdecompose

wmdecompose is a Python package for leveraging the inherent interpretability of Word Mover's Distance (WMD), which to date has not been enabled in any of the existing implementations of the framework.

The main idea behind wmdecompose is that the WMD between a pair of documents or two sets of documents can be decomposed into word-level distances. This way, we gain an understanding of not just the overall semantic distance between two sets of documents, but also of the distance that individual words contribute to the total WMD. Furthermore, the package allows users to cluster words, so that word-level distances can be understood through coherent and interpretable wholes.

The package was originally published as part of the paper "WMDecompose: A Framework for Leveraging the InterpretableProperties of Word Mover’s Distance in Sociocultural Analysis" as part of the SIGHUM workshop of the 2021 EMNLP conference. Notebooks and scripts for replicating the paper can be found in the paper subfolder of this repo.

Installation

To install wmdecompose, pull this repository from github:

git clone https://github.com/maybemkl/wmdecompose.git

Then, make sure you are in the root folder of the local copy of the repository and run the following command, preferably inside a virtual environment:

pip install .

Examples

Using the well-known review dataset from Yelp, filtered to reviews written after 2017 in Atlanta or Portland and with one or five stars, wmdecompose gives the following words as top contributors to the overall WMD when moving from positive to negative reviews (left column) and vice versa (right column):

Clustering the word-level distances using t-SNE and K-means, we get the following cluster-level distances, with cluster keywords organized by intra-cluster word-level distances for positive to negative (top) and negative to positive (bottom):

For more examples, see the wmdecompose paper.

Usage

For detailed code examples on how to use the package, see the notebooks for the wmdecompose paper.

Citation

When using the wmdecompose package, please cite the original paper:

@inproceedings{Brunila2021WMDecompose,
    title = {WMDecompose: A Framework for Leveraging the Interpretable Properties of Word Mover’s Distance in Sociocultural Analysis},
    booktitle = {Proceedings of the {Fifth} {Workshop} on {Computational} {Linguistics} for {Cultural Heritage}, {Social Sciences}, {Humanities} and {Literature}},
    publisher = {Association for Computational Linguistics},
    author = {Brunila, Mikael and LaViolette, Jack},
    month = nov,
    year = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
dist		dist
paper		paper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dist

dist

paper

paper

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

wmdecompose

Installation

Examples

Usage

Citation

About

Releases

Packages

Contributors 2

Languages

License

maybemkl/wmdecompose

Folders and files

Latest commit

History

Repository files navigation

wmdecompose

Installation

Examples

Usage

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages