LINSPECTOR WEB

LINSPECTOR (Language Inspector) is a multilingual inspector to analyze word embeddings in a web based application. Our goal is to provide researchers with an easily accessible tool to gain quick insights into their word embeddings especially outside of the English language. To do this we employ simple classification tasks called probing tasks for a diverse set of languages.

linspector.ukp.informatik.tu-darmstadt.de

Citation

Please use the following citation:

@inproceedings{eichler-etal-2019-linspector,
    title = "{LINSPECTOR} {WEB}: A Multilingual Probing Suite for Word Representations",
    author = {Eichler, Max  and
      {\c{S}}ahin, G{\"o}zde G{\"u}l  and
      Gurevych, Iryna},
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D19-3022",
    doi = "10.18653/v1/D19-3022",
    pages = "127--132",
}

Abstract: We present LINSPECTOR WEB, an open source multilingual inspector to analyze word representations. Our system provides researchers working in low-resource settings with an easily accessible web based probing tool to gain quick insights into their word embeddings especially outside of the English language. To do this we employ 16 simple linguistic probing tasks such as gender, case marking, and tense for a diverse set of 28 languages. We support probing of static word embeddings along with pretrained AllenNLP models that are commonly used for NLP downstream tasks such as named entity recognition, natural language inference and dependency parsing. The results are visualized in a polar chart and also provided as a table. LINSPECTOR WEB is available as an offline tool or at https://linspector.ukp.informatik.tu-darmstadt.de.

Contact Person: Gözde Gül Şahin, sahin@ukp.informatik.tu-darmstadt.de

https://www.ukp.tu-darmstadt.de

https://www.tu-darmstadt.de

Overview

inspector/ is a Django application structured similar to the official documentation.
inspector/nn/ contains an AllenNLP based evaluation suite.

Installation

LINSPECTOR is hosted at linspector.ukp.informatik.tu-darmstadt.de but you can also run a local copy.

Clone this repository.

 git clone https://github.com/UKPLab/linspector-web.git

Create a virtual environment using Python 3.6.x.

 pip install virtualenv
 cd linspector-web/
 # As of Python 3.7.3 there is a bug using Eventlet with pathlib
 virtualenv linspectorenv -p python3.6
 source linspectorenv/bin/activate

Install requirements.
```
 pip install -r requirements.txt
```

Run migrations and load fixtures.

 ./manage.py migrate
 ./manage.py loaddata languages probing_tasks

Download Bootstrap (4.3) Sass files to inspector/static/inspector/bootstrap/scss/.

Compile Sass file.

 npm install sass postcss-cli autoprefixer
 sass --no-source-map inspector/static/inspector/scss/custom.scss inspector/static/inspector/custom.css
 npx postcss inspector/static/inspector/custom.css --use autoprefixer --replace

Install a Celery supported broker, we use RabbitMQ with Eventlet as an execution pool.
Add training data to media/intrinsic_data/ (static probing tasks) and media/intrinsic_context_data (contextual probing tasks) (see Intrinsic Data below).

Start the server (activate virtualenv for Celery and Django).

 rabbitmq-server
 celery -A linspector worker -l info -P eventlet
 ./manage.py runserver

Open localhost:8000 in your browser.

Notes

When using Celery without Eventlet as an execution pool there can be issues running AllenNLP.

In a production environment you might encounter performance and / or stability issues using SQLite. We recommend using PostgreSQL.

Probing Tasks

Probing tasks are simple classification tasks aiming to gain insights into information encoded inside embeddings. Our work focuses on word embeddings but should be extendable to other embedding types.

See Conneau et al. (2018), or Şahin et al. (2019) to learn more.

Intrinsic Data

For training data each line consists of a token and a label separated by whitespace e.g. Klammeraffe Noun.

We are using intrinsic data provided by Şahin et al. (2019) with a modified folder structure:

Probing tasks are title cased without spaces
Use ISO 639-1 codes (or ISO ISO 639-2 codes if there is no ISO 639-1 code) for languages
Folder names have to match database entries except for the spaces
media/intrinsic_data/CaseMarking/de/ > train.txt, dev.txt, test.txt

Additionally we renamed some task to be more descriptive:

Case to Case Marking
Tag Count to Morphological Feature Count
Odd Feat to Odd Morphological Feature
Pseudo to Pseudoword
Same Feat to Shared Morphological Feature
Character Bin to Word Length
Part of Speech to POS

Also we deleted Character Count. It was replaced by Word Length.

We have attached our fixtures under inspector/fixtures/.

Contrastive Tasks

Odd Morphological Feature and Shared Morphological Feature are contrastive tasks trying to predict a single odd or shared morphological feature between two tokens.

For training data each line consists of two tokens and a label all separated by whitespace e.g. ruckelte getoastet Tense.

A boolean flag contrastive has to be set in the database for each contrastive task.

Commands

There is a command-line script experiment to probe static embeddings files which is designed to work with the intrinsic evaluation of Şahin et al. (2019).

To probe a single embeddings file per language run:

# Folders structured like media/experiment/ar/embeddings.vec
./manage.py experiment ar hy cs fr hu

To probe different embedding types run:

# Folders structured like media/experiment/ar/fasttext/embeddings.vec
./manage.py experiment ar hy cs fr hu --types bpe fasttext w2v

To probe different embedding dimensions run:

# media/experiment/ar/fasttext/100/embeddings.vec
./manage.py experiment ar hy cs fr hu --types bpe fasttext w2v --dims 50 100 200 300

The metrics will be written to a CSV file for each language e.g. media/experiment/ar.csv.

Run ./manage.py help experiment for additional arguments.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
inspector		inspector
linspector		linspector
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inspector

inspector

linspector

linspector

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

NOTICE.txt

NOTICE.txt

README.md

README.md

manage.py

manage.py

requirements.txt

requirements.txt

Repository files navigation

LINSPECTOR WEB

Citation

Overview

Installation

Notes

Probing Tasks

Intrinsic Data

Contrastive Tasks

Commands

About

Releases

Packages

Contributors 3

Languages

License

UKPLab/linspector-web

Folders and files

Latest commit

History

Repository files navigation

LINSPECTOR WEB

Citation

Overview

Installation

Notes

Probing Tasks

Intrinsic Data

Contrastive Tasks

Commands

About

Resources

License

Stars

Watchers

Forks

Languages