finalfusion-python

Introduction

finalfusion is a Python package for reading, writing and using finalfusion embeddings, but also supports other commonly used embeddings like fastText, GloVe and word2vec.

The Python package supports the same types of embeddings as the finalfusion-rust crate:

Vocabulary:
- No subwords
- Subwords
Embedding matrix:
- Array
- Memory-mapped
- Quantized
Norms
Metadata

Installation

The finalfusion module is available on PyPi for Linux, Mac and Windows. You can use pip to install the module:

$ pip install --upgrade finalfusion

Installing from source

Building from source depends on Cython. If you install the package using pip, you don't need to explicitly install the dependency since it is specified in pyproject.toml.

$ git clone https://github.com/finalfusion/finalfusion-python
$ cd finalfusion-python
$ pip install .

If you want to build wheels from source, wheel needs to be installed. It's then possible to build wheels through:

$ python setup.py bdist_wheel

The wheels can be found in dist.

Package Usage

Basic usage

import finalfusion
# loading from different formats
w2v_embeds = finalfusion.load_word2vec("/path/to/w2v.bin")
text_embeds = finalfusion.load_text("/path/to/embeds.txt")
text_dims_embeds = finalfusion.load_text_dims("/path/to/embeds.dims.txt")
fasttext_embeds = finalfusion.load_fasttext("/path/to/fasttext.bin")
fifu_embeds = finalfusion.load_finalfusion("/path/to/embeddings.fifu")

# serialization to formats works similarly
finalfusion.compat.write_word2vec("to_word2vec.bin", fifu_embeds)

# embedding lookup
embedding = fifu_embeds["Test"]

# reading an embedding into a buffer
import numpy as np
buffer = np.zeros(fifu_embeds.storage.shape[1], dtype=np.float32)
fifu_embeds.embedding("Test", out=buffer)

# similarity and analogy query
sim_query = fifu_embeds.word_similarity("Test")
analogy_query = fifu_embeds.analogy("A", "B", "C")

# accessing the vocab and printing the first 10 words
vocab = fifu_embeds.vocab
print(vocab.words[:10])

# SubwordVocabs give access to the subword indexer:
subword_indexer = vocab.subword_indexer
print(subword_indexer.subword_indices("Test", with_ngrams=True))

# accessing the storage and calculate its dot product with an embedding
res = embedding.dot(fifu_embeds.storage)

# printing metadata
print(fifu_embeds.metadata)

Beyond Embeddings

# load only a vocab from a finalfusion file
from finalfusion import load_vocab
vocab = load_vocab("/path/to/finalfusion_file.fifu")

# serialize vocab to single file
vocab.write("/path/to/vocab_file.fifu.voc")

# more specific loading functions exist
from finalfusion.vocab import load_finalfusion_bucket_vocab
fifu_bucket_vocab = load_finalfusion_bucket_vocab("/path/to/vocab_file.fifu.voc")

The package supports loading and writing all finalfusion chunks this way. This is only supported by the Python package, reading will fail with e.g. the finalfusion-rust.

Scripts

finalfusion also includes a conversion script ffp-convert to convert between the supported formats.

# convert from fastText format to finalfusion
$ ffp-convert -f fasttext fasttext.bin -t finalfusion embeddings.fifu

ffp-bucket-to-explicit can be used to convert bucket embeddings to embeddings with an explicit ngram lookup.

# convert finalfusion bucket embeddings to explicit
$ ffp-bucket-to-explicit -f finalfusion embeddings.fifu explicit.fifu

ffp-select generates new embedding files based on some embeddings and a word list. Using ffp-select with embeddings with a simple vocab results in a subset of the original embeddings. With subword embeddings, vectors for unknown words in the word list are computed and added to the new embeddings. The resulting embeddings cannot provide representations for OOV words anymore. The new vocabulary covers only the words in the word list.

$ ffp-select large-embeddings.fifu subset-embeddings.fifu words.txt

Finally, the package comes with ffp-similar and ffp-analogy to do analogy and similarity queries.

# get the 5 nearest neighbours of "Tübingen"
$ echo Tübingen | ffp-similar embeddings.fifu
# get the 5 top answers for "Tübingen" is to "Stuttgart" like "Heidelberg" to...
$ echo Tübingen Stuttgart Heidelberg | ffp-analogy embeddings.fifu

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
.readthedocs.yaml		.readthedocs.yaml
.travis.yml		.travis.yml
CONTRIBUTORS		CONTRIBUTORS
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build_wheels.sh		build_wheels.sh
default.nix		default.nix
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
setup.py		setup.py
shell.nix		shell.nix

License

finalfusion/finalfusion-python

Folders and files

Latest commit

History

Repository files navigation

finalfusion-python

Introduction

Installation

Installing from source

Package Usage

Basic usage

Beyond Embeddings

Scripts

Where to go from here

About

Topics

Resources

License

Stars

Watchers

Forks

Languages