LANN (Lame Approximate Neighbour Search) is a Python re-implementation of Annoy mainly for Jeffrey04's learning (hence the 'L' in name)/experimental purpose. It does not bring anything new to the table, and is not meant for production use for now (probably wouldn't work with large scale real-life data). Like Annoy, the library can be used to search for the nearest points for a given query point in a vector space.
It does not generate multiple trees to improve precision and recall for now. Also it does not store points in the tree. In order to use this library for searching, the points needs to be index-able, preferably in a dictionary-like structure.
- Python 3.4+
from lann import points_add, point_convert, forest_build, forest_query_neighbourhood, search
from uuid import uuid4
from random import gauss
size, dim = 25000, 5
print('generating points')
pmeta, points = points_add([[gauss(0, 1) for __ in range(dim)] for _ in range(size)],
dim,
'list',
[uuid4() for _ in range(size)])
print('generating query')
query = point_convert((1/3., 1/3.), 'list')
print('generating forest')
fmeta, forest = forest_build(points, pmeta, 25, leaf_max=5, n_jobs=4)
print('search')
idx, distance = search(query, points, pmeta, forest, fmeta, 1)[0]
points
: an array of pointsdimension
: the dimension ofpoints
ptype
: points are eitherlist
(a list of numeric values), orgensim
for gensim-like corpusidentifiers
(optional, defaulted toNone
): an array of identifiers if applicable, otherwise a list of uuid4 is assigned to each of the point- Returns: a pair of dictionaries where the first item stores the
dimension
and the second storesidentifier
as key, and point as value
vector
: the vector to be convertedptype
: points are eitherlist
(a list of numeric values), orgensim
for gensim-like corpus- Returns: point recognized by lann
points
: output ofpoints_add
tree_count
: number of trees to buildleaf_max
(optional, defaulted to5
): maximum number of points to be stored in a leaf noden_jobs
(optional, defaulted to1
): maximum number of processes to spawnbatch_size
(optional, defaulted to10000
): only usable whenn_jobs
> 1, defines how many nodes to build in a batch- Returns: a pair of dictionaries where the first stores the forest
count
,leaf_max
androots
(ids for root nodes), and the second storesid
as key and node details as value
query
: the query point (converted bypoint_convert
)points
: second output ofpoints_add
pmeta
: first output ofpoints_add
forest
: second output offorest_build
fmeta
: first output offorest_build
n
: number of points to returnthreshold
(optional, defaulted toNone
to use a heuristic value): at0
, only one leaf node per tree is returned, the number of leaf nodes returned (as well as the number of candidate points) increases as the threshold value increases.n_jobs
(optional, defaulted to1
): maximum number of threads to spawn- Returns: a list of tuple, where each of them consists of a point identifier, and the corresponding distance score for the point.
- Unit tests
- Further optimization (Refactor + Cython/C/C++/Golang?)
- Proper API