[MNT] Update similarity search with new base classes #1243

baraline · 2024-02-24T16:25:02Z

Describe the issue

As new base classes were introduced by #996 , we should update the similarity search module to use these classes.
We are still in the context of a single query matching a series or a collection. The case of a collection of queries will come later.

Suggest a potential alternative/fix

Redisgn the API to use BaseSeriesEstimators as a base case and warp around it for collections.

Additional context

No response

TonyBagnall · 2024-02-24T16:37:27Z

definitely, let me know if you want any input. BaseSeriesEstimator is still experimental, so we can adapt to your use case if necessary.

baraline · 2024-02-24T17:25:22Z

In my mind there is 4 main cases :

single query to single series
single query to collection of series
collection of queries to single series
collection of queries to collection of series

All collection of series warp around the single case. In the case of top K matches, you simply update while iterating on the collection or compute them independently in parallel for all single series before isolating the top K of the collection.

I'll try it out and keep you updated, but as the module is more aimed toward data analysis, I think the BaseSeriesEstimator looks fine as it is for now.

baraline · 2024-04-11T13:07:01Z

Pasting the message I made on slack here to keep track of it.

Hi everyone, regarding PR #1310 and the update of the similarity search module, and to prepare for issue #1311 and the future additions, I would like your inputs on the following proposition:

Module structure :

- aeon/
|---- similarity_search/
|-------- BaseSimilaritySearch.py
|-------- query_search/
|------------ BaseQuerySearch.py
|-------- series_search/
|------------ BaseSeriesSearch.py
|-------- index_search/
|------------ BaseIndexSearch.py

Query search : Given a query Q and a series/collection X, evaluate the similarity between Q and each admissible candidate in X.
Series search : Given a length parameter (for now, we add techniques that don't require it later) do a query search for all admissible queries in a series/collection X. In the naive case, this is simply a broadcasting of query search, but more optimized algorithms exists for this case (e.g. STUMP/STOMP for Euclidean distance)
Index search : Given a series/collection X and a length parameter (again, for now), build an indexing (e.g.what the Faiss library does) of all admissible candidates in X. Then, this indexing can be used as an estimator to answer query search tasks. This is generally used when you have a frozen historical set and want fast answers for new queries / when the inputs do not fit in memory.

Expected data and internal input conversion :

We could accept series/collection in numpy and series in pd.Series data as input, but we would ideally convert all of it to numpy collection, and make use of the axis argument we introduced in other modules to avoid the channel problem:

For query search, we would implement heavy computation numba functions in a series case and loop over it with the collection. This can for example allow passing down (between series of a collection X) best-so-far values when doing early abandon or pruning with lower bounds.

The same reasoning apply to series search.
The indexing case generally work with out-of-memory data, and require updates after loading new parts of the dataset, which are generally made of collections/big series.

Interaction with other modules:

We would still use the distance module for the naive search cases, which are the one without speed-ups (which can lead to exact or approximative results). It would also be nice to offer some visualisation through the visualisation module.
As for the reverse, which is which aeon estimator could benefit from the similarity search module speed-ups, it is still to be explored.

Documentation and notebooks:

I would like to continue to have 3 type of notebooks for similarity search :

A benchmark notebook, which would show the effect of different speed-up option, and would help us determine which one to put as "default" speed-ups for a given distance function.
An example notebook, which shows how to use the estimators on some toy cases and how to visualize the results
A more theoretical one, which would explain the maths behind the different speed-ups and the base tasks.

Additionally, should we do wrappers around stumpy for Euclidean distance cases and focus on other distances ? Or try to have our own implementation to see if we end up with similar results as stumpy ?(I would ofc split the creation of each sub-modules into different PRs). Looking forward to your inputs ! If some other people want to work on this, I can create smaller issues to pick-up.

baraline added API design API design & software architecture maintenance Continuous integration, unit testing & package distribution similarity search Similarity search package labels Feb 24, 2024

baraline self-assigned this Feb 24, 2024

baraline linked a pull request Mar 13, 2024 that will close this issue

[MNT] Update similarity search with new base classes #1310

Closed

3 tasks

baraline mentioned this issue Mar 13, 2024

[MNT] Update similarity search with new base classes #1310

Closed

3 tasks

baraline linked a pull request May 9, 2024 that will close this issue

[MNT] Update similarity search with new base classes : Query Search #1508

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MNT] Update similarity search with new base classes #1243

[MNT] Update similarity search with new base classes #1243

baraline commented Feb 24, 2024

TonyBagnall commented Feb 24, 2024

baraline commented Feb 24, 2024 •

edited

baraline commented Apr 11, 2024

[MNT] Update similarity search with new base classes #1243

[MNT] Update similarity search with new base classes #1243

Comments

baraline commented Feb 24, 2024

Describe the issue

Suggest a potential alternative/fix

Additional context

TonyBagnall commented Feb 24, 2024

baraline commented Feb 24, 2024 • edited

baraline commented Apr 11, 2024

Module structure :

Expected data and internal input conversion :

Interaction with other modules:

Documentation and notebooks:

baraline commented Feb 24, 2024 •

edited