Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNT] Update similarity search with new base classes #1243

Open
baraline opened this issue Feb 24, 2024 · 3 comments · May be fixed by #1508
Open

[MNT] Update similarity search with new base classes #1243

baraline opened this issue Feb 24, 2024 · 3 comments · May be fixed by #1508
Assignees
Labels
API design API design & software architecture maintenance Continuous integration, unit testing & package distribution similarity search Similarity search package

Comments

@baraline
Copy link
Contributor

Describe the issue

As new base classes were introduced by #996 , we should update the similarity search module to use these classes.
We are still in the context of a single query matching a series or a collection. The case of a collection of queries will come later.

Suggest a potential alternative/fix

Redisgn the API to use BaseSeriesEstimators as a base case and warp around it for collections.

Additional context

No response

@baraline baraline added API design API design & software architecture maintenance Continuous integration, unit testing & package distribution similarity search Similarity search package labels Feb 24, 2024
@baraline baraline self-assigned this Feb 24, 2024
@TonyBagnall
Copy link
Contributor

definitely, let me know if you want any input. BaseSeriesEstimator is still experimental, so we can adapt to your use case if necessary.

@baraline
Copy link
Contributor Author

baraline commented Feb 24, 2024

In my mind there is 4 main cases :

  • single query to single series
  • single query to collection of series
  • collection of queries to single series
  • collection of queries to collection of series

All collection of series warp around the single case. In the case of top K matches, you simply update while iterating on the collection or compute them independently in parallel for all single series before isolating the top K of the collection.

I'll try it out and keep you updated, but as the module is more aimed toward data analysis, I think the BaseSeriesEstimator looks fine as it is for now.

@baraline
Copy link
Contributor Author

Pasting the message I made on slack here to keep track of it.

Hi everyone, regarding PR #1310 and the update of the similarity search module, and to prepare for issue #1311 and the future additions, I would like your inputs on the following proposition:

Module structure :

- aeon/
|---- similarity_search/
|-------- BaseSimilaritySearch.py
|-------- query_search/
|------------ BaseQuerySearch.py
|-------- series_search/
|------------ BaseSeriesSearch.py
|-------- index_search/
|------------ BaseIndexSearch.py
  • Query search : Given a query Q and a series/collection X, evaluate the similarity between Q and each admissible candidate in X.
  • Series search : Given a length parameter (for now, we add techniques that don't require it later) do a query search for all admissible queries in a series/collection X. In the naive case, this is simply a broadcasting of query search, but more optimized algorithms exists for this case (e.g. STUMP/STOMP for Euclidean distance)
  • Index search : Given a series/collection X and a length parameter (again, for now), build an indexing (e.g.what the Faiss library does) of all admissible candidates in X. Then, this indexing can be used as an estimator to answer query search tasks. This is generally used when you have a frozen historical set and want fast answers for new queries / when the inputs do not fit in memory.

Expected data and internal input conversion :

We could accept series/collection in numpy and series in pd.Series data as input, but we would ideally convert all of it to numpy collection, and make use of the axis argument we introduced in other modules to avoid the channel problem:

For query search, we would implement heavy computation numba functions in a series case and loop over it with the collection. This can for example allow passing down (between series of a collection X) best-so-far values when doing early abandon or pruning with lower bounds.

  • The same reasoning apply to series search.
  • The indexing case generally work with out-of-memory data, and require updates after loading new parts of the dataset, which are generally made of collections/big series.

Interaction with other modules:

We would still use the distance module for the naive search cases, which are the one without speed-ups (which can lead to exact or approximative results). It would also be nice to offer some visualisation through the visualisation module.
As for the reverse, which is which aeon estimator could benefit from the similarity search module speed-ups, it is still to be explored.

Documentation and notebooks:

I would like to continue to have 3 type of notebooks for similarity search :

  • A benchmark notebook, which would show the effect of different speed-up option, and would help us determine which one to put as "default" speed-ups for a given distance function.
  • An example notebook, which shows how to use the estimators on some toy cases and how to visualize the results
  • A more theoretical one, which would explain the maths behind the different speed-ups and the base tasks.

Additionally, should we do wrappers around stumpy for Euclidean distance cases and focus on other distances ? Or try to have our own implementation to see if we end up with similar results as stumpy ?(I would ofc split the creation of each sub-modules into different PRs). Looking forward to your inputs ! If some other people want to work on this, I can create smaller issues to pick-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture maintenance Continuous integration, unit testing & package distribution similarity search Similarity search package
Projects
None yet
2 participants