Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sktime interface #98

Open
mloning opened this issue Nov 20, 2020 · 2 comments
Open

sktime interface #98

mloning opened this issue Nov 20, 2020 · 2 comments

Comments

@mloning
Copy link

mloning commented Nov 20, 2020

Hi everyone,

Your package looks really good and we're thinking about interfacing your package in sktime to make use of the time series distance functions you provide, and potentially also the clustering algorithms (see sktime/sktime#501).

  • Is there anything we should keep in mind when adding dtaidistance as a dependency?
  • Have you played around with integrating it with sklearn (I see that you largely follow their interface)?
  • In clustering, what is your expected input type for the time series data in fit and predict, 2d numpy array assuming multiple instances of equal-length univariate data?

cc @chrisholder

@wannesm
Copy link
Owner

wannesm commented Nov 22, 2020

Hi @mloning ,

Sounds interesting and exciting. We started using sktime for some research tasks and teaching assignments and are satisfied users. So we are willing to provide updates and features to make this process easier. Our current focus is on our own projects (i.e. fast C-versions), and while we try to make it reusable, some functionality might be missing because we didn't yet need it (e.g. we don't have a predict method for clustering although that's easy to implement).

W.r.t. your questions:

  • dtaidistance as a dependency: I don't think there is anything special. We have two important optional dependencies: Cython and Numpy, which you already require (optionally). For clustering we rely on Scipy and PyClustering for some methods (optional and dynamically checked). We also selected a permissive license to make it easy to collaborate.
  • We often have sklearn or one of our own ML algorithms in the same pipeline. But it is typically communicating only through simple datastructures. For example, we use DTW to compute distances to prototypes and use the results as features. Or we feed it to scipy, pyclustering or our own clustering algorithms (e.g. for fleet-based anomaly detection). Currently we use agglomerative clustering (using own implementation or scipy), medoid clustering (using pyclustering) and, since this month, DBA-k-means (own implementation). We do indeed follow the sklearn interface since everybody is already familiar with this api (also inspired by the now moved out sklearn HMM module). Did you have anything else in mind than clustering for classification?
  • The expected input is quite flexible because our use cases also produce series with different lengths and we support embedded devices. We try to be smart about it and detect the type of input format (numpy, array.array, list, ...). If it's an equal-length numpy matrix, we use the internal datastructure directly. If it's a list of series, we keep track of individual pointers to the series. We thus avoid copy operations where we can. It can also be univariate or multivariate, both are supported for our use cases (e.g. for sports monitoring).

@mloning
Copy link
Author

mloning commented Nov 23, 2020

Thanks @wannesm, sounds good! @chrisholder will start working on it over the next few weeks, we'll report back if we have more questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants