Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for cdist #175

Open
bilelomrani1 opened this issue Aug 8, 2022 · 7 comments
Open

Support for cdist #175

bilelomrani1 opened this issue Aug 8, 2022 · 7 comments

Comments

@bilelomrani1
Copy link

Currently it is possible to compute a distance matrix between all pairs of time series within a collection with dtaidistance.dtw_ndim.distance_matrix, but it is not possible to (efficiently) compute distances between each pair of two collections of inputs, something akin to scipy.spatial.distance.cdist but for DTW.

This can be really useful to efficiently compute distances between multiple time series and a collection of prototypes.

@wannesm
Copy link
Owner

wannesm commented Aug 11, 2022

Indeed, that specific setting was not considered when designing this toolbox. While it is a good suggestion, this would require quite some refactoring. You probably already thought of this, but our goto solution in case of prototypes would be to parallelize over the prototypes manually (see e.g. the k-means implementation).

Two simpler alternatives could be to use either numpy concat or to make the toolbox think it sees one list. And then use the block argument. Both are compatible with the fast C implementation.

Assume two sets of series you want to compare:

nb = 3
size = 10
np.random.seed(seed=3980)
s1 = np.random.random((nb,size))
s2 = np.random.random((nb,size))

First alternative (with the disadvantage that it requires copying all data to one data structure):

s = np.vstack((s1, s2))
m = dtaidistance.dtw.distance_matrix(s, block=((0,nb), (nb,2*nb)),
                                     compact=False, use_c=True)

Second alternative:

class MyList(list):
    def __init__(self, s1, s2):
        self.s1 = s1
        self.s2 = s2
    def __len__(self):
        return len(self.s1) + len(self.s2)
    def __getitem__(self, idx):
        if idx < len(self.s1):
            return self.s1[idx]
        idx -= len(self.s1)
        return self.s2[idx]
s = MyList(s1, s2)
m = dtaidistance.dtw.distance_matrix(s, block=((0,nb), (nb,2*nb)),
                                     compact=False, use_c=True)

The overhead of the second solution should not be too bad. The additional memory requirements are a lot less than storing the actual data. If this second solution works well, we could consider adding this with some wrappers to make it easy to use different series. But some benchmarking would be necessary first.

@bilelomrani1
Copy link
Author

Thank you for this extensive answer, adding a wrapper sounds like a good idea. I'm leaving the issue opened just in case, but feel free to close it if you want. Thank you for your help.

@jenghub
Copy link

jenghub commented Sep 12, 2022

How would one properly format a matrix of (lat, long) coordinates for the dtw.distance_matrix_fast with block? Let's say, I have a bunch of same-length series in a numpy matrix of size (num_trajectories, time_steps, 2). Unsure of how to adapt the above examples. Thanks!

@wannesm
Copy link
Owner

wannesm commented Sep 12, 2022

In this case you can probably use this functionality directly (assuming the series you want to use are consecutive):
https://dtaidistance.readthedocs.io/en/latest/usage/dtw.html#dtw-between-multiple-time-series-limited-to-block

In case the series you want to compare are scattered in your numpy matrix, you need to reorder. Suppose you want to compare series 0,2,4 with 1,3,4, this would be something like:

series_reordered = series[[0,2,4,1,3,5]]
dtw_ndim.distance_matrix_fast(series_reordered, block=((0,3),(3,6)))

Unrelated to DTW, using Euclidean distance (used within DTW) for lat/lon might give suboptimal results. For example, one degree latitude is about 111km, but on degree longitude ranges from being 111km to 0km depending where you are on earth. If you observe strange results, consider re-projecting to a two-dimensional plane (e.g. Mercator): https://geopandas.org/en/stable/docs/user_guide/projections.html

@jenghub
Copy link

jenghub commented Sep 12, 2022

Thanks for the advice! Just to confirm, would I have to reshape a my array of shape (examples, time_step, 2) to (examples, time_step * 2)? The initial 3d matrix returns zeroes, and the second 2d matrix does return valid numbers when I use dtw.distance_matrix_fast with equal-sized arrays. I just want to make sure the lat, long (in whatever projection) is being properly ordered/used by the distance function.

@wannesm
Copy link
Owner

wannesm commented Sep 13, 2022

You should not reshape, but use the multi-dimensional version available in the dtw_ndim subpackage (thus not the one dimensional version in dtw). See also https://dtaidistance.readthedocs.io/en/latest/usage/dtw.html#multi-dimensionsal-dtw

@jenghub
Copy link

jenghub commented Sep 13, 2022

Gotcha. Have done that, but was wondering if there was to have the matrix function work with 2d. All good. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants