Support for cdist #175

bilelomrani1 · 2022-08-08T19:28:30Z

Currently it is possible to compute a distance matrix between all pairs of time series within a collection with dtaidistance.dtw_ndim.distance_matrix, but it is not possible to (efficiently) compute distances between each pair of two collections of inputs, something akin to scipy.spatial.distance.cdist but for DTW.

This can be really useful to efficiently compute distances between multiple time series and a collection of prototypes.

The text was updated successfully, but these errors were encountered:

wannesm · 2022-08-11T22:24:34Z

Indeed, that specific setting was not considered when designing this toolbox. While it is a good suggestion, this would require quite some refactoring. You probably already thought of this, but our goto solution in case of prototypes would be to parallelize over the prototypes manually (see e.g. the k-means implementation).

Two simpler alternatives could be to use either numpy concat or to make the toolbox think it sees one list. And then use the block argument. Both are compatible with the fast C implementation.

Assume two sets of series you want to compare:

nb = 3
size = 10
np.random.seed(seed=3980)
s1 = np.random.random((nb,size))
s2 = np.random.random((nb,size))

First alternative (with the disadvantage that it requires copying all data to one data structure):

s = np.vstack((s1, s2))
m = dtaidistance.dtw.distance_matrix(s, block=((0,nb), (nb,2*nb)),
                                     compact=False, use_c=True)

Second alternative:

class MyList(list):
    def __init__(self, s1, s2):
        self.s1 = s1
        self.s2 = s2
    def __len__(self):
        return len(self.s1) + len(self.s2)
    def __getitem__(self, idx):
        if idx < len(self.s1):
            return self.s1[idx]
        idx -= len(self.s1)
        return self.s2[idx]
s = MyList(s1, s2)
m = dtaidistance.dtw.distance_matrix(s, block=((0,nb), (nb,2*nb)),
                                     compact=False, use_c=True)

The overhead of the second solution should not be too bad. The additional memory requirements are a lot less than storing the actual data. If this second solution works well, we could consider adding this with some wrappers to make it easy to use different series. But some benchmarking would be necessary first.

bilelomrani1 · 2022-08-12T11:38:47Z

Thank you for this extensive answer, adding a wrapper sounds like a good idea. I'm leaving the issue opened just in case, but feel free to close it if you want. Thank you for your help.

jenghub · 2022-09-12T15:58:44Z

How would one properly format a matrix of (lat, long) coordinates for the dtw.distance_matrix_fast with block? Let's say, I have a bunch of same-length series in a numpy matrix of size (num_trajectories, time_steps, 2). Unsure of how to adapt the above examples. Thanks!

wannesm · 2022-09-12T18:49:59Z

In this case you can probably use this functionality directly (assuming the series you want to use are consecutive):
https://dtaidistance.readthedocs.io/en/latest/usage/dtw.html#dtw-between-multiple-time-series-limited-to-block

In case the series you want to compare are scattered in your numpy matrix, you need to reorder. Suppose you want to compare series 0,2,4 with 1,3,4, this would be something like:

series_reordered = series[[0,2,4,1,3,5]]
dtw_ndim.distance_matrix_fast(series_reordered, block=((0,3),(3,6)))

Unrelated to DTW, using Euclidean distance (used within DTW) for lat/lon might give suboptimal results. For example, one degree latitude is about 111km, but on degree longitude ranges from being 111km to 0km depending where you are on earth. If you observe strange results, consider re-projecting to a two-dimensional plane (e.g. Mercator): https://geopandas.org/en/stable/docs/user_guide/projections.html

jenghub · 2022-09-12T21:14:02Z

Thanks for the advice! Just to confirm, would I have to reshape a my array of shape (examples, time_step, 2) to (examples, time_step * 2)? The initial 3d matrix returns zeroes, and the second 2d matrix does return valid numbers when I use dtw.distance_matrix_fast with equal-sized arrays. I just want to make sure the lat, long (in whatever projection) is being properly ordered/used by the distance function.

wannesm · 2022-09-13T12:41:08Z

You should not reshape, but use the multi-dimensional version available in the dtw_ndim subpackage (thus not the one dimensional version in dtw). See also https://dtaidistance.readthedocs.io/en/latest/usage/dtw.html#multi-dimensionsal-dtw

jenghub · 2022-09-13T15:01:45Z

Gotcha. Have done that, but was wondering if there was to have the matrix function work with 2d. All good. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for cdist #175

Support for cdist #175

bilelomrani1 commented Aug 8, 2022

wannesm commented Aug 11, 2022

bilelomrani1 commented Aug 12, 2022

jenghub commented Sep 12, 2022

wannesm commented Sep 12, 2022

jenghub commented Sep 12, 2022

wannesm commented Sep 13, 2022

jenghub commented Sep 13, 2022

Support for cdist #175

Support for cdist #175

Comments

bilelomrani1 commented Aug 8, 2022

wannesm commented Aug 11, 2022

bilelomrani1 commented Aug 12, 2022

jenghub commented Sep 12, 2022

wannesm commented Sep 12, 2022

jenghub commented Sep 12, 2022

wannesm commented Sep 13, 2022

jenghub commented Sep 13, 2022