Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on whether it is possible to optimize many Recurrence Plots production #88

Open
aleksejs-fomins opened this issue Oct 23, 2020 · 8 comments

Comments

@aleksejs-fomins
Copy link

So recurrence plot pyts.image.RecurrencePlot works really nicely, but can be a bit slow. I want to write a function that would make a movie for threshold='point' and percentage in range 0 to 100. I can do it trivially by repeating the plot independently 100 times, but I presume that (theoretically) some part of the work can be re-used if percentage is changed but data stays the same. Can this be done?

@johannfaouzi
Copy link
Owner

johannfaouzi commented Oct 24, 2020

Yes, sure! You just have to compute the images without using a threshold (threshold=None), then compute the thresholds yourself and binarize the recurrence plots for each threshold.

Here is an example:

import numpy as np
from pyts.image import RecurrencePlot

# Create a toy data set
rng = np.random.RandomState(42)
n_samples, n_timestamps = 200, 48
X = rng.randn(n_samples, n_timestamps)

# Compute the recurrence plots with no threshold
X_rp = RecurrencePlot(threshold=None).transform(X)

# Compute the thresholds (each percentile here)
thresholds = np.percentile(
    np.reshape(X_rp, (n_samples, -1)),
    np.arange(1, 100), axis=1
)

X_rp_thresholds = np.empty((thresholds.size, *X_rp.shape))

# Binarize the recurrence plots for each threshold
for i, thresh in enumerate(thresholds):
    X_rp_thresholds[i] = X_rp <= thresh[:, None, None]

You can show all the images on one figure like this:

import matplotlib.pyplot as plt

# Choose the index of the time series
idx = 10

# Plot the recurrence plot for each threshold
plt.figure(figsize=(5, 20))
for i in range(thresholds.shape[0]):
    plt.subplot(20, 5, i + 1)
    plt.imshow(X_rp_thresholds[i, idx], cmap='binary', origin='lower')
    plt.axis('off')
plt.show()

To make a video, here are a few ressources (found with a quick research):

But maybe you can do the video yourself without python by saving each image?

@aleksejs-fomins
Copy link
Author

Ah, this is really cool, much simpler than I expected. I don't really need help with making a video, I have already made a tiny library using opencv for that. Thanks so much for your feedback

@aleksejs-fomins
Copy link
Author

aleksejs-fomins commented Oct 26, 2020

This seems to be a very heavy operation to perform. The following line in your code

X_rp_thresholds = np.empty((thresholds.size, *X_rp.shape))

seems to require 68Gb of space.

@aleksejs-fomins
Copy link
Author

Wait, I'm confused. Why does RecurrencePlot.transform return a 3D array? Does it calculate the recurrence plots for each dimension independently? I was expecting that it would calculate distances directly in N-dimensional space and return a 2D array [timesteps x timesteps]. Can this be done?

@johannfaouzi
Copy link
Owner

The library is designed from a machine learning point of view, so it expects a data set, that is a set of time series. Moreover, it is mostly designed to deal with univariate time series, although it can also handle multivariate time series.

If I guess it right, you only have one time series but a multivariate one? So you have a 2D-array with the first axis being the dimension (channel) and the second axis being time? And you want to compute only one recurrence plot for all the dimensions (instead of one for each dimension independently)?

Here is a detailed example:

import numpy as np
from numpy.lib.stride_tricks import as_strided
from pyts.image import RecurrencePlot
from scipy.spatial.distance import squareform


# Create a multivariate time series with 12 dimensions and 480 timepoints
rng = np.random.RandomState(42)
n_dimensions, n_timestamps = 12, 480
x = rng.randn(n_dimensions, n_timestamps)

# Define a function that will compute all the trajectories and save them in a 3D-array
def compute_trajectories(x, trajectory_size, time_delay):
    n_dimensions, n_timestamps = x.shape
    shape_new = (n_dimensions,
                 n_timestamps - (trajectory_size - 1) * time_delay,
                 trajectory_size)
    s0, s1 = x.strides
    strides_new = (s0, s1, time_delay * s1)
    return as_strided(x, shape=shape_new, strides=strides_new)

# Compute all the trajectories
# The number of trajectories is equal to n_timestamps - (trajectory_size - 1) * time_delay
trajectory_size = 4
time_delay = 2
x_traj = compute_trajectories(x, trajectory_size, time_delay)

# Compute the distance between each trajectory using the Euclidean distance
# This is the unthresholded recurrence plot
# We sum over axis 0 (dimensions) and axis 3 (trajectory_size)
# The output is a 2D-array with shape (n_trajectories, n_trajectories)
x_dist = np.sqrt(
    np.sum((x_traj[:, None, :, :] - x_traj[:, :, None, :]) ** 2,
           axis=(0, 3))
)

# Compute the thresholds (each percentile here)
thresholds = np.percentile(x_dist.ravel(), np.arange(1, 100))

# Note that the matrix is symmetric and that the diagonale is full of zeros
# because the distance between two identical trajectories is null.
# You can also compute the thresholds using only the upper (or lower) triangular entries
thresholds_alt = np.percentile(squareform(x_dist), np.arange(1, 100))

# Binarize the recurrence plots for each threshold
x_rp_thresholds = np.empty((thresholds_alt.size, *x_dist.shape))
for i, thresh in enumerate(thresholds):
    x_rp_thresholds[i] = x_dist <= thresh

The documentation explains how trajectories are computed, if it's not clear. In your case, I guess that you assume trajectory_size=1 (and time_delay is ignored in that case since it's irrelevant).

@aleksejs-fomins
Copy link
Author

Ah yes, I just did more or less what you wrote, works very nicely. I can provide an example with interactive jupyter plot, where I can slide over percentile with a slider, very convenient. Two further questions, if I may:

  1. On top of having multidimensional data, I actually do have repetitions of the experiment. So following this procedure, I can construct a 2D binary matrix for each repetition of the experiment (given exact threshold value). What I have tried now is to simply average the binary matrices over the repetitions to get a 2D float matrix. It looks cool and seems informative, but I just wonder if that is what people actually do. I could just have 2 parameters, that is, 2 sliders, but then it becomes harder to explore all possibilities.
  2. In practice, do people do any pre-processing to the data so that the recurrence plot is more sensible? One thing I have considered is z-scoring each dimension independently, so that the distances are not biased by dimensions with very high relative activity. The other thing I have heard is that euclidean norm has a fail case when used as a classifier. Namely, if the dataset has a variance along the axis that is orthogonal to the distance, then the distance may be very small relative to the domain size, but nevertheless be highly informative. As a solution, the Mahalanobis distance was proposed. Have you heard anything of this?

@johannfaouzi
Copy link
Owner

  1. I've seen people propose multiplication instead of addition, they call it joint recurrence plot (https://abdn.pure.elsevier.com/en/publications/multivariate-recurrence-plots), but it was not proposed for time series classification in particular.

  2. To be honest, I'm not sure that many people use recurrence plots for time series classification. I've seen a few papers (this one, that one, that one), and the literature for multivariate time series classification is much smaller than the one for univariate time series classification. The scale / variance of each dimension is important for Euclidean distance as it corresponds to the weight of this dimension in the Euclidean distance. When you have computed all the trajectories, you can compute their pairwise distances using sklearn.metrics.pairwise_distances (you just need to reshape the 3D-array into a 2D-array first, and do the inverse operation afterwards) and they provide many more distances. The Mahalanobis distance makes use of the covariance matrix between both time series, so it may be useful indeed. Nonetheless, at the hand, you still need a classifier to classify your time series based on its recurrence plot. Maybe you can use an ensemble of classifiers, where each recurrence plot is computed with a different distance?

Hope this helps you a bit.

@aleksejs-fomins
Copy link
Author

Thanks a lot for your suggestions! This is all extremely helpful to me! I will go through the publications you suggest. The goal really is to build a toolbox for neuroscience applications within our research group. I really liked the recurrence plots because they enable to see by eye what is going on to get intuition. The real problem is that there is a very complicated, possibly chaotic system at hand, and while there are many deep learning methods proposed to decode what is going on, the only real way to debug and understand HOW the neuronal populations encode data is to project the multivariate phase-space onto some simple metrics to filter out possible hypotheses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants