Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samples with different length #104

Open
thunderbug1 opened this issue Aug 9, 2021 · 2 comments
Open

samples with different length #104

thunderbug1 opened this issue Aug 9, 2021 · 2 comments

Comments

@thunderbug1
Copy link

If I understand the WEASEL+MUSE algorithm correctly it should be possible to use it with samples of different lengths.
This is currently not possible with the API of the WEASELMUSE class which expects a 3d array in the shape = (n_samples, n_features, n_timestamps) since a numpy array has the same shape for all samples.

I tried to fill the time series of all samples to the length of the longest samples with nan values, but the input checks reject nan values.
Is there a way to achieve using samples of different lengths?

@johannfaouzi
Copy link
Owner

Hi,

Sorry for the late reply. Support for variable-length data sets is unfortunately not supported for the moment.

Regarding WEASEL+MUSE, you can achieve this with the following process:

  1. Create a data set for each unique length value (in each data set, the time series should have the same length)
  2. Transform each data set using a separate instance of WEASELMUSE (set chi2_threshold to a very low positive value in order to not perform feature selection)
  3. Concatenate the transformed data set (the pandas package is handy for this)
  4. Perform feature selection on the concatenated data set

The main downside of this approach is the high memory (RAM) usage because the feature selection is performed at the last step. A possible solution (that would lead to the same results) would be to use a for loop for the window_sizes parameters (instead of setting a list with k window sizes, you create a for loop (on the window sizes) and provide a single window size inside the for loop).

Here is an example (without the aforementioned optimization, I can modify the example to show you if needed):

import numpy as np
import matplotlib.pyplot as plt
from pyts.datasets import load_basic_motions
from pyts.multivariate.transformation import WEASELMUSE
import pandas as pd
from sklearn.feature_selection import chi2

#######################
####### D A T A #######
#######################

# Toy dataset
X_train, X_test, y_train, y_test = load_basic_motions(return_X_y=True)
# X_train.shape = X_test.shape = (40, 6, 100)

# Sample 4 random lengths between in the interval [80, 100]
rng = np.random.RandomState(42)
lengths = 80 + rng.choice(21, size=4, replace=False)

# Assign 10 time series to each length
lengths_samples_train_idx = rng.permutation(40).reshape((4, 10))
lengths_samples_test_idx = rng.permutation(40).reshape((4, 10))


#######################
# P A R A M E T E R S #
#######################

# WEASEL+MUSE parameters
weasel_muse_params = {'word_size': 2, 'n_bins':2, 'window_sizes': [12, 36],
                      'chi2_threshold': 1e-80}
transformer_list = [WEASELMUSE(**weasel_muse_params) for _ in range(4)]


#######################
### T R A I N I N G ###
#######################

X_weasel_train = []
for samples_idx, length, transformer in zip(lengths_samples_train_idx, lengths, transformer_list):
    X_weasel_train.append(transformer.fit_transform(X_train[samples_idx, :, :length], y_train[samples_idx]))
    
# Concatenate the array as a DataFrame and fill NA values with 0
df_weasel_train = pd.concat([
    pd.DataFrame.sparse.from_spmatrix(
        X, index=samples_idx, columns=np.vectorize(transformer.vocabulary_.get)(np.arange(X.shape[1]))
    )
    for X, samples_idx, transformer in zip(X_weasel_train, lengths_samples_train_idx, transformer_list)
]).fillna(0.)

# Perform feature selection using chi2 test
chi2_threshold = 2.
chi2_statistics, _ = chi2(df_weasel_train, y_train)
features_idx_to_keep = np.where(chi2_statistics > chi2_threshold)[0]
features_to_keep = df_weasel_train.columns[features_idx_to_keep]
df_weasel_train = df_weasel_train[features_to_keep]


#######################
## I N F E R E N C E ##
#######################

X_weasel_test = []
for samples_idx, length, transformer in zip(lengths_samples_test_idx, lengths, transformer_list):
    X_weasel_test.append(transformer.transform(X_test[samples_idx, :, :length]))
    
# Concatenate the array as a DataFrame and fill NA values with 0
df_weasel_test = pd.concat([
    pd.DataFrame.sparse.from_spmatrix(
        X, index=samples_idx, columns=np.vectorize(transformer.vocabulary_.get)(np.arange(X.shape[1]))
    )
    for X, samples_idx, transformer in zip(X_weasel_test, lengths_samples_test_idx, transformer_list)
]).fillna(0.)[features_to_keep]

Let me know if this helps you.

@thunderbug1
Copy link
Author

oh wow, thanks for the extensive example.
I wouldn't have considered using separate instances of WEASELMUSE but it makes sense.
I will give it a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants