Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA, KDiscordODetector & Telemanon don't make predictions for all datapoints #99

Open
Jeroenvanwely opened this issue Jun 9, 2023 · 0 comments

Comments

@Jeroenvanwely
Copy link

I noticed that PCA, KDiscordODetector & Telemanon don't make predictions for all data points provided. One will get this issue after training (using .fit(X)) and now want to use .predict(Y) for evaluation. Let's say we want to run the following code:

X # Training data
Y # Eval data
y_true # Eval true labels
model # Either PCA, KDiscordODetector or Telemanon

model.fit(X) # Train model on X
y_pred = model.predict(Y) # Make model prediction on Y

# Analyse evaluation results
accuracy_score(y_true, y_pred)
confusion_matrix(y_true, y_pred)
classification_report(y_true, y_pred)

The last three lines won't run because y_pred is always shorter than y_true. That is due to these methods using the function get_sub_matrices(X, window_size, step, return_numpy, flatten, flatten_order) (found in utility.py) that returns a numpy array of shape (valid_len), window_size*n_sequenses), where each row stands for a flattened submatrix (Below you will find a copy of this function). This function cuts the data up into matrices based on the window_size and step parameters. However, if the last points in the data are not enough to form a new sub-matrix, they will not be taken along in the prediction. Therefore when analysing the evaluating results, you will have to change the above example code to:

X # Training data
Y # Eval data
y_true # Eval true labels
model # Either PCA, KDiscordODetector or Telemanon

model.fit(X) # Train model on X
y_pred = model.predict(Y) # Make model prediction on Y

# Analyse evaluation results
accuracy_score(y_true[:len(y_pred], y_pred)
confusion_matrix(y_true[:len(y_pred], y_pred)
classification_report(y_true[:len(y_pred], y_pred)

Here is the code where the sub_matrices are produced:

def get_sub_matrices(X, window_size, step=1, return_numpy=True, flatten=True,
                     flatten_order='F'):
    """Chop a multivariate time series into sub sequences (matrices).

    Parameters
    ----------
    X : numpy array of shape (n_samples,)
        The input samples.

    window_size : int
        The moving window size.

    step_size : int, optional (default=1)
        The displacement for moving window.
    
    return_numpy : bool, optional (default=True)
        If True, return the data format in 3d numpy array.

    flatten : bool, optional (default=True)
        If True, flatten the returned array in 2d.
        
    flatten_order : str, optional (default='F')
        Decide the order of the flatten for multivarite sequences.
        ‘C’ means to flatten in row-major (C-style) order. 
        ‘F’ means to flatten in column-major (Fortran- style) order. 
        ‘A’ means to flatten in column-major order if a is Fortran contiguous in memory, 
        row-major order otherwise. ‘K’ means to flatten a in the order the elements occur in memory. 
        The default is ‘F’.

    Returns
    -------
    X_sub : numpy array of shape (valid_len, window_size*n_sequences)
        The numpy matrix with each row stands for a flattend submatrix.
    """
    X = check_array(X).astype(np.float)
    n_samples, n_sequences = X.shape[0], X.shape[1]

    # get the valid length
    valid_len = get_sub_sequences_length(n_samples, window_size, step)

    X_sub = []
    X_left_inds = []
    X_right_inds = []

    # exclude the edge
    steps = list(range(0, n_samples, step))
    steps = steps[:valid_len]

    # print(n_samples, n_sequences)
    for idx, i in enumerate(steps):
        X_sub.append(X[i: i + window_size, :])
        X_left_inds.append(i)
        X_right_inds.append(i + window_size)

    X_sub = np.asarray(X_sub)

    if return_numpy:
        if flatten:
            temp_array = np.zeros([valid_len, window_size * n_sequences])
            if flatten_order == 'C':
                for i in range(valid_len):
                    temp_array[i, :] = X_sub[i, :, :].flatten(order='C')

            else:
                for i in range(valid_len):
                    temp_array[i, :] = X_sub[i, :, :].flatten(order='F')
            return temp_array, np.asarray(X_left_inds), np.asarray(
                X_right_inds)

        else:
            return np.asarray(X_sub), np.asarray(X_left_inds), np.asarray(
                X_right_inds)
    else:
        return X_sub, np.asarray(X_left_inds), np.asarray(X_right_inds)


def get_sub_sequences_length(n_samples, window_size, step):
    """Pseudo chop a univariate time series into sub sequences. Return valid
    length only.

    Parameters
    ----------
    X : numpy array of shape (n_samples,)
        The input samples.

    window_size : int
        The moving window size.

    step_size : int, optional (default=1)
        The displacement for moving window.

    Returns
    -------
    valid_len : int
        The number of subsequences.
        
    """
    # if X.shape[0] == 1:
    #     n_samples = X.shape[1]
    # elif X.shape[1] == 1:
    #     n_samples = X.shape[0]
    # else:
    #     raise ValueError("X is not a univarite series. The shape is {shape}.".format(shape=X.shape))

    # valid_len = n_samples - window_size + 1
    # valida_len = int_down(n_samples-window_size)/step + 1 
    valid_len = int(np.floor((n_samples - window_size) / step)) + 1
    return valid_len
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant