This package proposes a time series tokenization package for attention-based classifiers based in two layers of LSH and a final embedding layer trained with a Triplet Contrastive Loss
This work is intended for Time Series Classification tasks, such that given a family of multivariate time series
The discrete set
Symbol | Name | Type |
---|---|---|
Random process / Time series family | ||
Instance / Time series instance | ||
Time series value domain | ||
Time series time index | ||
Class labels | ||
A sample of the instance i at time t | ||
Number of attributes/variables per sample | ||
Number of instances | ||
Instance index | ||
Number of samples, per instance | ||
Sample time index | ||
Number of class labels | ||
Length of the sliding window (number of samples) | Hyperparameter | |
Length of the increment of the sliding window (number of samples) | Hyperparameter | |
Number of generated tokens | ||
Data window and Token index | ||
Token embedding size | Hyperparameter | |
Token | ||
Number of Transformer layers at the Classifier | Hyperparameter | |
Number of Attention Heads for each Transformer layer at the Classifier | Hyperparameter | |
Number of Hidden Units for each Transformer layer at the Classifier | Hyperparameter |
The tokenizer works by splitting the time series into overlapped patches using a sliding window, parametrized with window length
A time series instance
The tokenization model is a function
-
Sample-level hashing: A set of
$H_S$ LSHs based on randomized projections which are applied for each sample$y(t)\in Y$ , producing an output$h_s(t) \in \mathbb{N}^{H_W}$
-
Patch-level hashing: A set of
$H_W$ LSHs based on randomized projections which are applied for each sample window$h_s(j),\ldots,h_s(j+W)$ , producing an output token$h_p(j) \in \mathbb{N}^{H_W}$
-
Layer Normalization: Each token
$h_p(w)$ is normalized, such that$h_p(w) \sim \mathcal{N}(0,0.1)$
-
Contrastive layer: A linear layer with
$H_W$ inputs and$E$ outputs that transforms the token$h_p(w)$ in the embedded token$\tau(j)$
The LSH layers are just sampled at model creation and are not trainable, and the final layer is trained with a Contrastive Metric Learning approach, where the same model is used to embed different samples with different or equal classes, and the distance between these embeddings is used to adjust the model accordingly in a way to minimize the distance between intra-class embeddings and maximize the distance between inter-class embeddings. This research adopts the Triple Loss as the contrastive error where given a random sample
- An Attention Classifier was proposed to assess the performance of the tokenizer, composed of the following layers:
-
Tokenization and Embedding: Contrastive-LSH models transforms an input time series sample
$Y_i$ in a set of$N_T$ tokens$[\tau_0, \ldots, \tau_{N_T}]$ -
Positional Embedding: A simple nn.Embedding layer with
$N_T$ vectors which are added to the token embeddings to represent the temporal position of the token in the input sequence. The vectors are initialized with a linear sequence between -0.2 and 0.2, and later adjusted by the training procedure. -
Transformers: A sequence of
$M_T$ Transformer layers with$M_H$ attention-heads and$M_F$ linear units in the feed-forward layer each. -
Classification: A simple linear layer with
$N_T \times E$ inputs and$k$ outputs - Log-Softmax
-
Tokenization and Embedding: Contrastive-LSH models transforms an input time series sample
The model training employs a Negative Log Cross Entropy Loss error.