Skip to content

ServerSideHannes/las

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 

Repository files navigation

Listen, attend and spell

Minimal tf 2.0 implementation of Listen, attend and spell (https://arxiv.org/abs/1508.01211). To get a better understanding of the naming of the models variables please see the paper above.

Done:

  • Model architecture looks right to me. If you find an error in the code please dont hesitate to open an issue 😊

ToDo:

  • Implement data handing for easier training of model.
  • Train on LibriSpeech 100h
  • Implement specAugment features (prev SOTA LibriSpeech) (https://arxiv.org/abs/1904.08779)

Usage

The file model.py contains the architecture of the model. Example usage below.

"""
def LAS(dim, f_1, no_tokens):
  dim: Number of hidden neurons for most LSTM's.
  f_1: pBLSTM takes (Batch, timesteps, f_1) as input, f_1 is number of features of the mel spectrogram 
       per timestep. Timestep is the width of the spectrogram.
  No_tokens: Number of unique tokens for input and output vector.
"""

model = LAS(256, 256, 16)
model.compile(loss="mse", optimizer="adam")

# x_1 should have shape (Batch-size, timesteps, f_1)
x_1 = np.random.random((1, 550, 256))

# x_2 should have shape (Batch-size, no_prev_tokens, No_tokens). The token vector should be one-hot encoded.
x_2 = np.zeros((1,12,16))
for n in range(12):
  x_2[0, n, np.random.randint(1, 16)] = 1

# By passing x_1 and x_2 the model will predict the 12th token 
# given by the spectogram and the prev predicted tokens

model.predict([x_1, x_2])