Skip to content

Latest commit



287 lines (192 loc) · 14.3 KB

File metadata and controls

287 lines (192 loc) · 14.3 KB


project rev

A proof-of-concept audio-interactive personalized chatbot based on Ted Mosby, a character from the renowned TV show "How I Met Your Mother"

Table of Contents

About The Project


This project is about creating a chatbot that simulates a certain persona, whether a real one or a virtual one, through an audio-interactive interface where users talk to it using their voice and consequently, the bot responds using a voice that resembles the simulated person's voice to an extent.


Some of the modules represented below are implemented for educational purposes only and despite them having near state-of-the-art implementations (e.g speech recognizer model inspired by DeepSpeech 2 architecture), they weren't trained or implemented to give the best possible results due to resources restrictions; hence it's best advised that they are substituted with much better ones if the goal is top-notch results and configurability. For instance, the generator module, despite being fully functional and generating good results, it doesn't have the luxuries that Hugging Face's model.generate() sports such as repetition, length and diversity penalties.

Project Block Diagram

Project Architecture

A simple graphical interface is used to wrap the modules illustrated above into a single interface with which the user can interact.


project rev demo

Speech Recognizer

A module powered by deep neural networks and signal analysis and processing techniques to convert users' audio signals to text.

Flow Architecture

Speech Recognizer


After training the model using Mel Spectrogram features on LibriSpeech ASR corpus training set of 360 hours "clean" speech for 40 epochs split into 2 15-hour sessions on a Nvidia V100 instance, the model yielded a word error rate (WER) = 0.2097601 and a character error rate (CER) = 0.06480708

Example Runs

  • Run #1:

    • Input: "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour, fat and sauce"

    • Prediction: "he hoped there would be sto for dinner turnips and carats and bruised potatoes and fat mutton pieces to be laitled out in thick pepperd flowr fattaind sauce"

  • Run #2:

    • Input: "Also, a popular contrivance whereby love making may be suspended but not stopped during the picnic season"

    • Prediction: "also a popularcandrivans wher by love making maybe suspended but not stopped during the picnic xeason"

Language Model

A purely statistical n-grams model trained on Tweets Blogs News is used to filter semantic and syntactic errors resulted unintentionally from the previous stage, the speech recognition phase.

Flow Architecture

Language Model


Using perplexity as a performance metric, the module resulted in a perplexity = 6.96 on a 2-million-sentence dataset

Example Runs

  • Run #1:

    • Input: "hellow ted how is it goink"

    • Output: "hello ted how is it going"

  • Run #2:

    • Input: "i red a book"

    • Output: "i read a book"

  • Run #3:

    • Input: "thee logik of hogan winnink the wordl titls at the end mmade no sense"

    • Output: "the logic of hogan winnin the woudl title at the end made no sense"

Core Module

A transformer-based module that takes the user's latest utterance as an input and outputs the logits (scores) of each word in the vocabulary to be the succeeding word of the given sequence.


The dataset used to fine-tune a pre-trained BlenderBot transformer model instance is basically the script of the first 6 seasons or so of "How I Met Your Mother". It's scraped off this website and processed to hold to the format shown below using this script.

Dataset format:

<bot> Kids, I'm going to tell you an incredible story. The story of how I met your mother
<s> Are we being punished for something?
<bot> No
<s> Yeah, is this going to take a while?
<bot> Yes. Twenty-five years ago, before I was dad, I had this whole other life.
<bot> It was way back in 2005. I was twenty-seven just starting to make it as an architect and living in New York with my friend Marshall, my best...

The <bot> tag marks the sentences being said by the character that the core module should predict their responses, on the other hand, the <s> tag marks the sentences being said by characters other than the character to be simulated. This, in fact, gives the flexibility of training the core module on other characters or even using real life conversations of a particular person, but this is only possible if the conversations data is given in the specified format above.

Example Runs

  • Run #1:

    • User: "What do you think of Stella"

    • Model: "I love her so much. She's the best thing that has ever happened to me. In a nutshell, I think I'm in love with her."

  • Run #2:

    • User: "tell me more about the Olive theory"

    • Model: "The olive theory is based on my friends Marshal and Lilly. He hates olives, she loves them"

  • Run #3:

    • User: "are we playing laser tag tomorrow night?"

    • Model: "Yeah, sure. Laser tag, it's a good idea. We'll do it."


This module is solely in charge of generating the predicted response of the simulated persona based on the logits (scores) given from the core module.

It is configurable to use different decoding methods like:

  • Greedy search
  • Top-K sampling
  • Top-p (nucleus) sampling
  • Best search (experimental implementation, not practical to use at all)
  • Beam search

Speech Synthesizer

misbah4064/Real-Time-Voice-Cloning is used to convert core module output text to speech with a similar voice to the simulated person.

Possible Improvements

  • Use a larger dataset for the speech recognizer (e.g Mozilla's Common Voice)
  • Train a bigger instance of the speech recognizer model for many more epochs
  • Improve speech recognizer capability to recognize voices not present in the dataset
  • Use a bigger corpus such as OpenSubtitles' OPUS corpus for training the n-gram language model or, better yet, use a neural network architecture instead of a probabilistic model for improved results.
  • Clean and extend core module dataset to the whole 9 seasons
  • Use a better and bigger pre-trained model for core module fine-tuning (e.g BlenderBot 2.0)
  • Solve core module occasional factual incorrectness by incorporating some kind of a knowledge base or a long-term memory with the transformer-based model

Built With


Getting Started


  • Setup Python using this link

  • Download and install FFmpeg

  • Install requirements.txt packages using the following line to skip errors should a package fail to install:

    cat requirements.txt | xargs -n 1 pip install

  • Create a file in the project directory with the following content:

    COMET_API_KEY = "zJemHQ8mJtC2Cgv6bxUcsBxxd"
    FLASK_SECRET_KEY = "8b3HefzzLm2qYEce#"
  • To use Comet while training, set COMET_API_KEY with a valid API key which can be obtained free of charge from here

  • Create a models directory in the project directory. This directory should include saved trained instances or dictionaries needed to run the project. You can either train each module and it would automatically save the required files in /models or you can easily download those files from here.

  • /models directory structure should be similar to the following for successful runs:

    ├── core-model-3160-1.8512854990749796
    │   ├── config.json
    │   └── pytorch_model.bin
    ├── speech-synthesizer
    │   ├── synthesizer
    │   │   ├── checkpoint
    │   │   ├──
    │   │   ├── tacotron_model.ckpt-278000.index
    │   │   └── tacotron_model.ckpt-278000.meta
    │   ├──
    │   ├── TED_VOICE_SAMPLE.wav
    │   └──
    ├── bigrams_tuples
    ├── names
    ├── trigrams_tuples
    └── unigrams_tuples


  • Each module can be configured using its own file. However, the Flask web app which serves as the 'wrapping' module for the project is configured through

  • Perhaps the most interesting parameter in is the APP_MODE parameter. It has 3 possible values:

    • APP_MODE = "TEXT_CHAT_MODE" would render a simple text chat interface that essentially serves as a modular test for the core module.

      Text Chat Interface

    • APP_MODE = "VOICE_CHAT_MODE" would render a voice chat interface where all modules are loaded and used for inference. WARNING: this mode require a relatively powerful machine with at least 16 GB of memory, so please run with caution.

      Voice Chat Interface

    • Due to the former mode being resource-intensive and the results of the speech recognizer is dependant on the loaded saved instance, a lighter mode is implemented. APP_MODE = "VOICE_CHAT_LITE_MODE" is identical to the previous mode in regards to the interface, however, it skips loading both the speech recognizer module and the language model and instead, uses Web API's SpeechRecognition.

  • Finally, run the following line and launch the Flask app by going to using a web browser:

