Long-form audio transcription with Hugging Face Transformers

Automatic speech recognition (ASR) converts a speech signal to text. It is an example of a sequence-to-sequence task, going from a sequence of audio inputs to textual outputs. Voice assistants like Siri and Alexa utilize ASR models to assist users.

This guide will give you a quick step by step tutorial about how to create an end to end Automatic Speech Recognition (ASR) solution that deals with long-form audio e.g. podcasts, videos, audiobooks, etc.

Getting started

Video Tutorial in YouTube

Requirements

SageMaker Studio Lab account. See this explainer video to learn more about this.
Python>=3.7
PyTorch>=1.10
Hugging Face Transformers
Several audio processing libraries (see environment.yml)

Step by step tutorial

Clone repo and install dependencies

Follow the steps shown in example_w_HuggingFace.ipynb Click on Copy to project in the top right corner. This will open the Studio Lab web interface and ask you whether you want to clone the entire repo or just the Notebook. Clone the entire repo and click Yes when asked about building the Conda environment automatically. You will now be running on top of a Python environment with Streamlit and Gradio already installed along with other libraries.

Download audio directly from YouTube

We first point to the Spaces url that we want to run on Studio Lab:

from os.path import exists as path_exists
YouTubeID = 'YOUR_YOUTUBE_ID' 
OutputFile = 'YOUR_AUDIO_FILE.m4a'
if not path_exists(OutputFile):
    !youtube-dl -o $OutputFile $YouTubeID --extract-audio --restrict-filenames -f 'bestaudio[ext=m4a]'

Process using `Librosa`/`HuggingFace`

import librosa
from transformers import pipeline
pipe = pipeline(model=model_name)
speech, sample_rate = librosa.load(OutputFile,sr=16000)
transcript = pipe(speech, chunk_length_s=10, stride_length_s=(4,2))

Split long audio into smaller chunks with `PyDub`

import torch
import pydub
import array
import numpy as np
from pydub.utils import mediainfo
from pydub import AudioSegment
from pydub.utils import get_array_type

def audio_resampler(sound, sample_rate=16000):
    sound = sound.set_frame_rate(sample_rate)
    left = sound.split_to_mono()[0]
    bit_depth = left.sample_width * 8
    array_type = pydub.utils.get_array_type(bit_depth)
    numeric_array = np.array(array.array(array_type, left._data))
    
    return np.asarray(numeric_array,dtype=np.double), sample_rate

pydub_speech = pydub.AudioSegment.from_file(OutputFile)
speech, sample_rate = audio_resampler(pydub_speech)

transcript = ''
for chunk in np.array_split(speech,len(speech)/sample_rate/30)[:2]: # split every 30 seconds, take only first minute
    output = pipe(chunk)
    transcript = transcript + ' ' + output['text']
    print(output)
    
transcript = transcript.strip()

Split long audio into chunks based on detected silence

import librosa
from librosa import display
import matplotlib.pyplot as plt

speech, sample_rate = librosa.load(OutputFile,sr=16000)
non_mute_sections_in_speech = librosa.effects.split(speech,top_db=50)

transcript = ''
for chunk in non_mute_sections_in_speech[:6]:
    speech_chunk = speech[chunk[0]:chunk[1]]
    output = pipe(speech_chunk)
    transcript = transcript + ' ' + output['text']
    print(output)
    
transcript = transcript.strip()

Compare results between open-source vs Google-generated

from youtube_transcript_api import YouTubeTranscriptApi
transcript = YouTubeTranscriptApi.get_transcript(YouTubeID,languages=['es'])
transcript_from_YouTube = ' '.join([i['text'] for i in transcript])

from utils import *
from IPython.display import HTML, display
base = "transcripts/transcribed_speech_hf_pipelines.txt"
compare = "transcripts/transcribed_speech_generated_by_youtube.txt"
a = open(base,'r').readlines()[0][:1000]
b = open(compare,'r').readlines()[0][:1000]
print(f'Original: {base} / Compare: {compare}')
display(HTML(html_diffs(a,b)))

Keep reading

Citations

@misc{grosman2022wav2vec2-xls-r-1b-spanish,
  title={XLS-R Wav2Vec2 Spanish by Jonatas Grosman},
  author={Grosman, Jonatas},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-spanish}},
  year={2022}
}

Disclaimer

The content provided in this repository is for demonstration purposes and not meant for production. You should use your own discretion when using the content.
The ideas and opinions outlined in these examples are my own and do not represent the opinions of AWS.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
transcripts		transcripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
example-time-offsets-per-word.ipynb		example-time-offsets-per-word.ipynb
example.png		example.png
example_w_HuggingFace.ipynb		example_w_HuggingFace.ipynb
example_w_NEMO_Citrinet.ipynb		example_w_NEMO_Citrinet.ipynb
test_audio_youtube.m4a		test_audio_youtube.m4a
utils.py		utils.py

License

machinelearnear/long-audio-transcription-spanish

Folders and files

Latest commit

History

Repository files navigation

Long-form audio transcription with Hugging Face Transformers

Getting started

Video Tutorial in YouTube

Requirements

Step by step tutorial

Clone repo and install dependencies

Download audio directly from YouTube

Process using Librosa/HuggingFace

Split long audio into smaller chunks with PyDub

Split long audio into chunks based on detected silence

Compare results between open-source vs Google-generated

Keep reading

Citations

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Process using `Librosa`/`HuggingFace`

Split long audio into smaller chunks with `PyDub`