Transcript Search

This code was designed to make a large number of OpenAI's whisper transcripts easily searchable. So one can find a particular passage or term occurs and at what time in the transcript. However, it should work with any folder of .VTT files: non just OpenAI transcripts of podcasts.

I used Whisper OpenAI to transcribe the Accidental Tech Podcast, and I deployed a live search engine web site front end here, powered by this module (specifically SearchTranscripts).

This module has two classes:

LoadTranscripts: This creates a sqlite database and FTS5 virtual table from a folder of transcript files (.vtt or .json files). It creates longer chunks of text (about 300 words each) from the short transcript segments in the original file in order to make the text blocks searchable. It preserves the individual transcript segments in a separate database.
SearchTranscripts: This is a python class that uses the Sqlite database to return a pandas dataframe of the top results for the search query.

Once the sqlite database is created with LoadTranscripts, you can access that database via any sqlite interface you like such as datasette, dbeaver, the command line, SQL alchemy, etc. The SearchTranscripts class is meant to be a simple and convenient way to access the data from python, using the built-in sqlite3 module and pandas, but it is somewhat limited.

Installation:

Clone and cd into the repo's main directory, then run:

pip install .

Usage:


from search_transcripts import LoadTranscripts, SearchTranscripts

l = LoadTranscripts('transcripts') ## will create main.db and bm25.pickle


s = SearchTranscripts()

## Returns a pandas dataframe of the top scoring transcript sections, across all transcripts.

s.search('starship enterprise')

##find the exact phrase

s.search('"starship enterprise"')

JSON transcripts?

So, before I realized Whisper would create a standard .VTT file, I was using the python API directly. It generates a list of python dictionaries. Saving that as JSON seemed logical at the time. I find the JSON much more easily machine readable than .VTT, and can easily be converted to VTT, so I still support this somewhat quirky format. It looks like so:

    [
           {
        "start": 606.1800000000001,
        "end": 610.74,
        "text": " It's important to have a goal to work toward and accomplish rather than just randomly learning and half building things"
    },
    {
        "start": 610.74,
        "end": 613.0600000000001,
        "text": " Having a specific thing you want to build is a good substitute"
    },
    {
        "start": 613.38,
        "end": 619.78,
        "text": " Keep making things until you've made something you're proud enough a proud of enough to show off in an interview by the time you've built a few"
    },
    {
        "start": 619.78,
        "end": 624.26,
        "text": " Things you'll start developing the taste you need to make that determination of what's quote unquote good enough"
    },


    ]

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
src/search_transcripts		src/search_transcripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__about__.py		__about__.py
helper.py		helper.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/search_transcripts

src/search_transcripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

about.py

about.py

helper.py

helper.py

pyproject.toml

pyproject.toml

Repository files navigation

Transcript Search

Installation:

Usage:

JSON transcripts?

About

Releases

Packages

Languages

License

astrowonk/search_transcripts

Folders and files

Latest commit

History

Repository files navigation

Transcript Search

Installation:

Usage:

JSON transcripts?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages