Chairum Corpus

A corpus of publicly available speeches from Mexican president Andres Manuel Lopez Obrador. Currently data is sourced exclusively from YouTube. For some videos it was not possible to get the automatically generated subtitles to source the transcriptions, in future iterations a mechanism will be added to translate them into text.

Image source: https://twitter.com/marianojuarez/status/1148739501604450304

Currently there is no interface or API where the data can be queried (coming in future iterations), but it's really simple to do using a text editor, for example using Visual Studio:

Data

The data is available as a CSV file: https://www.kaggle.com/datasets/ivansabik/andres-manuel-lopez-obrador-amlo-speeches

Individual files in JSON format are also provided under the data folder. Additionally, a script is provided to generate a file in CSV format with all records. Sample record:

{
    "video_id": "_uNpYoBHukM",
    "video_thumbnail_url": "https://i.ytimg.com/vi/_uNpYoBHukM/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLBiA5GPXPQfIJ7UxkMLQKQY9gKhhQ",
    "video_url": "https://www.youtube.com/watch?v=_uNpYoBHukM",
    "video_title": "M\u00e9xico garantiza derecho de asilo a solicitantes de Nicaragua. Conferencia presidente AMLO",
    "video_length_seconds": 10097,
    "transcription_with_timestamps": [
        {
            "text": "el INE no se toca",
            "start": 1803.179,
            "duration": 5.761
        },
        {
            "text": "pero tambi\u00e9n",
            "start": 1806.6,
            "duration": 5.959
        },
        {
            "text": "Garc\u00eda Luna no se toca",
            "start": 1808.94,
            "duration": 3.619
        },
        {
            "text": "y en el fondo es",
            "start": 1812.779,
            "duration": 3.081
        },
        {
            "text": "el r\u00e9gimen",
            "start": 1816.159,
            "duration": 6.781
        },
        {
            "text": "corrupto y conservador no se toca",
            "start": 1818.26,
            "duration": 4.68
        },
        {
            "text": "para eso es pero es bueno",
            "start": 1826.039,
            "duration": 4.941
        }
    ],
    "transcription_text": " el INE no se toca pero tambi\u00e9n Garc\u00eda Luna no se toca y en el fondo es el r\u00e9gimen corrupto y conservador no se toca para eso es pero es bueno",
    "transcription_source": "YouTube auto-generated captions",
    "playlist_id": "PLRnlRGar-_296KTsVL0R6MEbpwJzD8ppA",
    "playlist_title": "Conferencias de prensa matutinas",
    "published_time_text": "Streamed 6 months ago",
    "retrieved_time": "2023-09-07 20:16:50.123990"
}

Whenever it's not possible to retrieve the transcriptions from YouTube, metadata for the videos is stored under failed so that an alternative mechanism for retrieving or generating them can be used in future iterations.

How to run?

Install requirements:

pip3 install -r requirements.txt

Get a YouTube API token and set an environment variable with this value:

export YOUTUBE_V3_API_KEY={YOUR_TOKEN}

Run:

python process.py && python transcribe.py

To generate a single CSV file for the dataset run:

python generate_csv.py

Future work

Add persistence (db backend)
Add API
- Handle gracefully phonetic coincidences (Krauze, Krause, Kraus, Krauz) using something like Metaphone or Baider-Morse
Add simple app to search and query the data
Add new field with transcribed text without stop words
Exclude videos from speeches where main speaker is not AMLO (or does not include him)
Exclude videos which are not from a speech or conference
Filter out or annotate parts of videos where speaker is not AMLO. Even better add a new field with the speaker, but this could be quite challenging and would require manual work and curation

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
data		data
failed		failed
manual_transcriptions		manual_transcriptions
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_csv.py		generate_csv.py
process.py		process.py
requirements.txt		requirements.txt
simple_search.gif		simple_search.gif
thij_ij_fine.png		thij_ij_fine.png
transcribe.py		transcribe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

data

data

failed

failed

manual_transcriptions

manual_transcriptions

.flake8

.flake8

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

generate_csv.py

generate_csv.py

process.py

process.py

requirements.txt

requirements.txt

simple_search.gif

simple_search.gif

thij_ij_fine.png

thij_ij_fine.png

transcribe.py

transcribe.py

Repository files navigation

Chairum Corpus

Data

How to run?

Future work

About

Languages

License

ivansabik/chairum-corpus

Folders and files

Latest commit

History

Repository files navigation

Chairum Corpus

Data

How to run?

Future work

About

Topics

Resources

License

Stars

Watchers

Forks

Languages