Skip to content

Collection of text corpora for publicly available speeches from Mexican president Andres Manuel Lopez Obrador (AMLO) sourced from YouTube. The dataset includes his daily morning conferences (conferencias mañaneras) 😴🪿

License

ivansabik/chairum-corpus

Repository files navigation

Chairum Corpus

flake8 black isort

Mexico is fine

A corpus of publicly available speeches from Mexican president Andres Manuel Lopez Obrador. Currently data is sourced exclusively from YouTube. For some videos it was not possible to get the automatically generated subtitles to source the transcriptions, in future iterations a mechanism will be added to translate them into text.

Image source: https://twitter.com/marianojuarez/status/1148739501604450304

Currently there is no interface or API where the data can be queried (coming in future iterations), but it's really simple to do using a text editor, for example using Visual Studio:

Search locally

Data

The data is available as a CSV file: https://www.kaggle.com/datasets/ivansabik/andres-manuel-lopez-obrador-amlo-speeches

Individual files in JSON format are also provided under the data folder. Additionally, a script is provided to generate a file in CSV format with all records. Sample record:

{
    "video_id": "_uNpYoBHukM",
    "video_thumbnail_url": "https://i.ytimg.com/vi/_uNpYoBHukM/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLBiA5GPXPQfIJ7UxkMLQKQY9gKhhQ",
    "video_url": "https://www.youtube.com/watch?v=_uNpYoBHukM",
    "video_title": "M\u00e9xico garantiza derecho de asilo a solicitantes de Nicaragua. Conferencia presidente AMLO",
    "video_length_seconds": 10097,
    "transcription_with_timestamps": [
        {
            "text": "el INE no se toca",
            "start": 1803.179,
            "duration": 5.761
        },
        {
            "text": "pero tambi\u00e9n",
            "start": 1806.6,
            "duration": 5.959
        },
        {
            "text": "Garc\u00eda Luna no se toca",
            "start": 1808.94,
            "duration": 3.619
        },
        {
            "text": "y en el fondo es",
            "start": 1812.779,
            "duration": 3.081
        },
        {
            "text": "el r\u00e9gimen",
            "start": 1816.159,
            "duration": 6.781
        },
        {
            "text": "corrupto y conservador no se toca",
            "start": 1818.26,
            "duration": 4.68
        },
        {
            "text": "para eso es pero es bueno",
            "start": 1826.039,
            "duration": 4.941
        }
    ],
    "transcription_text": " el INE no se toca pero tambi\u00e9n Garc\u00eda Luna no se toca y en el fondo es el r\u00e9gimen corrupto y conservador no se toca para eso es pero es bueno",
    "transcription_source": "YouTube auto-generated captions",
    "playlist_id": "PLRnlRGar-_296KTsVL0R6MEbpwJzD8ppA",
    "playlist_title": "Conferencias de prensa matutinas",
    "published_time_text": "Streamed 6 months ago",
    "retrieved_time": "2023-09-07 20:16:50.123990"
}

Whenever it's not possible to retrieve the transcriptions from YouTube, metadata for the videos is stored under failed so that an alternative mechanism for retrieving or generating them can be used in future iterations.

How to run?

  1. Install requirements:
pip3 install -r requirements.txt
  1. Get a YouTube API token and set an environment variable with this value:
export YOUTUBE_V3_API_KEY={YOUR_TOKEN}
  1. Run:
python process.py && python transcribe.py
  1. To generate a single CSV file for the dataset run:
python generate_csv.py

Future work

  • Add persistence (db backend)
  • Add API
    • Handle gracefully phonetic coincidences (Krauze, Krause, Kraus, Krauz) using something like Metaphone or Baider-Morse
  • Add simple app to search and query the data
  • Add new field with transcribed text without stop words
  • Exclude videos from speeches where main speaker is not AMLO (or does not include him)
  • Exclude videos which are not from a speech or conference
  • Filter out or annotate parts of videos where speaker is not AMLO. Even better add a new field with the speaker, but this could be quite challenging and would require manual work and curation