Skip to content

linagora-labs/FREDSum

Repository files navigation

FREDSum: A Dialogue Summarization Corpus for French Political Debates

Overview

This repository contains the FREDSum dataset, a comprehensive collection of transcripts and metadata from various political and public debates in France. The dataset aims to provide researchers, linguists, and data scientists with a rich source of debate content for analysis and natural language processing tasks.

Further details are provided in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates" (see Acknowledgement below). While we continue to improve the dataset, the version of the transcripts and summaries used in the FREDSum paper can be found in the release v0.1-emnlp-2023.

The dataset can also be found on Hugging Face and Ortolang.

Data from the French National Assembly and French Senate that were used to continue the pretraining of the Barthez language model, as described in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates", is available at FREDSum Parliament. (The portion from the National Assembly can also be found on Hugging Face.)

Dataset Description

The dataset includes transcripts from a range of French political debates and discussions. Each transcript is provided along with abstractive and extractive summaries.

Structure

The dataset is organized as follows:

  • transcripts: contains the debate transcripts.

    • Each file is named in the format Speakers--Partie_X_Theme.txt. The name structure is consistant through folders for usage purposes.
  • summary_extractive: contains two sub folders, one for each of the two annotators who made the extractive summaries.

  • summary_abstractive: contains three sub folders corresponding to three different types of abstractive summary as follows:

      1. Contains summaries that aim to preserve the original wording as much as possible (more extractive) and to limit co-reference resolution by using proper names instead of pronouns.
      1. Contains summaries that aim to preserve the original wording as much as possible (more extractive) while allowing for co-reference.
      1. Contains summaries that have been written freely (more abstractive).
  • summary_abstractive_prediction: contains abstractive summaries generated by three models:

    Please note that there are no predicted summaries for the debate 'Destaing_Mitterand_2'.

  • community: contains abstractive communities for abstractive summaries 1 and 3. In an abstractive community, a single sentence from the abstractive summary is paired with a set of sentences from the corresponding extractive community that supports it.

  • FREDSum_test.json: a list of the names of the test files.

Versions

  • v0.1-emnlp-2023: the original transcripts and summaries used in the paper "FREDSum: A Dialogue Summarization Corpus for French Political Debates"
  • the current version contains:
    • standardized speaker tags (e.g. "GA" -> "Gabriel Attal")
    • corrected text for abstractive summaries (1-3)

Acknowledgement

This corpus was created as a part of the SUMM-RE (ANR-20-CE23-0017) and CORTEX2 (Horizon Europe CL4-2021-HUMAN-01-25) research projects.

If you use this dataset, please cite the following article:

About

Corpus of political debates : transcriptions and summaries

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •