Skip to content

ixa-ehu/basqueparl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions

This repository contains BasqueParl, a bilingual corpus for political discourse analysis. It covers transcriptions from the Parliament of the Basque Autonomous Community for eight years and two legislative terms (2012-2020), and its main characteristic is the presence of Basque-Spanish code-switching speeches.

Download dataset

https://huggingface.co/datasets/HiTZ/basqueparl

Data Description

For instance, the following unprocessed speech combines a Basque text (plain) with Spanish fragments (highlighted):

Bai, zure baimenarekin hemendik.

Ba zure desioak, Guanche andrea, gureak ere badira. Harritu nau eta ez nau harritu hitza berriro hartzeak, zeren hitz egiten nengoen bitartean esan diozu albokoari le voy a contestar. Le voy a contestar, ondo iruditzen, zure eskubidean zaude, baino beno, ez dut uste inongo astakeriarik esan dudanik.

Gauzak egiten dira eta uste dut nik, nik ere eskubidea dudala Gobernuak eta beste erakundeek egiten dutena esateko. Zeren beti ver el vaso medio vacío o medio lleno, pues cambia un poco la perspectiva y vernos siempre en modo Gobierno, creo que no es nada objetivo. Se hacen cosas, se harán cosas y esta vez creo que me deberían reconocer que de la iniciativa primera a lo que hemos acordado, no nos hemos dejado nada o creo que casi nada. Entonces, bueno, sólo querı́a aclarar eso eta eskerrak berriro.

Eta ziur egon emakumea dokumentu horietan ez bada agertzen hitzetan, zeren uste dut hori ez dela garrantzitsuena, bai politiketan egongo dela eta dagoela.

Eskerrik asko.

The specificities of the BasqueParl corpus are:

  • 14 M words of bilingual parliamentary transcriptions
  • Speech paragraphs as units
  • Metadata such as date and speaker's name, year of birth, gender and party for each paragraph
  • Language of each paragraph (either Basque or Spanish)
  • Lemmas and named entities of each paragraph, with and without stopwords

Data Fields

The BasqueParl corpus is written as a Tab Separated Values (TSV) file. Each unit presents the next fields:

  • "date": Date corresponding to the speech, e.g. 2020-02-07
  • "speech_id": Number that identifies the speech within its date, e.g. 3
  • "text_id": Number that identifies the paragraph within its speech, e.g. 3
  • "speaker": Family names of the speaker, including their position if any, e.g. Tejeria Otermin LEHENDAKARIA
  • "birth": Year of birth of the speaker, e.g. 1971
  • "gender": Gender of the speaker, either E (emakumea) for female or G (gizona) for male
  • "party": Political group of the speaker, e.g. EAJ
  • "language": Language assigned to a paragraph, either eu for Basque or es for Spanish
  • "text": Paragraph of the speech text
  • "lemmas": Lemmatized paragraph
  • "lemmas_stw": Lemmatized paragraph without stopwords
  • "entities": Named entities extracted from the paragraph
  • "entities_stw": Named entities extracted from the paragraph without stopwords

Methodological Information

Lemmas and named entities of each paragraph have been extracted with these state-of-the-art Flair lemmatization and NER models:

Language detection was performed by means of langdetect.

If you use this resource please cite the following paper:

Contact Details

Rodrigo Agerri HiTZ Center - Ixa, University of the Basque Country UPV/EHU https://ragerri.github.io/

About

Proceedings of the Basque Parliament

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published