Investigating Antisemitic Bias in German Parliamentary Proceedings

Explore ethno-religious biases within German parliamentary proceedings reaching from 1867 - 2020, based on word embeddings that are trained from scratch on eight different slices of proceedings, with each slice corresponding to one of the following historical periods.

	Kaiserreich I	Kaiserreich II	Weimar	NS	CDU I	SPD I	CDU II	SPD II	CDU III
start	1867	1890	1918	1933	1949	1969	1982	1998	2005
end	1890	1918	1933	1942	1969	1982	1998	2005	2020

The four target sets to investigate bias are:

Judaism
Christianity
Catholicism
Protestantism

Based on those targets sets, four explicit bias specifications of the form B_E = (T₁, T₂, A₁, A₂) as per the definition of Lauscher et al. get tested for the presence of anti-semitic bias, namely:

B_E1 = (T_Christian, T_Jewish, A_pos, A_neg)
B_E2 = (T_Protestant, T_Catholic, A_pos, A_neg)
B_E3 = (T_Protestant, T_Jewish, A_pos, A_neg)
B_E4 = (T_Catholic, T_Jewish, A_pos, A_neg)

A_pos respectively A_neg are placeholders for the opposing attribute sets of six antisemitic streams s in S. S covers six anti-semitic streams, namely religious, economic, patriotic, racial, conspiratorial and ethic streams. Each stream is semantically linked to antisemitic tendencies that subsume commonly-held stereotypes towards Jews. The constituent terms of all target and attribute sets can be taken from ./data/specifications.

All trained word2vec models used for the bias experiments in the thesis are provided in the ./models folder, with the corresponding vocab in the ./vocab folder. Hence, the results of all bias experiments should be reproducible. The temporally aligned TWEC models for plotting semantic shifts are provided under ./models/aligned and ./models/aligned_brd respectively. The PPMI matrices for running label propagation are to be found in the ./matrices folder and their vocab in ./ppmi_vocab.

The folders ./data/reichstag and ./data/bundestag contain the preprocessed proceedings corresponding to each historic slice.

Originally, four original collections of OCR-ed and sentence-tokenized, but otherwise unprocessed parliamentary proceedings (1895.corr.seg, 1918.corr.seg, 1933.corr.seg, 1942.corr.seg) were available. Check 1942.corr.seg in the ./data folder for a sample of the original data.

The following steps need to be taken to first preprocess and vectorize the data, and to then apply several bias evaluation methods on the data representation of each historic slice.

Reichstag

Extract Reichstag proceedings from each of the original files (e.g. '1942.corr.seg'):

python extract_proceedings.py [file] (output will be saved in ./data/protocols_1942)

After the protocols from all 4 original files are extracted, either create historically aligned slices or slices that are balanced in number of documents per slice:

python create_corpora.py -s [historic|balanced] (historic output will be saved in ./data/kaiserreich_1, ./data/kaiserreich_2, ./data/weimar, ./data/ns, ./data/cdu_1, etc.)

Process the resulting slices by applying a cascade of text processing functions on each extracted protocol; specify protocol_type as either RT (Reichstagsprotokolle) or BRD (Bundestagsprotokolle) as the processing steps slightly vary between Reichstag and Bundestag proceedings:

python process_protocols [slice_folder] [protocol_type] (output will be saved in ./data/kaiserreich_1_processed, etc.)

Train a word2vec embedding space on the processed slices (model will be stored in ./models , vocab in ./data/vocab):

bash train_embeddings.sh (--model_architecture [word2vec], --protocols [the processed slice, e.g. ./data/reichstag/kaiserreich_1_processed)

Bundestag

Bundestag protocols are already provided in a handy format, with separate folders for each legislatory period and protocols already separated into distinct text files. Thus, steps 1) and 2) are omitted and the folders containing protocols of each legislatory period can be directly pre-processed. A sample of the original Bundestag protocols is provided under folder ./data/slice_7

Evaluation

A range of bias evaluation tests can be run on the trained word2vec embedding spaces. All bash scripts for the bias experiments are contained in the ./evaluation folder

Explicit bias tests under DEBIE:

WEAT --> bash run_weat.sh.
BAT / ECT --> bash run_bat_ect.sh (specify test type as either --BAT/ECT).

Implicit bias tests under DEBIE:

K-Means --> bash run_kmeans.sh.

Semantic evaluation:

Simlex --> bash run_simlex.sh.

Subspace Projections:

To compute the subspace projections onto the Christian-Jewish bias subspace, run the sript ripa.sh. In order to plot the projections, provide a semantic domain to plot with the --sem_domain argument and set plot_projections to True.

The evaluation scripts on the provided USB stick should work with the pre-set arguments, to reproduce the results of each bias test.

Harmonic Function Label Propagation

For HFLP, input representations based on Positive Pointwise Mutual Information (PPMI) are employed. The following command creates a PPMI matrix for the processed slice fed as an argument:

bash ppmi.sh (--protocols  [the processed slice, e.g. ./data/reichstag/kaiserreich_1_processed] -> matrix is stored in ./matrices/ppmi_kaiserreich_1.npz, index in ./ppmi_vocab/kaiserreich_1.json))

Afterwards, the label propagation algorithm can be executed on the PPMI matrix by running the script propagate.sh. Provide a semantic domain to propagate from with the --semantic_domain argument.

Visualize Semantic Shift

First, temporally aligned embedding spaces must be trained. For this, change to the ./twec folder and run:

python train_temporal_embeddings.py  [protocol_type]

You might have to change the provided folder arguments of the train_slice function inside the train_temporal_embeddings.py script to the paths where your processed data lies.

Now, semantic shifts can be plotted with the script:

python closest_over_time_with_anns.py -w [word1 word2 etc.] -n [number of neighbors] --protocol_type [RT/BRD] --model_folder [folder in which TWEC models reside]

Publication Plots

To reproduce the "Corpus Statistics" and "average RIPA over slices" plots, use the Jupyter notebook reproduce_plots.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 298 Commits
charts		charts
data		data
evaluation		evaluation
hflp		hflp
models		models
modules		modules
twec		twec
.gitignore		.gitignore
README.md		README.md
ReichstagEmbeddings.py		ReichstagEmbeddings.py
SequentialEmbeddings.py		SequentialEmbeddings.py
bias_specifications.py		bias_specifications.py
create_corpora.py		create_corpora.py
create_ppmi_mat.py		create_ppmi_mat.py
environment.py		environment.py
environment.txt		environment.txt
extract_proceedings.py		extract_proceedings.py
pipeline_anticom.sh		pipeline_anticom.sh
pipeline_antisemitism.sh		pipeline_antisemitism.sh
ppmi_anticom.sh		ppmi_anticom.sh
ppmi_antisem.sh		ppmi_antisem.sh
prepare_scores.py		prepare_scores.py
process_ideological_slices.py		process_ideological_slices.py
process_protocols.py		process_protocols.py
propagate.py		propagate.py
propagate_anticom.sh		propagate_anticom.sh
propagate_antisem.sh		propagate_antisem.sh
requirements.txt		requirements.txt
result_anticom.pdf		result_anticom.pdf
result_antisem.pdf		result_antisem.pdf
text_preprocessing.py		text_preprocessing.py
train_embeddings.py		train_embeddings.py
train_embeddings_anticom.sh		train_embeddings_anticom.sh
train_embeddings_antisem.sh		train_embeddings_antisem.sh
utils.py		utils.py

umanlp/crosstemporal_bias

Folders and files

Latest commit

History

Repository files navigation

Investigating Antisemitic Bias in German Parliamentary Proceedings

Reichstag

Bundestag

Evaluation

Harmonic Function Label Propagation

Visualize Semantic Shift

Publication Plots

About

Resources

Stars

Watchers

Forks

Languages