After reading Dmitry Paranyushkin's towardsdatascience blog post Measuring discourse bias using text network analysis, I wanted to see if his thesis could be tested via a separate Python implementation instead of the built-in version of parts of his ideas in Dmitry's web based open source tool InfraNodus. After contacting Dmitry I decided to code a library according to his instructions.
We ended up coding it together, evaluated it on the PAN Semeval 2019 Hyperpartisan News Detection dataset and published the results here. The code for what we ended up calling DiscourseDiversity is hosted on Gitlab.
The labeled dataset is from https://pan.webis.de/semeval19/semeval19-web/. Access to the actual dataset can be applied for at https://zenodo.org/record/1489920.
The DataFrames produced by and used in this code can be downloaded from osf.io in case you don't want to produce them yourself by running the code. It took a couple of days using one core in the Jupyter Notebooks.
To use the code as is, create a new folder named data
where you uncompress the files in the dataset you've gotten from the link above.
Your current folder should now look like this:
discoursebias_evaluation_pan_semeval_2019/
├── 1a_create_paragraphed_texts_in_pandas_dataframe_from_xml_bypublisher.ipynb
├── 1b_create_dataframe_unparagraphed_texts_from_validation_bypublisher_xml_file.ipynb
├── 1c_compare_paragraphed_vs_non-paragraphed_validation_bypublisher_texts.ipynb
├── 2_collect_ground_truth_partisanship_annotations_to_dataframe.ipynb
├── 3_compare_biasIndex_score_with_hyperpartisan_scoring.ipynb
├── data
├── article.xsd
└── articles-training-byarticle-20181122.xml
└── articles-training-bypublisher-20181122.xml
└── ground-truth-training-byarticle-20181122.xml
└── ground-truth-training-bypublisher-20181122.xml
└── ground-truth-validation-bypublisher-20181122.xml
└── ground-truth.xsd
├── LICENSE
├── README.md
├── all_bias_index_classes_results.png
└── only_biased_and_dispersed_bias_index_scores_result.png
Make sure you are in the root folder (this repo) and start the notebook server by running jupyter notebook
in your terminal.
You only need to run the notebooks numbered 1a, 2, 3 and 4 since number 1b, 1c where only to make sure that leaving paragraphs or not when parsing the xml-files didn't affect the end result.
Assuming that either "Dispersed" or "Diversified" is equivalent to hyperpartisan = false
while "Focused" or "Biased" is equivalent to hyperpartisan = true
:
Assuming that "Dispersed" means hyperpartisan=false
and "Biased" means hyperpartisan=true
:
An 2018 overview of the problem of measuring media bias in the cross-roads of social sciences and computer science can be found in Automated identification of media bias in news articles: an interdisciplinary literature review