Overview

After reading Dmitry Paranyushkin's towardsdatascience blog post Measuring discourse bias using text network analysis, I wanted to see if his thesis could be tested via a separate Python implementation instead of the built-in version of parts of his ideas in Dmitry's web based open source tool InfraNodus. After contacting Dmitry I decided to code a library according to his instructions.

We ended up coding it together, evaluated it on the PAN Semeval 2019 Hyperpartisan News Detection dataset and published the results here. The code for what we ended up calling DiscourseDiversity is hosted on Gitlab.

Dataset

Source

The labeled dataset is from https://pan.webis.de/semeval19/semeval19-web/. Access to the actual dataset can be applied for at https://zenodo.org/record/1489920.

Download pickle-files with Pandas DataFrames

The DataFrames produced by and used in this code can be downloaded from osf.io in case you don't want to produce them yourself by running the code. It took a couple of days using one core in the Jupyter Notebooks.

How to reproduce

To use the code as is, create a new folder named data where you uncompress the files in the dataset you've gotten from the link above.

Your current folder should now look like this:

discoursebias_evaluation_pan_semeval_2019/
├── 1a_create_paragraphed_texts_in_pandas_dataframe_from_xml_bypublisher.ipynb
├── 1b_create_dataframe_unparagraphed_texts_from_validation_bypublisher_xml_file.ipynb
├── 1c_compare_paragraphed_vs_non-paragraphed_validation_bypublisher_texts.ipynb
├── 2_collect_ground_truth_partisanship_annotations_to_dataframe.ipynb
├── 3_compare_biasIndex_score_with_hyperpartisan_scoring.ipynb
├── data
    ├── article.xsd
    └── articles-training-byarticle-20181122.xml
    └── articles-training-bypublisher-20181122.xml
    └── ground-truth-training-byarticle-20181122.xml
    └── ground-truth-training-bypublisher-20181122.xml
    └── ground-truth-validation-bypublisher-20181122.xml
    └── ground-truth.xsd
├── LICENSE
├── README.md
├── all_bias_index_classes_results.png
└── only_biased_and_dispersed_bias_index_scores_result.png

Make sure you are in the root folder (this repo) and start the notebook server by running jupyter notebook in your terminal.

You only need to run the notebooks numbered 1a, 2, 3 and 4 since number 1b, 1c where only to make sure that leaving paragraphs or not when parsing the xml-files didn't affect the end result.

Evaluation results

Assuming that either "Dispersed" or "Diversified" is equivalent to hyperpartisan = false while "Focused" or "Biased" is equivalent to hyperpartisan = true:

Assuming that "Dispersed" means hyperpartisan=false and "Biased" means hyperpartisan=true:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

1a_create_paragraphed_texts_in_pandas_dataframe_from_xml_bypublisher.ipynb

1a_create_paragraphed_texts_in_pandas_dataframe_from_xml_bypublisher.ipynb

1b_create_dataframe_unparagraphed_texts_from_validation_bypublisher_xml_file.ipynb

1b_create_dataframe_unparagraphed_texts_from_validation_bypublisher_xml_file.ipynb

1c_compare_paragraphed_vs_non-paragraphed_validation_bypublisher_texts.ipynb

1c_compare_paragraphed_vs_non-paragraphed_validation_bypublisher_texts.ipynb

2_collect_ground_truth_partisanship_annotations_to_dataframe.ipynb

2_collect_ground_truth_partisanship_annotations_to_dataframe.ipynb

3_add_biasIndex_score_to_pandas_dataframe_of_validation_by_publisher_texts_in_dataframe.ipynb

3_add_biasIndex_score_to_pandas_dataframe_of_validation_by_publisher_texts_in_dataframe.ipynb

4_compare_biasIndex_score_with_hyperpartisan_scoring.ipynb

4_compare_biasIndex_score_with_hyperpartisan_scoring.ipynb

LICENSE

LICENSE

README.md

README.md

all_bias_index_classes_results.png

all_bias_index_classes_results.png

only_biased_and_dispersed_bias_index_scores_result.png

only_biased_and_dispersed_bias_index_scores_result.png

Repository files navigation

Overview

Dataset

Source

Download pickle-files with Pandas DataFrames

How to reproduce

Evaluation results

See also

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
1a_create_paragraphed_texts_in_pandas_dataframe_from_xml_bypublisher.ipynb		1a_create_paragraphed_texts_in_pandas_dataframe_from_xml_bypublisher.ipynb
1b_create_dataframe_unparagraphed_texts_from_validation_bypublisher_xml_file.ipynb		1b_create_dataframe_unparagraphed_texts_from_validation_bypublisher_xml_file.ipynb
1c_compare_paragraphed_vs_non-paragraphed_validation_bypublisher_texts.ipynb		1c_compare_paragraphed_vs_non-paragraphed_validation_bypublisher_texts.ipynb
2_collect_ground_truth_partisanship_annotations_to_dataframe.ipynb		2_collect_ground_truth_partisanship_annotations_to_dataframe.ipynb
3_add_biasIndex_score_to_pandas_dataframe_of_validation_by_publisher_texts_in_dataframe.ipynb		3_add_biasIndex_score_to_pandas_dataframe_of_validation_by_publisher_texts_in_dataframe.ipynb
4_compare_biasIndex_score_with_hyperpartisan_scoring.ipynb		4_compare_biasIndex_score_with_hyperpartisan_scoring.ipynb
LICENSE		LICENSE
README.md		README.md
all_bias_index_classes_results.png		all_bias_index_classes_results.png
only_biased_and_dispersed_bias_index_scores_result.png		only_biased_and_dispersed_bias_index_scores_result.png

License

mattiasostmar/discoursediversity_evaluation_pan_semeval_2019

Folders and files

Latest commit

History

Repository files navigation

Overview

Dataset

Source

Download pickle-files with Pandas DataFrames

How to reproduce

Evaluation results

See also

About

Resources

License

Stars

Watchers

Forks

Languages