Assignment 3 for HCDS
This is a assignment organised under the framework of FUB:HCDS course for the topic "Bias".
The goal of this project is to determine prediction scores and to analyze them afterwards.
The project is structured in three parts:
Two data sources in the project are used: (1) Wikipedia articles of politicians and (2) world population data. Both of the original datasets can be found in data_raw folder.
(1) Wikipedia articles - The Wikipedia articles can be found here. The file page_data.csv
is the data source used in this project and can be found in data_raw folder.
This data inclused unnecessary entries and need to be processed.
This dataset is published by Os Keyes under license CC BY 4.0.
(2) Population data - The preprocessed population data named export_2019.csv is provided by the instructor in the _data folder. The dataset is drawn from the world population datasheet published by the PRB (Population Reference Bureu). This dataset is also found in data_raw folder. This population data is being used directly without any processing.
Step 1. clean data base (filter out unwilling data entries)
Unwilling entries (which include "Template" are not wikipedia articles) are filtered out.
Step 2. ORES request
In the second step, the articles found on wikipedia page need to be rated by its quality through ORES ("Objective Revision Evaluation Service") API. This API is based on a machine learning system and will rate the articles into 6 different categories -- the two best of them are "FA" (featured article) and "GA" (good article). These two categories are relevant for our analysis.
Step 3. Assign quality assessment to each article
Each article (the original dataset are read as DataFrame and renamed into politician_data) is rated through ORES REST API and given a quality catogory. Cleaned and processed database should contain essential information of each article (including country, revision_id, prediction etc.)
The data is retrieved through multiple chunks, since the retrieval is taking a few hours. The result can be found in
data clean/chunks/
. These chunks are being used to create later on a single csv.
Articles which cannot be processed are also extracted and stored in
result/countries_no_match.csv
.
In this step, cleaned dataset are combined with population data and generates an complete dataset in
result/politician_by_country.csv
.
The data contains following structure:
article_name | country | region | revision_id | article_quality | population |
---|---|---|---|---|---|
xxxxxx | xxxxxxx | xxxx | xxxxxxxxx | xxx | xxx |
The aim of this project is to compare the number of articles (with/without good quality) in proportion to national/regional population. Therefore, 6 comparison tables are generated showing the results.
They are: Table 1&2: Top/Bottom 10 countries by coverage Table 3&4: Top/Bottom 10 countries by relative quality Table 5&6: Regions by coverage (total articles/good quality articles)
All 6 tables can be found in the notebook.
The ORES REST API contains three parameters project, redid and model (Documentation,Usage).
peotry for dependency management, Jupyter Notebook, Python 3.8
The project is licensed under MIT.license.