A3-hcds-hcc-bias

Assignment 3 for HCDS

1. Project Description

This is a assignment organised under the framework of FUB:HCDS course for the topic "Bias".

The goal of this project is to determine prediction scores and to analyze them afterwards.

2. Project Structure

The project is structured in three parts:

2.1. Data Source

Two data sources in the project are used: (1) Wikipedia articles of politicians and (2) world population data. Both of the original datasets can be found in data_raw folder.

(1) Wikipedia articles - The Wikipedia articles can be found here. The file page_data.csv is the data source used in this project and can be found in data_raw folder. This data inclused unnecessary entries and need to be processed. This dataset is published by Os Keyes under license CC BY 4.0.

(2) Population data - The preprocessed population data named export_2019.csv is provided by the instructor in the _data folder. The dataset is drawn from the world population datasheet published by the PRB (Population Reference Bureu). This dataset is also found in data_raw folder. This population data is being used directly without any processing.

2.2 Data Acquisition

Step 1. clean data base (filter out unwilling data entries)

Unwilling entries (which include "Template" are not wikipedia articles) are filtered out.

Step 2. ORES request

In the second step, the articles found on wikipedia page need to be rated by its quality through ORES ("Objective Revision Evaluation Service") API. This API is based on a machine learning system and will rate the articles into 6 different categories -- the two best of them are "FA" (featured article) and "GA" (good article). These two categories are relevant for our analysis.

Step 3. Assign quality assessment to each article

Each article (the original dataset are read as DataFrame and renamed into politician_data) is rated through ORES REST API and given a quality catogory. Cleaned and processed database should contain essential information of each article (including country, revision_id, prediction etc.)

The data is retrieved through multiple chunks, since the retrieval is taking a few hours. The result can be found in data clean/chunks/. These chunks are being used to create later on a single csv.

Articles which cannot be processed are also extracted and stored in result/countries_no_match.csv.

2.3. Data Processing and Cleaning

In this step, cleaned dataset are combined with population data and generates an complete dataset in result/politician_by_country.csv.

The data contains following structure:

article_name	country	region	revision_id	article_quality	population
xxxxxx	xxxxxxx	xxxx	xxxxxxxxx	xxx	xxx

2.4. Data Analysis

The aim of this project is to compare the number of articles (with/without good quality) in proportion to national/regional population. Therefore, 6 comparison tables are generated showing the results.

They are: Table 1&2: Top/Bottom 10 countries by coverage Table 3&4: Top/Bottom 10 countries by relative quality Table 5&6: Regions by coverage (total articles/good quality articles)

All 6 tables can be found in the notebook.

3. External Resource: ORES REST API endpoint

The ORES REST API contains three parameters project, redid and model (Documentation,Usage).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data_clean		data_clean
data_raw		data_raw
result		result
A3_Wikipedia_ORES_Bias.ipynb		A3_Wikipedia_ORES_Bias.ipynb
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

data_clean

data_clean

data_raw

data_raw

result

result

A3_Wikipedia_ORES_Bias.ipynb

A3_Wikipedia_ORES_Bias.ipynb

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

A3-hcds-hcc-bias

1. Project Description

2. Project Structure

2.1. Data Source

2.2 Data Acquisition

2.3. Data Processing and Cleaning

2.4. Data Analysis

3. External Resource: ORES REST API endpoint

4. Notes for reproduction

Prerequisites:

5. License

About

Releases

Packages

Languages

License

mvrcx/A3-hcds-hcc-bias

Folders and files

Latest commit

History

Repository files navigation

A3-hcds-hcc-bias

1. Project Description

2. Project Structure

2.1. Data Source

2.2 Data Acquisition

2.3. Data Processing and Cleaning

2.4. Data Analysis

3. External Resource: ORES REST API endpoint

4. Notes for reproduction

Prerequisites:

5. License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages