Skip to content

Big Data and AI Engineering bootcamp 2nd capstone project. Using Big Data Tools to predict the probability of university enrollment for Egypt's High School students. 🏫 📚 🔬

RghdE/CapstoneTwo_EducationalLandscape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

📚 Educational Landscape Project

As part of the Big Data and AI Engineering Onsite Bootcamp, we are asked to deliver a solution for the MENA market that can be solved by Big Data and AI tools. The project has to have an impact and deliver a solution for a real-world problem using MENA datasets.

Table of Contents
  1. Project Overview
  2. Business Objective
  3. Dataset Overview
  4. Preprocessing Overview
  5. Visualization
  6. Modeling Results
  7. Contributing Members Contact
  8. Acknowledgments

Project Overview

This is the overview of the project's structure and files for easier navigation.

├── README.md
├── Deserts_Dashboard.pdf
├── presentation.pdf
├── Notebooks
│   ├── CapstoneProject2_Preprocessing_Notebook.ipynb
|   ├── CapstoneProject2_EDA_Notebook.zip
│   └── CapstoneProject2_ML_Notebook.ipynb
└── Datasets (too big to be uploaded)
    ├── High_School_Public_Results_2022_EG_both_attempts.csv (original dataset) 
    └── capstone_project2_preprocessed_eng.csv (output of the pre-preprocessing notebook: used for the EDA, Dashboard, and Machine Learning models)

(back to top)

Business Objective

The goal of this project is to forecast if a student can enroll in one of the public institutions in Egypt based on his current major and a few extracted attributes using data from a third-year secondary school dataset that was web scraped from the standardized tests in Egypt.

Methods Used

  • Preprocessing
  • Feature Engineering
  • Feature Selection
  • Labeling and classifying the data
  • Exploratory Data Analysis
  • Data Visualization
  • Machine Learning
  • Oversampling

Technologies

  • Python, Jupyter
  • Pandas
  • Plotly
  • Power BI
  • Pig (Big Data tool)
  • PySpark Machine Learning (Big Data tool)

(back to top)

Dataset Overview

This dataset provides Egyptian student’s public results information. Including the student’s unique desk identifier number during the exam (this is unique for all students across Egypt), their gender and school name, the administration and the city their school belongs to, and how many test attempts they had. Lastly, for each attempt, it lists all the courses they can take depending on their branch and what score they have achieved for each course. Most of the courses will be calculated in the total score except for three courses; religion, national education, and economics Statistics. The dataset consists of 45 features and 683k records, which were taken for one year only; 2022.

Dataset link

However, the problem has challenges because all the helpful features to our target can be found in the grades which can't be taken because it will create a data leakge in our model. In order to create a solid prediction, we need to extract more features from the existing columns, i.e. the school name.

At the beginning of our analysis, we raised some questions that we intend to answer using our EDA, dashboard visualization, and modeling. The questions are:

  1. How many branches? Do grades differ based on the branch?
  2. Has Egypt achieved the perfect normal distribution for the grading curve?
  3. Were there any unusual cases that happened to students during their exams?
  4. What exactly happens if a student fails or misses their exam? Are they given another chance? And does their score improve once they get a second chance?
  5. For people with disabilities, What's their gender? How many can join the university? What are their grades? Do they have more second attempts? for unusual cases?
  6. For Egypt and Saudi, Do we have the same schooling system? How do schools handle disabled students?
  7. Do we have gender equality in our schools?

(back to top)

Preprocessing Overview

Preprocessing is the essence of this project. In this README file, we will be listing the overview of each step. However, for a more detailed description, visit our Medium Blog Post

Before the Preprocessing:

image

After the Preprocessing:

image

General Preprocessing steps:

image

Visualization

Saudi dashboards:

Screenshot 2023-01-04 051558

Screenshot 2023-01-04 051617

Egypt dashboards:

Screenshot 2023-01-04 051640

Screenshot 2023-01-04 051704

(back to top)

Modeling Results

All of these models were evaluted in order to choose the best one of them.

image

For the model selection, gradient Boost is the best model since it has the highest accuracy, and this is the result after the optimization.

image

(back to top)

Contributing Members Contact

Team Leadear: Reema Alaswad (Reema's LinkedIn)

Other Members:

Name LinkedIn
Raghad Aleisa Raghad's LinkedIn
AlJohara Alkanhal AlJohara's LinkedIn
Maha AlHazzani Maha's LinkedIn
Eman Aldosari Eman's LinkedIn

(back to top)

Acknowledgments

(back to top)

About

Big Data and AI Engineering bootcamp 2nd capstone project. Using Big Data Tools to predict the probability of university enrollment for Egypt's High School students. 🏫 📚 🔬

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published