Unstack the Flow, Two-level Cluster Analysis of StackOverflow Posts

Data Mining and Wrangling project by Ria Ysabelle Flora, James Bryan Labergas, Maria Angela Legaspi, & Elijah Justin Medina

The research aims to answer the question on uncovering the underlying themes in the top posts about Python in StackOverflow in the last five years. In accomplishing the research question, data was gathered from both the StackOverflow website (through web scraping) and the StackExchange data dump - summing up to a total of 18 million posted questions. This was further narrowed down to a total of 1.3 million posted questions pertaining to Python. From this, a random sample of 10K posts was used in determining the natural grouping of the topics. To prove the robustness of the random sampling, the Jaccard index of which was derived, which resulted to roughly 70%. Results may aid students, enthusiasts, and academicians in targeting topics to focus on and further develop literature and programs to address the demand for these queries.

Data

The data used in this analysis is from StackExchange data dump and manually scraped data directly from StackOverflow. The data consists of listings from 2015 to 2019 amounting to about 16 million rows with 106 features, comprising a total of 85GB. Due to the size of the data, it is not included in the repository.

Report and analysis

The full report is included in the Jupyter notebook, along with supplementary notebooks for processing, robustness test, and other results. If you have any questions regarding this study, please send me a message via e-mail or LinkedIn.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
output_img_appB		output_img_appB
output_img_appC		output_img_appC
output_img_main		output_img_main
.gitignore		.gitignore
Appendix A. Data Wrangling.ipynb		Appendix A. Data Wrangling.ipynb
Appendix A. Data Wrangling.md		Appendix A. Data Wrangling.md
Appendix B. Robustness Test.ipynb		Appendix B. Robustness Test.ipynb
Appendix B. Robustness Test.md		Appendix B. Robustness Test.md
Appendix C. Non-numerical Clustering.ipynb		Appendix C. Non-numerical Clustering.ipynb
Appendix C. Non-numerical Clustering.md		Appendix C. Non-numerical Clustering.md
PCA_and_Validation.py		PCA_and_Validation.py
README.md		README.md
stackoverflow-clustering-dmw.ipynb		stackoverflow-clustering-dmw.ipynb
stackoverflow-clustering-dmw.md		stackoverflow-clustering-dmw.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output_img_appB

output_img_appB

output_img_appC

output_img_appC

output_img_main

output_img_main

.gitignore

.gitignore

Appendix A. Data Wrangling.ipynb

Appendix A. Data Wrangling.ipynb

Appendix A. Data Wrangling.md

Appendix A. Data Wrangling.md

Appendix B. Robustness Test.ipynb

Appendix B. Robustness Test.ipynb

Appendix B. Robustness Test.md

Appendix B. Robustness Test.md

Appendix C. Non-numerical Clustering.ipynb

Appendix C. Non-numerical Clustering.ipynb

Appendix C. Non-numerical Clustering.md

Appendix C. Non-numerical Clustering.md

PCA_and_Validation.py

PCA_and_Validation.py

README.md

README.md

stackoverflow-clustering-dmw.ipynb

stackoverflow-clustering-dmw.ipynb

stackoverflow-clustering-dmw.md

stackoverflow-clustering-dmw.md

Repository files navigation

Unstack the Flow, Two-level Cluster Analysis of StackOverflow Posts

Data Mining and Wrangling project by Ria Ysabelle Flora, James Bryan Labergas, Maria Angela Legaspi, & Elijah Justin Medina

Data

Report and analysis

About

Releases

Packages

Languages

ejmmedina/unstack-the-flow

Folders and files

Latest commit

History

Repository files navigation

Unstack the Flow, Two-level Cluster Analysis of StackOverflow Posts

Data Mining and Wrangling project by Ria Ysabelle Flora, James Bryan Labergas, Maria Angela Legaspi, & Elijah Justin Medina

Data

Report and analysis

About

Topics

Resources

Stars

Watchers

Forks

Languages