Skip to content

ejmmedina/unstack-the-flow

Repository files navigation

Unstack the Flow, Two-level Cluster Analysis of StackOverflow Posts

Data Mining and Wrangling project by Ria Ysabelle Flora, James Bryan Labergas, Maria Angela Legaspi, & Elijah Justin Medina

The research aims to answer the question on uncovering the underlying themes in the top posts about Python in StackOverflow in the last five years. In accomplishing the research question, data was gathered from both the StackOverflow website (through web scraping) and the StackExchange data dump - summing up to a total of 18 million posted questions. This was further narrowed down to a total of 1.3 million posted questions pertaining to Python. From this, a random sample of 10K posts was used in determining the natural grouping of the topics. To prove the robustness of the random sampling, the Jaccard index of which was derived, which resulted to roughly 70%. Results may aid students, enthusiasts, and academicians in targeting topics to focus on and further develop literature and programs to address the demand for these queries.

Data

The data used in this analysis is from StackExchange data dump and manually scraped data directly from StackOverflow. The data consists of listings from 2015 to 2019 amounting to about 16 million rows with 106 features, comprising a total of 85GB. Due to the size of the data, it is not included in the repository.

Report and analysis

The full report is included in the Jupyter notebook, along with supplementary notebooks for processing, robustness test, and other results. If you have any questions regarding this study, please send me a message via e-mail or LinkedIn.

About

Unstack the Flow: Clustering of Stackoverflow Posts

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published