Skip to content

Worked Example Miner (WEM): A Comprehensive Tool for Analyzing Java Repositories.

License

Notifications You must be signed in to change notification settings

BrenoFariasdaSilva/Worked-Example-Miner

Repository files navigation


Welcome to my Worked-Example-Miner Repository!

The Worked-Example-Miner is a comprehensive tool for Java repository analysis. This tool integrates CK, PyDriller, and RefactoringMiner to analyze Java repositories and generate data and metadata about the software evolution. The tool is designed to identify trends in how repositories evolve over time and select prime candidates for creating worked examples.

This project is massive and complex, containing multiple integrated tools and exploring different goals and research questions. With that in mind, each of the directories in this repository has its own README.md file explaining it's purpose and how it contributes to the overall project.


GitHub Build/WorkFlow GitHub Build/WorkFlow GitHub Code Size in Bytes GitHub Commits GitHub Last Commit GitHub Forks GitHub Language Count GitHub License GitHub Stars wakatime

Repobeats Statistics

Table of Contents

Introduction

The Worked-Example-Miner project is a comprehensive research endeavor that delves into the evolution of code in Java repositories, focusing on Distributed Systems (DS). By integrating specialized tools and metrics, we aim to analyze code quality, identify patterns of improvement, and select exemplary code segments for educational purposes. Our research explores the intricacies of software engineering, emphasizing the importance of code metrics, refactoring, and code evolution in enhancing software design quality and maintainability

Within this repository, you'll find a wealth of resources, from detailed code analyses and data sets to insightful findings and theoretical advancements. Whether you're a researcher seeking to deepen your understanding of software evolution, a developer looking for proven practices in distributed systems, or an educator aiming to enrich your curriculum, this documentation offers valuable knowledge and tools to support your goals.

Setup

Clone without Submodule

In order to clone this repository without the submodule (CK), you can use the following command:

git clone https://github.com/BrenoFariasdaSilva/Worked-Example-Miner

Clone with Submodule

In order to clone this repository with the submodule (CK), you can use the following command:

git clone --recurse-submodules https://github.com/BrenoFariasdaSilva/Worked-Example-Miner.git

Clone Submodule

In case you have already cloned the repository and forgot to clone the submodule (CK), you can use the following command to clone the submodule:

git submodule init
git submodule update

Paper Submissions

This research project aims to contribute to the field of Software Engineering (SE) and Distributed Systems (DS) by exploring the evolution of code quality in Java repositories. Our research findings and insights will be shared through academic papers, conference presentations, and educational resources. We are committed to advancing knowledge in software development practices and improve the educational quality of worked examples in SE.

EduComp 2024 - Ideas Laboratory (UPDATE)

We are excited to announce that our paper's submission to the EduComp 2024 conference was accepted! EduComp is a premier conference that focuses on educational computing, providing a platform for researchers, educators, and practitioners to share their insights and innovations in the field of educational technology. Our paper highlights the significance of worked examples in software engineering education, particularly within the domain of Distributed Systems, and discusses a novel approach for selecting these examples based on code quality metrics.

The study introduces a heuristic based on metrics to examine the evolution of code quality in Distributed Systems, aiming to identify code examples that demonstrate significant improvements. Using software projects such as Apache Kafka and ZooKeeper, the research applies tools like CK (Java code metrics calculator) and RefactoringMiner integrated into the developed Worked Example Miner (WEM) tool. This approach allowed for the generation of statistical descriptions, linear regressions, and refactorings that aid in selecting code changes for worked examples.

Our findings reveal that this methodology can effectively contribute to the selection of worked examples for Distributed Systems, highlighting improvements in modularization, cohesion, and code reusability. Such examples are instrumental in enhancing learning and understanding in software engineering education.

For further details on our approach and findings, you can read our paper submission here: Abordagem para seleção de exemplos trabalhados para Engenharia de Software do domínio de Sistemas Distribuídos and watch our presentation at EduComp 2024 on April 25 available on YouTube.

EduComp24, April 22-27, 2024, São Paulo, São Paulo, Brazil (Online)

© 2024 Copyright maintained by the authors. Publication rights licensed to the Brazilian Computer Society (SBC).

SBES 2024 (UPDATE)

We are also planning to submit a paper to the SBES 2024 conference. SBES is the Brazilian Symposium on Software Engineering, a prestigious event that brings together researchers, practitioners, and students to discuss the latest trends and advancements in software engineering. Our paper will delve into the evolution of code quality metrics in Java repositories, focusing on Distributed Systems (DS) and the implications for software design and maintainability. You can our paper submission here Abordagem para seleção de exemplos trabalhados para Engenharia de Software do domínio de Sistemas Distribuídos.

Goals

  1. Code Metrics Generation:

    • Traverse the repository commit history using PyDriller.
    • Extract code metrics using CK (Chidamber & Kemerer) metrics for Java repositories.
    • Extract refactoring patterns using RefactoringMiner for Java repositories.
  2. Code Metrics Selection:

    • Identify relevant code quality metrics for analyzing Distributed Systems (DS) evolution.
    • Evaluate the significance of selected metrics in reflecting code quality improvements.
    • Analyze the correlation between code quality metrics and non-functional characteristics.
  3. Analyzing Code Evolution:

    • Analyze code that started with "bad" values for the select metrics and evolved over time.
    • Identify good code examples that indicate what makes code better and what changes are typically made to improve it.
  4. Educational Code Examples:

    • Develop a heuristic for selecting code examples that represent quality improvements in DS.
    • Identify code segments that demonstrate effective practices for code improvement.
    • Create worked examples that highlight the adaptation and evolution of DS code over time.

Skills

Our research project involves expertise in the following areas:

  • Python Language.
  • Python Libraries (Pandas, Matplotlib, NumPy, Scikit-Learn).
  • Java Language.
  • CK (Chidamber & Kemerer) Metrics.
  • PyDriller.
  • RefactoringMiner (Refactoring Detection).
  • Software Engineering.
  • Distributed Systems.
  • Worked Examples.
  • Statistical Data Analysis and Visualization (Min, Max, Average, Third Quartile, Median, Linear Regression).
  • Apache Kafka (Distributed Messaging System).
  • Apache ZooKeeper (Distributed Coordination Service).
  • GitHub Repositories.
  • Data Collection and Analysis.
  • Makefile.
  • Virtual Environment.

Feel free to explore the code and data in this repository. If you have any questions or suggestions, please don't hesitate to reach out to me.

Directories

Each directory in this repository has its own README.md file explaining its purpose. Please refer to individual README files for more details.

  • PyDriller: This Python library excels in mining software repositories. Within Worked Example Miner, PyDriller is harnessed to navigate through the commit tree of a repository, facilitating the execution of CK at every commit, thereby ensuring a comprehensive analysis across the development timeline. This directory will contains two main files: code_metrics.py and metrics_changes.py. The code_metrics.py file is responsible for extracting the CK metrics from the Java repositories, as well as generating commit diff files and a commit hashes list file. In the other hand, the metrics_changes.py file is responsible for reading the generated ck metrics files and generate the metrics statistics, linear regressions, detecting substantial changes, and identifying refactoring types.

  • RefactoringMiner: This directory contains the RefactoringMiner tool, which specializes in detecting refactorings in Java repositories. By integrating RefactoringMiner into Worked Example Miner, we can identify and analyze refactorings that contribute to code evolution, highlighting changes that enhance code quality and maintainability. This directory will contains two main files: metrics_evolution_refactorings.py and repositories_refactorings.py. The metrics_evolution_refactorings.py file is responsible for generating the refactorings files for the selected files in the Java repositories. The repositories_refactorings.py file is responsible for generating the refactorings file for the selected repositories in the Java repositories.

By leveraging the combined strengths of these tools, Worked Example Miner emerges as a powerhouse for Java repository analysis. It not only facilitates the generation of differential analyses for each commit but also meticulously tracks the historical progression of selected CK metrics at each stage of code development. Furthermore, the tool is equipped to conduct linear regression analyses, detect substantial changes, and identify refactoring types cataloged by RefactoringMiner.

The integration of these capabilities allows Worked Example Miner to produce an array of outputs, from detailed commit diffs to analyses of repository evolution and potential trends. Such comprehensive data is instrumental in pinpointing exemplary candidates for the creation of worked examples, thus enriching educational resources and facilitating a deeper understanding of Java repository dynamics.

In essence, Worked Example Miner stands as a testament to the synergy of combining specialized tools to achieve a greater understanding of software development practices by the code metrics evolution. Through its detailed analyses, educators, researchers, and developers are better equipped to study Java repositories, enabling the cultivation of rich, informative worked examples that highlight best practices and evolutionary insights in software development.

Repositories

Our research project focuses on analyzing the evolution of code in Java repositories, with a particular emphasis on Distributed Systems (DS). We have selected two prominent repositories, Apache Kafka and Apache ZooKeeper, to serve as case studies for our investigation. These repositories are renowned for their contributions to distributed messaging systems and coordination services, respectively, making them ideal candidates for studying code evolution in DS. Also, they are widely used in the industry and academia, are open-source, and are still actively maintained and developed.

  • Purpose: Apache Kafka is a distributed messaging system based on the publish-subscribe model, widely used for building real-time data processing infrastructures. It is designed to handle large-scale data flows, enabling organizations to process, store, and transmit data efficiently.
  • Usage in Research: Kafka's architecture, real-world usage, and capability to handle massive volumes of real-time data make it an excellent candidate for our study. It provides insights into the design and maintenance of distributed systems and how they evolve to meet scalability, fault tolerance, and data distribution requirements.
  • Purpose: Apache ZooKeeper is a distributed coordination service widely used for large-scale internet systems. It offers a reliable and highly available environment for coordinating tasks across multiple nodes in a distributed cluster.
  • Usage in Research: ZooKeeper's role in providing a consensus service for distributed systems and its mechanisms for ensuring data consistency across nodes makes it invaluable for studying distributed service coordination, management, and the evolution of critical infrastructure components in distributed systems.

This are the main repositories that we are analyzing in this research project, but for future work, we can expand the analysis to other repositories in order to consolidade our methodology and improve the results.

Methodology

This research adopts a systematic approach to explore the evolution of Distributed Systems (DS) through code metric analysis. Our methodology encompasses data collection, code analysis, and the integration of several tools and metrics to examine how code evolves in terms of complexity, quality, efficiency and in many other aspects.

Data Collection

  • Repositories Selection: We select relevant repositories that align with our research goals, focusing on projects like Apache Kafka and ZooKeeper.
  • CK Integration: CK tool is integrated for conducting code metric analysis on chosen commits, classes, or methods within the repositories.
  • Mining Software Repositories: PyDriller is utilized to navigate through the commit history, extracting essential data regarding code metrics and their evolution.
  • Metric Evaluation: We evaluate code metrics that generates the values of each selected metric for each state (commit) of the code. This allows us to identify trends, patterns, and changes in the code over time.
  • Metric Visualization: We employ Matplotlib for generating visual representations that illustrate the progression of code metrics over time.
  • RefactoringMiner Integration: RefactoringMiner is used to detect refactorings in the codebase that signal improvements or changes contributing to code evolution.

Code Analysis

We analyze instances where code initially demonstrated suboptimal metrics but evolved positively over time. Identifying exemplary modifications sheds light on effective practices for code improvement, focusing on alterations that enhance metric scores.

Research Questions

Our investigation is guided by four principal questions:

  1. How to identify relevant code quality metrics for analyzing DS evolution?
  2. What patterns and trends signify clear code improvement in DS?
  3. How do code improvements reflect on selected metrics and their correlation with non-functional characteristics?
  4. Which metrics and characteristics are crucial for selecting appropriate code examples for educational purposes in Software Engineering (SE)?

Proposed Approach

The project aims to develop a heuristic for identifying code examples that represent quality improvements in DS. This heuristic will aid in selecting code segments for educational examples, illustrating the adaptation and evolution of DS code over time. The heuristic will focus on improvements detectable through selected metrics, using specific tools on carefully chosen open-source repositories.

Software Metrics

Our analysis leverages a suite of metrics for object-oriented design as outlined in the seminal work by Chidamber and Kemerer. The study, titled "A Metrics Suite for Object Oriented Design," was published in the IEEE Transactions on Software Engineering (vol. 20, no. 6, pp. 476–493, 1994). It introduces key metrics that have become foundational in assessing and improving the design quality of object-oriented software systems. These metrics include:

  • Coupling Between Object classes (CBO): Reflects the degree of coupling by measuring the number of classes directly associated with a given class through method calls. A higher CBO value suggests higher complexity and lower flexibility, potentially leading to increased maintenance challenges. Reducing CBO over time can indicate improvements in code quality, aiming for a more modular software design that minimizes the impact of changes across the system.

  • Response for a Class (RFC): Represents the set of methods that can be executed in response to a message received by an instance of the class. A lower RFC value denotes fewer behaviors and potentially lower complexity, making the class more cohesive and easier to maintain and test.

  • Weighted Methods per Class (WMC): Calculates the sum of complexity measures of the class's methods. High WMC values may indicate complex classes with multiple responsibilities, affecting development and maintenance costs. Lower WMC values suggest a more focused and cohesive class, facilitating understanding and extension.

Additionally, the CK tool offers insights into other metrics that help understand code evolution:

  • Depth of Inheritance Tree (DIT): Measures the number of ancestor classes, indicating the complexity level and the potential for side effects from changes in superclasses. A higher DIT value can imply more complex inheritance structures that may affect maintainability.

  • Lack of Cohesion in Methods (LCOM): Indicates the degree of method cohesion within a class, ranging from 0 (high cohesion) to 1 (low cohesion). Preferred low values suggest that methods within a class are closely related to each other, enhancing the class's cohesiveness.

  • Number of Children (NOC): Counts the direct subclasses of a class, with higher values hinting at greater reusability and significance within the codebase, as it implies a foundational role due to other classes' dependency on it.

Refactorings Patterns

Refactorings play a crucial role in software evolution, enabling developers to enhance code quality, maintainability, and extensibility. By detecting and analyzing refactorings, we can identify patterns of improvement and understand how code evolves to meet changing requirements and design goals. RefactoringMiner is a powerful tool that automates the detection of refactorings in Java repositories, providing valuable insights into code changes and their implications.

Refactorings can be categorized into several types, each serving a specific purpose in code improvement, but these are the ones we use in our research:

  • Extract Method: Involves extracting a block of code into a new method to improve readability, maintainability, and reusability. This refactoring reduces code duplication and enhances modularity.
  • Extract Class: Separates part of a class into a new class to enhance cohesion and reduce complexity. This refactoring promotes a more focused and modular design, facilitating future changes and extensions.
  • Extract Superclass: Creates a superclass to encapsulate common behavior shared by multiple classes, promoting code reuse and modularity. This refactoring simplifies the inheritance hierarchy and enhances maintainability.
  • Pull Up Method: Moves a method from a subclass to a superclass to promote code reuse and simplify the inheritance hierarchy. This refactoring enhances modularity and reduces duplication.
  • Push Down Method: Transfers a method from a superclass to a subclass to enhance encapsulation and modularity. This refactoring ensures that methods are located closer to the data they operate on, improving code organization and maintainability and avoiding the "God class" anti-pattern. "God class" is a design flaw where a single class handles most of the system's functionality, breaking the Single Responsibility Principle and leading to poor maintainability and extensibility.

Collectively, these metrics and refactorings provide a comprehensive view of the codebase's complexity, quality, and maintainability. They serve as essential tools for developers to refine software design and architecture effectively. It's important to note that these metrics are derived from static code analysis, which involves evaluating the source code without executing the program. This approach allows for an in-depth understanding of the code's structural and qualitative aspects, facilitating targeted improvements and ensuring a more robust, maintainable, and efficient software system.

Dynamic code analysis complements our understanding by examining the code's behavior during execution. It sheds light on runtime characteristics, class communication, performance, and resource utilization, offering a holistic view of the software's operational efficiency. Despite the value of dynamic analysis, our research emphasizes static code analysis. This focus allows us to delve into the software quality's evolution within the domain of Distributed Systems (DS), providing insights into the code design changes and their impact on maintainability and reliability over time.

Tools Utilized

  • CK Tool (with Enhancements): This repository includes a fork of the original CK tool, tailored for static code analysis in Java projects. The CK tool is instrumental in assessing various software metrics related to complexity, coupling, and cohesion among others. Our version extends the original functionality by addressing Java dependencies issues that were causing build failures. Additionally, we've introduced new features to track the instantiation frequency of classes and the invocation frequency of methods across the codebase. These enhancements aim to provide deeper insights into object creation patterns and method usage within Java applications, further aiding in the evaluation of code quality and design.
  • PyDriller: A Python library for mining software repositories, facilitating the extraction of changes, contributions, and evolution of code.
  • RefactoringMiner: Specialized in identifying and analyzing source code refactorings in Java repositories, providing insights into code evolution and quality improvement.

Conclusion

This research methodology, underpinned by detailed code metric analysis and tool integration, aims to offer significant insights into the evolution of software quality in DS. By identifying and analyzing patterns of improvement, this work contributes to the broader field of Software Engineering, particularly in educational contexts where real-world examples of code evolution are invaluable.

Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have suggestions for improving the code, your insights will be highly welcome. In order to contribute to this project, please follow the guidelines below or read the CONTRIBUTING.md file for more details on how to contribute to this project, as it contains information about the commit standards and the entire pull request process. Please follow these guidelines to make your contributions smooth and effective:

  1. Set Up Your Environment: Ensure you've followed the setup instructions in the Setup section to prepare your development environment.

  2. Make Your Changes:

    • Create a Branch: git checkout -b feature/YourFeatureName
    • Implement Your Changes: Make sure to test your changes thoroughly.
    • Commit Your Changes: Use clear commit messages, for example:
      • For new features: git commit -m "FEAT: Add some AmazingFeature"
      • For bug fixes: git commit -m "FIX: Resolve Issue #123"
      • For documentation: git commit -m "DOCS: Update README with new instructions"
      • For refactorings: git commit -m "REFACTOR: Enhance component for better aspect"
      • For snapshots: git commit -m "SNAPSHOT: Temporary commit to save the current state for later reference"
    • See more about crafting commit messages in the CONTRIBUTING.md file.
  3. Submit Your Contribution:

    • Push Your Changes: git push origin feature/YourFeatureName
    • Open a Pull Request (PR): Navigate to the repository on GitHub and open a PR with a detailed description of your changes.
  4. Stay Engaged: Respond to any feedback from the project maintainers and make necessary adjustments to your PR.

  5. Celebrate: Once your PR is merged, celebrate your contribution to the project!

Collaborators

We thank the following people who contributed to this project:

My Profile Picture
Breno Farias da Silva
Profile Picture
Marco Aurélio Graciotto Silva

License

This project is licensed under the Apache License 2.0. This license permits use, modification, distribution, and sublicense of the code for both private and commercial purposes, provided that the original copyright notice and a disclaimer of warranty are included in all copies or substantial portions of the software. It also requires a clear attribution back to the original author(s) of the repository. For more details, see the LICENSE file in this repository.