Skip to content

zurlog/MLDS_PR104

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PR104

Made withJupyter Contributors Forks Issues LinkedIn

Reproduction of a Software Quality Prediction Study

Comparison of Machine Learning Techniques for Software Quality Prediction
a paper by Goyal, S. (2020) published in Int. J. Knowl. Syst. Sci., 11(2) - IGI Global


This repository contains the reproduction of a small paper on software quality prediction as a university project. The original paper provides an overview of different software quality prediction models and their performance in terms of accuracy, recall and ROC AUC. The goal of this reproduction project is to validate the findings of the original paper, provide a deeper understanding of the various software quality prediction models, and to critically evaluate the approach and choices made by the authors.

In addition to reproducing the paper, this repository also critiques the approach taken by the authors. One major criticism is that the authors did not properly handle the class imbalance problem, which can greatly impact the performance of the models. Furthermore, the authors used misleading and inadequate performance metrics, which also affected the conclusions of the results.

If you are interested in software quality prediction or simply want to learn about the different models and techniques used in this field, this repository is for you! The code is well documented and easy to follow, making it an excellent resource for anyone looking to get started with software quality prediction. In addition, the critical evaluation of the original paper provides valuable insight into the limitations and potential improvements in this field of research.

So feel free to take a look, experiment with the code, and let me know if you have any questions or suggestions!


Data

The work utilizes data collected from NASA projects using McCabe metrics which are made available in the PROMISE repository. This research is done with six fault prediction benchmark datasets named CM1, KC1, KC2, PC1, JM1, and ALL_DATA (a combination of the previous datasets). The data has been collected using McCabe and Halstead features extractors from the source code of multiple projects.

Name Instances Buggy Clean Imbalance Ratio Features Source
CM1 498 49 449 0.109 22 CM1 is a NASA spacecraft instrument written in C
JM1 10885 2106 8779 0.240 22 JM1 is written in C and is a real-time predictive ground system. It uses simulations to generate predictions
KC1 2109 326 1783 0.183 22 KC1 is a C++ system implementing storage management for receiving and processing ground data
KC2 522 107 415 0.258 22 C++ functions used in a scientific data project which is separate from another part known as KC1. These share some third-party software libraries with no other software overlap
PC1 1109 77 1032 0.075 22 Data from C functions. Flight software for earth orbiting satellite
ALL_DATA 15123 2665 12458 0.214 22 Combined Dataset

Usage

To run the analysis, you must have Python 3.x and the required libraries installed. The required libraries are listed and imported in the setup.ipynb notebook.

The MLDS_PR104 repository contains the following folders:

  • scripts that contains the Jupyter notebooks for the analysis and setup of utility functions;
  • conf, if necessary, that contains configuration files used in scripts or jupyter notebook files;
  • data, that contains input benchmark datasets both in '.csv' and '.arff' format;
  • results contains outputs from the reproduction for an easy comparison with the original study, usually in the '.csv' format;
  • figures that contains plot files
  • reference that contains any possibly referenced resource.


Acknowledgments

References, Inspiration, Code Snippets, etc.