Skip to content

deekshaarya4/Info_Types_in_OSS_Issue_Discussions

Repository files navigation

README

DOI

This artifact contains the data and code used in the paper Analysis and Detection of Information Types ofOpen Source Software Issue Discussions.

It comprises of three main folders: data, experiments and results. The components of which are as follows:

  1. data: This folder contains all the data utilized in the experiments performed.
  • chosen_issues - contains the comments of the chosen 15 OSS project issue discussion, retrieved from the Github API, in json format
  • Codebook.xlsx - the codebook to classify a sentence into a particular information type
  • Corpus.xlsx - the list annotated sentences and their corresponding conversational feature set.
  • annotated_data_with_metadata.xlsx - this is the file exported from Atlas.ti annotation tool. It is the list of annotated sentences, along with meta information provided by the tool. Additionally, it contains phrases of comment created at and author annotated as METADATA for the purpose of extracting conversational features.
  • all_data.pkl - is a pickle file containing a pandas dataframe of similar information to Corpus.xlsx. It contains the annotated sentences along with their conversational feature set. It also contains the document in which the sentence exists.
  • data_by_document.pkl - this contains the same information as all_data.pkl. The only difference is that it is a dictionary with keys as documents and values as the pandas dataframe of sentence information.
  1. experiments: This folder contains all the code used to perform the experiments presented in the paper.
  • preprocess.ipynb - is the first step. It reads the annotations from the xlsx file exported from Atlas.ti and extracts all conversational feature information of each sentence and stores it in all_data.pkl and data_by_document.pkl.
  • transform_features.ipynb - performs transformation on the conversational features such as converting categorical columns to one-hot-encoding and converting datetime features to numerically comparable values
  • logistic_regression/random_forest - these folders contain each of the experiments presented in the corresponding paper.

All the files in this folder are Jupyter notebooks with the extension .ipynb. Each file comprises of multiple cells containing code. These cells encapsulate different steps in the flow of code to enable observation of intermediate results. More about Jupyter can be found on this Jupyter page.

  1. results: Is the folder containing the results of the experiments performed.

Two additional folders exist:

  1. docker: This folder contains the compressed docker image with the required environment as well as the Dockerfile used to build this image

  2. nltk_data: This contains the wordnet corpus required by the code to lemmatize words

Instructions for Reproducibility:

To set up the environment for this work, refer to the file INSTALL.md.

  1. Run all cells, in order, in preprocess_data.ipynb
  2. Run all cells, in order, in transform_features.ipynb
  3. Enter into either of the algorithm folders (logistic_regression or random_forest), and run all cells (in order) of the experiment you wish to reproduce.

About

Data, Code and Results from the ICSE 2019 accepted paper: Analysis and Detection of Information Types of Open Source Software Issue Discussions

Resources

License

Stars

Watchers

Forks

Packages

No packages published