Modeling and explaining beacons in code comprehension

The goal of this project is to understand how software engineers comprehend computer programs.
Do they use beacons or references in programs to ease their comprehension? Are their critical parts of the program they tend to focus and spend time on more?
We also investigate how state of the art generative models like GPT perform on the task of identifying such beacons.

Setup

conda create -n program-comprehension python=3.8.11
conda env update --file env.yml --prune
conda activate program-comprehension

or

PIP_EXISTS_ACTION=w conda env create -f env.yml

Data

Programs used in the behavioral experiments were sourced from the following repositories:

https://github.com/githubhuyang/refactory
https://github.com/jkoppel/QuixBugs

To run

Step 1

First, get model output information for each stimuli:

Mode 1: Get last-layer model activations for each input token

For each problem, this mode generates a torch pkl containing a dict: tokens -> tensor.
Path: ./experiments/custom-anonym

python comprehend/model_outputs.py \
--model_names santa-coder \
--number_of_records -1 \
--dataset_name custom-anonym \
--dataset_path ./data \
--infer_interval 1 \
--expt_dir ./experiments \
--mode 1

Mode 2: Get model LL support sizes for each input token

This mode generates a CSV for each problem.
Path: ./experiments/custom-anonym

python comprehend/model_outputs.py \
--model_names santa-coder \
--number_of_records -1 \
--dataset_name custom-anonym \
--dataset_path ./data \
--infer_interval 1 \
--expt_dir ./experiments \
--mode 2

Step 2

Next, align model output data with participant responses available as Qualtrics data (which needs to be placed in ./data)

python comprehend/prepare_dataset.py \
--responses_path data/code-comprehend_March 13, 2023_10.00.xlsx \ 
--token_wise_ll_support_path experiments/custom-anonym \
--token_wise_representations_path experiments/custom-anonym \
--out_path experiments/results

Step 3

Analyze the prepared data by training models

python comprehend/analyze.py \
--dataset_path experiments/results \

Test config

For comprehend/model_outputs.py

[
"--model_names", "codeberta-small",
"--number_of_records", "-1",
"--infer_interval", "2",
"--dataset_name", "custom-anonym",
"--dataset_path", "./data",
]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
comprehend		comprehend
data		data
test		test
.gitignore		.gitignore
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comprehend

comprehend

data

data

test

test

.gitignore

.gitignore

README.md

README.md

env.yml

env.yml

Repository files navigation

Modeling and explaining beacons in code comprehension

Setup

Data

To run

Step 1

Mode 1: Get last-layer model activations for each input token

Mode 2: Get model LL support sizes for each input token

Step 2

Step 3

Test config

About

Releases

Packages

Languages

ALFA-group/beacons-in-code-comprehension

Folders and files

Latest commit

History

Repository files navigation

Modeling and explaining beacons in code comprehension

Setup

Data

To run

Step 1

Mode 1: Get last-layer model activations for each input token

Mode 2: Get model LL support sizes for each input token

Step 2

Step 3

Test config

About

Resources

Stars

Watchers

Forks

Languages