10 add evaluation pipeline #25

J-Dymond · 2024-05-13T14:20:23Z

Evaluation pipeline

Has a few utils files containing metrics and utility functions, and some scripts which perform evaluation on a selected model. Will briefly go over the evaluation scripts as they are at the moment, and the changes I made to the evaluation dataset class. The evaluation scripts can be run periodically throughout training to allow get a clearer picture of model performance as it is being trained.

`quantitative_eval.py`

Performs quantitative evaluation over a test set with a selected model, it compares ground truth inputs against perturbed inputs like in the paper. Unlike the paper we don't generate these, rather they are answers to different questions, randomly sampled from within the same author. In the future we can change this according to our work package. The script outputs the truth ratio values, and raw losses which are outputted as a numpy array for further processing. Currently if running as main these are saved to a .np file in a separate folder in the parent folder of that where the model weights are stored.

`qualitative_eval.py`

This performs a qualitative evaluation of the model. Loops over the test data and generates an output answer according to the input question. These are both printed along with the target printed along with the target to allow qualitative comparison against the target answer.

`EvalQADataset()` Changes

I made some changes to allow the quantitative evaluation script to work. Namely adding a batch formatter which when given a question, outputs input IDs, labels, and attention masks with appropriate padding for batch computation. Furthermore, a method which locates perturbed answers is added, which when given a question index will locate a random question pertaining the same author which can be used as a perturbed answer.

…d losses across potential answers

…(defaults to forget)

src/arcsf/data/tofu.py

src/arcsf/data/data_module.py

tests/test_data_module.py

tests/test_eval.py

…ions

jack89roberts · 2024-05-22T14:20:45Z

Ensure we understand what's going on in the eval (e.g. document shapes etc.)
- including NaNs appearing in places
Compute and return alternative=greater in KS test (as well as default).
Try to understand why IDK model comes out worse for forgetting

jack89roberts · 2024-05-22T16:48:48Z

It would be nice if the tests capture some of what we were trying to think through earlier today, e.g. check truth ratio of one of the dummy forget models is larger than the dummy fine-tuned model and similar. I haven't checked back through the tests so it might be that you've already done that.

… outputs the metrics we want to track. Also hidden away/cleaned up evaluation scripts and moved most functions to the utils file

…, dtype)

J-Dymond · 2024-05-23T19:09:49Z

I've made some changes to my pull request now, I've added a function in evaluate_model.py called evaluate_model :

def evaluate_model(
    model: torch.nn.Module,
    base_truth_ratios_path: str,
    tokenizer: transformers.AutoTokenizer,
    experiment_config: dict,
) -> dict[float, float, float, float, float]:

It outputs a dictionary containing:

result_dict = {
        "mean_tr_retain": retain_tr,
        "mean_rouge_score": rouge_score,
        "forget_quality_1": forget_quality_one_sided,
        "forget_quality_2": forget_quality_two_sided,
        "model_utility": model_utilty,
    }

This should be everything we want to track in the wandb. I've added some tests testing it, and I've moved the old scripts I wrote into a /scripts folder within eval. The functions for these scripts are all contained in utils now. So within the eval folder there are just three files:

metrics.py : containing the functions for the metrics
utils.py : containing all of the functions used in the scripts folder and the evaluate_model function
evaluate_model.py : containing the function for evaluating model, in hindsight, maybe this can be moved to utils.. but I leave that up to what you guys think is best

…he max() function in table 1 of the tofu paper (it wasn't)

J-Dymond · 2024-05-23T23:34:11Z

Just something minor I didn't explicitly point out in the above: the path for the base model truth ratios should currently be the relative path to where the forget truth ratios are stored.

The all_eval script will calculate and save these, provided you give it the forget dataset.

These are the only values that need to be stored locally for evaluate_model to run, everything else should be calculated within the function.

… script if ran as main

…el.generate() method

jack89roberts

I haven't got my head around it fully yet, will look again next week. I left a few comments about things being hardcoded but not for everything - for this PR they can be kept hardcoded but if so we should make an issue with everything that's left outstanding/will need to be changed for future runs/experiments.

jack89roberts · 2024-05-24T14:55:46Z

src/arcsf/data/data_module.py

    return f"Question: {question}\nAnswer: {answer}"


-class EvalQADataset(Dataset):
+def qa_formatter_autoregression(qa: tuple[str, str, int]) -> str:


Maybe qa_formatter_blank or similar (i.e. nothing added other than question and answer).

jack89roberts · 2024-05-24T14:58:11Z

src/arcsf/data/data_module.py

@@ -57,7 +58,7 @@ def get_data(
    return data


-def qa_formatter_basic(qa: tuple[str, str]) -> str:
+def qa_formatter_basic(qa: tuple[str, str, int]) -> str:


Note to self: Forgetting branch has refactored QA formatters.

jack89roberts · 2024-05-24T15:02:21Z

src/arcsf/data/data_module.py

+    def batch_formatter(
+        self,


Note to self: Data collators/padding choices.

jack89roberts · 2024-05-24T15:11:23Z

src/arcsf/data/data_module.py

+            perturbed_options = self.data.filter(
+                lambda sample: sample["author_index"] == author_n
+                and sample["question_index"] != question_n
+            ).shuffle(seed=self.rand_gen.seed())


I've not gone through to check but is setting the seed like this here ok or will it be giving the same perturbed samples every time? i.e. is this function used multiple times during an evaluation run and should/does it give different perturbed rows each time?

jack89roberts · 2024-05-24T15:13:37Z

src/arcsf/data/tofu.py

+    forget_set.num_authors = TOFU_NUM_AUTHORS
+    forget_set.q_per_author = TOFU_Q_PER_AUTHOR
+
+    retain_set.num_authors = TOFU_NUM_AUTHORS
+    retain_set.q_per_author = TOFU_Q_PER_AUTHOR


Are these still needed/used?

jack89roberts · 2024-05-24T16:13:35Z

src/arcsf/eval/utils.py

+) -> dict[float, float, float, float, float]:
+    """


Suggested change

) -> dict[float, float, float, float, float]:

"""

) -> dict[str, float]:

"""

jack89roberts · 2024-05-24T16:17:42Z

src/arcsf/eval/utils.py

+
+
+def all_eval(
+    model: torch.nn.Module,


All model typehints maybe should be transformers.PreTrainedModel (but could be they all work with torch modules anyway)

Suggested change

model: torch.nn.Module,

model: transformers.PreTrainedModel,

jack89roberts · 2024-05-24T16:19:50Z

src/arcsf/eval/utils.py

+        "all_losses": torch.zeros(
+            (dataset.__len__(), n_perturbed + 1), dtype=torch.float64
+        ),
+        "truth_ratios": torch.zeros(dataset.__len__()),
+        "rougeL_recall": torch.zeros(dataset.__len__()),


Use len rather than calling __len__ directly

Suggested change

"all_losses": torch.zeros(

(dataset.__len__(), n_perturbed + 1), dtype=torch.float64

),

"truth_ratios": torch.zeros(dataset.__len__()),

"rougeL_recall": torch.zeros(dataset.__len__()),

"all_losses": torch.zeros(

(len(dataset), n_perturbed + 1), dtype=torch.float64

),

"truth_ratios": torch.zeros(len(dataset)),

"rougeL_recall": torch.zeros(len(dataset)),

jack89roberts · 2024-05-24T16:23:44Z

src/arcsf/eval/utils.py

+def get_analysis_values(
+    model_dir: str,
+) -> dict[np.ndarray, np.ndarray, np.ndarray, torch.Tensor, torch.Tensor]:
+    """


Suggested change

"""

) -> dict[str, np.ndarray | torch.Tensor]:

jack89roberts · 2024-05-24T16:24:57Z

src/arcsf/eval/utils.py

+    vals["forget_losses"] = np.loadtxt(model_dir + "/eval/forget/all_losses.txt")
+    vals["retain_losses"] = np.loadtxt(model_dir + "/eval/retain/all_losses.txt")
+    vals["rouge_scores"] = np.loadtxt(model_dir + "/eval/retain/rougeL_scores.txt")
+    # we re-calculate the truth ratio, since torch calculated many as NaNs


could we check for nans and log a warning somewhere if they appear?

J-Dymond added 18 commits April 16, 2024 16:07

added some preliminary evaluation metrics with some tests

91580c9

added KS-Test to evaluation functions

6268b9b

standardising pytest.approx() in tests

f408339

corrected the input for conditional probability: it accepts normalise…

8c180dd

…d losses across potential answers

adding the loss function

5e95602

added a loss and a test for it

d4383c3

used the dummy model + tokeniser for tests

9cf7f37

TODO: write tests and debug eval functions

8ee88af

added tests for the truth ratio, now working as intented

779900d

adding evaluation pipeline

c588882

trying to implement the evaluation pieplline

d82b0d1

added an end to end test in test_eval.py

8cb7578

added an end to end test in test_eval.py

41fc3bf

added a qualitative evalation function

515ad48

added argparser for the qualitative eval script

83cca9f

fixing merge conflict

d5137a4

added initial quanititative and qualitative evaluation scripts

99c2b0e

commented and provided docstrings for all code

ecec14e

J-Dymond linked an issue May 13, 2024 that may be closed by this pull request

Add evaluation pipeline #10

Open

J-Dymond and others added 3 commits May 14, 2024 22:12

added the data split to the input arguments of the evaluation script …

6e17d2a

…(defaults to forget)

📌 Update poetry.lock

6b0cabf

🔥 delete VSCode config

a423261

jack89roberts requested changes May 15, 2024

View reviewed changes

jack89roberts added this to the Milestone 1: Working pipeline on small novel usecase milestone May 15, 2024

J-Dymond added 3 commits May 16, 2024 17:25

added a third evaluation script which outputs all scores

e11b4cc

broken test: fixed, to do with device when running non-tokenized vers…

937683b

…ions

added a script which aggregates the evaluation metrics

b94799e

added function which evaluates the model against a baseline model and…

192eda0

… outputs the metrics we want to track. Also hidden away/cleaned up evaluation scripts and moved most functions to the utils file

J-Dymond added 2 commits May 23, 2024 17:13

minor error in tests going from type(var) == dtype --> isinstance(var…

3f8ff0e

…, dtype)

added a comment in the ks_test

cc2f004

minor changes: fixed some imports and ensured the function performs t…

7db6d9d

…he max() function in table 1 of the tofu paper (it wasn't)

J-Dymond added 3 commits May 24, 2024 08:36

minor fixes: added max generated tokens, and gave evaluate_model.py a…

b68af78

… script if ran as main

added base values path to the argparser of the evaluate script

df83297

fixed bug in test: evaluate model should be passed kwargs for the mod…

0791369

…el.generate() method

jack89roberts requested changes May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10 add evaluation pipeline #25

10 add evaluation pipeline #25

J-Dymond commented May 13, 2024

jack89roberts commented May 22, 2024 •

edited

jack89roberts commented May 22, 2024

J-Dymond commented May 23, 2024

J-Dymond commented May 23, 2024

jack89roberts left a comment

jack89roberts May 24, 2024

jack89roberts May 24, 2024

jack89roberts May 24, 2024

jack89roberts May 24, 2024

jack89roberts May 24, 2024

jack89roberts May 24, 2024

jack89roberts May 24, 2024

jack89roberts May 24, 2024

jack89roberts May 24, 2024

jack89roberts May 24, 2024

10 add evaluation pipeline #25

Are you sure you want to change the base?

10 add evaluation pipeline #25

Conversation

J-Dymond commented May 13, 2024

Evaluation pipeline

quantitative_eval.py

qualitative_eval.py

EvalQADataset() Changes

jack89roberts commented May 22, 2024 • edited

jack89roberts commented May 22, 2024

J-Dymond commented May 23, 2024

J-Dymond commented May 23, 2024

jack89roberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`quantitative_eval.py`

`qualitative_eval.py`

`EvalQADataset()` Changes

jack89roberts commented May 22, 2024 •

edited