pass@k seems be incorrectly computed on my end. #18

ZishunYu · 2023-06-06T01:54:09Z

Hi, folks, thanks a lot for releasing this amazing project!!!

I was trying to run the evaluation module, with the following code,

import sys
from pathlib import Path
sys.path.append(str(Path(".").absolute().parent))
from codetf.models import load_model_pipeline
from codetf.data_utility.util import EOF_STRINGS, EndOfFunctionCriteria, remove_last_block
from torch.utils.data.dataloader import DataLoader
from transformers import StoppingCriteriaList
from codetf.data_utility.human_eval_dataset import HumanEvalDataset
import torch
import os
from accelerate import Accelerator
import torch
from collections import defaultdict
from tqdm import tqdm
import torch
from evaluate import load
from codetf.performance.model_evaluator import ModelEvaluator
from torch.utils.data import TensorDataset

def main():
    os.environ["HF_ALLOW_CODE_EVAL"] = "1"
    os.environ["TOKENIZERS_PARALLELISM"] = "true"

    model_class = load_model_pipeline(model_name="codet5", task="pretrained",
                model_type="plus-220M", is_eval=True,
                load_in_8bit=True, weight_sharding=False)

    dataset = HumanEvalDataset(tokenizer=model_class.get_tokenizer())
    prompt_token_ids, prompt_attention_masks, references= dataset.load()

    problems = TensorDataset(prompt_token_ids, prompt_attention_masks)
    
    evaluator = ModelEvaluator(model_class)
    avg_pass_at_k = evaluator.evaluate_pass_k(problems=problems, unit_tests=references, \
                                              num_workers=1, k=[1,10,100], batch_size=256, \
                                              num_return_sequences=200, sequences_per_chunk=10)
    print("Pass@k: ", avg_pass_at_k)

if __name__ == "__main__":
    main()

where I changed the evaluation parameters, and also the model to T5p-220M.

Surprisingly, the pass@k was too good to be true, I got

Pass@k:  {'pass@1': 0.09500000000000008, 'pass@10': 0.6403557771996593, 'pass@100': 0.9999992577464678}

which is much higher than 220M results in the T5+ paper.

I wonder is there anything wrong in the evaluation code above (it seems fine to me but I am sorry if I made some silly mistake)?

BTW, I wonder if you guys could kindly give some sample code of evaluating MBPP/APPS by any chance? Thanks a lot!! :)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pass@k seems be incorrectly computed on my end. #18

pass@k seems be incorrectly computed on my end. #18

ZishunYu commented Jun 6, 2023 •

edited

pass@k seems be incorrectly computed on my end. #18

pass@k seems be incorrectly computed on my end. #18

Comments

ZishunYu commented Jun 6, 2023 • edited

ZishunYu commented Jun 6, 2023 •

edited