Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass@k seems be incorrectly computed on my end. #18

Open
ZishunYu opened this issue Jun 6, 2023 · 0 comments
Open

pass@k seems be incorrectly computed on my end. #18

ZishunYu opened this issue Jun 6, 2023 · 0 comments

Comments

@ZishunYu
Copy link

ZishunYu commented Jun 6, 2023

Hi, folks, thanks a lot for releasing this amazing project!!!

I was trying to run the evaluation module, with the following code,

import sys
from pathlib import Path
sys.path.append(str(Path(".").absolute().parent))
from codetf.models import load_model_pipeline
from codetf.data_utility.util import EOF_STRINGS, EndOfFunctionCriteria, remove_last_block
from torch.utils.data.dataloader import DataLoader
from transformers import StoppingCriteriaList
from codetf.data_utility.human_eval_dataset import HumanEvalDataset
import torch
import os
from accelerate import Accelerator
import torch
from collections import defaultdict
from tqdm import tqdm
import torch
from evaluate import load
from codetf.performance.model_evaluator import ModelEvaluator
from torch.utils.data import TensorDataset

def main():
    os.environ["HF_ALLOW_CODE_EVAL"] = "1"
    os.environ["TOKENIZERS_PARALLELISM"] = "true"

    model_class = load_model_pipeline(model_name="codet5", task="pretrained",
                model_type="plus-220M", is_eval=True,
                load_in_8bit=True, weight_sharding=False)

    dataset = HumanEvalDataset(tokenizer=model_class.get_tokenizer())
    prompt_token_ids, prompt_attention_masks, references= dataset.load()

    problems = TensorDataset(prompt_token_ids, prompt_attention_masks)
    
    evaluator = ModelEvaluator(model_class)
    avg_pass_at_k = evaluator.evaluate_pass_k(problems=problems, unit_tests=references, \
                                              num_workers=1, k=[1,10,100], batch_size=256, \
                                              num_return_sequences=200, sequences_per_chunk=10)
    print("Pass@k: ", avg_pass_at_k)

if __name__ == "__main__":
    main()

where I changed the evaluation parameters, and also the model to T5p-220M.

Surprisingly, the pass@k was too good to be true, I got

Pass@k:  {'pass@1': 0.09500000000000008, 'pass@10': 0.6403557771996593, 'pass@100': 0.9999992577464678}

which is much higher than 220M results in the T5+ paper.

I wonder is there anything wrong in the evaluation code above (it seems fine to me but I am sorry if I made some silly mistake)?

BTW, I wonder if you guys could kindly give some sample code of evaluating MBPP/APPS by any chance? Thanks a lot!! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant