How to obtain perplexity evaluation datasets? #11

LGH1gh · 2022-12-14T02:23:57Z

Dear Author,

Thanks for releasing the RITA for protein generation!
However, I wonder how can I obtain perplexity evalutation datasets used in your paper and how to calculate perplexity.
Hope for your suggestions. Thanks in advance!

Detopall · 2024-04-30T09:30:02Z

You can use the following code to calculate the perplexity. Can't really help you with obtaining perplexity evalutation of the datasets used in their paper

import math
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline

model = AutoModelForCausalLM.from_pretrained("lightonai/RITA_s", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")

rita_gen = pipeline('text-generation', model=model, tokenizer=tokenizer)
sequences = rita_gen("MAB", max_length=200, do_sample=True, top_k=950, repetition_penalty=1.2, 
                     num_return_sequences=2, eos_token_id=2)

def calculatePerplexity(sequence, model, tokenizer):
    input_ids = torch.tensor(tokenizer.encode(sequence)).unsqueeze(0) 
    input_ids = input_ids.to(model.device)

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
    loss, logits = outputs[:2]

    return math.exp(loss)

for seq in sequences:
    print(f"seq: {seq['generated_text'].replace(' ', '')}")
    ppl = calculatePerplexity(seq['generated_text'], model, tokenizer)
    print(f"Perplexity: {ppl}\n")

With these results:

seq: MABVVGTALYPGSDRFDGEYEVDIVIDTDGARYVLPVINTITHVKQGTSTRHPLGKAGQARKYATMHTGNLVLHLFDKGHTGVSIHGTSIDERIFGADGRVIAEAQGSGDMRHYGISPNRVAVCVARPFGGEGFSVPLSIHALGNETGVQTTGSGDVSTTSAVEGPAQEQMGFLDHTLSYASSTILTYRTQVTTGLGGAR
Perplexity: 132566.77587907546

seq: MABPVVTREPGVYFLAPRVSKFYEIIPWWNEMYVIECSIVSAAAGAPAVTPIQIRAPDVDIMSQVTSTAGMTAFVKVKRSRVIKMYQRVEPVERLHALVGGASILLDASLPQAALVTIEGGDIFEVFHGTEGLLAIIDGAIQQGLFSYKM
Perplexity: 127686.55561821107

The lower the perplexity score the better. The lower perplexity of the second sequence suggests that it is more coherent and natural-sounding according to the language model, and is likely a better-quality sequence compared to the first one.

Hope this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to obtain perplexity evaluation datasets? #11

How to obtain perplexity evaluation datasets? #11

LGH1gh commented Dec 14, 2022

Detopall commented Apr 30, 2024

How to obtain perplexity evaluation datasets? #11

How to obtain perplexity evaluation datasets? #11

Comments

LGH1gh commented Dec 14, 2022

Detopall commented Apr 30, 2024