Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Text Generation] Debug a Text Generator #681

Open
SatishDeshbhratar opened this issue Mar 27, 2022 · 4 comments
Open

[Text Generation] Debug a Text Generator #681

SatishDeshbhratar opened this issue Mar 27, 2022 · 4 comments

Comments

@SatishDeshbhratar
Copy link

SatishDeshbhratar commented Mar 27, 2022

Hi,

I want to debug a text generator, I am using two fine-tuned models facebook/bart-large-cnn and human-centered-summarization/financial-summarization-pegasus .

I am following this link for the same https://pair-code.github.io/lit/tutorials/generation/, as the link suggests t5 models, can I use this code file for my fine-tuned models(as I am not able to import any other model other than t5). If yes, can I have any reference code file for the same?

I was also following this link for adding my own model 'https://github.com/PAIR-code/lit/wiki/api.md#adding-models-and-data',
but I am working on a text summarization task, so how can I replace the following parameters with my specifiation-
def init(self, path):
# Read the eval set from a .tsv file as distributed with the GLUE benchmark.
df = pandas.read_csv(path, sep='\t')
# Store as a list of dicts, conforming to self.spec()
self._examples = [{
'premise': row['sentence1'], --> In my case input text
'hypothesis': row['sentence2'], --> Dont require hypothesis
'label': row['gold_label'], --> In my case there will be an output summary
'genre': row['genre'], --> Dont require
} for _, row in df.iterrows()]

@jameswex
Copy link
Collaborator

The generations tutorial link follows along with some demo code we wrote that uses a t5 model for text generation, but the concepts in the UI will be the same regardless of what model architecture your text generation models you are using in LIT.

You are correct that you will want to define new LIT Dataset and Model classes for your specific dataset and models that you wish to use in LIT. As per the documentation, you specify for the LIT Dataset what fields will exist in the dataset and what their names will be. If your data just contains a single field called "input text" then have your Dataset spec have a single entry in its dictionary with name "input text" and its value of type TextSegment. Then set your self._examples to a list of dicts with each dict for each input having that single key "input text" and the value being the string from the dataset being loaded. For your model, its input spec can also be the same as the spec for the dataset (just one TextSegment, with name "input text"), and have the output be of type "GeneratedText", as our example T5 model shows its its code. You can define the predict_minibatch function to do whatever it needs to do to get model predictions from your model and return the generated text, similar to our T5 example.

@SatishDeshbhratar
Copy link
Author

The generations tutorial link follows along with some demo code we wrote that uses a t5 model for text generation, but the concepts in the UI will be the same regardless of what model architecture your text generation models you are using in LIT.

You are correct that you will want to define new LIT Dataset and Model classes for your specific dataset and models that you wish to use in LIT. As per the documentation, you specify for the LIT Dataset what fields will exist in the dataset and what their names will be. If your data just contains a single field called "input text" then have your Dataset spec have a single entry in its dictionary with name "input text" and its value of type TextSegment. Then set your self._examples to a list of dicts with each dict for each input having that single key "input text" and the value being the string from the dataset being loaded. For your model, its input spec can also be the same as the spec for the dataset (just one TextSegment, with name "input text"), and have the output be of type "GeneratedText", as our example T5 model shows its its code. You can define the predict_minibatch function to do whatever it needs to do to get model predictions from your model and return the generated text, similar to our T5 example.

I have tried creating this, but I am getting some issues where I am stucked from long time, can you help identify the issue ?
I have shared the colab notebook feel free to edit it or comment if I have made any mistakes.
https://drive.google.com/file/d/1E1Iwr-vMFO11D3RRQIzTgk34ZvlGPn-_/view?usp=sharing

@jameswex
Copy link
Collaborator

Thanks for sharing. The first issue you are running into is that your self.tokenizer and self.model are swapped. self.model should be the BartForConditionalGeneration, not the BartTokenizer. If you fix that, from my test, you'll run into another error down in the call to batch_encode_plus, but I'm not an expert at these tokenizers/models, so I'm not sure of the root cause of that issue with your code.

@SatishDeshbhratar
Copy link
Author

Is there anyone else in the team who can help, I am trying to work out this issue but I am not progressing. Or is there an example/reference of summarization because t5 demo is not working for me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants