Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the Hugging Face Models. #3838

Open
davidRFB opened this issue Feb 18, 2024 · 4 comments
Open

Some questions about the Hugging Face Models. #3838

davidRFB opened this issue Feb 18, 2024 · 4 comments

Comments

@davidRFB
Copy link
Contributor

Hi !
I have a couple of questions related to the HUgging Face models.

Base on the task avaialbe in the deepchem wrapper. Is not possible to perform task like text-generation of this kind of examples ?

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
seq = tokenizer(['M<|endoftext|>'],return_tensors='pt')
outputs = model.generate(input_ids=seq['input_ids'])
print(outputs[0])
result = tokenizer.decode(outputs[0])
print(result)

with the output

M<|endoftext|>
MSNDTPDTRRRLLRGTAAAGAAAAVAGCSGGGDGDGDGGDGD

So if its not possible to use text generation task maybe is it possible to use masked language modeling with the paremeter mlm.

In this kind of task we will predict the mask in the sequence.

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("facebook/esm1b_t33_650M_UR50S")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm1b_t33_650M_UR50S")
seqs = ['QAVMGYSMGGGGTLA<MASK>ARDNPGLKAAFALAPWHT']
task = ['']
import pandas as pd
import deepchem as dc
df = pd.DataFrame({'seq':seqs,'task':task})
df.to_csv('test.csv',index=False)
loader = dc.data.CSVLoader(["task"], feature_field="seq", featurizer=dc.feat.DummyFeaturizer())
dataset = loader.create_dataset('test.csv')
from deepchem.models.torch_models.hf_models import HuggingFaceModel
hf_model = HuggingFaceModel(model,tokenizer)
hf_model.predict(dataset)

however I am getting an error :

TypeError                                 Traceback (most recent call last)
TypeError: cannot unpack non-iterable NoneType object

image

Is it related to the empty task in the deepchem dataset ?

@arunppsg
Copy link
Contributor

At this point, I think generation cannot be done. But MLM can be done. In the initialization of hugging-face model, along with model and tokenizer, I would suggest also to pass task. Example:

model = HuggingFaceModel(mode=model, task='mlm', tokenizer=tokenizer)

@davidRFB
Copy link
Contributor Author

Yeah ! as far as i could only mlm task can be done. But it works.

from deepchem.models.torch_models.hf_models import HuggingFaceModel
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
import deepchem as dc
tokenizer = AutoTokenizer.from_pretrained("facebook/esm1b_t33_650M_UR50S")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm1b_t33_650M_UR50S")
seqs = ['MAA<mask>AKML']
task = ['']
df = pd.DataFrame({'seq':seqs,'task':task})
df.to_csv('test.csv',index=False)
loader = dc.data.CSVLoader(["task"], feature_field="seq", featurizer=dc.feat.DummyFeaturizer())
dataset = loader.create_dataset('test.csv')
hf_model = HuggingFaceModel(model,tokenizer,task='mlm')
output = hf_model.predict(dataset)
print(tokenizer.decode(output[0].argmax(axis=1)))
>>><cls> M A A A A K M L <eos>

However, I would like to know if is also possible to do some fine-tuning in terms of certain types of sequences. So the filling task will use a specific fine tune model. Seems like this models requires a lot of RAM to retrain right ?

I try this in colab and died :(

from deepchem.models.torch_models.hf_models import HuggingFaceModel
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
import deepchem as dc
import torch
tokenizer = AutoTokenizer.from_pretrained("facebook/esm1b_t33_650M_UR50S")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm1b_t33_650M_UR50S")
torch.cuda.empty_cache()
seqs = ['MAAKAKML','MAAKAKML','MAAKAKML','MAAKAKML','MAAKAKML']
task = ['','','','','']
df = pd.DataFrame({'seq':seqs,'task':task})
df.to_csv('test_small_train.csv',index=False)
loader = dc.data.CSVLoader(["task"], feature_field="seq", featurizer=dc.feat.DummyFeaturizer())
dataset_train = loader.create_dataset('test_small_train.csv')
hf_model = HuggingFaceModel(model,tokenizer,task='mlm')

training_loss = hf_model.fit(dataset_train, nb_epoch=1)

@rbharath
Copy link
Member

RAM usage is definitely challenging for these models. I'd say at least 16GB of RAM to train a model. I would love to see if we can work out how to do completion with ChemBERTa though since that would be very useful for conditional generation

@AIzealotwu
Copy link

Yeah, some hardware requirement must be satisfied for finetuning. About 1/2 A100 can be used to finetune a Large language model with 7 billion parameters. In my research, at least 16GM for prediction of protein structures with more than total 1400 length for Alphafold-multimer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants