Some questions about the Hugging Face Models. #3838

davidRFB · 2024-02-18T15:07:41Z

Hi !
I have a couple of questions related to the HUgging Face models.

Base on the task avaialbe in the deepchem wrapper. Is not possible to perform task like text-generation of this kind of examples ?

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2")
model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2")
seq = tokenizer(['M<|endoftext|>'],return_tensors='pt')
outputs = model.generate(input_ids=seq['input_ids'])
print(outputs[0])
result = tokenizer.decode(outputs[0])
print(result)

with the output

M<|endoftext|>
MSNDTPDTRRRLLRGTAAAGAAAAVAGCSGGGDGDGDGGDGD

So if its not possible to use text generation task maybe is it possible to use masked language modeling with the paremeter mlm.

In this kind of task we will predict the mask in the sequence.

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("facebook/esm1b_t33_650M_UR50S")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm1b_t33_650M_UR50S")
seqs = ['QAVMGYSMGGGGTLA<MASK>ARDNPGLKAAFALAPWHT']
task = ['']
import pandas as pd
import deepchem as dc
df = pd.DataFrame({'seq':seqs,'task':task})
df.to_csv('test.csv',index=False)
loader = dc.data.CSVLoader(["task"], feature_field="seq", featurizer=dc.feat.DummyFeaturizer())
dataset = loader.create_dataset('test.csv')
from deepchem.models.torch_models.hf_models import HuggingFaceModel
hf_model = HuggingFaceModel(model,tokenizer)
hf_model.predict(dataset)

however I am getting an error :

TypeError                                 Traceback (most recent call last)
TypeError: cannot unpack non-iterable NoneType object

Is it related to the empty task in the deepchem dataset ?

The text was updated successfully, but these errors were encountered:

arunppsg · 2024-02-21T07:47:59Z

At this point, I think generation cannot be done. But MLM can be done. In the initialization of hugging-face model, along with model and tokenizer, I would suggest also to pass task. Example:

model = HuggingFaceModel(mode=model, task='mlm', tokenizer=tokenizer)

davidRFB · 2024-02-21T15:34:34Z

Yeah ! as far as i could only mlm task can be done. But it works.

from deepchem.models.torch_models.hf_models import HuggingFaceModel
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
import deepchem as dc
tokenizer = AutoTokenizer.from_pretrained("facebook/esm1b_t33_650M_UR50S")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm1b_t33_650M_UR50S")
seqs = ['MAA<mask>AKML']
task = ['']
df = pd.DataFrame({'seq':seqs,'task':task})
df.to_csv('test.csv',index=False)
loader = dc.data.CSVLoader(["task"], feature_field="seq", featurizer=dc.feat.DummyFeaturizer())
dataset = loader.create_dataset('test.csv')
hf_model = HuggingFaceModel(model,tokenizer,task='mlm')
output = hf_model.predict(dataset)
print(tokenizer.decode(output[0].argmax(axis=1)))
>>><cls> M A A A A K M L <eos>

However, I would like to know if is also possible to do some fine-tuning in terms of certain types of sequences. So the filling task will use a specific fine tune model. Seems like this models requires a lot of RAM to retrain right ?

I try this in colab and died :(

from deepchem.models.torch_models.hf_models import HuggingFaceModel
from transformers import AutoTokenizer, AutoModelForMaskedLM
import pandas as pd
import deepchem as dc
import torch
tokenizer = AutoTokenizer.from_pretrained("facebook/esm1b_t33_650M_UR50S")
model = AutoModelForMaskedLM.from_pretrained("facebook/esm1b_t33_650M_UR50S")
torch.cuda.empty_cache()
seqs = ['MAAKAKML','MAAKAKML','MAAKAKML','MAAKAKML','MAAKAKML']
task = ['','','','','']
df = pd.DataFrame({'seq':seqs,'task':task})
df.to_csv('test_small_train.csv',index=False)
loader = dc.data.CSVLoader(["task"], feature_field="seq", featurizer=dc.feat.DummyFeaturizer())
dataset_train = loader.create_dataset('test_small_train.csv')
hf_model = HuggingFaceModel(model,tokenizer,task='mlm')

training_loss = hf_model.fit(dataset_train, nb_epoch=1)

rbharath · 2024-02-22T08:40:53Z

RAM usage is definitely challenging for these models. I'd say at least 16GB of RAM to train a model. I would love to see if we can work out how to do completion with ChemBERTa though since that would be very useful for conditional generation

AIzealotwu · 2024-03-20T07:24:36Z

Yeah, some hardware requirement must be satisfied for finetuning. About 1/2 A100 can be used to finetune a Large language model with 7 billion parameters. In my research, at least 16GM for prediction of protein structures with more than total 1400 length for Alphafold-multimer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the Hugging Face Models. #3838

Some questions about the Hugging Face Models. #3838

davidRFB commented Feb 18, 2024

arunppsg commented Feb 21, 2024

davidRFB commented Feb 21, 2024

rbharath commented Feb 22, 2024

AIzealotwu commented Mar 20, 2024

Some questions about the Hugging Face Models. #3838

Some questions about the Hugging Face Models. #3838

Comments

davidRFB commented Feb 18, 2024

arunppsg commented Feb 21, 2024

davidRFB commented Feb 21, 2024

rbharath commented Feb 22, 2024

AIzealotwu commented Mar 20, 2024