Question about the value of cls token #186

SELECT-FROM · 2024-04-21T06:05:20Z

Thank for your amazing work. I have some questions about the value of cls token. During pretraining, the value of cls is pad_value(default is -2), while during the finetuning of integration, the value of cls is 0. Is there any special purpose in this design as the value of cls token is different between the pre training stage and the finetune stage?

scGPT/examples/pretrain.py

Lines 430 to 441 in 4068d67

def _map_append_cls(dataset: Dataset) -> Dataset:

logger.info(f"Rank {args.local_rank}: Appending <cls> to dataset")

dataset = dataset.map(

lambda example: {

"genes": [vocab["<cls>"]] + example["genes"],

"expressions": [args.pad_value] + example["expressions"],

},

# batched=True, # not using since then the map func needs to loop

num_proc=len(os.sched_getaffinity(0)),

)

return dataset

scGPT/scgpt/tokenizer/gene_tokenizer.py

Lines 298 to 300 in 706526a

if append_cls:

genes = np.insert(genes, 0, cls_id)

values = np.insert(values, 0, 0)

During finetuning of batch integration, model work as self-supervised training. When masking the gene expression value, the value of the cls token may also be masked. But this situation will not occur during the pre training process. I want to know why the value of cls token is also likely to be masked in batch integration finetune. What is the reason for this design?

scGPT/scgpt/tokenizer/gene_tokenizer.py

Lines 467 to 472 in 706526a

    
           for i in range(len(values)): 
        
               row = values[i] 
        
               non_padding_idx = np.nonzero(row - pad_value)[0] 
        
               n_mask = int(len(non_padding_idx) * mask_ratio) 
        
               mask_idx = np.random.choice(non_padding_idx, n_mask, replace=False) 
        
               row[mask_idx] = mask_value

scGPT/scgpt/data_collator.py

Lines 417 to 422 in 4068d67

    
           if keep_first_n_tokens > 0: 
        
               result_ = self._mask( 
        
                   expressions[:, keep_first_n_tokens:], 
        
                   keep_first_n_tokens=0, 
        
               ) 
        
               return torch.cat([expressions[:, :keep_first_n_tokens], result_], dim=1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the value of cls token #186

Question about the value of cls token #186

SELECT-FROM commented Apr 21, 2024

Question about the value of cls token #186

Question about the value of cls token #186

Comments

SELECT-FROM commented Apr 21, 2024