Incorrect use of `dtype` resulting into incorrect `token_ids` #452

damin604 · 2023-07-11T17:06:32Z

In the __iter__ method of the ConcatTokensDataset class, the dtype argument is not specified for the statement yield {'tokens': np.asarray(concat_sample).tobytes()}. The default dtype used by numpy is np.int32.

On the other hand, in the _read_binary_tokenized_sample method of the StreamingTextDataset class, the dtype is specified as np.int64, which results in incorrect token_ids.

Below is example displaying the issue,

import numpy as np
import torch

tokenizer = config.tokenizer

def load_sample(tokens, dtype):
    np_tokens = np.frombuffer(tokens, dtype=dtype)
    pt_tokens = torch.from_numpy(np_tokens)
    print(f"dtype = {dtype}", pt_tokens, tokenizer.decode(pt_tokens), sep="\n")

# dataset is an object of StreamingTextDataset
tokens = dataset.get_item(0)["tokens"]
load_sample(tokens, np.int64)
print()
load_sample(tokens, np.int32)

"""
>>> Output:
dtype = <class 'numpy.int64'>
tensor([   77309411330, 31048318582927, 55344948613018,  ...,
           30064791424,    30064786861,  5420248739272])
<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>...

dtype = <class 'numpy.int32'>
tensor([    2,    18,   143,  ...,     7, 11720,  1262], dtype=torch.int32)
</s>The present invention relates to novel compounds of formula (I) that are capable...
"""

The text was updated successfully, but these errors were encountered:

dakinggg · 2023-07-23T23:16:29Z

Are you possible on a 32bit system? This is what I get on my machine

In [11]: import numpy as np

In [12]: token_ids = [1,10,100]

In [13]: np_arr = np.asarray(token_ids)

In [14]: np_bytes = np_arr.tobytes()

In [15]: np_arr.dtype
Out[15]: dtype('int64')

In [16]: read_in = np.frombuffer(np_bytes, dtype=np.int64)

In [17]: read_in
Out[17]: array([  1,  10, 100])

In [18]: read_in.dtype
Out[18]: dtype('int64')

We should probably just make the dtype explicit though to avoid the possibility of your issue, but what system are you running into this problem on?

damin604 added the bug Something isn't working label Jul 11, 2023

hanlint assigned knighton Jul 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect use of `dtype` resulting into incorrect `token_ids` #452

Incorrect use of `dtype` resulting into incorrect `token_ids` #452

damin604 commented Jul 11, 2023 •

edited

dakinggg commented Jul 23, 2023

Incorrect use of dtype resulting into incorrect token_ids #452

Incorrect use of dtype resulting into incorrect token_ids #452

Comments

damin604 commented Jul 11, 2023 • edited

dakinggg commented Jul 23, 2023

Incorrect use of `dtype` resulting into incorrect `token_ids` #452

Incorrect use of `dtype` resulting into incorrect `token_ids` #452

damin604 commented Jul 11, 2023 •

edited