Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect use of dtype resulting into incorrect token_ids #452

Open
damin604 opened this issue Jul 11, 2023 · 1 comment
Open

Incorrect use of dtype resulting into incorrect token_ids #452

damin604 opened this issue Jul 11, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@damin604
Copy link

damin604 commented Jul 11, 2023

In the __iter__ method of the ConcatTokensDataset class, the dtype argument is not specified for the statement yield {'tokens': np.asarray(concat_sample).tobytes()}. The default dtype used by numpy is np.int32.

On the other hand, in the _read_binary_tokenized_sample method of the StreamingTextDataset class, the dtype is specified as np.int64, which results in incorrect token_ids.

Below is example displaying the issue,

import numpy as np
import torch

tokenizer = config.tokenizer

def load_sample(tokens, dtype):
    np_tokens = np.frombuffer(tokens, dtype=dtype)
    pt_tokens = torch.from_numpy(np_tokens)
    print(f"dtype = {dtype}", pt_tokens, tokenizer.decode(pt_tokens), sep="\n")

# dataset is an object of StreamingTextDataset
tokens = dataset.get_item(0)["tokens"]
load_sample(tokens, np.int64)
print()
load_sample(tokens, np.int32)

"""
>>> Output:
dtype = <class 'numpy.int64'>
tensor([   77309411330, 31048318582927, 55344948613018,  ...,
           30064791424,    30064786861,  5420248739272])
<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>...

dtype = <class 'numpy.int32'>
tensor([    2,    18,   143,  ...,     7, 11720,  1262], dtype=torch.int32)
</s>The present invention relates to novel compounds of formula (I) that are capable...
"""
@damin604 damin604 added the bug Something isn't working label Jul 11, 2023
@dakinggg
Copy link
Collaborator

Are you possible on a 32bit system? This is what I get on my machine

In [11]: import numpy as np

In [12]: token_ids = [1,10,100]

In [13]: np_arr = np.asarray(token_ids)

In [14]: np_bytes = np_arr.tobytes()

In [15]: np_arr.dtype
Out[15]: dtype('int64')

In [16]: read_in = np.frombuffer(np_bytes, dtype=np.int64)

In [17]: read_in
Out[17]: array([  1,  10, 100])

In [18]: read_in.dtype
Out[18]: dtype('int64')

We should probably just make the dtype explicit though to avoid the possibility of your issue, but what system are you running into this problem on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants