Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamingTextDataset's default dtype for binarized data should be int32 #321

Open
vancoykendall opened this issue Jun 14, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@vancoykendall
Copy link
Contributor

馃殌 Feature Request

In StreamingTextDataset the _read_binary_tokenized_sample() method assumes the data is a numpy array of type np.int64. The default type should be np.int32, and the user should be able to specify if they used np.uint16 (or maybe implement some type of auto check for the dtype based on the vocab size. See additional context below).

Motivation

Currently the largest tokenizers used by any model have a vocab size of ~250k. Since range of int32 is -2,147,483,648 to +2,147,483,647, any tokenizer can use int32 for its input ids. Using int64 just doubles the size of the binarized data at no benefit. Additionally, the range of uint16 is 0 to +65535, so a tokenizer with a vocab size of 65536 or less can use unit16.

Implementation

For backward compatibility it might make sense to allow the user to specify the dtype. However, I'm not sure there would every be a reason to use np.int64. So ideally you could just have a flag to indicate if np.uint16 was used.

def _read_binary_tokenized_sample(self, sample):
    return torch.from_numpy(
        np.frombuffer(sample['tokens'],
            dtype=np.int64)[:self.max_seq_len].copy())

would change to something like:

def _read_binary_tokenized_sample(self, sample, used_uint16:Optional[bool]=False):
    binary_dtype = np.int32 if not used_uint16 else np.uint16
    return torch.from_numpy(
        np.frombuffer(sample['tokens'],
            dtype=binary_dtype)[:self.max_seq_len].copy())

or this

def _read_binary_tokenized_sample(self, sample, binary_dtype:Optional[Type]=np.int32):
    return torch.from_numpy(
        np.frombuffer(sample['tokens'],
            dtype=binary_dtype)[:self.max_seq_len].copy())

Additional Context

I first noticed this using Eleuther AI's gpt-neox repo to binarize my data. Before their script binarizes the data, it checks the vocab size. If the vocab size is < 65500, they use np.uint16.

link to following code snippet

def __best_fitting_dtype(vocab_size=None):
    if vocab_size is not None and vocab_size < 65500:
        return np.uint16
    else:
        return np.int32
@vancoykendall vancoykendall added the enhancement New feature or request label Jun 14, 2023
@hanlint
Copy link
Collaborator

hanlint commented Jul 24, 2023

@knighton can you take a look?

@dakinggg
Copy link
Collaborator

@knighton or @karan6181 what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants