`StreamingTextDataset`'s default dtype for binarized data should be `int32` #321

vancoykendall · 2023-06-14T02:26:30Z

🚀 Feature Request

In StreamingTextDataset the _read_binary_tokenized_sample() method assumes the data is a numpy array of type np.int64. The default type should be np.int32, and the user should be able to specify if they used np.uint16 (or maybe implement some type of auto check for the dtype based on the vocab size. See additional context below).

Motivation

Currently the largest tokenizers used by any model have a vocab size of ~250k. Since range of int32 is -2,147,483,648 to +2,147,483,647, any tokenizer can use int32 for its input ids. Using int64 just doubles the size of the binarized data at no benefit. Additionally, the range of uint16 is 0 to +65535, so a tokenizer with a vocab size of 65536 or less can use unit16.

Implementation

For backward compatibility it might make sense to allow the user to specify the dtype. However, I'm not sure there would every be a reason to use np.int64. So ideally you could just have a flag to indicate if np.uint16 was used.

def _read_binary_tokenized_sample(self, sample):
    return torch.from_numpy(
        np.frombuffer(sample['tokens'],
            dtype=np.int64)[:self.max_seq_len].copy())

would change to something like:

def _read_binary_tokenized_sample(self, sample, used_uint16:Optional[bool]=False):
    binary_dtype = np.int32 if not used_uint16 else np.uint16
    return torch.from_numpy(
        np.frombuffer(sample['tokens'],
            dtype=binary_dtype)[:self.max_seq_len].copy())

or this

def _read_binary_tokenized_sample(self, sample, binary_dtype:Optional[Type]=np.int32):
    return torch.from_numpy(
        np.frombuffer(sample['tokens'],
            dtype=binary_dtype)[:self.max_seq_len].copy())

Additional Context

I first noticed this using Eleuther AI's gpt-neox repo to binarize my data. Before their script binarizes the data, it checks the vocab size. If the vocab size is < 65500, they use np.uint16.

link to following code snippet

def __best_fitting_dtype(vocab_size=None):
    if vocab_size is not None and vocab_size < 65500:
        return np.uint16
    else:
        return np.int32

The text was updated successfully, but these errors were encountered:

hanlint · 2023-07-24T06:18:55Z

@knighton can you take a look?

dakinggg · 2023-09-15T22:46:47Z

@knighton or @karan6181 what do you think?

vancoykendall added the enhancement New feature or request label Jun 14, 2023

alextrott16 assigned knighton Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`StreamingTextDataset`'s default dtype for binarized data should be `int32` #321

`StreamingTextDataset`'s default dtype for binarized data should be `int32` #321

vancoykendall commented Jun 14, 2023

hanlint commented Jul 24, 2023

dakinggg commented Sep 15, 2023

StreamingTextDataset's default dtype for binarized data should be int32 #321

StreamingTextDataset's default dtype for binarized data should be int32 #321

Comments

vancoykendall commented Jun 14, 2023

🚀 Feature Request

Motivation

Implementation

Additional Context

hanlint commented Jul 24, 2023

dakinggg commented Sep 15, 2023

`StreamingTextDataset`'s default dtype for binarized data should be `int32` #321

`StreamingTextDataset`'s default dtype for binarized data should be `int32` #321