StreamingTextDataset
's default dtype for binarized data should be int32
#321
Labels
enhancement
New feature or request
馃殌 Feature Request
In
StreamingTextDataset
the_read_binary_tokenized_sample()
method assumes the data is a numpy array of typenp.int64
. The default type should benp.int32
, and the user should be able to specify if they usednp.uint16
(or maybe implement some type of auto check for the dtype based on the vocab size. See additional context below).Motivation
Currently the largest tokenizers used by any model have a vocab size of ~250k. Since range of
int32
is -2,147,483,648 to +2,147,483,647, any tokenizer can useint32
for its input ids. Usingint64
just doubles the size of the binarized data at no benefit. Additionally, the range ofuint16
is 0 to +65535, so a tokenizer with a vocab size of 65536 or less can useunit16
.Implementation
For backward compatibility it might make sense to allow the user to specify the dtype. However, I'm not sure there would every be a reason to use
np.int64
. So ideally you could just have a flag to indicate ifnp.uint16
was used.would change to something like:
or this
Additional Context
I first noticed this using Eleuther AI's gpt-neox repo to binarize my data. Before their script binarizes the data, it checks the vocab size. If the vocab size is < 65500, they use
np.uint16
.link to following code snippet
The text was updated successfully, but these errors were encountered: