You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, while using the code to check what sequences exist on a specific batch index in the OLMo-7B pre-training dataset, IndexError: 925801835 is out of bounds for dataset of size 925201012 occurred, so I would like to inquire.
1. Preparation
The .npy files in the data.path of ./configs/official/OLMo-7B.yaml were saved to disk using the wget command.
I changed data.path in OLMo-7B.yaml to the local path where I just downloaded the data.
2. Executing the code
Here is the code I used:
data_order_file_path = cached_path("https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy")
train_config_path = os.path.join(os.path.dirname(FILE_PATH), "OLMo_config/OLMo-7B.yaml")
cfg = TrainConfig.load(train_config_path)
batch_size = cfg.global_train_batch_size
global_indices = np.memmap(data_order_file_path, mode="r+", dtype=np.uint32)
dataset = build_memmap_dataset(cfg, cfg.data)
def get_batch_instances(batch_idx: int) -> list[list[int]]:
batch_start = batch_idx * batch_size
batch_end = (batch_idx + 1) * batch_size
batch_indices = global_indices[batch_start:batch_end]
batch_instances = []
for index in batch_indices:
token_ids = dataset[index]["input_ids"].tolist()
batch_instances.append(token_ids)
return batch_instances
def main():
steps = [1]
results = [False for i in range(len(steps))]
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B", trust_remote_code=True)
for i, step in enumerate(steps):
batch = torch.tensor(get_batch_instances(batch_idx=step))
# <class: 'list'>, len : 2048
batch_in_text = tokenizer.batch_decode(batch, skip_special_tokens=True)
for sequence in batch_in_text:
if 'apple'.lower() in sequence.lower():
results[i] = True
continue
print(results)
if __name__=="__main__":
main()
3. Detailed Error Message
> Traceback (most recent call last):
> File "test.py", line 96, in <module>
> main()
> File "test.py", line 83, in main
> batch = torch.tensor(get_batch_instances(batch_idx=step))
> File "test.py", line 60, in get_batch_instances
> token_ids = dataset[index]["input_ids"].tolist()
> File "site-packages/olmo/data/memmap_dataset.py", line 176, in __getitem__
> raise IndexError(f"{index} is out of bounds for dataset of size {len(self)}")
> IndexError: 925801835 is out of bounds for dataset of size 925201012
Is the OLMo-7B pre-training corpus saved at this urls wrong? Or is there a problem with the dataset saved at this url and something went wrong when I downloaded it?
❓ The question
Hello, while using the code to check what sequences exist on a specific batch index in the OLMo-7B pre-training dataset,
IndexError: 925801835 is out of bounds for dataset of size 925201012
occurred, so I would like to inquire.1. Preparation
wget
command.2. Executing the code
3. Detailed Error Message
Is the OLMo-7B pre-training corpus saved at this urls wrong? Or is there a problem with the dataset saved at this url and something went wrong when I downloaded it?
4. Additional Question
The text was updated successfully, but these errors were encountered: