Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError in OLMo-7B pre-training dataset #538

Open
Bread0288 opened this issue Apr 9, 2024 · 0 comments
Open

IndexError in OLMo-7B pre-training dataset #538

Bread0288 opened this issue Apr 9, 2024 · 0 comments
Labels
type/question An issue that's a question

Comments

@Bread0288
Copy link

❓ The question

Hello, while using the code to check what sequences exist on a specific batch index in the OLMo-7B pre-training dataset, IndexError: 925801835 is out of bounds for dataset of size 925201012 occurred, so I would like to inquire.

1. Preparation

  • The .npy files in the data.path of ./configs/official/OLMo-7B.yaml were saved to disk using the wget command.
  • I changed data.path in OLMo-7B.yaml to the local path where I just downloaded the data.

2. Executing the code

  • Here is the code I used:
data_order_file_path = cached_path("https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy")
train_config_path = os.path.join(os.path.dirname(FILE_PATH), "OLMo_config/OLMo-7B.yaml")    
cfg = TrainConfig.load(train_config_path)
batch_size = cfg.global_train_batch_size
global_indices = np.memmap(data_order_file_path, mode="r+", dtype=np.uint32)
dataset = build_memmap_dataset(cfg, cfg.data)

def get_batch_instances(batch_idx: int) -> list[list[int]]:
    batch_start = batch_idx * batch_size
    batch_end = (batch_idx + 1) * batch_size
    batch_indices = global_indices[batch_start:batch_end]
    batch_instances = []
    for index in batch_indices:
        token_ids = dataset[index]["input_ids"].tolist()
        batch_instances.append(token_ids)
    return batch_instances


def main():
    steps = [1]
    results = [False for i in range(len(steps))]
    
    tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B", trust_remote_code=True)
    for i, step in enumerate(steps):
        batch = torch.tensor(get_batch_instances(batch_idx=step))
        # <class: 'list'>, len : 2048
        batch_in_text = tokenizer.batch_decode(batch, skip_special_tokens=True)
        for sequence in batch_in_text:
            if 'apple'.lower() in sequence.lower():
                results[i] = True
                continue
    print(results)


if __name__=="__main__":
    main()

3. Detailed Error Message

> Traceback (most recent call last):
>   File "test.py", line 96, in <module>
>     main()
>   File "test.py", line 83, in main
>     batch = torch.tensor(get_batch_instances(batch_idx=step))
>   File "test.py", line 60, in get_batch_instances
>     token_ids = dataset[index]["input_ids"].tolist()
>   File "site-packages/olmo/data/memmap_dataset.py", line 176, in __getitem__
>     raise IndexError(f"{index} is out of bounds for dataset of size {len(self)}")
> IndexError: 925801835 is out of bounds for dataset of size 925201012

Is the OLMo-7B pre-training corpus saved at this urls wrong? Or is there a problem with the dataset saved at this url and something went wrong when I downloaded it?

4. Additional Question

@Bread0288 Bread0288 added the type/question An issue that's a question label Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

1 participant