how to train model base on v1_6-sample dataset on local trainset #535

scalaboy · 2024-04-03T04:22:10Z

❓ The question

the train.py is base on cloud dataset,however,I have download the v1_6-sample dataset from https://huggingface.co/datasets/allenai/dolma. In this time ,I want just try train model for fun on this local dataset ,may you help me how to do it? in config ,the data is npy file,where in v1_6-sample,the datatype is json .so they are no match

epwalsh · 2024-04-03T16:28:19Z

Hey @scalaboy, you can use the tools in Dolma to tokenize the JSON files. See https://github.com/allenai/dolma/blob/main/docs/tokenize.md, for example.

scalaboy added the type/question An issue that's a question label Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to train model base on v1_6-sample dataset on local trainset #535

how to train model base on v1_6-sample dataset on local trainset #535

scalaboy commented Apr 3, 2024

epwalsh commented Apr 3, 2024

how to train model base on v1_6-sample dataset on local trainset #535

how to train model base on v1_6-sample dataset on local trainset #535

Comments

scalaboy commented Apr 3, 2024

❓ The question

epwalsh commented Apr 3, 2024