Add data streaming support through `mosaic-streaming` #1525

fmv1992 · 2024-04-16T12:41:47Z

Description

This PR adds support for (non-volatile) memory efficient training through StreamingDataset.

Motivation and Context

Context: #585 .

How has this been tested?

I have tested this through docker on a VM.

I'm open to ideas as to how this should be added. Does the repo support an s3 bucket for instance?

fmv1992 · 2024-04-16T12:42:47Z

requirements.txt

@@ -31,6 +31,7 @@ art
 fschat==0.2.36
 gradio==3.50.2
 tensorboard
+mosaicml-streaming


question: mosaicml-streaming should be an optional dependency. Is this the right way of adding it in this capacity?

I'm okay with it being a required dependency. If it causes issues down the line we can make it optional then. Hoping to keep things simpler.

Could this version be locked?

Addressed by 976bc13.

setup.py

winglian · 2024-04-16T17:59:43Z

src/axolotl/utils/data/sft.py

+    #
+    # This is necessary because downstream functions use a different interface
+    # than `StreamingDataset` (e.g. the `features` attribute).
+    ds = Dataset.from_generator(


This becomes an IterableDataset, right?

@winglian ,

Sorry for the delay here.

No, that was something that I wanted to verify but it looks like it goes to def process and everything is evaluated eagerly.

I started a draft like:

def process(self, dataset): features = dataset.features.keys() map_kwargs = {} if self.prompt_tokenizer.supports_batched: map_kwargs["batched"] = True map_kwargs["batch_size"] = 100 map_kwargs["desc"] = "Tokenizing Prompts" if isinstance(dataset, IterableDataset): dataset_wrapper = dataset.map( self.prompt_tokenizer.tokenize_prompt, remove_columns=features, keep_in_memory=self.keep_in_memory, **map_kwargs, ) else: num_proc = min( 64, self.process_count if self.process_count else os.cpu_count() ) return dataset.map( self.prompt_tokenizer.tokenize_prompt, num_proc=num_proc, remove_columns=features, keep_in_memory=self.keep_in_memory, desc="Tokenizing Prompts", **map_kwargs, )

But I don't know whether that's a good idea. The .map API is different between Dataset (here) and IterableDataset (here).

Feel free to remove the "ready to merge" tag from this.

setup.py

winglian

good to go. thank you @fmv1992 !

fmv1992 · 2024-04-17T14:37:22Z

Thanks, much appreciated; I'm just checking a few more things before merging.

The experience of contributing to this repo has been very positive.

NanoCode012 · 2024-04-18T13:48:10Z

Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add stream: true to load a Mosaic streaming dataset.

You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd

https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start

Kesta-bos · 2024-04-21T06:07:21Z

I think it need additional 'StreamingDataset' support for pretraining dataset (completion) in addition to Finetuning dataset.

ehartford · 2024-04-21T23:50:01Z

We can pretrain with Axolotl streaming a data mix from s3?

winglian · 2024-04-22T01:07:05Z

Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add stream: true to load a Mosaic streaming dataset.

You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd

https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start

JSONL should be fine for streaming. see https://github.com/mosaicml/streaming?tab=readme-ov-file#1-prepare-your-data

fmv1992 · 2024-04-22T11:24:35Z

We can pretrain with Axolotl streaming a data mix from s3?

We can, but I prefer if we include this in a second PR. Right now I would rather see this smaller change working and merged. Expanding on it should be easier later.

fmv1992 · 2024-04-22T11:48:29Z

We can pretrain with Axolotl streaming a data mix from s3?

We can, but I prefer if we include this in a second PR. Right now I would rather see this smaller change working and merged. Expanding on it should be easier later.

Hey, thanks for the PR. I just wanted to clarify something I asked previously. This would require user's to preprocess their dataset to Mosaic's format first right? If so, I would prefer this to be documented somewhere near the cloud loading section. For ex, add stream: true to load a Mosaic streaming dataset.

You should also add this parameter to this https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd

https://github.com/mosaicml/streaming?tab=readme-ov-file#quick-start

Addressed by ba86339 . Let me know if that addresses all your points.

fmv1992 · 2024-04-22T11:49:00Z

As per this comment this is not ready for merging. Maybe we want to remove that tag.

I posted a draft of the changes there, but the issue is that the tokenization should happen as we download the data, and right now I'm almost certain it does everything in a batch: it downloads everything, then tokenizes everything, then proceeds to do the fine tuning.

NanoCode012 · 2024-04-22T12:03:13Z

but the issue is that the tokenization should happen as we download the data, and right now I'm almost certain it does everything in a batch: it downloads everything, then tokenizes everything, then proceeds to do the fine tuning.

@fmv1992 , this is correct. I only got to review your code in detail earlier. The section I provided you was incorrect.

axolotl/src/axolotl/utils/data/sft.py

Line 121 in 60f5ce0

def load_tokenized_prepared_datasets(

This function runs the whole dataset, merges it, and perform tokenization at this point here.

axolotl/src/axolotl/utils/data/sft.py

Lines 410 to 411 in 60f5ce0

    
           LOG.info("merging datasets") 
        
           dataset = concatenate_datasets(datasets)

The only part that "skips" tokenization before finetuning is the pretaining section that you attempted to modify before.

axolotl/src/axolotl/utils/data/sft.py

Lines 70 to 102 in 60f5ce0

    
           path = cfg.pretraining_dataset 
        
           split = "train" 
        
           name = None 
        
           if isinstance(cfg.pretraining_dataset, list) and isinstance( 
        
               cfg.pretraining_dataset[0], dict 
        
           ): 
        
               path = cfg.pretraining_dataset[0]["path"] 
        
               name = cfg.pretraining_dataset[0]["name"] 
        
               if "split" in cfg.pretraining_dataset[0]: 
        
                   split = cfg.pretraining_dataset[0]["split"] 
        
           ds_wrapper_partial = functools.partial( 
        
               get_dataset_wrapper, 
        
               cfg.pretraining_dataset[0], 
        
               tokenizer, 
        
               cfg, 
        
               cfg.pretraining_dataset[0]["type"] or "pretrain", 
        
           ) 
        
           train_dataset = wrap_pretraining_dataset( 
        
               load_dataset(path, streaming=True, split=split, name=name), 
        
               tokenizer, 
        
               cfg, 
        
               ds_wrapper_partial, 
        
               max_tokens=cfg.sequence_len, 
        
               batch_size=cfg.micro_batch_size, 
        
               seed=cfg.seed or 42, 
        
               buffer_size=cfg.pretrain_multipack_buffer_size or 10_000, 
        
           ) 
        
           # https://discuss.huggingface.co/t/how-to-use-huggingface-trainer-streaming-datasets-without-wrapping-it-with-torchdatas-iterablewrapper/25230 
        
           train_dataset = train_dataset.with_format("torch") 
        
           eval_dataset = None 
        
           return train_dataset, eval_dataset, cfg.max_steps, prompters

I have two ideas as of now:

discuss a better way to handle data preprocessing between the current pretraining_dataset and dataset format as the code is currently messy before continuing further.
Hack around and support streaming for pretraining datasets first and figure sft later. This is also because, your code expects the data in completion aka ({ "text": ..." }) format. This is not the case for SFT datasets.

axolotl/src/axolotl/utils/data/sft.py

Lines 79 to 80 in ba86339

# Define dataset features according to the axolotl structure.

features = Features({"text": Value("string")})

I would also appreciate @winglian 's comments on this.

Side note: what should this batch_size be set to? Is it hardcoded to 4 on purpose?

axolotl/src/axolotl/utils/data/sft.py

Line 76 in ba86339

local=None, remote=config_dataset.path, shuffle=True, batch_size=4

add data streaming support through mosaicml-streaming

de4aa62

fmv1992 marked this pull request as draft April 16, 2024 12:42

fmv1992 commented Apr 16, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

fmv1992 marked this pull request as ready for review April 16, 2024 12:43

fmv1992 mentioned this pull request Apr 16, 2024

Support streaming from cloud storage for downloading training data #585

Open

5 tasks

winglian reviewed Apr 16, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

winglian added 2 commits April 16, 2024 14:24

chore: lint

58297d3

remove optional reps

41141c1

winglian approved these changes Apr 17, 2024

View reviewed changes

winglian added the ready to merge label Apr 17, 2024

lock version for mosaicml-streaming

976bc13

add documentation and update the config.qmd file

ba86339

NanoCode012 removed the ready to merge label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data streaming support through `mosaic-streaming` #1525

Add data streaming support through `mosaic-streaming` #1525

fmv1992 commented Apr 16, 2024

fmv1992 Apr 16, 2024

winglian Apr 16, 2024

NanoCode012 Apr 18, 2024

fmv1992 Apr 22, 2024

winglian Apr 16, 2024 •

edited

fmv1992 Apr 18, 2024

winglian left a comment

fmv1992 commented Apr 17, 2024

NanoCode012 commented Apr 18, 2024

Kesta-bos commented Apr 21, 2024

ehartford commented Apr 21, 2024

winglian commented Apr 22, 2024

fmv1992 commented Apr 22, 2024

fmv1992 commented Apr 22, 2024

fmv1992 commented Apr 22, 2024

NanoCode012 commented Apr 22, 2024 •

edited

Add data streaming support through mosaic-streaming #1525

Are you sure you want to change the base?

Add data streaming support through mosaic-streaming #1525

Conversation

fmv1992 commented Apr 16, 2024

Description

Motivation and Context

How has this been tested?

fmv1992 Apr 16, 2024

Choose a reason for hiding this comment

winglian Apr 16, 2024

Choose a reason for hiding this comment

NanoCode012 Apr 18, 2024

Choose a reason for hiding this comment

fmv1992 Apr 22, 2024

Choose a reason for hiding this comment

winglian Apr 16, 2024 • edited

Choose a reason for hiding this comment

fmv1992 Apr 18, 2024

Choose a reason for hiding this comment

winglian left a comment

Choose a reason for hiding this comment

fmv1992 commented Apr 17, 2024

NanoCode012 commented Apr 18, 2024

Kesta-bos commented Apr 21, 2024

ehartford commented Apr 21, 2024

winglian commented Apr 22, 2024

fmv1992 commented Apr 22, 2024

fmv1992 commented Apr 22, 2024

fmv1992 commented Apr 22, 2024

NanoCode012 commented Apr 22, 2024 • edited

Add data streaming support through `mosaic-streaming` #1525

Add data streaming support through `mosaic-streaming` #1525

winglian Apr 16, 2024 •

edited

NanoCode012 commented Apr 22, 2024 •

edited