Skip to content

Releases: mosaicml/streaming

v0.7.5

09 Apr 00:35
3ba9301
Compare
Choose a tag to compare

🚀 Streaming v0.7.5

Streaming v0.7.5 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.5

💎 New Features

1. Tensor/Sequence Parallelism Support

Using the replication argument, easily share data samples across multiple ranks, enabling sequence or tensor parallelism.

  • Replicating samples across devices (SP / TP enablement) by @knighton in #597
  • Expanded replication testing + documentation by @snarayan21 in #607
  • Make streaming use the correct number of unique samples with SP/TP by @snarayan21 in #619

2. Overhauled Streaming Documentation

New and improved streaming documentation can be found here -- please submit issues with any feedback.

3. batch_size is now required for StreamingDataset

As we have seen multiple errors and performance degradations from users not setting the batch_size argument to StreamingDataset, we are making it a requirement to iterate over the dataset.

3. Support for Python 3.11, deprecate Python 3.8

  • Add support for Python 3.11 and deprecate Python 3.8 by @karan6181 in #586

🐛 Bug Fixes

  • [easy typo fix] fix f-string by @bigning in #596
  • Change comparison in partitions to include equals by @JAEarly in #587
  • Use type int when initializing SharedMemory size by @bchiang2 in #604
  • COCO Dataset fix -- avoids allow_unsafe_types=True by @snarayan21 in #647

🔧 Improvements

What's Changed

New Contributors

Full Changelog: v0.7.4...v0.7.5

v0.7.4

08 Feb 22:00
a0443bb
Compare
Choose a tag to compare

🚀 Streaming v0.7.4

Streaming v0.7.4 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.4

🐛 Bug Fixes

  • Download to temporary path from azure by @philipnrmn in #566
  • fix(merge_index): scheme was not well formatted by @fwertel in #576
  • Update misplaced params of _format_remote_index_files by @lsongx in #584
  • Modifications to resumption shared memory allowing load_state_dict multiple times. by @snarayan21 in #593

What's Changed

New Contributors

Full Changelog: v0.7.3...v0.7.4

v0.7.3

12 Jan 18:12
47efc9d
Compare
Choose a tag to compare

🚀 Streaming v0.7.3

Streaming v0.7.3 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.3

🐛 Bug Fixes

  • Logging messages for new defaults only show once per rank. (#543)
  • Fixed padding calculation for repeat samples in the partition. (#544)

🔧 Other improvements

  • Update copyright license year from 2023 -> 2022-2024. (#560)

What's Changed

Full Changelog: v0.7.2...v0.7.3

v0.7.2

14 Dec 17:26
fac84b4
Compare
Choose a tag to compare

🚀 Streaming v0.7.2

Streaming v0.7.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.2

💎 New Features

1. Canned ACL Support (#512)

Add support for the Canned ACL using the environment variable S3_CANNED_ACL for AWS S3. Checkout Canned ACL document on how to use it.

2. Allow/reject datasets containing unsafe types (#519)

The pickle serialization format, one of the available MDS encodings, is a potential security vulnerability. We added a boolean flag allow_unsafe_types in the StreamingDataset class to allow or reject datasets containing Pickle.

🐛 Bug Fixes

  • Retrieve batch size correctly from vision yamls for the streaming simulator (#501)
  • Fix for CVE-2023-47248 (#504)
  • Streaming simulator bug fixes (proportion, repeat, yaml ingestion) (#514)
    • Proportion of None instead of a string 'None' is now handled correctly.
    • Repeat of None instead of a string 'None' is now handled correctly.
    • Added warning for StreamingDataset subclass defaults
  • Fix sample partitioning algorithm bug for tiny datasets (#517)

🔧 Improvements

  • Added warning messages for new streaming dataset defaults to inform users about the old and new values. (#502)

What's Changed

New Contributors

Full Changelog: v0.7.1...v0.7.2

v0.7.1

06 Nov 23:03
4c33ad3
Compare
Choose a tag to compare

🚀 Streaming v0.7.1

Streaming v0.7.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.1

🐛 Bug Fixes

  • Simulation from command line with simulator is fixed (#499)

What's Changed

  • Fixing simulator command with simulation directories being included in package by @snarayan21 in #499

Full Changelog: v0.7.0...v0.7.1

v0.7.0

06 Nov 01:23
4e8c944
Compare
Choose a tag to compare

🚀 Streaming v0.7.0

Streaming v0.7.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.0

📈 Better Defaults for StreamingDataset (#479)

  • The default values for StreamingDataset have been updated to be more performant and are applicable for most use cases, detailed below:
Parameter Old Value New Value Benefit
shuffle_algo py1s py1e Better shuffle and balanced downloading
num_canonical_nodes 64 * physical nodes if py1s or py2s, 64 * physical_nodes, otherwise physical_nodes Consistently good shuffle for all shuffle algos
shuffle_block_size 262,144 4,000,000 / num_canonical_nodes Consistently good shuffle for all num_canonical_nodes values
predownload max(batch_size, 256 * batch_size // num_canonical_nodes) 8 * batch_size Better balanced downloading
partition_algo orig relaxed More flexible deterministic resumptions on nodes

💎 New Features

🤖 Streaming Simulator: Easily simulate the performance of training configurations. (#385)

  • After installing this version of streaming, simply run the command simulator in your terminal to open the simulation interface.
  • Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations.
  • Easily de-risk runs and find performant parameter settings.
  • Check out the docs for more information!

🔢 More flexible deterministic training and resumption (#476)

  • Deterministic training and resumptions are now possible on more numbers of nodes!
  • Previously, the num_canonical_nodes parameter had to divide or be a multiple of the number of physical nodes for determinism.
  • Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.

🐛 Bug Fixes

  • Check for invalid hash algorithm names (#486)

What's Changed

Full Changelog: v0.6.1...v0.7.0

v0.6.1

18 Oct 21:28
8827d7a
Compare
Choose a tag to compare

🚀 Streaming v0.6.1

Streaming v0.6.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.6.1

💎 New Features

🚃 Merge meta-data information from sub-directories dataset to form one unified dataset. (#449)

  • Addition of the merge_index() utility method to merge subdirectories index files from an MDS dataset. The subdirectories can be local or any supported cloud provider URL path.
  • Checkout dataset conversion and Spark Dataframe to MDS jupyter notebook for an example in action.

🔁 Retry uploading a file to a cloud provider path. (#448)

  • Added upload retry logic with backoff and jitter during dataset conversion as part of parameter retry in Writer.
from streaming import MDSWriter

with MDSWriter(
               ...,
               retry=3) as out:
    for sample in dataset:
        out.write(sample)

🐛 Bug Fixes

  • Validate Writer arguments and raise a ValueError exception if argument(s) is/are invalid. (#434)
  • Terminate the main process if one of the upload threads receives an Exception during dataset conversion. (#448)

🔧 Improvements

  • More balancing inter-node downloading for the py1e shuffling algorithm by varying shard sample ranges, helping to reduce throughput drops at scale. (#442)

What's Changed

New Contributors

Full Changelog: v0.6.0...v0.6.1

v0.6.0

13 Sep 20:11
65ac4ca
Compare
Choose a tag to compare

🚀 Streaming v0.6.0

Streaming v0.6.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.6.0

New Features

🆕  Databricks File System and Databricks Unity Catalog (#362)

Support for reading and writing data from and to the Databricks File System (DBFS) and Unity Catalog (UC) Volumes. This means that you can now use DBFS and UC Volumes as a source or sink for your streaming data pipelines or model training. Below is the path structure:

Databricks File System (DBFS)

DBFS path structure is a hierarchical namespace that is organized into directories and files. The DBFS prefix must starts with dbfs:/.

UC Volumes

The path structure for UC Volumes is similar to the path structure for DBFS, but with a few key differences.

The root of the UC Volumes namespace is dbfs:/Volumes/<catalog>/<schema>/<volume>, where:

  • <catalog> is the name of the catalog where the volume is created.
  • <schema> is the name of the schema where the volume is created.
  • <volume> is the name of the volume.

Hence, use a dbfs://Volumes prefix to specify a UC Volumes path.

💽 Spark Dataframe to MDS convertor (#363)

Introducing the new DataFrameToMDS API, empowering users to effortlessly leverage Spark's capabilities for handling diverse datasets in various formats. This API enables seamless conversion of Spark DataFrames into MDS datasets, with the flexibility to specify output locations to both local and cloud storage. Index files are optionally merged. Additionally, users can add data preprocessing steps by defining custom iterator functions and arguments. All these features are seamlessly bundled into a single Spark job, ensuring an efficient and streamlined workflow for data transformation. An example notebook is provided to help users get started.

🔀 Randomize and offset shuffle blocks algorithm (#373)

The new py1br shuffle algorithm helps mitigate download spikes that occur when using the py1b algorithm. With py1b, shuffle blocks are all the same size, so when progressing through training, nodes will have to download many shards at the same time. In contrast, with py1br, shuffle blocks are offset from each other and are variably sized. This results in more balanced downloads over time. The py1br algorithm is a replacement for the py1b algorithm, which will be deprecated soon.

from streaming import StreamingDataset

dataset = StreamingDataset(
    shuffle_algo='py1br',
    ...
)

🔀 Expanded range shuffle algorithm (#394)

The new py1e shuffle algorithm helps reduce the minimum cache limit needed for training, and results in much smoother downloads than both py1br and py1e. However, its shuffle quality is slightly lower. Rather than shuffling all samples in blocks of size shuffle_block_size, it instead spreads the samples of each shard over a range of maximum size shuffle_block_size, retaining most of the shuffle quality from py1b and py1br while reducing download spikes across the duration of training.

from streaming import StreamingDataset

dataset = StreamingDataset(
    shuffle_algo='py1e',
    ...
)

🔥 Per-Stream Batching (#407)

Users are now able to ensure that each batch comes has samples from only a single stream. You can now set the new parameter batching_method to per_stream to access this functionality. Per-stream batching will still take into account upsampling and downsampling of streams, set by proportion, repeat, or choose. To make batches contain only samples from a group of streams, merge streams’ index.json files to create a single one for each group.

from streaming import StreamingDataset

dataset = StreamingDataset(
    batching_method='per_stream',
    ...
)

🔥 Stratified Batching (#408)

Users are now able to ensure that each batch has a consistent number of samples from every stream. Previously, stream proportions were satisfied in the aggregate but not at the batch level. You can now set the new parameter batching_method to stratified to access this functionality. Stratified batching will still take into account upsampling and downsampling of streams, set by proportion, repeat, or choose.

from streaming import StreamingDataset

dataset = StreamingDataset(
    batching_method='stratified',
    ...
)

💪 Download-Efficient Sparse Sampling (#391)

Previous versions of StreamingDataset implement downsampling/upsampling by giving each sample equal probability of being selected (plus or minus one due when sampling is fractional), without regard to what shard a sample is on. This means that no matter how small your desired downsampling is, StreamingDataset will still use each shard at as equal a rate as possible. This is problematic for downloading performance.

In this version of Streaming, we have added a new optional StreamingDataset argument sampling_granularity which can be used to configure how sampling is done. It is an integer, defaulting to 1, that determines how many samples are to be drawn at a time from a single random shard until we have enough samples.

Note that the default setting of 1 is equivalent to the old non-shard-aware behavior. Setting it high, e.g. the number of samples in a full shard or more, means it will draw all the samples in a randomly chosen (without replacement) shard until it has enough samples, which is much more download-effiicient but results in the samples of each shard always being seen close together in training, which may have implications to convergence depending on your workload. Setting sampling granularity to half a shard means, roughly speaking, you'll see half the samples of a shard at a time during training.

from streaming import StreamingDataset

dataset = StreamingDataset(
    sampling_granularity=1,
    ...
)

📑 Reusable local directory (#406)

Users can now instantiate more than one StreamingDataset with same local directory and remote=None. This would be useful if there is a high-speed storage mounted on a node and multiple folks are trying to read the dataset directly from mount storage on the same node without having to copy the data on local disk.

from streaming import StreamingDataset

local = '<local disk directory or a mount point directory>'
dataset_0 = StreamingDataset(local=local, remote=None)
dataset_1 = StreamingDataset(local=local, remote=None)

🐛 Bug Fixes

  • Terminate the worker threads when process terminates to avoid deadlock. (#425)
  • Raise an exception if cache_limit is lower than the size of a single shard file to avoid deadlock. (#420)
  • Fixed predownload value to zero issue where users can now provide predownload=0 in StreamingDataset. (#383)

🔧 Improvements

  • Add google Application Default Credentials (#376).
    • The order of authentication has changed and added a new App Engine or Compute Engine authentication channel if these are available. The order of authentication is as follows:
      1. HMAC
      2. Google service account
      3. App Engine
      4. Compute Engine
      5. Raise an error
  • Check if index.json exists locally before downloading to avoid duplicate downloads (#372).

What's Changed

Read more

v0.5.2

19 Jun 05:58
a301cd0
Compare
Choose a tag to compare

🚀 Streaming v0.5.2

Streaming v0.5.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.5.2

New features

  • Allow authentication with GCS for service accounts #315
  • human-readable suffixes for size_limit and epoch_size #333
  • static sampling #348

Documentation changes

  • Update contribution guide and improved unittest logic #343
  • static sampling #348

Testing

  • Add a regression test for StreamingDataset instantiation and iteration #318
  • Fixed accidental shard delete test #341
  • Add a regression test for StreamingDataset using cloud providers #319
  • Add iteration time test as part of regression testing #358

Bug fix

  • Fix init local dir zip-only shard handling #330
  • added default behavior if no streams and epoch_size specified #348

What's Changed

New Contributors

Full Changelog: v0.5.1...v0.5.2

v0.5.1

08 Aug 18:59
ac53002
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.5.0...v0.5.1