09 Apr 00:35

3ba9301

v0.7.5 Latest

Latest

🚀 Streaming v0.7.5

Streaming v0.7.5 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.5

💎 New Features

1. Tensor/Sequence Parallelism Support

Using the replication argument, easily share data samples across multiple ranks, enabling sequence or tensor parallelism.

Replicating samples across devices (SP / TP enablement) by @knighton in #597
Expanded replication testing + documentation by @snarayan21 in #607
Make streaming use the correct number of unique samples with SP/TP by @snarayan21 in #619

2. Overhauled Streaming Documentation

New and improved streaming documentation can be found here -- please submit issues with any feedback.

Major overhaul of Streaming documentation by @snarayan21 in #636

3. `batch_size` is now required for StreamingDataset

As we have seen multiple errors and performance degradations from users not setting the batch_size argument to StreamingDataset, we are making it a requirement to iterate over the dataset.

You must set batch size. There is no other way. by @snarayan21 in #624

3. Support for Python 3.11, deprecate Python 3.8

Add support for Python 3.11 and deprecate Python 3.8 by @karan6181 in #586

🐛 Bug Fixes

[easy typo fix] fix f-string by @bigning in #596
Change comparison in partitions to include equals by @JAEarly in #587
Use type int when initializing SharedMemory size by @bchiang2 in #604
COCO Dataset fix -- avoids allow_unsafe_types=True by @snarayan21 in #647

🔧 Improvements

Allow writers to overwrite existing data by @JAEarly in #594
Update careers link by @milocress in #611
Update license by @b-chu in #568
Updated documentation for S3-compatible object stores by @AIproj in #592
Make yamllint consistent with Composer by @b-chu in #583
Switch linting workflows to ci-testing repo by @b-chu in #616

What's Changed

Bump uvicorn from 0.26.0 to 0.27.1 by @dependabot in #599
Bump pytest-split from 0.8.1 to 0.8.2 by @dependabot in #581
Update ruff to 0.2.2 by @Skylion007 in #608
Bump fastapi from 0.109.0 to 0.110.0 by @dependabot in #610
Bump yamllint from 1.33.0 to 1.35.1 by @dependabot in #601
Bump uvicorn from 0.27.1 to 0.28.0 by @dependabot in #626
Update moto requirement from <5,>=4.0 to >=4.0,<6 by @dependabot in #580
Bump furo from 2023.7.26 to 2024.1.29 by @dependabot in #631
Bump pypandoc from 1.12 to 1.13 by @dependabot in #630
Bump databricks-sdk from 0.14.0 to 0.22.0 by @dependabot in #629
Add batch_size to 1 if not provided for regression testing by @karan6181 in #635
Fixed docstring note for getting sequential sample ordering by @snarayan21 in #632
Bump pytest and fix failing test by @snarayan21 in #642
Update pytest-cov requirement from <5,>=4 to >=4,<6 by @dependabot in #638
Bump pydantic from 2.5.3 to 2.6.4 by @dependabot in #639
Bump uvicorn from 0.28.0 to 0.29.0 by @dependabot in #640
Bump databricks-sdk from 0.22.0 to 0.23.0 by @dependabot in #644
Version bump to 0.7.5 by @snarayan21 in #650

New Contributors

@bigning made their first contribution in #596
@JAEarly made their first contribution in #587
@AIproj made their first contribution in #592
@milocress made their first contribution in #611
@bchiang2 made their first contribution in #604

Full Changelog: v0.7.4...v0.7.5

Contributors

Skylion007, bigning, and 9 other contributors

Assets 2

08 Feb 22:00

snarayan21

v0.7.4

a0443bb

v0.7.4

🚀 Streaming v0.7.4

Streaming v0.7.4 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.4

🐛 Bug Fixes

Download to temporary path from azure by @philipnrmn in #566
fix(merge_index): scheme was not well formatted by @fwertel in #576
Update misplaced params of _format_remote_index_files by @lsongx in #584
Modifications to resumption shared memory allowing load_state_dict multiple times. by @snarayan21 in #593

What's Changed

Bump fastapi from 0.108.0 to 0.109.0 by @dependabot in #564
Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in #565
Download to temporary path from azure by @philipnrmn in #566
Use tempfile.gettempdir() instead of a hardcoded temp root. by @knighton in #570
fix(merge_index): scheme was not well formatted by @fwertel in #576
Bump uvicorn from 0.25.0 to 0.26.0 by @dependabot in #572
Bump sphinx-tabs from 3.4.4 to 3.4.5 by @dependabot in #571
Update misplaced params of _format_remote_index_files by @lsongx in #584
Remove .ci folder and move FILE_HEADER and CODEOWNERS by @irenedea in #588
Modifications to resumption shared memory allowing load_state_dict multiple times. by @snarayan21 in #593
Bump version to 0.7.4 by @snarayan21 in #595

New Contributors

@philipnrmn made their first contribution in #566
@fwertel made their first contribution in #576
@lsongx made their first contribution in #584
@irenedea made their first contribution in #588

Full Changelog: v0.7.3...v0.7.4

Contributors

knighton, philipnrmn, and 5 other contributors

Assets 2

12 Jan 18:12

karan6181

v0.7.3

47efc9d

v0.7.3

🚀 Streaming v0.7.3

Streaming v0.7.3 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.3

🐛 Bug Fixes

Logging messages for new defaults only show once per rank. (#543)
Fixed padding calculation for repeat samples in the partition. (#544)

🔧 Other improvements

Update copyright license year from 2023 -> 2022-2024. (#560)

What's Changed

Logging messages from new defaults only show once per rank. by @snarayan21 in #543
Fixed condition for warning when partitioning over tiny datasets. by @snarayan21 in #544
Removing stray print statement by @snarayan21 in #553
Bump pydantic from 2.5.2 to 2.5.3 by @dependabot in #548
Bump uvicorn from 0.24.0.post1 to 0.25.0 by @dependabot in #549
Bump fastapi from 0.104.1 to 0.108.0 by @dependabot in #557
Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #558
Update copyright: 2023 -> 2022-2024. by @knighton in #560
Bump version to 0.7.3 by @karan6181 in #562

Full Changelog: v0.7.2...v0.7.3

Contributors

knighton, karan6181, and 2 other contributors

Assets 2

14 Dec 17:26

karan6181

v0.7.2

fac84b4

v0.7.2

🚀 Streaming v0.7.2

Streaming v0.7.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.2

💎 New Features

1. Canned ACL Support (#512)

Add support for the Canned ACL using the environment variable S3_CANNED_ACL for AWS S3. Checkout Canned ACL document on how to use it.

2. Allow/reject datasets containing unsafe types (#519)

The pickle serialization format, one of the available MDS encodings, is a potential security vulnerability. We added a boolean flag allow_unsafe_types in the StreamingDataset class to allow or reject datasets containing Pickle.

🐛 Bug Fixes

Retrieve batch size correctly from vision yamls for the streaming simulator (#501)
Fix for CVE-2023-47248 (#504)
Streaming simulator bug fixes (proportion, repeat, yaml ingestion) (#514)
- Proportion of None instead of a string 'None' is now handled correctly.
- Repeat of None instead of a string 'None' is now handled correctly.
- Added warning for StreamingDataset subclass defaults
Fix sample partitioning algorithm bug for tiny datasets (#517)

🔧 Improvements

Added warning messages for new streaming dataset defaults to inform users about the old and new values. (#502)

What's Changed

Migrate pydocstyle to ruff by @Skylion007 in #500
Bump fastapi from 0.104.0 to 0.104.1 by @dependabot in #496
Bump uvicorn from 0.23.2 to 0.24.0.post1 by @dependabot in #497
Retrieve batch size correctly from vision yamls for simulator by @snarayan21 in #501
Adding warning messages for new defaults by @snarayan21 in #502
Fix for CVE-2023-47248 by @bandish-shah in #504
Bump pydantic from 2.4.2 to 2.5.2 by @dependabot in #513
Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in #506
Fixed comments and update dataframe_to_MDS API signature by @karan6181 in #515
Simulator bug fixes (proportion, repeat, yaml ingestion) by @snarayan21 in #514
Add support for the Canned ACL environment variable for AWS S3 by @karan6181 in #512
Fixed bugs when trying to use very small datasets by @snarayan21 in #517
Bump databricks-sdk from 0.8.0 to 0.14.0 by @dependabot in #518
Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) by @knighton in #519
improve exception error messages for downloading by @Skylion007 in #525
doc: add NDArray format by @OrenLeung in #527
Offload exception to mds_write. by @XiaohanZhangCMU in #528
Add allow_unsafe_types parameter to the streaming regression tests by @karan6181 in #531
Bump version to 0.7.2 by @karan6181 in #532

New Contributors

@OrenLeung made their first contribution in #527

Full Changelog: v0.7.1...v0.7.2

Contributors

Skylion007, knighton, and 6 other contributors

Assets 2

06 Nov 23:03

snarayan21

v0.7.1

4c33ad3

v0.7.1

🚀 Streaming v0.7.1

Streaming v0.7.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.1

🐛 Bug Fixes

Simulation from command line with simulator is fixed (#499)

What's Changed

Fixing simulator command with simulation directories being included in package by @snarayan21 in #499

Full Changelog: v0.7.0...v0.7.1

Contributors

snarayan21

Assets 2

06 Nov 01:23

snarayan21

v0.7.0

4e8c944

v0.7.0

🚀 Streaming v0.7.0

Streaming v0.7.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.0

📈 Better Defaults for `StreamingDataset` (#479)

The default values for StreamingDataset have been updated to be more performant and are applicable for most use cases, detailed below:

Parameter	Old Value	New Value	Benefit
`shuffle_algo`	`py1s`	`py1e`	Better shuffle and balanced downloading
`num_canonical_nodes`	`64 * physical nodes`	if `py1s` or `py2s`, `64 * physical_nodes`, otherwise `physical_nodes`	Consistently good shuffle for all shuffle algos
`shuffle_block_size`	`262,144`	`4,000,000 / num_canonical_nodes`	Consistently good shuffle for all `num_canonical_nodes` values
`predownload`	`max(batch_size, 256 * batch_size // num_canonical_nodes)`	`8 * batch_size`	Better balanced downloading
`partition_algo`	`orig`	`relaxed`	More flexible deterministic resumptions on nodes

💎 New Features

🤖 Streaming Simulator: Easily simulate the performance of training configurations. (#385)

After installing this version of streaming, simply run the command simulator in your terminal to open the simulation interface.
Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations.
Easily de-risk runs and find performant parameter settings.
Check out the docs for more information!

🔢 More flexible deterministic training and resumption (#476)

Deterministic training and resumptions are now possible on more numbers of nodes!
Previously, the num_canonical_nodes parameter had to divide or be a multiple of the number of physical nodes for determinism.
Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.

🐛 Bug Fixes

Check for invalid hash algorithm names (#486)

What's Changed

Bump fastapi from 0.103.2 to 0.104.0 by @dependabot in #480
Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in #481
Bump sphinx-tabs from 3.4.1 to 3.4.4 by @dependabot in #482
do not remove local directory when out is local by @XiaohanZhangCMU in #477
Update init.py by @XiaohanZhangCMU in #484
Check for invalid hash algorithm name by @karan6181 in #486
Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes by @snarayan21 in #476
Better default values for StreamingDataset args by @snarayan21 in #479
Update release yaml to not write anything to GitHub by @karan6181 in #487
Bump pypandoc from 1.11 to 1.12 by @dependabot in #490
Bump pytest from 7.4.2 to 7.4.3 by @dependabot in #491
Bumping version for streaming v0.7.0 by @snarayan21 in #495

Full Changelog: v0.6.1...v0.7.0

Contributors

karan6181, dependabot, and 2 other contributors

Assets 2

18 Oct 21:28

karan6181

v0.6.1

8827d7a

v0.6.1

🚀 Streaming v0.6.1

Streaming v0.6.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.6.1

💎 New Features

🚃 Merge meta-data information from sub-directories dataset to form one unified dataset. (#449)

Addition of the merge_index() utility method to merge subdirectories index files from an MDS dataset. The subdirectories can be local or any supported cloud provider URL path.
Checkout dataset conversion and Spark Dataframe to MDS jupyter notebook for an example in action.

🔁 Retry uploading a file to a cloud provider path. (#448)

Added upload retry logic with backoff and jitter during dataset conversion as part of parameter retry in Writer.

from streaming import MDSWriter

with MDSWriter(
               ...,
               retry=3) as out:
    for sample in dataset:
        out.write(sample)

🐛 Bug Fixes

Validate Writer arguments and raise a ValueError exception if argument(s) is/are invalid. (#434)
Terminate the main process if one of the upload threads receives an Exception during dataset conversion. (#448)

🔧 Improvements

More balancing inter-node downloading for the py1e shuffling algorithm by varying shard sample ranges, helping to reduce throughput drops at scale. (#442)

What's Changed

Validate writer arguments by @karan6181 in #434
Bump pytest from 7.4.1 to 7.4.2 by @dependabot in #428
Bump gitpython from 3.1.34 to 3.1.36 by @dependabot in #435
Fix stylistic issues (mostly 100col, docstring conventions) by @knighton in #439
Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #436
py1e randomized by @snarayan21 in #442
Bump gitpython from 3.1.36 to 3.1.37 by @dependabot in #446
Fix BatchFeature of Transformers not handled by StreamingDataloader by @Hubert-Bonisseur in #450
Add a retry logic with backoff and jitter by @karan6181 in #448
Fix broken bibtext by @Skylion007 in #452
Update integration test to include sample order comparison by @karan6181 in #456
Bump pydantic from 2.3.0 to 2.4.2 by @dependabot in #455
Update MCLI credential page for Databricks by @karan6181 in #466
Add merge index file utility by @XiaohanZhangCMU in #449
Add py1e warning when Shuffle block size is smaller than shard size by @snarayan21 in #463
Fix doc strings by @XiaohanZhangCMU in #469
Bump fastapi from 0.103.1 to 0.103.2 by @dependabot in #454
Maintain order for merge_index_from_list by @XiaohanZhangCMU in #472
Fixed codeql out of disk space issue by @karan6181 in #473
Bump version to 0.6.1 by @karan6181 in #474

New Contributors

@Hubert-Bonisseur made their first contribution in #450

Full Changelog: v0.6.0...v0.6.1

Contributors

Skylion007, knighton, and 5 other contributors

Assets 2

13 Sep 20:11

XiaohanZhangCMU

v0.6.0

65ac4ca

v0.6.0

🚀 Streaming v0.6.0

Streaming v0.6.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.6.0

New Features

🆕 Databricks File System and Databricks Unity Catalog (#362)

Support for reading and writing data from and to the Databricks File System (DBFS) and Unity Catalog (UC) Volumes. This means that you can now use DBFS and UC Volumes as a source or sink for your streaming data pipelines or model training. Below is the path structure:

Databricks File System (DBFS)

DBFS path structure is a hierarchical namespace that is organized into directories and files. The DBFS prefix must starts with dbfs:/.

UC Volumes

The path structure for UC Volumes is similar to the path structure for DBFS, but with a few key differences.

The root of the UC Volumes namespace is dbfs:/Volumes/<catalog>/<schema>/<volume>, where:

<catalog> is the name of the catalog where the volume is created.
<schema> is the name of the schema where the volume is created.
<volume> is the name of the volume.

Hence, use a dbfs://Volumes prefix to specify a UC Volumes path.

💽 Spark Dataframe to MDS convertor (#363)

Introducing the new DataFrameToMDS API, empowering users to effortlessly leverage Spark's capabilities for handling diverse datasets in various formats. This API enables seamless conversion of Spark DataFrames into MDS datasets, with the flexibility to specify output locations to both local and cloud storage. Index files are optionally merged. Additionally, users can add data preprocessing steps by defining custom iterator functions and arguments. All these features are seamlessly bundled into a single Spark job, ensuring an efficient and streamlined workflow for data transformation. An example notebook is provided to help users get started.

🔀 Randomize and offset shuffle blocks algorithm (#373)

The new py1br shuffle algorithm helps mitigate download spikes that occur when using the py1b algorithm. With py1b, shuffle blocks are all the same size, so when progressing through training, nodes will have to download many shards at the same time. In contrast, with py1br, shuffle blocks are offset from each other and are variably sized. This results in more balanced downloads over time. The py1br algorithm is a replacement for the py1b algorithm, which will be deprecated soon.

from streaming import StreamingDataset

dataset = StreamingDataset(
    shuffle_algo='py1br',
    ...
)

🔀 Expanded range shuffle algorithm (#394)

The new py1e shuffle algorithm helps reduce the minimum cache limit needed for training, and results in much smoother downloads than both py1br and py1e. However, its shuffle quality is slightly lower. Rather than shuffling all samples in blocks of size shuffle_block_size, it instead spreads the samples of each shard over a range of maximum size shuffle_block_size, retaining most of the shuffle quality from py1b and py1br while reducing download spikes across the duration of training.

from streaming import StreamingDataset

dataset = StreamingDataset(
    shuffle_algo='py1e',
    ...
)

🔥 Per-Stream Batching (#407)

Users are now able to ensure that each batch comes has samples from only a single stream. You can now set the new parameter batching_method to per_stream to access this functionality. Per-stream batching will still take into account upsampling and downsampling of streams, set by proportion, repeat, or choose. To make batches contain only samples from a group of streams, merge streams’ index.json files to create a single one for each group.

from streaming import StreamingDataset

dataset = StreamingDataset(
    batching_method='per_stream',
    ...
)

🔥 Stratified Batching (#408)

Users are now able to ensure that each batch has a consistent number of samples from every stream. Previously, stream proportions were satisfied in the aggregate but not at the batch level. You can now set the new parameter batching_method to stratified to access this functionality. Stratified batching will still take into account upsampling and downsampling of streams, set by proportion, repeat, or choose.

from streaming import StreamingDataset

dataset = StreamingDataset(
    batching_method='stratified',
    ...
)

💪 Download-Efficient Sparse Sampling (#391)

Previous versions of StreamingDataset implement downsampling/upsampling by giving each sample equal probability of being selected (plus or minus one due when sampling is fractional), without regard to what shard a sample is on. This means that no matter how small your desired downsampling is, StreamingDataset will still use each shard at as equal a rate as possible. This is problematic for downloading performance.

In this version of Streaming, we have added a new optional StreamingDataset argument sampling_granularity which can be used to configure how sampling is done. It is an integer, defaulting to 1, that determines how many samples are to be drawn at a time from a single random shard until we have enough samples.

Note that the default setting of 1 is equivalent to the old non-shard-aware behavior. Setting it high, e.g. the number of samples in a full shard or more, means it will draw all the samples in a randomly chosen (without replacement) shard until it has enough samples, which is much more download-effiicient but results in the samples of each shard always being seen close together in training, which may have implications to convergence depending on your workload. Setting sampling granularity to half a shard means, roughly speaking, you'll see half the samples of a shard at a time during training.

from streaming import StreamingDataset

dataset = StreamingDataset(
    sampling_granularity=1,
    ...
)

📑 Reusable local directory (#406)

Users can now instantiate more than one StreamingDataset with same local directory and remote=None. This would be useful if there is a high-speed storage mounted on a node and multiple folks are trying to read the dataset directly from mount storage on the same node without having to copy the data on local disk.

from streaming import StreamingDataset

local = '<local disk directory or a mount point directory>'
dataset_0 = StreamingDataset(local=local, remote=None)
dataset_1 = StreamingDataset(local=local, remote=None)

🐛 Bug Fixes

Terminate the worker threads when process terminates to avoid deadlock. (#425)
Raise an exception if cache_limit is lower than the size of a single shard file to avoid deadlock. (#420)
Fixed predownload value to zero issue where users can now provide predownload=0 in StreamingDataset. (#383)

🔧 Improvements

Add google Application Default Credentials (#376).
- The order of authentication has changed and added a new App Engine or Compute Engine authentication channel if these are available. The order of authentication is as follows:
  1. HMAC
  2. Google service account
  3. App Engine
  4. Compute Engine
  5. Raise an error
Check if index.json exists locally before downloading to avoid duplicate downloads (#372).

What's Changed

Bump fastapi from 0.100.0 to 0.101.0 by @dependabot in #367
Bump uvicorn from 0.23.1 to 0.23.2 by @dependabot in #368
Check if index.json exists locally before downloading by @karan6181 in #372
Bench/plot sample access times across data and across formats by @knighton in #365
Apply ruff pre-commit hook by @Skylion007 in #364
Add a regression test for shuffling sample order by @b-chu in #359
Epoch size default behavior by @snarayan21 in #374
Stream unspecified docstring change by @snarayan21 in #377
fixed comments by @snarayan21 in #378
Add google Application Default Credentials to download by @fgerzer in #376
Fixed fake AWS credentials by @karan6181 in #382
Fixed predownload value to zero issue by @karan6181 in #383
Bump fastapi from 0.101.0 to 0.101.1 by @dependabot in #387
Bump pydantic from 2.1.1 to 2.2.1 by @dependabot in #389
Add a regression test for mixing of different dataset streams by @b-chu in #375
Add support for Databricks File System backend by @maddiedawson in #362
Add support for downloading from Unity Catalog volumes by @maddiedawson in #361
Fix MosaicML platform credential setup links by @karan6181 in #396
Plug hole in MDS type system: add arbitrary-precision decimal by @knighton in #390
Bump fastapi from 0.101.1 to 0.103.0 by @dependabot in #402
Bump pydantic from 2.2.1 to 2.3.0 by @dependabot in #403
Bump databricks-sdk from 0.3.1 to 0.6.0 by @dependabot in #404
Py1br algorithm implementation by @snarayan21 in #373...

Contributors

acutkosky, Skylion007, and 9 other contributors

Assets 2

19 Jun 05:58

karan6181

v0.5.2

a301cd0

v0.5.2

🚀 Streaming v0.5.2

Streaming v0.5.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.5.2

New features

Allow authentication with GCS for service accounts #315
human-readable suffixes for size_limit and epoch_size #333
static sampling #348

Documentation changes

Update contribution guide and improved unittest logic #343
static sampling #348

Testing

Add a regression test for StreamingDataset instantiation and iteration #318
Fixed accidental shard delete test #341
Add a regression test for StreamingDataset using cloud providers #319
Add iteration time test as part of regression testing #358

Bug fix

Fix init local dir zip-only shard handling #330
added default behavior if no streams and epoch_size specified #348

What's Changed

Bump myst-parser from 1.0.0 to 2.0.0 by @dependabot in #309
Added files to support azure datalake storage by @shivshandilya in #311
Add secrets check as part of pre-commit by @karan6181 in #312
Bump pytest from 7.3.2 to 7.4.0 by @dependabot in #313
Bump fastapi from 0.97.0 to 0.98.0 by @dependabot in #314
Add GCS authentication for service accounts by @b-chu in #315
Bump fastapi from 0.98.0 to 0.100.0 by @dependabot in #322
Bump uvicorn from 0.22.0 to 0.23.0 by @dependabot in #327
Bump gitpython from 3.1.31 to 3.1.32 by @dependabot in #329
Bump pydantic from 1.10.9 to 1.10.11 by @dependabot in #328
Sync tmp directory by @b-chu in #316
Add a regression test for StreamingDataset instantiation and iteration by @b-chu in #318
human-readable suffixes for size_limit and epoch_size by @snarayan21 in #333
Updated pre commit packages by @snarayan21 in #340
Fix init local dir zip-only shard handling by @knighton in #330
Fixed accidental shard delete test by @karan6181 in #341
Bump uvicorn from 0.23.0 to 0.23.1 by @dependabot in #338
Download the index.json file as tmp extension until it finishes by @karan6181 in #346
Update contribution guide and improved unittest logic by @karan6181 in #343
Bump fastapi from 0.100.0 to 0.100.1 by @dependabot in #351
Bump uvicorn from 0.23.1 to 0.23.2 by @dependabot in #352
Bump furo from 2023.5.20 to 2023.7.26 by @dependabot in #354
Bump pydantic from 1.10.11 to 2.1.1 by @dependabot in #353
added default behavior if no streams and epoch_size specified by @snarayan21 in #348
Add a regression test for StreamingDataset using cloud providers by @b-chu in #319
Fixed sampling by @snarayan21 in #356
mds ndarray int conversion by @snarayan21 in #357
Add iteration time test as part of regression testing by @karan6181 in #358
Bump pydantic from 1.10.11 to 2.1.1 by @dependabot in #366
Fixed CI test to perform proper directory cleanup by @karan6181 in #369
version bump to 0.5.2 by @snarayan21 in #370

New Contributors

@shivshandilya made their first contribution in #311
@b-chu made their first contribution in #315
@snarayan21 made their first contribution in #333

Full Changelog: v0.5.1...v0.5.2

Contributors

knighton, karan6181, and 4 other contributors

Assets 2

08 Aug 18:59

snarayan21

v0.5.1

ac53002

v0.5.1

What's Changed

Improved shard eviction test execution time by @karan6181 in #291
Bump fastapi from 0.96.0 to 0.97.0 by @dependabot in #294
Bump pytest from 7.3.1 to 7.3.2 by @dependabot in #295
Bump pydantic from 1.10.8 to 1.10.9 by @dependabot in #296
Terminate the main process if thread died unexpectedly by @karan6181 in #297
Improved existing exception and exception messages by @karan6181 in #298
Round drop_first to be divisible by num_physical_nodes. by @knighton in #301
Added a utility method to clean stale shared memory by @karan6181 in #299
Propagate exception between threads and processes and improved error message by @karan6181 in #304
Fix LocalDataset (property size for fancy getitem). by @knighton in #305
Natively support encoding and decoding ndarrays in MDS by @knighton in #82
Bump version to 0.5.1 by @karan6181 in #308

Full Changelog: v0.5.0...v0.5.1

Contributors

knighton, karan6181, and dependabot

Assets 2

Releases: mosaicml/streaming

v0.7.5

🚀 Streaming v0.7.5

💎 New Features

1. Tensor/Sequence Parallelism Support

2. Overhauled Streaming Documentation

3. batch_size is now required for StreamingDataset

3. Support for Python 3.11, deprecate Python 3.8

🐛 Bug Fixes

🔧 Improvements

What's Changed

New Contributors

Contributors

v0.7.4

🚀 Streaming v0.7.4

🐛 Bug Fixes

What's Changed

New Contributors

Contributors

v0.7.3

🚀 Streaming v0.7.3

🐛 Bug Fixes

🔧 Other improvements

What's Changed

Contributors

v0.7.2

🚀 Streaming v0.7.2

💎 New Features

1. Canned ACL Support (#512)

2. Allow/reject datasets containing unsafe types (#519)

🐛 Bug Fixes

🔧 Improvements

What's Changed

New Contributors

Contributors

v0.7.1

🚀 Streaming v0.7.1

🐛 Bug Fixes

What's Changed

Contributors

v0.7.0

🚀 Streaming v0.7.0

📈 Better Defaults for StreamingDataset (#479)

💎 New Features

🤖 Streaming Simulator: Easily simulate the performance of training configurations. (#385)

🔢 More flexible deterministic training and resumption (#476)

🐛 Bug Fixes

What's Changed

Contributors

v0.6.1

🚀 Streaming v0.6.1

💎 New Features

🚃 Merge meta-data information from sub-directories dataset to form one unified dataset. (#449)

🔁 Retry uploading a file to a cloud provider path. (#448)

🐛 Bug Fixes

🔧 Improvements

What's Changed

New Contributors

Contributors

v0.6.0

🚀 Streaming v0.6.0

New Features

🆕 Databricks File System and Databricks Unity Catalog (#362)

💽 Spark Dataframe to MDS convertor (#363)

🔀 Randomize and offset shuffle blocks algorithm (#373)

🔀 Expanded range shuffle algorithm (#394)

🔥 Per-Stream Batching (#407)

🔥 Stratified Batching (#408)

💪 Download-Efficient Sparse Sampling (#391)

📑 Reusable local directory (#406)

🐛 Bug Fixes

🔧 Improvements

What's Changed

Contributors

v0.5.2

🚀 Streaming v0.5.2

New features

Documentation changes

Testing

Bug fix

3. `batch_size` is now required for StreamingDataset

📈 Better Defaults for `StreamingDataset` (#479)