Skip to content

Releases: tensorwerk/hangar-py

v0.5.2 Release

08 May 18:21
88922c7
Compare
Choose a tag to compare

v0.5.2 (2020-05-08)

New Features

  • New column data type supporting arbitrary bytes data. (#198) @rlizzo

Improvements

  • str typed columns can now accept data containing any unicode code-point. In prior releases data containing any non-ascii character could not be written to this column type. (#198) @rlizzo

Bug Fixes

  • Fixed issue where str and (newly added) bytes column data could not be fetched / pushed between a local client repository and remote server. (#198) @rlizzo

Release v0.5.1

06 Apr 16:07
15761c8
Compare
Choose a tag to compare

v0.5.1 (2020-04-05)

BugFixes

  • Fixed issue where importing make_torch_dataloader or make_tf_dataloader under python 3.6 Would raise a NameError regardless of if the package is installed. (#196) @rlizzo

v0.5.0 Release

04 Apr 11:54
01c94bd
Compare
Choose a tag to compare

v0.5.0 (2020-04-4)

Improvements

  • Python 3.8 is now fully supported. (#193) @rlizzo
  • Major backend overhaul which defines column layouts and data types in the same interchangable / extensable manner as storage backends. This will allow rapid development of new layouts and data type support as new use cases are discovered by the community. (#184) @rlizzo
  • Column and backend classes are now fully serializable (pickleable) for read-only checkouts. (#180) @rlizzo
  • Modularized internal structure of API classes to easily allow new columnn layouts / data types to be added in the future. (#180) @rlizzo
  • Improved type / value checking of manual specification for column backend and backend_options. (#180) @rlizzo
  • Standardized column data access API to follow python standard library dict methods API. (#180) @rlizzo
  • Memory usage of arrayset checkouts has been reduced by ~70% by using C-structs for allocating sample record locating info. (#179) @rlizzo
  • Read times from the HDF5_00 and HDF5_01 backend have been reduced by 33-38% (or more for arraysets with many samples) by eliminating redundant computation of chunked storage B-Tree. (#179) @rlizzo
  • Commit times and checkout times have been reduced by 11-18% by optimizing record parsing and memory allocation. (#179) @rlizzo

New Features

  • Added str type column with same behavior as ndarray column (supporting both single-level and nested layouts) added to replace functionality of removed metadata container. (#184) @rlizzo
  • New backend based on LMDB has been added (specifier of lmdb_30). (#184) @rlizzo
  • Added .diff() method to Repository class to enable diffing changes between any pair of commits / branches without needing to open the diff base in a checkout. (#183) @rlizzo
  • New CLI command hangar diff which reports a summary view of changes made between any pair of commits / branches. (#183) @rlizzo
  • Added .log() method to Checkout objects so graphical commit graph or machine readable commit details / DAG can be queried when operating on a particular commit. (#183) @rlizzo
  • "string" type columns now supported alongside "ndarray" column type. (#180) @rlizzo
  • New "column" API, which replaces "arrayset" name. (#180) @rlizzo
  • Arraysets can now contain "nested subsamples" under a common sample key. (#179) @rlizzo
  • New API to add and remove samples from and arrayset. (#179) @rlizzo
  • Added repo.size_nbytes and repo.size_human to report disk usage of a repository on disk. (#174) @rlizzo
  • Added method to traverse the entire repository history and cryptographically verify integrity. (#173) @rlizzo

Changes

  • Argument syntax of __getitem__() and get() methods of ReaderCheckout and WriterCheckout classes. The new format supports handeling arbitrary arguments specific to retrieval of data from any column type. (#183) @rlizzo

Removed

  • metadata container for str typed data has been completly removed. It is replaced by a highly extensible and much more user-friendly str typed column. (#184) @rlizzo
  • __setitem__() method in WriterCheckout objects. Writing data to columns via a checkout object is no longer supported. (#183) @rlizzo

Bug Fixes

  • Backend data stores no longer use file symlinks, improving compatibility with some types file systems. (#171) @rlizzo
  • All arrayset types ("flat" and "nested subsamples") and backend readers can now be pickled -- for parallel processing -- in a read-only checkout. (#179) @rlizzo

Breaking changes

  • New backend record serialization format is incompatible with repositories written in version 0.4 or earlier.
  • New arrayset API is incompatible with Hangar API in version 0.4 or earlier.

v0.5.0 Pre-Release 2

04 Apr 10:03
e1bb0e8
Compare
Choose a tag to compare
v0.5.0 Pre-Release 2 Pre-release
Pre-release

Pre-Release for v0.5.0. Full Changelog To Follow.

v0.5.0 Pre-Release

04 Apr 09:21
fae9052
Compare
Choose a tag to compare
v0.5.0 Pre-Release Pre-release
Pre-release

Pre-Release for v0.5.0. Full Changelog To Follow.

Release v0.4.0

26 Nov 07:01
be7d40e
Compare
Choose a tag to compare

Release Notes

New Features

  • Added ability to delete branch names/pointers from a local repository via both API and CLI. #128 @rlizzo
  • Added local keyword arg to arrayset key/value iterators to return only locally available samples #131 @rlizzo
  • Ability to change the backend storage format and options applied to an arrayset after initialization. #133 @rlizzo
  • Added blosc compression to HDF5 backend by default on PyPi installations. #146 @rlizzo
  • Added Benchmarking Suite to Test for Performance Regressions in PRs. #155 @rlizzo
  • Added new backend optimized to increase speeds for fixed size arrayset access. #160 @rlizzo

Improvements

  • Removed msgpack and pyyaml dependencies. Cleaned up and improved remote client/server code. #130 @rlizzo
  • Multiprocess Torch DataLoaders allowed on Linux and MacOS. #144 @rlizzo
  • Added CLI options commit, checkout, arrayset create, & arrayset remove. #150 @rlizzo
  • Plugin system revamp. #134 @hhsecond
  • Documentation Improvements and Typo-Fixes. #156 @alessiamarcolini
  • Removed implicit removal of arrayset schema from checkout if every sample was removed from arrayset. This could potentially result in dangling accessors which may or may not self-destruct (as expected) in certain edge-cases. #159 @rlizzo
  • Added type codes to hash digests so that calculation function can be updated in the future without breaking repos written in previous Hangar versions. #165 @rlizzo

Bug Fixes

  • Programatic access to repository log contents now returns branch heads alongside other log info. #125 @rlizzo
  • Fixed minor bug in types of values allowed for Arrayset names vs Sample names. #151 @rlizzo
  • Fixed issue where using checkout object to access a sample in multiple arraysets would try to create a namedtuple instance with invalid field names. Now incompatible field names are automatically renamed with their positional index. #161 @rlizzo
  • Explicitly raise error if commit argument is set while checking out a repository with write=True. #166 @rlizzo

Breaking changes

  • New commit reference serialization format is incompatible with repositories written in version 0.3.0 or earlier.

v0.4.0b0 Beta Pre-Release

19 Oct 01:51
f1c5d05
Compare
Choose a tag to compare
Pre-release
Merge pull request #145 from rlizzo/version-0-4-0b0

Version 0.4.0b0

v0.3.0 Release

10 Sep 07:52
d337bec
Compare
Choose a tag to compare

New Features

  • API addition allowing reading and writing arrayset data from a checkout object directly. (#115) @rlizzo
  • Data importer, exporters, and viewers via CLI for common file formats. Includes plugin system for easy extensibility in the future. (#103) (@rlizzo, @hhsecond)

Improvements

  • Added tutorial on working with remote data. (#113) @rlizzo
  • Added Tutorial on Tensorflow and PyTorch Dataloaders. (#117) @hhsecond
  • Large performance improvement to diff/merge algorithm (~30x previous). (#112) @rlizzo
  • New commit hash algorithm which is much more reproducible in the long term. (#120) @rlizzo
  • HDF5 backend updated to increase speed of reading/writing variable sized dataset compressed chunks (#120) @rlizzo

Bug Fixes

  • Fixed ML Dataloaders errors for a number of edge cases surrounding partial-remote data and non-common keys. (#110) (@hhsecond, @rlizzo)

Breaking changes

  • New commit hash algorithm is incompatible with repositories written in version 0.2.0 or earlier

v0.2.0 Release

09 Aug 20:14
a47aaf0
Compare
Choose a tag to compare

See changelog for full details

New Features

  • Numpy memory-mapped array file backend added.
  • Remote server data backend added.
  • Selection heuristics to determine appropriate backend from arrayset schema.
  • Partial remote clones and fetch operations now fully supported.
  • CLI has been placed under test coverage, added interface usage to docs.
  • TensorFlow and PyTorch Machine Learning Dataloader Methods (Experimental Release).

Improvements

  • Record format versioning and standardization so to not break backwards compatibility in the future.
  • Backend addition and update developer protocols and documentation.
  • Read-only checkout arrayset sample get methods now are multithread and multiprocess safe.
  • Read-only checkout metadata sample get methods are thread safe if used within a context manager.
  • Samples can be assigned integer names in addition to string names.
  • Forgetting to close a write-enabled checkout before terminating the python process will close the
    checkout automatically for many situations.
  • Repository software version compatability methods added to ensure upgrade paths in the future.
  • Many tests added (including support for Mac OSX on Travis-CI).
    lead

Bug Fixes

  • Diff results for fast forward merges now returns sensible results.
  • Many type annotations added, and developer documentation improved.

Breaking changes

  • Renamed all references to datasets in the API / world-view to arraysets.
  • These are backwards incompatible changes. For all versions > 0.2, repository upgrade utilities will
    be provided if breaking changes occur.

v0.1.1 Release

24 May 18:21
019fffc
Compare
Choose a tag to compare

Fix for readme which had typos and was push to PyPi