Skip to content

Commit

Permalink
Remove broadcasting. (#992)
Browse files Browse the repository at this point in the history
Removes broadcasting in favour of using `squeezeMemSpace` and `reshapeMemSpace`. Details can be found in the Migration Guide.
  • Loading branch information
1uc committed May 14, 2024
1 parent abf4c69 commit 5bd727d
Show file tree
Hide file tree
Showing 9 changed files with 204 additions and 267 deletions.
66 changes: 66 additions & 0 deletions doc/migration_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,3 +168,69 @@ We felt that the savings in typing effort weren't worth introducing the concept
of a "file driver". Removing the concept hopefully makes it easier to add a
better abstraction for the handling of the property lists, when we discover
such an abstraction.

## Removal of broadcasting
HighFive v2 had a feature that a dataset (or attribute) of shape `[n, 1]` could
be read into a one-dimensional array automatically.

The feature is prone to accidentally not failing. Consider an array that shape
`[n, m]` and in general both `n, m > 0`. Hence, one should always be reading
into a two-dimensional array, even if `n == 1` or `m == 1`. However, due to
broadcasting, if one of the dimensions (accidentally) happens to be one, then
the checks wont fails. This isn't a bug, however, it can hide a bug. For
example if the test happen to use `[n, 1]` datasets and a one-dimensional
array.

Broadcasting in HighFive was different from broadcasting in NumPy. For reading
into one-dimensional data HighFive supports stripping all dimensions that are
not `1`. When extending the feature to multi-dimensional arrays it gets tricky.
We can't strip from both the front and back. If we allow stripping from both
ends, arrays such as `[1, n, m]` read into `[n, m]` if `m > 1` but into `[1,
n]` (instead of `[n, 1]`) if (coincidentally) `m == 1`. For HighFive because
avoiding being forced to read `[n, 1]` into `std::vector<std::vector<T>>` is
more important than `[1, n]`. Flattening the former requires copying
everything while the latter can be made flat by just accessing the first value.
Therefore, HighFive had a preference to strip from the right, while NumPy adds
`1`s to the front/left of the shape.

In `v3` we've removed broadcasting. Instead users must use one of the two
alternatives: squeezing and reshaping. The examples show will use datasets and
reading, but it works the same for attributes and writing.

### Squeezing
Often we know that the `k`th dimension is `1`, e.g. a column is `[n, 1]` and a
row is `[1, m]`. In this case it's convenient to state, remove dimension `k`.
The syntax to simultaneously remove the dimensions `{0, 2}` is:

```
dset.squeezeMemSpace({0, 2}).read(array);
```
Which will read a dataset with dimensions `[1, n, 1]` into an array of shape
`[n]`.

### Reshape
Sometimes it's easier to state what the new shape must be. For this we have the
syntax:
```
dset.reshapeMemSpace(dims).read(array);
```
To declare that `array` should have dimensions `dims` even if
`dset.getDimensions()` is something different.

Example:
```
dset.reshapeMemSpace({dset.getElementCount()}).read(array);
```
to read into a one-dimensional array.

### Scalars
There's a safe case that seems needlessly strict to enforce: if the dataset is
a multi-dimensional array with one element one should be able to read into
(write from) a scalar.

The reverse, i.e. reading a scalar value in the HDF5 file into a
multi-dimensional array isn't supported, because if we want to support array
with runtime-defined rank, we can't deduce the correct shape, e.g. `[1]` vs.
`[1, 1, 1]`, when read into an array.


2 changes: 1 addition & 1 deletion include/highfive/H5Attribute.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ class Attribute: public Object, public PathTraits<Attribute> {
/// \since 1.0
DataType getDataType() const;

/// \brief Get the DataSpace of the current Attribute.
/// \brief Get a copy of the DataSpace of the current Attribute.
/// \code{.cpp}
/// Attribute attr = dset.createAttribute<int>("foo", DataSpace(1, 2));
/// auto dspace = attr.getSpace(); // This will be a DataSpace of dimension 1 * 2
Expand Down
14 changes: 3 additions & 11 deletions include/highfive/bits/H5Attribute_misc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#include "h5a_wrapper.hpp"
#include "h5d_wrapper.hpp"
#include "squeeze.hpp"
#include "assert_compatible_spaces.hpp"

namespace HighFive {

Expand Down Expand Up @@ -81,10 +82,7 @@ inline void Attribute::read(T& array) const {
auto dims = mem_space.getDimensions();

if (mem_space.getElementCount() == 0) {
auto effective_dims = details::squeezeDimensions(dims,
details::inspector<T>::recursive_ndim);

details::inspector<T>::prepare(array, effective_dims);
details::inspector<T>::prepare(array, dims);
return;
}

Expand Down Expand Up @@ -172,13 +170,7 @@ inline Attribute Attribute::squeezeMemSpace(const std::vector<size_t>& axes) con
}

inline Attribute Attribute::reshapeMemSpace(const std::vector<size_t>& new_dims) const {
auto n_elements_old = this->getMemSpace().getElementCount();
auto n_elements_new = compute_total_size(new_dims);
if (n_elements_old != n_elements_new) {
throw Exception("Invalid parameter `new_dims` number of elements differ: " +
std::to_string(n_elements_old) + " (old) vs. " +
std::to_string(n_elements_new) + " (new)");
}
detail::assert_compatible_spaces(this->getMemSpace(), new_dims);

auto attr = *this;
attr._mem_space = DataSpace(new_dims);
Expand Down
6 changes: 2 additions & 4 deletions include/highfive/bits/H5Converter_misc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -415,10 +415,8 @@ struct data_converter {
static Reader<T> get_reader(const std::vector<size_t>& dims,
T& val,
const DataType& file_datatype) {
// TODO Use bufferinfo for recursive_ndim
auto effective_dims = details::squeezeDimensions(dims, inspector<T>::recursive_ndim);
inspector<T>::prepare(val, effective_dims);
return Reader<T>(effective_dims, val, file_datatype);
inspector<T>::prepare(val, dims);
return Reader<T>(dims, val, file_datatype);
}
};

Expand Down
88 changes: 6 additions & 82 deletions include/highfive/bits/H5Inspector_misc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,92 +28,16 @@ namespace HighFive {
namespace details {

inline bool checkDimensions(const std::vector<size_t>& dims, size_t n_dim_requested) {
size_t n_dim_actual = dims.size();

// We should allow reading scalar from shapes like `(1, 1, 1)`.
if (n_dim_requested == 0) {
if (n_dim_actual == 0ul) {
return true;
}

return size_t(std::count(dims.begin(), dims.end(), 1ul)) == n_dim_actual;
}

// For non-scalar datasets, we can squeeze away singleton dimension, but
// we never add any.
if (n_dim_actual < n_dim_requested) {
return false;
}

// Special case for 1-dimensional arrays, which can squeeze `1`s from either
// side simultaneously if needed.
if (n_dim_requested == 1ul) {
return n_dim_actual >= 1ul &&
size_t(std::count(dims.begin(), dims.end(), 1ul)) >= n_dim_actual - 1ul;
if (dims.size() == n_dim_requested) {
return true;
}

// All other cases strip front only. This avoid unstable behaviour when
// squeezing singleton dimensions.
size_t n_dim_excess = n_dim_actual - n_dim_requested;

bool squeeze_back = true;
for (size_t i = 1; i <= n_dim_excess; ++i) {
if (dims[n_dim_actual - i] != 1) {
squeeze_back = false;
break;
}
}

return squeeze_back;
// Scalar values still support broadcasting
// into arrays with one element.
size_t n_elements = compute_total_size(dims);
return n_elements == 1 && n_dim_requested == 0;
}


inline std::vector<size_t> squeezeDimensions(const std::vector<size_t>& dims,
size_t n_dim_requested) {
auto format_error_message = [&]() -> std::string {
return "Can't interpret dims = " + format_vector(dims) + " as " +
std::to_string(n_dim_requested) + "-dimensional.";
};

if (n_dim_requested == 0) {
if (!checkDimensions(dims, n_dim_requested)) {
throw std::invalid_argument("Failed dimensions check: " + format_error_message());
}

return {1ul};
}

auto n_dim = dims.size();
if (n_dim < n_dim_requested) {
throw std::invalid_argument("Failed 'n_dim < n_dim_requested: " + format_error_message());
}

if (n_dim_requested == 1ul) {
size_t non_singleton_dim = size_t(-1);
for (size_t i = 0; i < n_dim; ++i) {
if (dims[i] != 1ul) {
if (non_singleton_dim == size_t(-1)) {
non_singleton_dim = i;
} else {
throw std::invalid_argument("Failed one-dimensional: " +
format_error_message());
}
}
}

return {dims[std::min(non_singleton_dim, n_dim - 1)]};
}

size_t n_dim_excess = dims.size() - n_dim_requested;
for (size_t i = 1; i <= n_dim_excess; ++i) {
if (dims[n_dim - i] != 1) {
throw std::invalid_argument("Failed stripping from back:" + format_error_message());
}
}

return std::vector<size_t>(dims.begin(),
dims.end() - static_cast<std::ptrdiff_t>(n_dim_excess));
}
} // namespace details


Expand Down
10 changes: 2 additions & 8 deletions include/highfive/bits/H5Slice_traits_misc.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
#include "H5Converter_misc.hpp"
#include "squeeze.hpp"
#include "compute_total_size.hpp"
#include "assert_compatible_spaces.hpp"

namespace HighFive {

Expand Down Expand Up @@ -316,14 +317,7 @@ template <typename Derivate>
inline Selection SliceTraits<Derivate>::reshapeMemSpace(const std::vector<size_t>& new_dims) const {
auto slice = static_cast<const Derivate&>(*this);

auto n_elements_old = slice.getMemSpace().getElementCount();
auto n_elements_new = compute_total_size(new_dims);
if (n_elements_old != n_elements_new) {
throw Exception("Invalid parameter `new_dims` number of elements differ: " +
std::to_string(n_elements_old) + " (old) vs. " +
std::to_string(n_elements_new) + " (new)");
}

detail::assert_compatible_spaces(slice.getMemSpace(), new_dims);
return detail::make_selection(DataSpace(new_dims), slice.getSpace(), detail::getDataSet(slice));
}

Expand Down
29 changes: 29 additions & 0 deletions include/highfive/bits/assert_compatible_spaces.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
/*
* Copyright (c), 2024, BlueBrain Project, EPFL
*
* Distributed under the Boost Software License, Version 1.0.
* (See accompanying file LICENSE_1_0.txt or copy at
* http://www.boost.org/LICENSE_1_0.txt)
*
*/
#pragma once

#include <vector>
#include "../H5Exception.hpp"
#include "../H5DataSpace.hpp"

namespace HighFive {
namespace detail {

inline void assert_compatible_spaces(const DataSpace& old, const std::vector<size_t>& dims) {
auto n_elements_old = old.getElementCount();
auto n_elements_new = dims.size() == 0 ? 1 : compute_total_size(dims);

if (n_elements_old != n_elements_new) {
throw Exception("Invalid parameter `new_dims` number of elements differ: " +
std::to_string(n_elements_old) + " (old) vs. " +
std::to_string(n_elements_new) + " (new)");
}
}
} // namespace detail
} // namespace HighFive
7 changes: 3 additions & 4 deletions src/examples/broadcasting_arrays.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,9 @@ int main(void) {

auto dset = file.createDataSet("dset", DataSpace(dims), create_datatype<double>());

// Note that even though `values` is one-dimensional, we can still write it
// to an array of dimensions `[3, 1]`. Only the number of elements needs to
// match.
dset.write(values);
// Note that because `values` is one-dimensional, we can't write it
// to a dataset of dimensions `[3, 1]` directly. Instead we use:
dset.squeezeMemSpace({1}).write(values);

// When reading, (re-)allocation might occur. The shape to be allocated is
// the dimensions of the memspace. Therefore, one might want to either remove
Expand Down

0 comments on commit 5bd727d

Please sign in to comment.