Remove broadcasting. (#992)

Removes broadcasting in favour of using `squeezeMemSpace` and `reshapeMemSpace`. Details can be found in the Migration Guide.
BlueBrain · May 14, 2024 · 5bd727d · 5bd727d
1 parent abf4c69
commit 5bd727d
Show file tree

Hide file tree

Showing 9 changed files with 204 additions and 267 deletions.
diff --git a/doc/migration_guide.md b/doc/migration_guide.md
@@ -168,3 +168,69 @@ We felt that the savings in typing effort weren't worth introducing the concept
 of a "file driver". Removing the concept hopefully makes it easier to add a
 better abstraction for the handling of the property lists, when we discover
 such an abstraction.
+
+## Removal of broadcasting
+HighFive v2 had a feature that a dataset (or attribute) of shape `[n, 1]` could
+be read into a one-dimensional array automatically.
+
+The feature is prone to accidentally not failing. Consider an array that shape
+`[n, m]` and in general both `n, m > 0`. Hence, one should always be reading
+into a two-dimensional array, even if `n == 1` or `m == 1`. However, due to
+broadcasting, if one of the dimensions (accidentally) happens to be one, then
+the checks wont fails. This isn't a bug, however, it can hide a bug. For
+example if the test happen to use `[n, 1]` datasets and a one-dimensional
+array.
+
+Broadcasting in HighFive was different from broadcasting in NumPy. For reading
+into one-dimensional data HighFive supports stripping all dimensions that are
+not `1`. When extending the feature to multi-dimensional arrays it gets tricky.
+We can't strip from both the front and back. If we allow stripping from both
+ends, arrays such as `[1, n, m]` read into `[n, m]` if `m > 1` but into `[1,
+n]` (instead of `[n, 1]`) if (coincidentally) `m == 1`. For HighFive because
+avoiding being forced to read `[n, 1]` into `std::vector<std::vector<T>>` is
+more important than `[1, n]`.  Flattening the former requires copying
+everything while the latter can be made flat by just accessing the first value.
+Therefore, HighFive had a preference to strip from the right, while NumPy adds
+`1`s to the front/left of the shape.
+
+In `v3` we've removed broadcasting. Instead users must use one of the two
+alternatives: squeezing and reshaping. The examples show will use datasets and
+reading, but it works the same for attributes and writing.
+
+### Squeezing
+Often we know that the `k`th dimension is `1`, e.g. a column is `[n, 1]` and a
+row is `[1, m]`. In this case it's convenient to state, remove dimension `k`.
+The syntax to simultaneously remove the dimensions `{0, 2}` is:
+
+```
+dset.squeezeMemSpace({0, 2}).read(array);
+```
+Which will read a dataset with dimensions `[1, n, 1]` into an array of shape
+`[n]`.
+
+### Reshape
+Sometimes it's easier to state what the new shape must be. For this we have the
+syntax:
+```
+dset.reshapeMemSpace(dims).read(array);
+```
+To declare that `array` should have dimensions `dims` even if
+`dset.getDimensions()` is something different.
+
+Example:
+```
+dset.reshapeMemSpace({dset.getElementCount()}).read(array);
+```
+to read into a one-dimensional array.
+
+### Scalars
+There's a safe case that seems needlessly strict to enforce: if the dataset is
+a multi-dimensional array with one element one should be able to read into
+(write from) a scalar.
+
+The reverse, i.e. reading a scalar value in the HDF5 file into a
+multi-dimensional array isn't supported, because if we want to support array
+with runtime-defined rank, we can't deduce the correct shape, e.g. `[1]` vs.
+`[1, 1, 1]`, when read into an array.
+
+
diff --git a/include/highfive/H5Attribute.hpp b/include/highfive/H5Attribute.hpp
@@ -71,7 +71,7 @@ class Attribute: public Object, public PathTraits<Attribute> {
     /// \since 1.0
     DataType getDataType() const;
 
-    /// \brief Get the DataSpace of the current Attribute.
+    /// \brief Get a copy of the DataSpace of the current Attribute.
     /// \code{.cpp}
     /// Attribute attr = dset.createAttribute<int>("foo", DataSpace(1, 2));
     /// auto dspace = attr.getSpace(); // This will be a DataSpace of dimension 1 * 2

diff --git a/include/highfive/bits/H5Attribute_misc.hpp b/include/highfive/bits/H5Attribute_misc.hpp
@@ -24,6 +24,7 @@
 #include "h5a_wrapper.hpp"
 #include "h5d_wrapper.hpp"
 #include "squeeze.hpp"
+#include "assert_compatible_spaces.hpp"
 
 namespace HighFive {
 
@@ -81,10 +82,7 @@ inline void Attribute::read(T& array) const {
     auto dims = mem_space.getDimensions();
 
     if (mem_space.getElementCount() == 0) {
-        auto effective_dims = details::squeezeDimensions(dims,
-                                                         details::inspector<T>::recursive_ndim);
-
-        details::inspector<T>::prepare(array, effective_dims);
+        details::inspector<T>::prepare(array, dims);
         return;
     }
 
@@ -172,13 +170,7 @@ inline Attribute Attribute::squeezeMemSpace(const std::vector<size_t>& axes) con
 }
 
 inline Attribute Attribute::reshapeMemSpace(const std::vector<size_t>& new_dims) const {
-    auto n_elements_old = this->getMemSpace().getElementCount();
-    auto n_elements_new = compute_total_size(new_dims);
-    if (n_elements_old != n_elements_new) {
-        throw Exception("Invalid parameter `new_dims` number of elements differ: " +
-                        std::to_string(n_elements_old) + " (old) vs. " +
-                        std::to_string(n_elements_new) + " (new)");
-    }
+    detail::assert_compatible_spaces(this->getMemSpace(), new_dims);
 
     auto attr = *this;
     attr._mem_space = DataSpace(new_dims);

diff --git a/include/highfive/bits/H5Converter_misc.hpp b/include/highfive/bits/H5Converter_misc.hpp
@@ -415,10 +415,8 @@ struct data_converter {
     static Reader<T> get_reader(const std::vector<size_t>& dims,
                                 T& val,
                                 const DataType& file_datatype) {
-        // TODO Use bufferinfo for recursive_ndim
-        auto effective_dims = details::squeezeDimensions(dims, inspector<T>::recursive_ndim);
-        inspector<T>::prepare(val, effective_dims);
-        return Reader<T>(effective_dims, val, file_datatype);
+        inspector<T>::prepare(val, dims);
+        return Reader<T>(dims, val, file_datatype);
     }
 };
 

diff --git a/include/highfive/bits/H5Inspector_misc.hpp b/include/highfive/bits/H5Inspector_misc.hpp
@@ -28,92 +28,16 @@ namespace HighFive {
 namespace details {
 
 inline bool checkDimensions(const std::vector<size_t>& dims, size_t n_dim_requested) {
-    size_t n_dim_actual = dims.size();
-
-    // We should allow reading scalar from shapes like `(1, 1, 1)`.
-    if (n_dim_requested == 0) {
-        if (n_dim_actual == 0ul) {
-            return true;
-        }
-
-        return size_t(std::count(dims.begin(), dims.end(), 1ul)) == n_dim_actual;
-    }
-
-    // For non-scalar datasets, we can squeeze away singleton dimension, but
-    // we never add any.
-    if (n_dim_actual < n_dim_requested) {
-        return false;
-    }
-
-    // Special case for 1-dimensional arrays, which can squeeze `1`s from either
-    // side simultaneously if needed.
-    if (n_dim_requested == 1ul) {
-        return n_dim_actual >= 1ul &&
-               size_t(std::count(dims.begin(), dims.end(), 1ul)) >= n_dim_actual - 1ul;
+    if (dims.size() == n_dim_requested) {
+        return true;
     }
 
-    // All other cases strip front only. This avoid unstable behaviour when
-    // squeezing singleton dimensions.
-    size_t n_dim_excess = n_dim_actual - n_dim_requested;
-
-    bool squeeze_back = true;
-    for (size_t i = 1; i <= n_dim_excess; ++i) {
-        if (dims[n_dim_actual - i] != 1) {
-            squeeze_back = false;
-            break;
-        }
-    }
-
-    return squeeze_back;
+    // Scalar values still support broadcasting
+    // into arrays with one element.
+    size_t n_elements = compute_total_size(dims);
+    return n_elements == 1 && n_dim_requested == 0;
 }
 
-
-inline std::vector<size_t> squeezeDimensions(const std::vector<size_t>& dims,
-                                             size_t n_dim_requested) {
-    auto format_error_message = [&]() -> std::string {
-        return "Can't interpret dims = " + format_vector(dims) + " as " +
-               std::to_string(n_dim_requested) + "-dimensional.";
-    };
-
-    if (n_dim_requested == 0) {
-        if (!checkDimensions(dims, n_dim_requested)) {
-            throw std::invalid_argument("Failed dimensions check: " + format_error_message());
-        }
-
-        return {1ul};
-    }
-
-    auto n_dim = dims.size();
-    if (n_dim < n_dim_requested) {
-        throw std::invalid_argument("Failed 'n_dim < n_dim_requested: " + format_error_message());
-    }
-
-    if (n_dim_requested == 1ul) {
-        size_t non_singleton_dim = size_t(-1);
-        for (size_t i = 0; i < n_dim; ++i) {
-            if (dims[i] != 1ul) {
-                if (non_singleton_dim == size_t(-1)) {
-                    non_singleton_dim = i;
-                } else {
-                    throw std::invalid_argument("Failed one-dimensional: " +
-                                                format_error_message());
-                }
-            }
-        }
-
-        return {dims[std::min(non_singleton_dim, n_dim - 1)]};
-    }
-
-    size_t n_dim_excess = dims.size() - n_dim_requested;
-    for (size_t i = 1; i <= n_dim_excess; ++i) {
-        if (dims[n_dim - i] != 1) {
-            throw std::invalid_argument("Failed stripping from back:" + format_error_message());
-        }
-    }
-
-    return std::vector<size_t>(dims.begin(),
-                               dims.end() - static_cast<std::ptrdiff_t>(n_dim_excess));
-}
 }  // namespace details
 
 

diff --git a/include/highfive/bits/H5Slice_traits_misc.hpp b/include/highfive/bits/H5Slice_traits_misc.hpp
@@ -22,6 +22,7 @@
 #include "H5Converter_misc.hpp"
 #include "squeeze.hpp"
 #include "compute_total_size.hpp"
+#include "assert_compatible_spaces.hpp"
 
 namespace HighFive {
 
@@ -316,14 +317,7 @@ template <typename Derivate>
 inline Selection SliceTraits<Derivate>::reshapeMemSpace(const std::vector<size_t>& new_dims) const {
     auto slice = static_cast<const Derivate&>(*this);
 
-    auto n_elements_old = slice.getMemSpace().getElementCount();
-    auto n_elements_new = compute_total_size(new_dims);
-    if (n_elements_old != n_elements_new) {
-        throw Exception("Invalid parameter `new_dims` number of elements differ: " +
-                        std::to_string(n_elements_old) + " (old) vs. " +
-                        std::to_string(n_elements_new) + " (new)");
-    }
-
+    detail::assert_compatible_spaces(slice.getMemSpace(), new_dims);
     return detail::make_selection(DataSpace(new_dims), slice.getSpace(), detail::getDataSet(slice));
 }
 

diff --git a/include/highfive/bits/assert_compatible_spaces.hpp b/include/highfive/bits/assert_compatible_spaces.hpp
@@ -0,0 +1,29 @@
+/*
+ *  Copyright (c), 2024, BlueBrain Project, EPFL
+ *
+ *  Distributed under the Boost Software License, Version 1.0.
+ *    (See accompanying file LICENSE_1_0.txt or copy at
+ *          http://www.boost.org/LICENSE_1_0.txt)
+ *
+ */
+#pragma once
+
+#include <vector>
+#include "../H5Exception.hpp"
+#include "../H5DataSpace.hpp"
+
+namespace HighFive {
+namespace detail {
+
+inline void assert_compatible_spaces(const DataSpace& old, const std::vector<size_t>& dims) {
+    auto n_elements_old = old.getElementCount();
+    auto n_elements_new = dims.size() == 0 ? 1 : compute_total_size(dims);
+
+    if (n_elements_old != n_elements_new) {
+        throw Exception("Invalid parameter `new_dims` number of elements differ: " +
+                        std::to_string(n_elements_old) + " (old) vs. " +
+                        std::to_string(n_elements_new) + " (new)");
+    }
+}
+}  // namespace detail
+}  // namespace HighFive
diff --git a/src/examples/broadcasting_arrays.cpp b/src/examples/broadcasting_arrays.cpp
@@ -36,10 +36,9 @@ int main(void) {
 
     auto dset = file.createDataSet("dset", DataSpace(dims), create_datatype<double>());
 
-    // Note that even though `values` is one-dimensional, we can still write it
-    // to an array of dimensions `[3, 1]`. Only the number of elements needs to
-    // match.
-    dset.write(values);
+    // Note that because `values` is one-dimensional, we can't write it
+    // to a dataset of dimensions `[3, 1]` directly. Instead we use:
+    dset.squeezeMemSpace({1}).write(values);
 
     // When reading, (re-)allocation might occur. The shape to be allocated is
     // the dimensions of the memspace. Therefore, one might want to either remove