Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching away from AxisArrays to DimArrays #336

Open
ParadaCarleton opened this issue Nov 16, 2021 · 11 comments
Open

Switching away from AxisArrays to DimArrays #336

ParadaCarleton opened this issue Nov 16, 2021 · 11 comments

Comments

@ParadaCarleton
Copy link
Member

ParadaCarleton commented Nov 16, 2021

I decided to separate this out from the question of whether we want to drop the use of regular arrays or switch to TupleVectors/StructArrays. Would we want to switch to DimArrays (or potentially AxisKeys, although that has some problems)?

@devmotion
Copy link
Member

I guess some things to consider are:

  • what are the pros and cons of DimArrays compared with AxisArrays?
  • how breaking is it, for developers and users, if AxisArray is replaced with DimArray?
  • are there other, possibly better, alternatives?

@ParadaCarleton
Copy link
Member Author

I guess some things to consider are:

  • what are the pros and cons of DimArrays compared with AxisArrays?
  • how breaking is it, for developers and users, if AxisArray is replaced with DimArray?
  • are there other, possibly better, alternatives?
  1. The main pros would be active development and getting rid of a lot of bugs. Avoiding this bug, for instance. Other smaller benefits, e.g. getting all the features provided by Dim objects and dealing with integer indexing more elegantly.
  2. Depends on whether this behavior is changed. Obviously anything that relies on lots of AxisArray-specific implementation details (e.g. playing with axis objects) is going to break; however, if the behavior for DimArrays is switched back, indexing should work just as before, and I suspect indexing is what most people use most of the time in their arrays.
  3. There are alternatives. The main one I'm aware of is AxisKeys.jl, but that one suffers from some problems. The most notable one is that every array is treated as a functor, so that indexing is done with round parentheses instead of square brackets; this tends to result in a lot of method ambiguities and also makes any switch a more-breaking change, since square bracket indexing no longer works.

@rafaqz
Copy link

rafaqz commented Nov 19, 2021

I would be happy to support this and work on any required changes to DimensionalData.jl.

Something not mentioned here is that with a DimArray broadcasting and most base/stats methods (e.g. reverse) work and keep the DimArray wrapper through all transformations. There is also a Tables.jl interface.

Edit: I think point 2 is resolvable. we should be able to make DimArray a drop-in replacement.

@ParadaCarleton
Copy link
Member Author

Mentioning @cpfiffer who I think has mentioned not liking AxisArrays. I could get a PR up in a couple days if you'd like.

@devmotion
Copy link
Member

I think nobody likes AxisArrays, including me 😄

Nevertheless we need some concrete motivation and clear advantages for why we should make these breaking changes and why it is good to break users' workflows and downstream packages. It seems that many issues with the current design would not be solved by DimArray (e.g., we would still have to group and ungroup e.g. vector- or matrix-valued variables manually and would not be able to work with variables of different types properly), but of course it's nice if we can fix (or rather avoid) JuliaArrays/AxisArrays.jl#182.

Something not mentioned here is that with a DimArray broadcasting and most base/stats methods (e.g. reverse) work and keep the DimArray wrapper through all transformations.

It would be interesting to see concrete examples of how this would differ from the current behaviour when indexing or slicing chains. Produce the standard stats implementations the same results that we get with our manual implementations currently? Many statistical summaries have to be implemented manually anyway I assume since they are quite specific for our use case here, and IIRC also the implementations of e.g. mean (

MCMCChains.jl/src/stats.jl

Lines 280 to 298 in e7b3db5

function mean(chains::Chains; kwargs...)
# Store everything.
funs = [meancskip]
func_names = [:mean]
# Summarize.
summary_df = summarize(
chains, funs...;
func_names = func_names,
name = "Mean",
kwargs...
)
return summary_df
end
mean(chn::Chains, syms) = mean(chn[:, syms, :])
# resolve method ambiguity with `mean(f, ::AbstractArray)`
mean(chn::Chains, syms::AbstractVector) = mean(chn[:, syms, :])
) are a bit specialized since e.g. they always skip missing values, don't support all aggregations, and produce a light data frame.

It would be nice to make this discussion a bit more concrete but directly comparing current behaviour and how it would change with DimArray.

@rafaqz
Copy link

rafaqz commented Nov 19, 2021

I don't really have a horse in this race or know Turing.jl internals, I was just offering to help out if any changes are needed to DimensionalData.jl. But there shouldn't need to be any breaking workflow or syntax changes here when indexing or slicing.

There are some small things that are nicer for this codebase, like:

# niter = size(chn, 1)
niter = size(chn, :iter)

So you don't need to use magic numbers to specify axes.

And I do get the feeling there is code that could leverage DimensionalData.jl here given some thought, like the Tables.jl interface is very similar.

we would still have to group and ungroup e.g. vector- or matrix-valued variables manually and would not be able to work with variables of different types properly

I'm not sure I understand the use case, but for variables of different types and sizes I would use a DimStack instead of DimArray. Its like a NamedTuple of arrays that share some or all dimensions and can be indexed together, returning a NamedTuple over all layers, or a sliced DimStack if sliced.

It's also usable as a table where the columns are the dimensions (which are looped to length) and the layers are the other columns.

@devmotion
Copy link
Member

Maybe before I reply I should repeat that I'm not against this change (as I said nobody is really happy with AxisArray) but that the current discussion is just a bit too high-level for my taste and we should be careful that we actually gain something and don't break or remove existing functionality that works well 🙂

There are some small things that are nicer for this codebase, like:

# niter = size(chn, 1)
niter = size(chn, :iter)

While size(chn, i) is supported (of course), ideally already now users should not have to deal with the internal layout. niter is just length(chn). The iterations can be retrieved by range(chn) (in the current design limited to ranges, as the name indicates), the parameters with names(chn) (or names(chn, section) for a section of parameters, e.g., excluding internal metadata), and the chain numbers with chains(chn). They are implemented internally with the quite descriptive (but more verbose) Axis{:iter}, Axis{:chain} etc.

I'm not sure I understand the use case,

One of the main pain points with AxisArrays is not that we use AxisArray instead of one of its alternatives but that the underlying data has to be a 3d array. This is easy and works well in many cases but is annoying and problematic e.g. if one works with samples of different types (e.g. discrete variables of type Int and continuous variables of type Float64) and if some samples are vectors or matrices (because they have to be flattened and it is annoying/difficult to obtain the original array from the flattened version without heuristics and type stability issues).

@rafaqz
Copy link

rafaqz commented Nov 19, 2021

I was imagining using DimensionalData.jl could simplify some code in this package, more than for the user! That code is from this package.

It should be essentially the same for the user, but with hopefully less bugs.

One of the main pain points with AxisArrays is not that we use AxisArray instead of one of its alternatives but that the underlying data has to be a 3d array.

I guess I don't understand why you are stuck in that situation. Why not just use a NamedTuple holding a separate 1d/2d Array for each variable? Why does the data have to be in one array?

Essentially DimStack makes that easy, automates running map over all the arrays with base functions, and gives it a Tables.jl interface.

@devmotion
Copy link
Member

Why not just use a NamedTuple holding a separate Array for each variable? Why does the data have to be in one array?

It doesn't have to and we want to support this case (or maybe rather StructArrays or vectors of NamedTuples since this might be a more natural output for samplers) 🙂 The problem is time and resources - it took a long time to somewhat fix and clean the statistical analyses and other implementations in this package, it's an old and a bit messy code base with some parts extracted from Mamba.jl. Many algorithms were (and still are) written for the 3d-array/AxisArray design specifically. Recently some algorithms were moved to MCMCDiagnosticTools and made backend agnostic (at least that's the goal), which hopefully will make it easier to support other chain structures.

@cpfiffer
Copy link
Member

David's got good points here. There are a lot of far better ways to do what we currently do, but switching over to them is the hard bit. We could do DimensionalData.jl, @cscherrer's TupleVectors.jl which I quite like, raw vectors of named tuples, vectors-of-vectors, etc.

All of these kind of get us around the two big issues I see with the backend which are (a) forced equal chain sizes and (b) bad handling of multi-type chains.

Ultimately I just kind of want anything that supports the convenience function getthing(chain, parameter), and returns something array-like with respect to that parameter or parameters. I am very agnostic about how we do this so from my perspective DimensionalData.jl is as good as any other.

@cscherrer
Copy link

Thanks @cpfiffer . Of course I like TupleVectors too, but the whole SampleChains thing hasn't worked out as easily as I had hoped. I think it's yet another case where we really need an abstract interface.

To get there, I think it's important to consider different kinds of inference. A result might be an abstract vector of named tuples, but it could also be a Measure or Distribution (using VI or Laplace approximation) or even a continuous trajectory (using @mschauer's ZigZagBoomerang). For standard MCMC samples, TupleVectors works great. But to be more general, I think there just needs to be

  1. A "summarize this solution" with a consistent show method (I like the Measurements-like approach. but YMMV),
  2. a way to evaluate expectations, including an error term, and
  3. An abstraction for sampler-specific diagnostics

As an aside, I hope more packages can work toward an iterator interface rather than a fixed number of samples. But that's more of a community concern, very hard to enforce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants