Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initializing histograms from arrays of bin values #421

Open
alexander-held opened this issue Jul 15, 2020 · 14 comments
Open

Initializing histograms from arrays of bin values #421

alexander-held opened this issue Jul 15, 2020 · 14 comments

Comments

@alexander-held
Copy link
Member

alexander-held commented Jul 15, 2020

Is it possible to set the bin contents (and variances, for storage=bh.storage.Weight()) of a bh.Histogram() directly without having to .fill() in events?

I assume that aghast could be used to create a histogram from this information and then convert it. Is there a more direct way?

For context: I am considering using bh.Histogram or possibly the hist version as a container for histograms in another library. For this I would like to have the flexibility to create histograms in multiple ways.

@henryiii
Copy link
Member

Two ways:

h = bh.Histogram(bh.axis.Regular(10,0,1), storage=bh.storage.Weight())
values = np.ones(10)
variances = np.ones(10) * .1

Using the shortcut for Weight Histograms:

h[...] = np.stack([values, variances], axis=-1)

For a AxBxC histogram, you need a AxBxCx2 array. You can optionally include the flow bins, and those will get set too.

Using the view method:

h.view().value = values
h.view().variance = variances

You can pass flow=True to .view() if you want to set the flow bins too.

@HDembinski
Copy link
Member

HDembinski commented Jul 16, 2020

@alexander-held I use this feature myself. I fit invariant-mass distributions in eta and pt of some particle. Then I store the fitted particle yield and its uncertainty squared as "values" and "variances" in a histogram with Weight() storage. This way, the yields are not only nicely organized, but I can also add yields from fits of subsections of the data simply by adding the histograms.

@alexander-held
Copy link
Member Author

Thanks for the quick reply! There is no way to instantiate a Histogram and fill the value and variance at the same time, right? Something like this:

h = bh.Histogram(bh.axis.Regular(10,0,1), storage=bh.storage.Weight(), values=values, variances=variances)

This might be convenient to have (maybe something for hist?).

@HDembinski
Copy link
Member

No, but long constructors with many keywords are not a good design. It is not significantly more efficient to do this instead of assigning.

@tacaswell
Copy link

Comment from the 🥜 gallery: if you are going to have many end-users all extending the class to add a sub-set of the keyword arguments you can end up with a mess of slightly different sub-classes floating around the world. It could be better to standardize on how to manage that bit of complexity. You could do something like

def __init__(self, ..., **kwargs):
     # current init
     for k, v in kwargs.items():
         setattr(self, k, v)

It is not much code and makes the API a bit for ergonomic for your users.

@alexander-held
Copy link
Member Author

I was surprised that the default bh.storage.Double() storage changes the API and I could not use h.view().value anymore to set bin contents. h[...] = [...] still works. Does this mean I should prefer h[...] = ... over using .view()?

@henryiii
Copy link
Member

henryiii commented Jul 17, 2020

.view() returns a view of the storage. It's always an ndarray, though for accumulator storages, it is also RecArray/RecordArray (I forget the naming scheme)-like, which allows attribute access into the dtype fields. That's why .view().value works. If you want to do the same thing with a simpler dtype, you have set the contents (this is a python limitation, you can't assign to a function call):

h.view()[...] = np.array(...)

This is actually good, because it emphases that you are changing the contents, rather than changing the object. You can do the same thing with .value, as well, I believe view().value[...].

Note: I am horribly mixing meanings above. The first ... is the literal Pyhton Ellipsis object, while the second one means "whatever you want to put in the array".

For a general way to do this regardless of the backend storage, see #423 - but that won't work very well nor is it designed for setting values on Profiles. Maybe we should expose and provide nice constructors or shortcuts for the AccumulatorViews?

@henryiii
Copy link
Member

PS: of course, you can use h[...] =, and it works well, though you had to build an extra dimension in for the weighted storage (or you can use the actual dtype), while the simpler storages have a simple dtype.

@HDembinski
Copy link
Member

Comment from the 🥜 gallery: if you are going to have many end-users all extending the class to add a sub-set of the keyword arguments you can end up with a mess of slightly different sub-classes floating around the world. It could be better to standardize on how to manage that bit of complexity. You could do something like

def __init__(self, ..., **kwargs):
     # current init
     for k, v in kwargs.items():
         setattr(self, k, v)

It is not much code and makes the API a bit for ergonomic for your users.

Thank you @tacaswell for chiming in and making a valid point. My background is in C++ and there constructors with many arguments are frowned upon, hence my reaction...

@henryiii
Copy link
Member

If we ever add the ability to use existing memory allocated in Python as the Storage's memory, this would suddenly make histograms initialized this way more efficient (and would have a side effect that the array passed in would start being the one that changes).

@henryiii
Copy link
Member

Note: .view() does what it says it does - it returns a view of the underlying data. If it's non-simple, the view is non-simple, as it should be.

I've proposed an API for accessing the values() and variances() (if present) of all storages in #423 - it comes up in the context of making a standard API for "PlottableHistogram"s, which would be really nice if Scikit-HEP histograms could follow. It would also be useful though, as in your case, in making a standard way to get the values/variances regardless of the Histogram storage.

@henryiii
Copy link
Member

henryiii commented Mar 9, 2021

I'm intending to implement this in Hist first (it's already available for pandas data frame storages, actually - but that one can't be upstreamed to boost-histogram because it depends on named axes). I'd like a classmethod, like used for that feature, but in this case, having the other args (axes, storage, etc) are useful, so it is probably better as a keyword only argument instead of forwarding a lot of stuff.

@dcervenkov
Copy link

Unless I'm mistaken, both h[...] = np.stack([values, variances], axis=-1) and h.view().value = values are not mentioned in the documentation. These are very useful, so it would be great if you could add them.

@henryiii
Copy link
Member

Update: this is supported in Hist (as you can see from the linked merged PR above). I'd be fine to upstream it (as with any non-dependency addition to Hist) if @HDembinski would like a data= argument in bh.Histogram. It would not reuse memory currently, but if we were able to add that later, you'd immediately get the benefit of it without changing your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants