Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to add a value_range #482

Open
HDembinski opened this issue Dec 10, 2020 · 6 comments
Open

Proposal to add a value_range #482

HDembinski opened this issue Dec 10, 2020 · 6 comments

Comments

@HDembinski
Copy link
Member

It may be useful to have a way to do an inclusive selection on a value range which follows the rule "keep all bins that overlap with the value range".

I think the clean way of introducing that is via a special slice-like object. I suggest value_range as a placeholder for a proper name. You would use it instead of slice inside __getitem__.

h[value_range(None, upper)] # selects all bins up to and including the bin containing upper
h[value_range(lower, None)] # selects all bins starting with the bin containing lower
h[value_range(lower, upper)] # selects all bins starting with the bin containing lower up to and including the bin container upper

bh.sum and bh.rebin and friends are supported as an optional third argument. value_range can be transformed internally into an ordinary slice.

@HDembinski
Copy link
Member Author

I am not super committed to this proposal, I mainly added this here as an anchor for discussion. I am personally fine with not adding too many ways of doing the same thing, bh.loc probably does the job just fine.

@henryiii
Copy link
Member

I think we should make sure UHI is powerful enough that users could create a smart locator - see #485. This would greatly complicate UHI by having to track multiple types of slices and/or having a slice transform step, and would not be composable like the current UHI objects are (mixing bins and values is very useful in one special case - start to value or value to end; these are not easy to write any other way - None includes flow). It also does not have Python syntax support like slices do; that's why we use getitem in the first place over a random Python method; and for UHI takes advantage of this unless you want to use dict-based access (if the keyword getitem PEP gets accepted, you won't need verbose syntax for that anymore either!).

@HDembinski
Copy link
Member Author

HDembinski commented Dec 22, 2020

The problem with a "smart locator" is that it breaks some fundamental design principles. In well designed interfaces, everything is plain and obvious and consistent. We hate it when stuff happens behind the scenes, like global state is modified, objects that depend on each other without being explicitly coupled, etc.

A "smart locator" knows whether it is passed to __getitem__ or to a slice and changes the returned index accordingly. This is not at all what you expect. The locator purpose is to distinguish an index from a value, to signal to the histogram that it has to convert the value to an index before it can do anything with it. We made lots of compromises with respect to how Boost.Histogram works to get the behavior of indices and slices exactly how a Python user expects them to be.

Now you want to break all that with a "smart locator" that is context-aware and behaves differently in a slice. It is an abomination and it goes against all the simpleness and consistency that we worked towards.

Replacing the slice object like I propose here is the right design. If we do that, then we are not bound by any preconceptions on how slices and indexing works in Python. The value_range or you could also call it value_slice is not context-aware, because it does not have to be.

@HDembinski
Copy link
Member Author

HDembinski commented Dec 22, 2020

@jpivarski Henry wants smart locators, I think they are a bad idea. @henryiii seems to believe that you were also in favor of "context-aware" locators, but I think it goes against consistency and what is idiomatic Python, which you have also favored in the past. If Henry insists on his context-aware locators, we need another call to arbitrate, because I am strongly against them.

@henryiii
Copy link
Member

Tiny clarification: I want the ability for users to write smart locators; I'm proposing the default locators like bh.loc stay "dumb". With one exception: I have been long bothered by this:

h[bh.overflow] # -> overflow bin
h[bh.underflow:bh.overflow] # returns the whole histogram (normal slicing always bring along flow bins anyway)
h[bh.underflow:bh.overflow:sum] # Does not include the overflow! Slices do not include the endpoint, so no overflow bin

h[::sum] # correct syntax

If a locator can detect where it is in the slice, we can fix bh.overflow to always mean the overflow bin, regardless of where it is. If we add smart locators, then this finally can be fixed.

Users were not supposed to use these this way, but I fear that some users, wanting to be explicit, will find bh.underflow/bh.overflow and think it's a great way to be clear to readers that you are including flow bins. Which actually works everywhere except for a reduction over a slice (where it's really hard to debug, since you can't count the number of bins returned - they are summed together, the value you get is just smaller than expected).

At the end of our meeting, we said we wanted bh.loc to not include the flow bins inside a slice, and to include them outside a slice. That requires smart locators. I've relented on bh.loc, I think it's slightly better, but since you have to know about handling the upper bin anyway, I think it's best to keep them simple.

@jpivarski
Copy link
Member

If we're talking about "the ability for users to write" smart locators, then absolutely yes. We can apply maintainability and design constraints to ourselves and our own libraries, but restricting a user's freedom is known as an "opinionated library," usually in a negative light. Most applications are short-lived: maintainability should not be the primary concern for code that I write in my /tmp directory to test something.

The ability to write locators is a public interface that should allow for gradual complexity: simple things should be easy and complex things should be possible. So, for instance, we shouldn't require locator functions to take this "placement context" as an argument because most locators aren't going to use it. Attaching it to the object that gets passed in sounds like a great solution because users who write locators only have to think about it if they need it.


On a different topic, the standard set of locators provided by boost-histogram, such as bh.loc: I think they should be the easiest thing to describe and understand by a user. (Note that this is a different "user," one who is only slicing histograms, not one who is writing locators.) But "easiest to describe and understand" doesn't necessarily mean context free. Humans don't think in the simplest possible terms. To illustrate, consider the fact that LISP has simpler syntax than just about any other programming language, but humans prefer the complexity of infix notation, statements, and such: the whole field of programming language syntax is one of making computers complex enough for humans to understand. Mathematical simplicity is often easier for humans to think about—programs are not written in natural human languages—to a degree, in most situations.

Coming back to the locators, this means they should be highly constrained by mathematical simplicity, but mathematical simplicity should not win over user expectations. As a user, if

h[bh.underflow:bh.overflow]

returns the whole histogram, I would expect that

h[bh.underflow:bh.overflow:sum]

sums over the whole histogram. If bh.overflow has to be an exception to the "exclude endpoint" rule, then it should be a well-documented exception, even if it requires some minimal context-awareness to make that happen. As a human, I can accept bh.overflow having a special rule. That's why it has a special spelling: it was to make this bin different because in the course of doing data analysis, I think about this bin (and the underflow) in a special way.

If this sounds like the start of a slippery slope, we can just insist that it won't be and apply some self-constraint. Having the ability to use context doesn't mean that we will be tempted to use context just because it's there. Each rule can be approached from the point of view of "what would a user expect?" and the answer to that question will usually veer toward mathematical simplicity. This, I would argue, was the guiding principle in designing the Python language, not the insistence on consistency. I can think of much more mathematical languages than Python (Haskell leaps to mind). The choices Python made to optimize user expectation usually also optimize mathematical simplicity (such as "only one way to do it... mostly"), but as an instrumental good that serves the intrinsic good of making it easier for programmers to write programs. Python doesn't look like Perl because mathematical simplicity is highly correlated with ease of programming, but it doesn't look like Haskell either (or LISP, or Forth, ...) because these two goals aren't perfectly correlated.

@gohil-jay gohil-jay added good first issue Good for newcomers and removed good first issue Good for newcomers labels Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants