Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

optimized jagged operations #99

Closed
aminnj opened this issue Jul 10, 2018 · 17 comments
Closed

optimized jagged operations #99

aminnj opened this issue Jul 10, 2018 · 17 comments

Comments

@aminnj
Copy link

aminnj commented Jul 10, 2018

Hi,

First, thanks for making an excellent package!

I'm trying to get set up to run on CMS' nanoAOD format using uproot and the one main obstacle I'm facing is so-called jagged arrays.
For example, let's say I'm looking at a variable number of jets per event. I found that numpy has some related functionality to "reduce"
groups of entries in an array (reduceat, where I would just
need to feed in content, starts, stops for jagged arrays to get what I want. However, it seems that reduceat does not handle events with
0 jets (i.e., start = stop for that event) from this issue which still has not been resolved.

Of course, I could always just loop through in python, but that gets to be pretty slow. A couple of concrete examples of what I might want to do:

  • sum up the pT for jets with pT larger than x for each event
  • (more convoluted) for each electron, the index into the jets collection is stored, and I want to follow it to match a jet to each electron (and then divide the pTs, say)

I'm having to turn to numba to JIT some general jagged operations.

Now I'm wondering if you can offer advice on how to easily (and quickly) handle these jagged arrays. Or maybe I'm missing some feature of uproot that handles this already...

Thanks!
Nick

@jpivarski
Copy link
Member

Wow— this is precisely the goal of awkward-array project. Previously, I was trying to support this by presenting these arrays as objects that you can compute with using normal for loops, accelerated Numba (in OAMap), and while I still think it's necessary to have such access for particularly complex algorithms, a few weeks ago, I realized that we could extend Numpy idioms to provide exactly what you described without having install Numba (and therefore LLVM). These operations are fundamental things, unrelated to ROOT I/O, so it should be its own library. Also, I want to tackle chunked arrays (ROOT baskets and Arrow pages) the same framework: hence, awkward-array.

For your first bullet point, boolean-returning ufuncs like >, when applied to JaggedArrays, would return a JaggedArray of booleans that you can use as a mask on a JaggedArray same structure: jets[jets["pt"] > 30] to get a pruned set of jets. Then the .count() or ["pt"].sum() of that to reduce up to the event level (one value per event), which is a logical extension of flat array reduction a scalar.

Your second bullet point will involve a new IndexedArray type, development awkward-array.

awkward-array should be in a usable state with uproot depending on it later in the summer (once all these conferences are done). That will be uproot 3.0, which will also be adding write support. I'll leave this issue open until then because it's on target for the uproot 3.0 goals.

@aminnj
Copy link
Author

aminnj commented Jul 11, 2018

That's great, and exactly the general kind of operations I want! I'm looking forward to trying it out when it's usable.

Thanks!
Nick

@jpivarski
Copy link
Member

jpivarski commented Aug 10, 2018

Here is an interactive demo of the new functionality, which is being implemented in awkward-array (will be a dependency of uproot 3.0):

https://mybinder.org/v2/gh/scikit-hep/awkward-array/0.0.5?filepath=binder%2Fjagged-arrays.ipynb

It's very beta, but a lot of the manipulations I described above are possible now. Try it out!

@aminnj
Copy link
Author

aminnj commented Aug 10, 2018

Perfect! Played around with the notebook and I'm pretty sure I can use this to do what I mentioned in the original post (plus some more things I didn't list).

However, when I locally tried

from awkward import JaggedArray
ja = JaggedArray.fromiter([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5, 6.6, 7.7, 8.8, 9.9]])
print ja.sum()
print ja > 25

I got

[  3.3   0.    7.7  38.5]
Traceback (most recent call last):
  File "play_awkward.py", line 20, in <module>
    print ja > 25
  File "/home/users/namin/.local/lib/python2.7/site-packages/awkward/util.py", line 195, in func
    return ufunc(self, other)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

(in Python 2.7.11, since I was in a cmssw environment). With 3.6.2 on my laptop, I see

[[False False False] [] [False False] [False False False False False]]

as I'd expect. Does this only work in 3+? I installed with pip install git+https://github.com/scikit-hep/awkward-array@master --user.

@jpivarski
Copy link
Member

That's weird. But at this early stage, I expect weird.

The goal is to support Pythons 2.6, 2.7, and 3.4+. This could be related to Numpy version though. I'm going to be adding Numpy versions to the testing matrix, probably from 1.8 to the latest 1.15. I'm pretty sure I used a 1.8 feature, and the new 1.15 is different enough to raise a lot of warnings when I run it with Pandas.

@aminnj
Copy link
Author

aminnj commented Aug 11, 2018

Oops, I didn't think to list my numpy version above. I was using 1.12.1. I just now tried 1.14.5 with python 2.7 and the simple script works.

@jpivarski
Copy link
Member

I see: it's something that was added between Numpy 1.12.1 and 1.14.5 that fails in the very generic code on line 195 of awkward/util.py. That code came from something that was added in Numpy 1.15 (util.py is full of backports), so it's very likely too new. It'll be as important to support old Numpys as it is to support old Pythons, so I have some work to do in this area. (Here's a good reason for me to start using virtual environments: to catch these things before they hit the testing matrix!)

@aminnj
Copy link
Author

aminnj commented Aug 12, 2018

(Let me know if I should start moving this to the awkward repo)

I noticed one more thing with 1.14.5 that I claimed worked above. While it doesn't throw a ufunc error with 1.14.5, I don't think it's actually using numpy ufuncs when I tried min and max on jagged arrays. Sum seems fine, however. For example, with this test script, I got

awkward 0.0.5
numpy 1.14.5
numba 0.38.1
python sys.version_info(major=2, minor=7, micro=11, releaselevel='final', serial=0)
sum -- numba jit: 27.27MHz, numpy/awkward: 7.25MHz, python: 0.19MHz
min -- numba jit: 2.10MHz, numpy/awkward: 0.10MHz, python: 0.12MHz

and (with python3 + numpy 1.15)

awkward 0.0.5
numpy 1.15.0
numba 0.38.1
python sys.version_info(major=3, minor=6, micro=2, releaselevel='final', serial=0)
sum -- numba jit: 55.57MHz, numpy/awkward: 13.65MHz, python: 0.31MHz
min -- numba jit: 41.30MHz, numpy/awkward: 0.13MHz, python: 0.14MHz

The numpy/awkward sum is a lot faster than doing a loop over jagged entries in python, but still ~5x slower than jitting the whole operation. I don't trust myself to not be cheating in the jit comparison, though. For min/max, numpy is as slow as python.

@jpivarski
Copy link
Member

Good catch— it is a Python for loop. min and max are the only things that do a Python for loop, rather than Numpy operations. There's a way to do it with a modified prefix reduction (cumsum), but not the standard one.

The JaggedArray.sum method is Numpythonic, because I use a standard cumsum and then subtract off the excess. I can't do the same thing with min and max because they have no inverse operation— it literally works because addition is a group and minimization/maximization is a semigroup (or monoid because I claim that infinity is the identity of minimization; minus infinity for maximization). Those mathematical properties matter.

Accelerating min and max will require compiled code (probably Cython). In fact, it's better than that because our GSoC student found ways to vectorize these operations that don't appear to be vectorizable. (Just vectorizing cumsum on its own yields a factor of 10 over Numpy, and most of the tricks Jaydeep discovered were modifications of the prefix sum.) But the base version of awkward-array only depends on Numpy— a version with compiled dependencies, taking advantage of vectorization, will be an optional add-on. Code execution speed and ease of installation are trade-offs for the user to make.

It's good to see that my intuition of using Numpy everywhere possible pays off. It's too bad that this one (or two, depending on how you count) operation can't be expressed in that way.

@jpivarski
Copy link
Member

Actually, there's no guarantee that a Numpythonic solution to JaggedArray.min and max doesn't exist— at the last minute, I found such a solution for JaggedArray.parents (assuming that starts and stops are contiguous). Until then, I had to say that there were two methods with Python for loops— now there's only one.

If you have any ideas, they're more than welcome.

@aminnj
Copy link
Author

aminnj commented Aug 12, 2018

Thanks for the explanation. It makes sense. I guess I should have checked the code first, because it's clear min/max are loops...

Playing around a bit, I found that

def ufunc_reduce(ufunc, arr, initial=0.0):
    # np.ufunc.reduceat doesn't handle empty slices properly because there's no identity assumed
    indices = np.insert(arr.counts[:-1].cumsum(),0,0)
    out = ufunc.reduceat(arr.content,indices)
    # override the empty slices with our identity
    out[arr.counts == 0] = initial
    return out

seems to work ok when I give it things like np.add, np.minimum, ... since we specify an identity element.

@jpivarski
Copy link
Member

You're right! I forgot about ufunc.reduceat! This is the second time sometime pointed me to it, and it's definitely the right thing to do.

When I previously looked at it, I saw that it doesn't handle indexes exactly the same way we do, but that's just edge effects (might have to handle the last subarray separately).

I'll switch the implementation to one based on ufunc.reduceat as soon as possible. Thanks!

@jpivarski
Copy link
Member

This should do it: scikit-hep/awkward-0.x@c89e13a

JaggedArray.min and max now use reduceat (if the indexes form an offset, which will be true in all cases you're likely to meet), and JaggedArray.argmin and argmax use another trick because argmin/argmax are not ufuncs with a reduceat method.

JaggedArray.argmin and argmax will likely be more important than the equivalent min and max because they can be used as jagged fancy indexes. For example, you'd want to take argmax of pT with some cut (mask) and then someotherjagged[argmax_result] to get some other jagged quantity for those maximum pT particles.

(Answering a much earlier question: I don't think it's a problem to discuss awkward-array on the uproot repo because awkward-array isn't even "released" yet— I haven't made any announcements about it and there probably aren't many people looking at it yet. At first, its primary application will be in uproot, so it's relevant to this group of people anyway.)

@aminnj
Copy link
Author

aminnj commented Aug 13, 2018

Thanks! I pulled from master and now I can see compiled speeds. Though to get things like arr > 25. to return a jagged array of indices (which was failing in 1.12.1) I had to locally modify https://github.com/scikit-hep/awkward-array/blob/c89e13a30fcc5037201dffaaf0c3abc35ab27402/awkward/util.py#L195 to be

return self.copy(content=ufunc(self.content, other))

...not very general, but it lets me have a nice workflow with my current environment. Also, that ValueError from above might be related to this.

@jpivarski
Copy link
Member

I found the issue: the crucial feature that makes JaggedArrays usable by Numpy in ufuncs is its recognition of the __array_ufunc__ method in my class. It started checking for this method in version 1.13.1. (There was an earlier protocol using a method called __numpy_ufunc__ (Numpy 1.11.0), but neither I nor anyone on StackOverflow could get it to work; the Numpy group declared it "broken," and it's hard to even find documentation on it. As far as I'm concerned, it doesn't exist.)

Without __array_ufunc__, Numpy converts my JaggedArrays into Numpy object arrays, which have a different behavior (none of the jagged broadcasting, for instance). All of the different behaviors observed in old Numpy versions traced to this.

Requiring 1.13.1 and above fixes everything. After plugging away at it all day, I've concluded that I have to require at least this version— the whole idea of awkward-array is based on it. Transitively, this means that uproot 3.0 will require Numpy 1.13.1 and above (July 2017). It also means there's no point in me supporting Python 2.6 anymore.

(Yay! I get to use dict comprehensions!)

@jpivarski
Copy link
Member

jpivarski commented Aug 30, 2018

Take a look at

https://mybinder.org/v2/gh/scikit-hep/uproot/master?filepath=binder%2Fversion-3-features.ipynb

A pre-release of uproot 3 is available, and this notebook demonstrates the new jagged array features.

@jpivarski
Copy link
Member

Since this was asking for the above, I'll close the issue now, even though it's only in pre-release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants