optimized jagged operations #99
Comments
Wow— this is precisely the goal of awkward-array project. Previously, I was trying to support this by presenting these arrays as objects that you can compute with using normal for loops, accelerated Numba (in OAMap), and while I still think it's necessary to have such access for particularly complex algorithms, a few weeks ago, I realized that we could extend Numpy idioms to provide exactly what you described without having install Numba (and therefore LLVM). These operations are fundamental things, unrelated to ROOT I/O, so it should be its own library. Also, I want to tackle chunked arrays (ROOT baskets and Arrow pages) the same framework: hence, awkward-array. For your first bullet point, boolean-returning ufuncs like Your second bullet point will involve a new IndexedArray type, development awkward-array. awkward-array should be in a usable state with uproot depending on it later in the summer (once all these conferences are done). That will be uproot 3.0, which will also be adding write support. I'll leave this issue open until then because it's on target for the uproot 3.0 goals. |
That's great, and exactly the general kind of operations I want! I'm looking forward to trying it out when it's usable. Thanks! |
Here is an interactive demo of the new functionality, which is being implemented in awkward-array (will be a dependency of uproot 3.0): https://mybinder.org/v2/gh/scikit-hep/awkward-array/0.0.5?filepath=binder%2Fjagged-arrays.ipynb It's very beta, but a lot of the manipulations I described above are possible now. Try it out! |
Perfect! Played around with the notebook and I'm pretty sure I can use this to do what I mentioned in the original post (plus some more things I didn't list). However, when I locally tried from awkward import JaggedArray
ja = JaggedArray.fromiter([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5, 6.6, 7.7, 8.8, 9.9]])
print ja.sum()
print ja > 25 I got
(in Python 2.7.11, since I was in a cmssw environment). With 3.6.2 on my laptop, I see
as I'd expect. Does this only work in 3+? I installed with |
That's weird. But at this early stage, I expect weird. The goal is to support Pythons 2.6, 2.7, and 3.4+. This could be related to Numpy version though. I'm going to be adding Numpy versions to the testing matrix, probably from 1.8 to the latest 1.15. I'm pretty sure I used a 1.8 feature, and the new 1.15 is different enough to raise a lot of warnings when I run it with Pandas. |
Oops, I didn't think to list my numpy version above. I was using 1.12.1. I just now tried 1.14.5 with python 2.7 and the simple script works. |
I see: it's something that was added between Numpy 1.12.1 and 1.14.5 that fails in the very generic code on line 195 of awkward/util.py. That code came from something that was added in Numpy 1.15 (util.py is full of backports), so it's very likely too new. It'll be as important to support old Numpys as it is to support old Pythons, so I have some work to do in this area. (Here's a good reason for me to start using virtual environments: to catch these things before they hit the testing matrix!) |
(Let me know if I should start moving this to the awkward repo) I noticed one more thing with 1.14.5 that I claimed worked above. While it doesn't throw a ufunc error with 1.14.5, I don't think it's actually using numpy ufuncs when I tried min and max on jagged arrays. Sum seems fine, however. For example, with this test script, I got
and (with python3 + numpy 1.15)
The numpy/awkward sum is a lot faster than doing a loop over jagged entries in python, but still ~5x slower than jitting the whole operation. I don't trust myself to not be cheating in the jit comparison, though. For min/max, numpy is as slow as python. |
Good catch— it is a Python for loop. The Accelerating It's good to see that my intuition of using Numpy everywhere possible pays off. It's too bad that this one (or two, depending on how you count) operation can't be expressed in that way. |
Actually, there's no guarantee that a Numpythonic solution to If you have any ideas, they're more than welcome. |
Thanks for the explanation. It makes sense. I guess I should have checked the code first, because it's clear min/max are loops... Playing around a bit, I found that def ufunc_reduce(ufunc, arr, initial=0.0):
# np.ufunc.reduceat doesn't handle empty slices properly because there's no identity assumed
indices = np.insert(arr.counts[:-1].cumsum(),0,0)
out = ufunc.reduceat(arr.content,indices)
# override the empty slices with our identity
out[arr.counts == 0] = initial
return out seems to work ok when I give it things like |
You're right! I forgot about When I previously looked at it, I saw that it doesn't handle indexes exactly the same way we do, but that's just edge effects (might have to handle the last subarray separately). I'll switch the implementation to one based on |
This should do it: scikit-hep/awkward-0.x@c89e13a
(Answering a much earlier question: I don't think it's a problem to discuss awkward-array on the uproot repo because awkward-array isn't even "released" yet— I haven't made any announcements about it and there probably aren't many people looking at it yet. At first, its primary application will be in uproot, so it's relevant to this group of people anyway.) |
Thanks! I pulled from master and now I can see compiled speeds. Though to get things like
...not very general, but it lets me have a nice workflow with my current environment. Also, that ValueError from above might be related to this. |
I found the issue: the crucial feature that makes JaggedArrays usable by Numpy in ufuncs is its recognition of the Without Requiring 1.13.1 and above fixes everything. After plugging away at it all day, I've concluded that I have to require at least this version— the whole idea of awkward-array is based on it. Transitively, this means that uproot 3.0 will require Numpy 1.13.1 and above (July 2017). It also means there's no point in me supporting Python 2.6 anymore. (Yay! I get to use dict comprehensions!) |
Take a look at https://mybinder.org/v2/gh/scikit-hep/uproot/master?filepath=binder%2Fversion-3-features.ipynb A pre-release of uproot 3 is available, and this notebook demonstrates the new jagged array features. |
Since this was asking for the above, I'll close the issue now, even though it's only in pre-release. |
Hi,
First, thanks for making an excellent package!
I'm trying to get set up to run on CMS' nanoAOD format using uproot and the one main obstacle I'm facing is so-called jagged arrays.
For example, let's say I'm looking at a variable number of jets per event. I found that numpy has some related functionality to "reduce"
groups of entries in an array (reduceat, where I would just
need to feed in
content
,starts
,stops
for jagged arrays to get what I want. However, it seems that reduceat does not handle events with0 jets (i.e., start = stop for that event) from this issue which still has not been resolved.
Of course, I could always just loop through in python, but that gets to be pretty slow. A couple of concrete examples of what I might want to do:
I'm having to turn to numba to JIT some general jagged operations.
Now I'm wondering if you can offer advice on how to easily (and quickly) handle these jagged arrays. Or maybe I'm missing some feature of uproot that handles this already...
Thanks!
Nick
The text was updated successfully, but these errors were encountered: