Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unzip: the inverse of zip #33515

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

unzip: the inverse of zip #33515

wants to merge 6 commits into from

Conversation

StefanKarpinski
Copy link
Sponsor Member

@StefanKarpinski StefanKarpinski commented Oct 9, 2019

Closes #13942. Alternative to #33324, with apologies to @bramtayl, who did a lot of work on unzip.

@StefanKarpinski StefanKarpinski added needs compat annotation Add !!! compat "Julia x.y" to the docstring needs news A NEWS entry is required for this change needs tests Unit tests are required for this change labels Oct 9, 2019
@JeffBezanson
Copy link
Sponsor Member

It seems better to return a tuple. Maybe something like

function unzip(itr)
    c = collect(itr)
    ntuple(i->map(t->getfield(t,i),c), Val(length(c[1])))
end

Of course the downside to that is it uses more temporary space. We could optimize it later by hooking in to the collect machinery. For example wrapping itr in an Unzip type and defining Base._similar_for to return a StructArray for an Unzip should magically work. I suppose we'd have to add a minimal StructArray-like type in Base.

@quinnj
Copy link
Member

quinnj commented Oct 9, 2019

I've really enjoyed using https://github.com/piever/StructArrays.jl/ and seeing it mature, FWIW.

@JeffBezanson
Copy link
Sponsor Member

Definitely. If we wanted to add things to Base, StructArrays would be on my short list. It's arguably the most useful general "shape" of container that's not really provided by Base.

@StefanKarpinski
Copy link
Sponsor Member Author

It was simple to change this implementation to return a tuple: just convert it at the end. Since one expects the number of collections returned to be small, this should be quite cheap.

base/iterators.jl Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Sponsor Member Author

I think this is about the most straightforward eager unzip that doesn't allocate a lot of temporary data and iterates the iterators in the expected order. Do we want this or something else? I'll write tests, NEWS, etc. if we do.

The potential alternative that makes sense to me would be to have Unzip produce a tuple of iterators such that the first one produces that first values of each of the iterators in itrs, the second one produces the second value in each of the iterators in itrs, etc., but as was pointed out here, that's fundamentally incompatible with the order one wants to consume the iterators in. That is, in order to support consuming from the nth iterator before the earlier ones and still have them yield values, you'd need to consume and save all the values from earlier iterators, at which point you might as well just do this whole thing eagerly, which is what this PR does.

@StefanKarpinski
Copy link
Sponsor Member Author

One possible variation would be to call promote_type instead of promote_typejoin which would change this example to produce a concrete Float64 vector instead of an abstract Real one:

julia> unzip([[1, "apple"], [2.5, "orange"], [0, "mango"]])
([1.0, 2.5, 0.0], ["apple", "orange", "mango"])

@bramtayl
Copy link
Contributor

How would the performance of this compare to my version?

@StefanKarpinski
Copy link
Sponsor Member Author

I haven't compared the performance, but if the goal is to get a tuple of unzipped vectors, this ought to be about as fast as possible since it doesn't do any extraneous work. Do you have a test case that you've been doing benchmarking on?

@JeffBezanson
Copy link
Sponsor Member

The performance issues I'd expect here would be the lack of type info in vecs, and that push! is always used even when the size is known (sizehint helps of course but push will probably still have a little overhead). StructArray has been designed so that you can collect into it without those issues.

@StefanKarpinski
Copy link
Sponsor Member Author

True. I could rewrite it so that it collects into a tuple and replaces the whole tuple if the eltype of any of the slots needs to change with the usual recursive trick.

@StefanKarpinski
Copy link
Sponsor Member Author

Here's a version that does the widening recursion trick. This should be fast in the case where itrs already returns tuples since inference should be able to reason about that. I'm not sure about the case where each itr yielded by itrs is not a tuple, since it will generally be tricky for the compiler to reason about. It's possible that having a function barrier around the inner for loop here would help in that case, although you'd still be doing a dynamic dispatch per iteration of itrs. What do you think, @JeffBezanson?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 15, 2019

you'd still be doing a dynamic dispatch per iteration

You're doing that anyways, because of the call to zip. When all tuple eltypes are concrete, inference should elide it (via isa-branch splitting). If you used a loop over ii = eachindex(vecs) instead, it might do better, although Tuple(itr) was also already a dynamic call, and vals[ii] would still need to hit the runtime to lookup the layout (although not a dynamic call), so it's not necessarily actually any better.

@StefanKarpinski
Copy link
Sponsor Member Author

My main thought was to avoid any dynamic dispatch on each loop iteration and only do it once before the loop. I also figured that if the compiler knows about the length and type of each slot in vals then it will also know the length and types of eltypes and should be able to unroll the entire loop ideally. Otherwise, yes, you're likely dispatching all the time, in which case my original vector-based generic version is likely just as fast and doesn't tax the compiler as much.

@StefanKarpinski
Copy link
Sponsor Member Author

Neither not using zip in the for loop, nor factoring out the inner unzip loop seem to make any performance difference: when the "row type" is something inferrable like a tuple, it's fast, when it's something uninferrable like an array, it's slow. So the moral of the story is that if you're going to unzip something, make sure the inner collection types is tuple—which, fortunately, is what zip gives produces, so the natural use case of using zip, filtering or transforming tuples, and then unzipping to get "columns" back, is the fast case.

@StefanKarpinski
Copy link
Sponsor Member Author

Using the following test data:

itrs1 = collect(enumerate(eachline("/usr/share/dict/words")))
itrs2 = map(t->[t...], itrs1)
itrs3 = collect(enumerate(rand(1:100, 10^6)))
itrs4 = map(t->[t...], itrs3)

Benchmark results on this branch:

julia> @benchmark unzip($itrs1)
BenchmarkTools.Trial:
  memory estimate:  10.78 MiB
  allocs estimate:  470760
  --------------
  minimum time:     14.655 ms (0.00% GC)
  median time:      16.224 ms (0.00% GC)
  mean time:        16.807 ms (3.20% GC)
  maximum time:     23.046 ms (29.50% GC)
  --------------
  samples:          298
  evals/sample:     1

julia> @benchmark unzip($itrs2)
BenchmarkTools.Trial:
  memory estimate:  147.54 MiB
  allocs estimate:  4715666
  --------------
  minimum time:     522.516 ms (1.65% GC)
  median time:      546.774 ms (3.06% GC)
  mean time:        563.799 ms (2.50% GC)
  maximum time:     691.221 ms (1.36% GC)
  --------------
  samples:          10
  evals/sample:     1

julia> @benchmark unzip($itrs3)
BenchmarkTools.Trial:
  memory estimate:  61.03 MiB
  allocs estimate:  1999495
  --------------
  minimum time:     29.845 ms (0.00% GC)
  median time:      38.098 ms (16.04% GC)
  mean time:        36.911 ms (12.67% GC)
  maximum time:     44.987 ms (20.47% GC)
  --------------
  samples:          136
  evals/sample:     1

julia> @benchmark unzip($itrs4)
BenchmarkTools.Trial:
  memory estimate:  564.54 MiB
  allocs estimate:  17997947
  --------------
  minimum time:     1.594 s (3.17% GC)
  median time:      1.604 s (3.55% GC)
  mean time:        1.616 s (3.69% GC)
  maximum time:     1.661 s (4.44% GC)
  --------------
  samples:          4
  evals/sample:     1

Benchmark results on #33324:

julia> @benchmark unzip($itrs1)
BenchmarkTools.Trial:
  memory estimate:  3.60 MiB
  allocs estimate:  6
  --------------
  minimum time:     1.094 ms (0.00% GC)
  median time:      2.678 ms (0.00% GC)
  mean time:        2.573 ms (11.13% GC)
  maximum time:     12.254 ms (70.55% GC)
  --------------
  samples:          1938
  evals/sample:     1

julia> @benchmark unzip($itrs2)
ERROR: BoundsError: attempt to access 0-element Base.Rows{Tuple{},1,Tuple{}} at index [Base.OneTo(235886)]

julia> @benchmark unzip($itrs3)
BenchmarkTools.Trial:
  memory estimate:  15.26 MiB
  allocs estimate:  6
  --------------
  minimum time:     1.454 ms (0.00% GC)
  median time:      2.490 ms (0.00% GC)
  mean time:        3.486 ms (28.96% GC)
  maximum time:     12.152 ms (56.13% GC)
  --------------
  samples:          1432
  evals/sample:     1

julia> @benchmark unzip($itrs4)
ERROR: BoundsError: attempt to access 0-element Base.Rows{Tuple{},1,Tuple{}} at index [Base.OneTo(1000000)]

So @bramtayl's version does have really impressive performance in the inferrable case, but it fails in the non-inferrable case. I guess that means that this code can stand some optimization.

@StefanKarpinski
Copy link
Sponsor Member Author

Basically, it comes down to being tricky to convince the compiler to unroll the inner unzip loop even though it knows the length and type of the value tuple and the tuple of vectors to write into.

@pdeffebach
Copy link
Contributor

I'm going to give this a friendly bump. It would be nice to have in 1.7.

@JeffBezanson
Copy link
Sponsor Member

If we put this in Base, I do think it should be inferable and as fast as possible, especially since some good implementations (for particular cases) are already available in packages.

If I'm reading it right the big issue with the implementation in #33515 (comment) is that it does multiple passes over the iterator, and so is not general enough.

I will try to experiment with the StructArrays kind of approach I was talking about above.

@bramtayl
Copy link
Contributor

https://github.com/bramtayl/Unzip.jl exists and is registered (I think it's pretty performant, but I'm sure there's ways to speed it up more)

@JeffBezanson
Copy link
Sponsor Member

Yes, that's the approach I'm talking about. Unzip.jl seems quite good --- fast, infers, and no generated functions. StructArrays.jl does the same thing as well, but it is astonishing how different the code is. It uses some generated functions, and also re-implements some of the Base collect machinery (I believe to add features like allowing tuples and named tuples interchangeably?)

@bramtayl
Copy link
Contributor

Well, make_columns in LightQuery handles NamedTuples https://github.com/bramtayl/LightQuery.jl/blob/master/src/make_columns.jl too, also sans generated functions (still with a @pure function though), using fairly similar code. But the implementation there doesn't work with tuples, and Unzip doesn't work with NamedTuples.

@JeffreySarnoff
Copy link
Contributor

I am a proponent of allowing Tuples and NamedTuples interchangeably.

@pdeffebach
Copy link
Contributor

I noted this in an issue at Unzip, but I should note that my main use-cast is when I do

map(x) do _
    (1, 2)
end

i.e. I use map and return a vector of tuples, but want a tuple of vectors. So maybe this is an XY problem. I would probably not see a huge benefit from a highly optimized unzip if what I really need is a map that returns Tuple.

@bramtayl
Copy link
Contributor

bramtayl commented Dec 15, 2020

Agreed it might be nice for it to work with tuples or namedtuples. What would be even cooler would be to merge namedtuples and tuples into the same thing, where you could refer to items by name if they have it, or else just by number. Something like ((name"a", 1), 2, (name"c", 3))

@JeffBezanson
Copy link
Sponsor Member

That design would certainly have its own trade-offs; I don't think it's universally better.

@bramtayl
Copy link
Contributor

Out of curiosity, what would be the downsides?

@JeffreySarnoff
Copy link
Contributor

While it may make sense to have

abstract type AbstractTuple end
Tuple <: AbstractTuple
NamedTuple <: AbstractTuple

Allowing unnamed entries within NamedTuples seems a confusing tack to me.

@bramtayl
Copy link
Contributor

Hmm, well, I guess it's a mostly unrelated issue. But if we want an unzip that works with NamedTuples and Tuples, it would make it a lot easier if Tuples and NamedTuples were...less different?

@pdeffebach
Copy link
Contributor

Implementing a really basic unzip would make my life easier in the short term and I would really like to have it. I don't think contemplating a re-design of the Tuples system is necessary in order to agree on an implementation.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 8, 2021

bump! can we get tests and news so this can be closed?

@stevengj stevengj removed needs tests Unit tests are required for this change needs news A NEWS entry is required for this change needs compat annotation Add !!! compat "Julia x.y" to the docstring labels Jul 10, 2023
@stevengj
Copy link
Member

I added some tests, NEWS, and a compat annotation, and fixed a bug due to a missing import.

There is currently a test failure for the case of an empty iterator of iterators, where the behavior of returning () seems a bit off to me:

julia> itrs = ((), ())
((), ())

julia> collect.(itrs)
(Union{}[], Union{}[])

julia> unzip(zip(itrs...))
()

Note that this also makes unzip type-unstable:

julia> unzip(Tuple{Int8,Bool}[])
()

julia> unzip(Tuple{Int8,Bool}[(3,true)])
(Int8[3], Bool[1])

I know that the empty-iterator case is notoriously tricky, but can we do something better at least in the case where the eltype is known?

@stevengj stevengj added the domain:iteration Involves iteration or the iteration protocol label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:iteration Involves iteration or the iteration protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Base.unzip()