Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proportionmap accepts iterators #855

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ArunS-tack
Copy link

closes #842.

src/counts.jl Outdated
@@ -450,5 +450,5 @@ Return a dictionary mapping each unique value in `x` to its proportion in `x`.
If a vector of weights `wv` is provided, the proportion of weights is computed rather
than the proportion of raw counts.
"""
proportionmap(x::AbstractArray) = _normalize_countmap(countmap(x), length(x))
proportionmap(x) = _normalize_countmap(countmap(x), length(collect(x)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just count the total number of elements when building the countmap? It seems inefficient to materialize x only to obtain its length if we already iterate through it anyway.

Copy link
Author

@ArunS-tack ArunS-tack Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something around sum(values(countmap(x))? But I think that's memory inefficient even though it doesn't iterate again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I thought counting directly inside of countmap. But probably sum(values, countmap(x)) would still be more efficient than using collect(x) if x is an iterator with a large number of elements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

julia> @btime proportionmap(skipmissing(a)) 8.625 μs (27 allocations: 146.67 KiB) Dict{Int64, Float64} with 4 entries: 4 => 0.25 2 => 0.25 3 => 0.25 1 => 0.25

julia> @btime proportionmap(skipmissing(a)) 316.667 ns (9 allocations: 1.08 KiB) Dict{Int64, Float64} with 4 entries: 4 => 0.25 2 => 0.25 3 => 0.25 1 => 0.25

Looks like a significant improvement 🧐

Comment on lines +454 to +459
countm = Dict{eltype(x), Int}()
n = 0
for y in x
countm[y] = get(countm, y, 0) + 1
n += 1
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reinvents countmap. Better make countmap allow iterators instead, so that both functions benefit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

countmap already accepts iterators; I did that to keep a count of n while iterating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. The problem is that countmap uses different algorithms under the hood for performance. By using a Dict here, you lose the benefit of the fast radix sort and count sort algorithms.

I see two solutions:

  • do n = Base.IteratorSize(x) isa Union{HasLength, HasShape} ? length(x) : sum(values(countm))
  • adjust all _addcounts! methods to return the number of elements (this should be cheap so not a big deal if it's not used by addcounts)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking to help get this across the line. Is this your first proposed solution?

function proportionmap(x)
    countm = countmap(x)
    n = Base.IteratorSize(x) isa Union{Base.HasLength, Base.HasShape} ? length(x) : sum(values(countm))
    _normalize_countmap(countm, n)
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proportionmap should accept any iterator
4 participants