-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement time-domain convolution and use it for integers #545
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #545 +/- ##
==========================================
+ Coverage 97.74% 97.79% +0.04%
==========================================
Files 18 18
Lines 3199 3220 +21
==========================================
+ Hits 3127 3149 +22
+ Misses 72 71 -1 ☔ View full report in Codecov by Sentry. |
Should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe Val(N)
would help with type stability
It does for |
Hm. Export |
I kind of like the names |
68c20b4
to
3bb56ca
Compare
I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish. The |
I took reference to FastConv.jl (licensed under the "MIT License (Expat)") and wrote a version with an explicit loop over function _tdconv1!(out, E::AbstractArray{<:Number,N}, k::AbstractArray{<:Number,N}) where {N}
output_indices = CartesianIndices(ntuple(i -> 1:(size(E, i)+size(k, i)-1), Val(N)))
checkbounds(out, output_indices)
fill!(out, 0)
offset = CartesianIndex(ntuple(_ -> 1, Val(N)))
for j in CartesianIndices(E), i in CartesianIndices(k)
out[i+j-offset] = muladd(E[j], k[i], out[i+j-offset])
end
return out
end Benchmarks: julia> x = rand(100); y = rand(100); out = rand(199);
julia> @benchmark _conv_td!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 9.000 μs … 38.000 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.100 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.216 μs ± 602.871 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ ▃ ▂
▅▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁█▁▁▁█ █
9 μs Histogram: log(frequency) by time 10.2 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark _tdconv1!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.510 μs … 5.720 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.520 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.583 μs ± 101.580 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂
▄▁▁█▁▁▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▄▁▁█▁▁▂ ▂
1.51 μs Histogram: frequency by time 1.69 μs <
Memory estimate: 0 bytes, allocs estimate: 0. |
Yep. I used the generator-based version because it allowed to derive the output type based on the actual values. With the latest change from |
If you're referring to the somewhat messy organization in |
I was primarily referring to the keyword-based API. Once that's settled, the organization of the So let's discuss API. Some random thoughts I have gathered while working on this:
I think the current kwargs approach addresses the first three items, although choosing between |
My opinion, for what it's worth:
|
I'm no fan of Otherwise, this is slowly getting shape -- partially in code, partially in my head for now. One thing I haven't made up mind about is how
The two cases are inconsistent with each other for the input-indices-start-at-1 case, but each one individually is very much the only reasonable choice, so that's where we are. Now if the output provided to Of course, erroring for unclear cases would open the possibility to decide later on, so that might be the safest route for now. |
Sounds good. Another thought I had regarding commutativity and axes, while reading this after prematurely micro-optimizing the loop: even if we consider
I also just realized that |
addb53b
to
fce6cf9
Compare
These all seem like great changes! |
This turns out to be a but more of a rabbit hole than I had anticipated, but it's slowly getting into shape. Apart from some more tests and documentation, the main to-do appears to be a better heuristic to choose the fastest algorithm. But the working of the
The default for Another alternative would be to replace the |
I think allowing the user to choose the algorithm would be better. In my previous attempts to add direct convolution to this and figure out which alg to choose, I found it very hard to come up with heuristics since dimensionality, size of the problem, element type, number of threads available, and architecture (eg avx instructions available) all come into play. Doing a best effort selection and letting users who really care override seems better imo. |
That would then include the option the choose between simple FFT and overlap-save, right? (I had this in an earlier iteration, can revive.) |
I agree with giving users the ability to choose specific algorithms. Maybe we could have |
I think the second option with |
Do you mean @wheeheee I don't think allowing different kwargs in |
Yes, that's what I meant. |
bcfbb77
to
7c603ec
Compare
Ok, as there were no complaints about the current API, I've added docs and more tests. If someone could check the docs for comprehensibility, that would be welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of potential improvements to the docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos
Sorry this has stalled. The only thing left to do from my perspective is to come up with a heuristic to choose between time-domain and frequency-domain. I meant to base this on some benchmarking, but that "some" turned out to be quite a lot, especially when not restricting to 1-d. For the time-domain algorithm, my benchmarking confirmed that If anyone wants to pick this up or has a good idea for a benchmarking strategy - any help is appreciated! |
Not sure what the detailed benchmarks say, but would it be acceptable to start with the conservative heuristic here since it's IIUC a definite improvement for small sizes? julia> x = rand(ComplexF64, 245); rx = reverse(x);
julia> @btime conv($x, $rx; algorithm=:fft_simple);
45.600 μs (13 allocations: 31.91 KiB)
julia> x = rand(ComplexF64, 246); rx = reverse(x);
julia> @btime conv($x, $rx; algorithm=:fft_simple);
27.500 μs (13 allocations: 32.28 KiB) |
* add `:algorithm` kwarg * add `conv!`
Co-authored-by: wheeheee <104880306+wheeheee@users.noreply.github.com>
Co-authored-by: wheeheee <104880306+wheeheee@users.noreply.github.com>
This does the very minimal thing to get discussion started. I think we should do the time-domain convolution in more cases, but before going in that direction, I'd first like to get feedback.
In particular, for "small"
u
andv
, time-domain convolution would be more efficient and I see no reason not to do it. It just takes a bit of benchmarking to figure out what "small" should be exactly, here. And it would also fix this minor issue:The time-domain convolution form this PR does the right thing:
Another point worth considering is whether to use time-domain convolution for more input types. Consider
For
BigFloat
, this would change from throwing an error to working, but potentially slow, which I'd consider an improvement. ForRational
, I wonder whether doing the computation onRational
s (and also returningRational
s) wouldn't be better. The extreme measure would be to do time-domain convolution by default and only use FFT-based approaches forFloat32
andFloat64
input (above a certain problem size), which would still cover most cases in practice, I guess. Opinions?Note that if we decide to do time-domain convolution in more cases, we can always do that in later PRs; this one could stand on its own.
Closes #411, fixes #410.