Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Kronecker sums #49

Open
elisno opened this issue May 29, 2020 · 6 comments
Open

Sparse Kronecker sums #49

elisno opened this issue May 29, 2020 · 6 comments

Comments

@elisno
Copy link

elisno commented May 29, 2020

Right now, I'm working on in-place, CPU versions of kronsum(A,B) = kron(A, oneunit(B)) + kron(oneunit(A), B)

So far, this works for dense arrays:

function kronsum!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
    Base.require_one_based_indexing(A, B)
    (mA, nA) = size(A); (mB, nB) = size(B); (mC, nC) = size(C);
    @boundscheck (mC, nC) == (mA * mB, nA * nB) ||
        throw(DimensionMismatch("expect C to be a $(mA * mB)x$(nA * nB) matrix, got size $(mC)x$(nC)"))

    C .= 0
    m = 1
    @inbounds for j = 1:nA
        for l = 1:mB
            for k = 1:mA
                C[m] = A[k,j]
                m += nB
            end
            m += 1
        end
        m -= nB
    end

    m = 1
    @inbounds for j = 1:nA
        for k = 1:nB
            for l = 1:mB
                C[m] += B[l,k]
                m += 1
            end
            m += (nA - 1) * mB
        end
        m += mB
    end

    return C
end
A = rand(3,3); B = rand(2,2); C = zeros(6,6);
C == kron(A, oneunit(B)) + kron(oneunit(A), B) # true

I'm having trouble doing something similar with SparseMatrixCSC, since parts of A and B only overlap on the diagonal of C.

Using Diagonal Format (from DIA.jl) could work, but I think converting to CSC would take too much time (especially if either A or B are sparse) and there's the risk of storing too many zeroes for the diagonal storage Vectors.

Currently, I'm inspired by the sparse kron! in JuliaLang/julia#31069. It works for either kron(A,oneunit(B)) or kron(oneunit(A), B), but not the sum.

  • Could I use two loops like I did for the dense arrays? (One nested loop for A, the other for B)
  • Can I somehow efficiently check if an index points to a diagonal term of C where C isa SparseMatrixCSC?
@Roger-luo
Copy link
Member

have you tried to use kron + IMatrix from LuxurySparse? this is how we calculate our own Hamiltonian matrices in Yao at the moment. I'm curious if kronsum! will be faster than it and how much speedup could it be. (let's say in dense matrix case)

@GiggleLiu
Copy link
Member

GiggleLiu commented May 29, 2020

To get some feelings about which part to optimize

julia> a = sprand(500, 500, 0.02);

julia> m = kron(a, IMatrix{500}());

julia> n = kron(IMatrix{500}(), a);

julia> @benchmark $m + $n
BenchmarkTools.Trial: 
  memory estimate:  78.57 MiB
  allocs estimate:  8
  --------------
  minimum time:     43.870 ms (0.69% GC)
  median time:      69.541 ms (7.15% GC)
  mean time:        72.327 ms (10.12% GC)
  maximum time:     160.216 ms (61.05% GC)
  --------------
  samples:          70
  evals/sample:     1

julia> @benchmark kron($a, IMatrix{500}())
BenchmarkTools.Trial: 
  memory estimate:  40.24 MiB
  allocs estimate:  8
  --------------
  minimum time:     7.091 ms (0.00% GC)
  median time:      8.242 ms (0.00% GC)
  mean time:        8.540 ms (2.21% GC)
  maximum time:     14.998 ms (4.26% GC)
  --------------
  samples:          585
  evals/sample:     1

julia> @benchmark kron(IMatrix{500}(), $a)
BenchmarkTools.Trial: 
  memory estimate:  59.40 MiB
  allocs estimate:  10
  --------------
  minimum time:     7.096 ms (0.00% GC)
  median time:      9.283 ms (4.30% GC)
  mean time:        11.296 ms (6.95% GC)
  maximum time:     25.206 ms (3.04% GC)
  --------------
  samples:          443
  evals/sample:     1

julia> @benchmark copy($res)
BenchmarkTools.Trial: 
  memory estimate:  77.62 MiB
  allocs estimate:  8
  --------------
  minimum time:     37.340 ms (0.00% GC)
  median time:      55.627 ms (0.00% GC)
  mean time:        58.335 ms (7.91% GC)
  maximum time:     72.644 ms (20.63% GC)
  --------------
  samples:          86
  evals/sample:     1

So, the question is whether it is possible that kronsum! be faster than SparseMatrixCSC "+" operation in Base. It should is possible if you can preallocation the output Matrix, because most of time are spent in memory allocation.

@elisno
Copy link
Author

elisno commented May 29, 2020

So, the question is whether it is possible that kronsum! be faster than SparseMatrixCSC "+" operation in Base. It should is possible if you can preallocation the output Matrix, because most of time are spent in memory allocation.

In Kronecker.jl, the (out-of-place) equivalent of kronsum uses SparseMatrixCSC +:

"""
    collect(K::AbstractKroneckerSum)
Collects a lazy instance of the `AbstractKroneckerSum` type into a full,
native matrix. Returns the result as a sparse matrix.
"""
function Base.collect(K::AbstractKroneckerSum)
    A, B = getmatrices(sparse(K))
    IA, IB = oneunit(A), oneunit(B)
    return kron(A, IB) + kron(IA, B)
end

"""
    SparseArrays.sparse(K::KroneckerSum)
Creates a lazy instance of a `KroneckerSum` type with sparse
matrices. If the matrices are already sparse, `K` is returned.
"""
function SparseArrays.sparse(K::KroneckerSum{T, TA, TB}) where {T, TA <: AbstractSparseMatrix, TB <: AbstractSparseMatrix}
    return K
end

function SparseArrays.sparse(K::KroneckerSum{T, TA, TB}) where {T, TA <: AbstractMatrix, TB <: AbstractMatrix}
    return sparse(K.A)  sparse(K.B)
end

For my usecase, I would preallocate the output matrix C with kron once, then update the non-zero values of A and B (assuming they have a fixed sparsity structure).

julia> A, B = (sprand(32, 32, 0.1) for _ in 1:2);

julia> IA, IB = (IMatrix(X) for X in (A,B));

julia> C = kron(A, IB) + kron(IA, B); C.nzval .= 0;
...

When the sparsity structure is fixed, a batched_kron! implementation for batches of sparse matrices would be very nice. One would probably need a new type:

struct BatchSparseMatrixCSC{Tv,Ti<:Integer} <: AbstractSparseArray{Tv,Ti,3} end

@elisno
Copy link
Author

elisno commented May 29, 2020

have you tried to use kron + IMatrix from LuxurySparse? this is how we calculate our own Hamiltonian matrices in Yao at the moment. I'm curious if kronsum! will be faster than it and how much speedup could it be. (let's say in dense matrix case)

julia> A, B = (sprand(100, 100, 0.1) for _ in 1:2);

julia> IA, IB = (IMatrix(X) for X in (A,B));

julia> C = kron(A, IB) + kron(IA, B); C.nzval .= 0;

julia> @benchmark kronsum!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     65.690 ms (0.00% GC)
  median time:      66.072 ms (0.00% GC)
  mean time:        66.199 ms (0.00% GC)
  maximum time:     67.885 ms (0.00% GC)
  --------------
  samples:          76
  evals/sample:     1

julia> @benchmark kron($A, $IB) + kron($IA, $B)
BenchmarkTools.Trial: 
  memory estimate:  6.88 MiB
  allocs estimate:  26
  --------------
  minimum time:     1.381 ms (0.00% GC)
  median time:      1.427 ms (0.00% GC)
  mean time:        1.479 ms (2.43% GC)
  maximum time:     2.693 ms (31.06% GC)
  --------------
  samples:          3364
  evals/sample:     1

julia> C == kron(A,IB) + kron(IA, B)
true

IMatrix appears to be slightly faster than oneunit, but I have yet to take the construction of both matrices into account.

julia> unA, unB = (oneunit(X) for X in (A,B));

julia> @benchmark kron($A, $unB) + kron($unA, $B)
BenchmarkTools.Trial: 
  memory estimate:  6.15 MiB
  allocs estimate:  24
  --------------
  minimum time:     1.922 ms (0.00% GC)
  median time:      1.968 ms (0.00% GC)
  mean time:        2.012 ms (1.59% GC)
  maximum time:     2.960 ms (26.67% GC)
  --------------
  samples:          2475
  evals/sample:     1

@elisno
Copy link
Author

elisno commented May 29, 2020

If the matrices are dense: using IMatrix and + is much faster, but kronsum! is still allocation-free.

julia> A, B = (rand(40, 40) for _ in 1:2);

julia> IA, IB = (IMatrix(X) for X in (A,B));

julia> C = kron(A, IB) + kron(IA, B); C.nzval .= 0;

julia> @benchmark kronsum!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.664 ms (0.00% GC)
  median time:      3.728 ms (0.00% GC)
  mean time:        3.746 ms (0.00% GC)
  maximum time:     5.585 ms (0.00% GC)
  --------------
  samples:          1334
  evals/sample:     1

julia> @benchmark kron($A, $IB) + kron($IA, $B)
BenchmarkTools.Trial: 
  memory estimate:  3.94 MiB
  allocs estimate:  21
  --------------
  minimum time:     710.607 μs (0.00% GC)
  median time:      733.885 μs (0.00% GC)
  mean time:        768.422 μs (2.74% GC)
  maximum time:     1.837 ms (44.32% GC)
  --------------
  samples:          6452
  evals/sample:     1

julia> unA, unB = (oneunit(X) for X in (A,B));

julia> @benchmark kron($A, $unB) + kron($unA, $B)
BenchmarkTools.Trial: 
  memory estimate:  58.59 MiB
  allocs estimate:  6
  --------------
  minimum time:     9.340 ms (0.00% GC)
  median time:      9.791 ms (0.00% GC)
  mean time:        9.924 ms (2.67% GC)
  maximum time:     10.843 ms (7.24% GC)
  --------------
  samples:          504
  evals/sample:     1

@Roger-luo
Copy link
Member

IMatrix actually do not have any allocation, and kron creates a sparse matrix when there is a IMatrix, thus I believe the allocation could be fine. We could have some kron! interface support for these matrices, so you could preallocate these as well.

NOTE: oneunit allocates an entire dense array and the result of kron will be in Array as well, while IMatrix could allocate nothing and produce SparseMatrixCSC.

Could I use two loops like I did for the dense arrays? (One nested loop for A, the other for B)

no, setindex! for CSC format is not efficient, you have to generate the colptr etc. directly which would be the same as what's in Base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants