Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption #510

Open
ojwoodford opened this issue Sep 24, 2023 · 2 comments
Open

Memory corruption #510

ojwoodford opened this issue Sep 24, 2023 · 2 comments

Comments

@ojwoodford
Copy link

I am seeing some memory corruption in a function that I believe should be working. The code is:

using LoopVectorization

function At_mul_A!(C, A)
   @turbo for n in indices((C, A), 2), m in indices((C, A), (1, 2))
       Cmn = zero(eltype(C))
       for k in indices(A, 1)
           Cmn += A[k,m] * A[k,n]
       end
       C[m,n] = Cmn
   end
end

function test()
A = randn(16, 64)
C = randn(64, 64)
for a = 1:1000
    At_mul_A!(C, A)
end
end
test()

If I run this once, or sometimes a few times, I get the error:

julia(57934,0x1f15fe080) malloc: Incorrect checksum for freed object 0x12b391000: probably modified after being freed.
Corrupt value: 0x4037203caf449798
julia(57934,0x1f15fe080) malloc: *** set a breakpoint in malloc_error_break to debug

and julia crashes. I'm running Julia 1.10.0-beta2 and LoopVectorization v0.12.165.

I cannot see any reason that this shouldn't work - apologies if there is, but I really have looked it over several times, and run it without @turbo to check it's valid.

@chriselrod
Copy link
Member

I got

julia> function test()
       A = randn(16, 64)
       C = randn(64, 64)
       for a = 1:1000
           At_mul_A!(C, A)
       end
       end
test (generic function with 1 method)

julia> @time test()
  0.004525 seconds (3 allocations: 40.172 KiB)

julia> @time test()
  0.004688 seconds (3 allocations: 40.172 KiB)

julia> @time test()
  0.004536 seconds (3 allocations: 40.172 KiB)

julia> @time test()
  0.004673 seconds (3 allocations: 40.172 KiB)

julia> @time foreach(_->test(),1:1000)
  4.794809 seconds (5.07 k allocations: 39.368 MiB, 0.20% gc time, 0.09% compilation time)

julia> function test2()
       A = randn(64, 16)' # 16x64
       C = randn(64, 64)
       for a = 1:1000
           At_mul_A!(C, A)
       end
       end
test2 (generic function with 1 method)

julia> @time test2()
  0.001879 seconds (3 allocations: 40.172 KiB)

julia> malloc(): invalid size (unsorted)

[7231] signal (6.-6): Aborted
in expression starting at none:0

It'd be faster to use a different memory layout if possible, which you can see: test2 is over twice as fast as test.
That is, A_mul_At is faster than At_mul_A due to the better memory layout.
This surprises most people who only think about a matrix multiply as a bunch of dot products; i.e. some people naively suspect that setting the layout to make a dot product of a view as fast as possible would be best, but column-major A' * B is actually the worst of the four combinations, even A' * B' is better.
A' * A requires traversing memory much more quickly, increasing bandwdith requirements, and requires reductions at the end of the k loop, requiring extra FLOPs. None of the other 3 orientations need this.

Anyway, I got a segfault, too.

I tried replaces indices with axes, and didn't get a segfault.
It might be a bug in the optimizations it does for indices.

julia> using LoopVectorization

julia> function At_mul_A!(C, A)
          @turbo for n in axes(C, 2), m in axes(C, 1)
              Cmn = zero(eltype(C))
              for k in axes(A, 1)
                  Cmn += A[k,m] * A[k,n]
              end
              C[m,n] = Cmn
          end
       end
At_mul_A! (generic function with 1 method)

julia> function test()
       A = randn(16, 64)
       C = randn(64, 64)
       for a = 1:1000
           At_mul_A!(C, A)
       end
       end
test (generic function with 1 method)

julia> @time test()
  0.004558 seconds (3 allocations: 40.172 KiB)

julia> @time test()
  0.004533 seconds (3 allocations: 40.172 KiB)

julia> @time test()
  0.004505 seconds (3 allocations: 40.172 KiB)

julia> @time test()
  0.004553 seconds (3 allocations: 40.172 KiB)

julia> @time foreach(_->test(),1:1000)
  4.710555 seconds (12.46 k allocations: 39.802 MiB, 0.24% gc time, 0.08% compilation time)

julia> function test2()
       A = randn(64, 16)' # 16x64
       C = randn(64, 64)
       for a = 1:1000
           At_mul_A!(C, A)
       end
       end
test2 (generic function with 1 method)

julia> @time test2()
  0.002044 seconds (3 allocations: 40.172 KiB)

julia> @time test2()
  0.001877 seconds (3 allocations: 40.172 KiB)

julia> @time test2()
  0.001880 seconds (3 allocations: 40.172 KiB)

julia> @time test2()
  0.001868 seconds (3 allocations: 40.172 KiB)

julia> @time foreach(_->test2(),1:1000)
  2.011197 seconds (5.07 k allocations: 39.368 MiB, 0.45% gc time, 0.20% compilation time)

@ojwoodford
Copy link
Author

Thanks, @chriselrod . Especially for the tips on faster layouts. However, I also get the memory corruption using axes instead of indices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants