Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try new .Net Core 2.1 PopCount() intrinsic for HAMT implementation #5

Open
dadhi opened this issue Jun 15, 2018 · 9 comments
Open

Try new .Net Core 2.1 PopCount() intrinsic for HAMT implementation #5

dadhi opened this issue Jun 15, 2018 · 9 comments

Comments

@dadhi
Copy link
Owner

dadhi commented Jun 15, 2018

Currently it uses HammingWeight algorithm for calculating number of set bits in a int value:
https://bitbucket.org/dadhi/dryioc/src/e788ffce78b727b35fda5355a2297d40f0e7731b/Net45/Playground/HashArrayMappedTrieTests.cs#lines-190

Here is the thing Aye, that's there System.Runtime.Intrinsics.X86.PopCount(uint/ulong)

First step will to BDN benchmark against Swar or HammingWeight with PopCount.

@dadhi dadhi changed the title Try new .Net Core 2.1 PopCount() intrisinc for HAMT implementation Try new .Net Core 2.1 PopCount() intrinsic for HAMT implementation Jun 15, 2018
@dzmitry-lahoda
Copy link

Have you done that? May I do it?

@dadhi
Copy link
Owner Author

dadhi commented Jan 17, 2019

No, not yet. Yes, you can try!

dadhi pushed a commit that referenced this issue Jan 17, 2019
@dadhi
Copy link
Owner Author

dadhi commented Jan 17, 2019

@dzmitry-lahoda I have added two basic HAMT implementations into the Playground project with some tests.

There are two (or more) implementations: IntHashTrie<V> and HashTrie<K, V> (for arbitrary keys).

The method(s) to consider for replace with PopCount are named uint GetSetBitsCount(uint n):

private static uint GetSetBitsCount(uint n)

The hand-written benchmarks (without BenchmarkDotNet) are in https://github.com/dadhi/ImTools/blob/master/src/Playground/TreeBenchmarks.cs

@dzmitry-lahoda
Copy link

dzmitry-lahoda commented Jan 17, 2019

               Method | ItemCount |        Mean |     Error |       StdDev |
--------------------- |---------- |------------:|----------:|-------------:|
         SWAR_ImTools |      1000 |  2,362.5 ns |  47.07 ns |    73.284 ns |
               Popcnt |      1000 |    504.7 ns |  10.05 ns |     8.908 ns |
 SWAR_ImTools_Inlined |      1000 |  1,658.4 ns |  27.01 ns |    25.261 ns |
    SWAR_ForkOfCoreFx |      1000 |  2,378.0 ns |  45.01 ns |    42.106 ns |
         SWAR_ImTools |     10000 | 23,281.8 ns | 362.54 ns |   339.117 ns |
               Popcnt |     10000 |  4,912.9 ns |  73.60 ns |    68.845 ns |
 SWAR_ImTools_Inlined |     10000 | 16,221.2 ns | 265.45 ns |   248.305 ns |
    SWAR_ForkOfCoreFx |     10000 | 24,810.7 ns | 492.58 ns | 1,049.733 ns |

I did in my fork as it has .NET Core only build (i have hard time work with legacy projects:()

https://github.com/dzmitry-lahoda/ImTools/blob/unnoficial/src/Playground/BitCountsBenchmarks.cs

For synthetic benchmark.

  1. Seem better to mark for inlining to be 100% sure you are (I guess it will be better for exact that method, so will test https://gist.github.com/mrange/d6e7415113ebfa52ccb660f4ce534dd4#gistcomment-2026775)
  2. You SWAR is slightly better to some than other SWAR:) on my Intel 7gen
  3. Popcnt is 3 to 4 times faster.

I guess need to make TreeBenchmarks to be BDN. And add copy of Trie code. Will send results.

I guess I may merge these classes into separate branch and send pull. But better if whole solution will be .NET Core for contrib.

@dzmitry-lahoda
Copy link

Adding 10 items (ms):
Trie PopCnt- 0
Trie - 0

Getting one out of 10 items 1,000,000 times (ms):
Trie PopCnt - 21
Trie - 20
====================
Adding 100 items (ms):
Trie PopCnt- 2
Trie - 0

Getting one out of 100 items 1,000,000 times (ms):
Trie PopCnt - 39
Trie - 38
====================
Adding 1000 items (ms):
Trie PopCnt- 0
Trie - 0

Getting one out of 1000 items 1,000,000 times (ms):
Trie PopCnt - 40
Trie - 38
====================
Adding 10000 items (ms):
Trie PopCnt- 4
Trie - 3

Getting one out of 10000 items 1,000,000 times (ms):
Trie PopCnt - 49
Trie - 22
====================
Adding 100000 items (ms):
Trie PopCnt- 66
Trie - 80

Getting one out of 100000 items 1,000,000 times (ms):
Trie PopCnt - 64
Trie - 23
====================
Adding 1000000 items (ms):
Trie PopCnt- 1489
Trie - 1533

Getting one out of 1000000 items 1,000,000 times (ms):
Trie PopCnt - 18
Trie - 24

Strange result.

  1. May be transform to BDN to avoid any influences (but do not seems that issues relevant). I have intel 7700hq and process is 64 bits.
  2. Tried inline GetSetBitsCount. No changes.

So seem no need to popcnt for now for specific case.

@dzmitry-lahoda
Copy link

dzmitry-lahoda commented Jan 17, 2019

I see Trie is faster than Tree always. And only slower on 1M items to add. Interesting replacement of IoC. But may be be better try out BDN to measure memory and reduce any fluctuations. E.g. right now GC.Collect() seems(not sure) starts GC, but not waits to finish.

@dzmitry-lahoda
Copy link

dzmitry-lahoda commented Jan 17, 2019

popcnt is with netcoreapp3.0 preview. not with 2.1 or 2.2:) so I suggest to close story as it seems irrelevant for both performance and usage in near future.

@dadhi
Copy link
Owner Author

dadhi commented Jan 17, 2019

Thanks for the test.

  • BDN is needed - I fully agree
  • For some reason, I was thinking that PopCount is available in the .Net Core 2.1. Hmmm...

I will keep it open until I played with it myself, will decide how to proceed later.

@dzmitry-lahoda
Copy link

I was trying to get the code of intrinsic for some reason. Found that for .NET Core 2.1 there should be kind of System.Runtime.Intrinsics.Experimental.dll. But I could not find such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants