Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core #6

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Core #6

wants to merge 6 commits into from

Conversation

thild
Copy link

@thild thild commented Sep 21, 2020

Hi Kevin,

To test on Linux, I had to update to .Net Core 3.1. I had to disable some tests and others are breaking. If this update does not break your dev environment, you may want to consider applying it.

@KevinBaselinesw
Copy link
Collaborator

I apologize for the delay, I had a very busy week.

I spent time today analyzing the performance of your app and the call to np.dot. I see that we are allocating huge amounts of memory but I think it is all legitimate. Your code is in a loop calling np.dot on 200020008 byte buffers (3.2MB) Each call to np.dot allocates one or more buffers of that size to get it's work done. It adds up fast.

I also ran performance analysis software on the code to see where we are taking the most time. It turns out that it is my old friend NpyArray_ITER_NEXT plus the other code I shared with you earlier are taking most of the time. As a "strided" system, numpy creates data structures that map "views" into the allocated arrays. When operations are performed on these arrays, each element needs to run through the NpyArray_ITER_NEXT to make sure the correct offset into the buffer is calculated based on the mapped views.

In the C code, this is a MACRO which means the compiler inserts the code into the calling C code. This allows for optimal performance. As you may know, C# does not support MACROS so I had to port those MACROS to C# functions. This makes them way slower than C. I did mark these function for aggressive inlining but I am not sure the compiler is agreeing to do that because the functions are quite large.

I did find a way to parallel perform these iteration loops for some of the operations which allows me to be faster than the C code in some situations. If I am not able to use parallel processing, then this code path will be slower which is what we are seeing with your np.dot calls.

At some point, I/we/someone should look at calling into a C DLL to perform some of these heavy processing functions. This may be necessary in order to make this tool really competitive. If we can parallel process AND use C to perform the calculations, it may end up being way faster than the original python code.

Next up I will try to convert the sample apps and unit tests to .NET core so they can be run on Linux.

@thild
Copy link
Author

thild commented Sep 29, 2020

Hi Kevin,

I apologize for the delay, I had a very busy week.

No problem.

I spent time today analyzing the performance of your app and the call to np.dot. I see that we are allocating huge amounts of memory but I think it is all legitimate. Your code is in a loop calling np.dot on 2000_2000_8 byte buffers (3.2MB) Each call to np.dot allocates one or more buffers of that size to get it's work done. It adds up fast.

The profiler is telling that GEN 0 is getting full and being collected more than 100 times / s. I think there is some overhead in heap allocations.

I also ran performance analysis software on the code to see where we are taking the most time. It turns out that it is my old friend NpyArray_ITER_NEXT plus the other code I shared with you earlier are taking most of the time. As a "strided" system, numpy creates data structures that map "views" into the allocated arrays. When operations are performed on these arrays, each element needs to run through the NpyArray_ITER_NEXT to make sure the correct offset into the buffer is calculated based on the mapped views.

In the C code, this is a MACRO which means the compiler inserts the code into the calling C code. This allows for optimal performance. As you may know, C# does not support MACROS so I had to port those MACROS to C# functions. This makes them way slower than C. I did mark these function for aggressive inlining but I am not sure the compiler is agreeing to do that because the functions are quite large.

I did find a way to parallel perform these iteration loops for some of the operations which allows me to be faster than the C code in some situations. If I am not able to use parallel processing, then this code path will be slower which is what we are seeing with your np.dot calls.

NpyArray_ITER_NEXT has many branches indeed, but I copy an older serial version of your MatrixProduct and got interesting results. The serial version is twice as fast.

#Parallel#

Running Mackey...
Loading...
    Elapsed: 91ms
Constructing ESN...
    Elapsed: 2588ms
Fit...
    Elapsed: 32129ms
Predict...
    Elapsed: 20501ms
Error...
    STRING 
    { test error: 
    0,13960390995923377 }
    Elapsed: 43ms

Total time: 55364ms


#Serial#

Running Mackey...
Loading...
    Elapsed: 96ms
Constructing ESN...
    Elapsed: 2584ms
Fit...
    Elapsed: 15478ms
Predict...
    Elapsed: 3931ms
Error...
    STRING 
    { test error: 
    0,13960390995923377 }
    Elapsed: 31ms

Total time: 22128ms

parallel_dot

As you can see in the previous image, the _update takes up 71% of the time and TaskReplication takes up 67% of the time.
On the other hand, in the serial version, the _update takes up only 31% of the time. The bottleneck here is MathNet.Numerics.

serial_dot

At some point, I/we/someone should look at calling into a C DLL to perform some of these heavy processing functions. This may be necessary in order to make this tool really competitive. If we can parallel process AND use C to perform the calculations, it may end up being way faster than the original python code.

My curiosity was to see if it would be possible to beat Numpy's numbers with a pure C# implementation. I think JIT can get close enough to become competitive. For the past few days, I have been messing with your code trying to understand how it is architected. What are your plans for the library architecture? Will you stay close to Numpy architecture or move to a more object-oriented one? What were your criteria when you ported the Numpy code? Is all C code in the numpyinternal class? What is the role of the numpyAPI class?

Next up I will try to convert the sample apps and unit tests to .NET core so they can be run on Linux.

Only the WPF examples do not run on Linux and only a few tests that involve dynamic assembly emitting have had to be disabled because dotnet 3.1 does not support some APIs.

@thild
Copy link
Author

thild commented Sep 29, 2020

Did you consider a version of ndarray using generics?

@KevinBaselinesw
Copy link
Collaborator

Can we communicate directly by email rather than through github? I think we can do a better job sharing information than this tool allows. kmckenna@baselinesw.com

@KevinBaselinesw
Copy link
Collaborator

KevinBaselinesw commented Sep 29, 2020 via email

@thild
Copy link
Author

thild commented Sep 29, 2020

The code was running with 8 threads on average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants