Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel scaling: should we care about it? #2

Open
szaghi opened this issue Jul 30, 2015 · 4 comments
Open

Parallel scaling: should we care about it? #2

szaghi opened this issue Jul 30, 2015 · 4 comments

Comments

@szaghi
Copy link
Member

szaghi commented Jul 30, 2015

WENO schemes are in general non linear procedure (due to the smoothness indicators computation) that could take a not so negligible cpu-time. However, WENO procedures typically operate on a very reduced size stencil that seems (at a first view) to not offer much space for parallel scaling.

As Rouson cleary states into his great book, preliminary optimization could be very dangerous: for the moment we do not care about eventual performance bottlenecks for parallel architectures. However, here I would like to discuss about future strategies for supporting parallel architectures.

As a first guess, I suppose that the main parallel features that library should provides asap are:

  • being thread safe: WENO generally operates on a reduced size stencil that is typically a slice of a greater block; the procedures operating on this parent block are generally suitable to exploit thread-shared-memory parallelism, thus it could be crucial to ensure the thread safety for our library;
  • exploiting vectorization: this could speed up greatly the interpolation.

In this contest, your experience (I am thinking to Zaak, Andrea, Francesco, Rouson, Muller, Americi and many other members of our group) is very important. Please, feel free to post any pertinent comments.

@zbeekman
Copy link
Member

My experience with WENO is with finite difference WENO. If you use symmetric, bandwidth optimized WENO, you can perform high fidelity turbulence DNS at moderate and high Mach numbers with only 4th order accurate spatial discretization. On a 4th order accurate symmetric, bandwidth-optimized WENO scheme, each candidate stencil is only 4 grid points and the collection of candidate stencils span 8 grid points.

While the smoothness indicator computation is one of the most expensive parts of the computation, I don’t understand what your concerns about parallel efficiency are, regarding the non-linearity of the WENO scheme and the small stencil size.

The non-linearity of the scheme arises from the smoothness measurements which are used to weight the candidate stencils. This is a mathematical non linearity, since the stencil weights are now some function of the solution at those points, unlike a linear finite difference scheme, where the stencil coefficients/weights are typically constant. It does not refer to the algorithmic complexity. The algorithmic complexity is linear, constant number of operations per grid point. This makes load balancing trivial: try to match the sizes of the subdomains used on different cores/MPI ranks/etc.

I am also confused as to why you think the small stencil size will hurt the parallel efficiency. The latency associated with sending boundary points will be the same, no matter how large the stencil (i.e., how much data needs to be sent), but sending less data means that you won’t have to wait as long from the time that the receiver gets the first part of the boundary until it gets the last. Perhaps in the context of finite-volume simulations there are other issues to consider.

I agree that you shouldn’t spend much/any time optimizing at first. Pretty much every operation can be expressed in vector operations, without forming matrices. This helps make it natural for a compiler to vectorize. If you’re performing flux splitting, some information can be reused between neighboring cells.

@szaghi
Copy link
Member Author

szaghi commented Jul 30, 2015

Hi @zbeekman

While the smoothness indicator computation is one of the most expensive parts of the computation, I don’t understand what your concerns about parallel efficiency are, regarding the non-linearity of the WENO scheme and the small stencil size.

Indeed I am not really concerned about parallelism, I just like to create a room where we can discuss about this.

The algorithmic complexity is linear, constant number of operations per grid point. This makes load balancing trivial: try to match the sizes of the subdomains used on different cores/MPI ranks/etc...

Sure this is not a matter.

I am also confused as to why you think the small stencil size will hurt the parallel efficiency...

Not at all, I was not clear (sorry for my bad English).

What I meant is: probably, there is no room for speeding up the actual weno interpolation by parallelize it via OpenMP or MPI just due to the reduced size of the stencil computation (maybe we lost more time in communications rather than in computations in this case). On the contrary, the fluxes computations procedures (taken as an example of outer procedures calling weno interpolator) generally operate on larger stencil, thus they can be easily parallelized. In my typical application, a domain decomposition is done over mesh blocks and parallilized via MPI, the computations into each block are parallelized via OpenMP and "atomic" computations are (hopefully) vectorized. In this regards, I think that we should care that our WENO library is just thread-safe (in order to be safely used by user procedures being eventually parallelized) and "internally vectorizable".

In order to be thread-safe, I plan to exploit Rouson lessons: encapsulate all data in one object it being thread-safe via hiding data and exposing only the necessary thread-safe methods. I am using an abstract derived type as a contract for actual weno interpolator that we will implement with different algorithms, but providing the same API. As soon as I make clear my messy idea, I would like to test Rouson lessons on Factory pattern to this aim, but this is an argument of another discussion...

To summarize, these are my conclusions:

  • we will not need to care about parallelism (via threads or processes);
  • we will care to provide thread-safe methods;
  • we will care to allow the compiler to exploit the eventual hardware vector units allowing vectorizable atomic computations;

Do you agree?

@zbeekman
Copy link
Member

zbeekman commented Aug 2, 2015

I agree and think that if your aim is to create an efficient WENO library for interpolation, that tries to divorce itself from notions of CFD, Reimann solvers, etc. then yes, it should be thread safe, and need not worry at all about parallelism or threading explicitly.

@szaghi
Copy link
Member Author

szaghi commented Aug 3, 2015

Yes, we are interested in CFD, but many Fortraners no, we should think to the whole of our community.

Today I hope to push some updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants