Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move dimensions to allocator #85

Closed
wants to merge 3 commits into from
Closed

Conversation

Nanoseb
Copy link
Collaborator

@Nanoseb Nanoseb commented Apr 9, 2024

closes #77

@Nanoseb Nanoseb added the core Issue affecting core mechanisms of the software label Apr 9, 2024
@Nanoseb Nanoseb self-assigned this Apr 9, 2024
@semi-h semi-h mentioned this pull request Apr 10, 2024
@semi-h
Copy link
Member

semi-h commented Apr 10, 2024

The main change here is introducing new member variables and removing some of them, involving dirps_t, field_t, and allocator_t classes. Some of these are great and I completely agree, but there are a few I have some concerns.

  • cdims in allocator_t ✔️
    • A new member stores the actual sizes of the subdomain, I think very helpful. It also helped getting rid of globs in certain places which is great.
  • n_groups_x/y/z in allocator_t ✔️
    • We manage the padding in the allocator and the number of groups in the domain directly depends on this, its a good idea to store this info in the allocator.
  • Add n in field_t and remove it from dirps_t
    • I think this is the most important one, mainly due to the difference between velocity n and pressure n. I'll leave a separate review on this so that I can point out specific examples and suggest an alternative solution. I think going ahead with this will make managing the difference between velocity/pressure grid harder later on.
  • Add sz in field_t
    • If we want to avoid using the SZ from m_omp_common, then I think we can make sz a member of the backend_t. Because once the backend is instantiated, SZ is fixed and it'll never change. Having this as a member in field_t is a bit misleading in my opinion, it implies as if fields might have a different SZ value, but its never the case. There are a few locations where passing and using sz as a variable instead of using it as a parameter will have performance issues too, I'll comment on these separately.
  • Add n_groups in field_t and remove the corresponding member from dirps_t
    • As far as I can see this only replaced accessing n_groups from dirps_t with field_t. Also, it didn't really allowed us to avoid passing dirps_t in any of the subroutines we access this member, so I'm not sure if this simplifies anything. In my view its better to store n_groups in dirps_t because it doesn't really depend on the field instance, field instance is just a scratch space after all, but it completely depends on the direction we're working at a given time.

There is also a new member n_padded in field_t but its not used anywhere yet, so not commenting on it.

Copy link
Member

@semi-h semi-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I focused entirely on SZ.

backend%yblocks = dim3(globs%n_groups_y, 1, 1)
backend%zthreads = dim3(SZ, 1, 1)
backend%zblocks = dim3(globs%n_groups_z, 1, 1)
sz = allocator%sz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, here I would assign the allocator%sz into a new sz member of the backend_t if we wan't to avoid using SZ parameter from m_omp_common/m_cuda_common. In all the use cases of field_t%sz the PR introduces we can instead do self%sz.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think using the SZ parameter directly from m_omp_common/m_cuda_common is a better idea, but I don't really have a strong opinion on this. I think if we're already under omp/ or cuda/, its fine to rely on the corresponding SZ parameter directly. Because otherwise we use sz from allocator_t to set a very critical value for us.

@@ -360,7 +365,7 @@ subroutine tds_solve_cuda(self, du, u, dirps, tdsops)
error stop 'DIR mismatch between fields and dirps in tds_solve.'
end if

blocks = dim3(dirps%n_blocks, 1, 1); threads = dim3(SZ, 1, 1)
blocks = dim3(u%n_groups, 1, 1); threads = dim3(u%sz, 1, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example here sz doesn't depend on the field u. This is always the same SZ that practically backend instantiation freezes.


self%nx = xdirps%n; self%ny = ydirps%n; self%nz = zdirps%n
sz = allocator%sz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #83 possion solver doesn't require SZ any more, just wanted to give an update.

@@ -2,7 +2,6 @@ module m_omp_exec_dist
use mpi

use m_common, only: dp
use m_omp_common, only: SZ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not being able to access SZ as a parameter will probably make performance worse, because compiler won't be certain whether the simd loops can be vectorised or not at compile time.

@@ -127,15 +128,15 @@ subroutine exec_dist_transeq_compact( &
! Handle dud by locally generating u*v
do j = 1, n
!$omp simd
do i = 1, SZ
do i = 1, sz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example here we basically tell compiler to do its best to vectorise the loop here. But if the loop here goes from 1 to 7 lets say, its impossible for compiler to vectorise this. And if sz here is a variable not a parameter, compiler can't be sure at compile time that this is a good integer for vectorisation. In the best case scenario it'll generate multiple versions of assembly, and will have to decide which one to execute only at runtime, which is not ideal. It may alltogether refuse to generate vectorised assembly, and this will depend on the compiler. So here I recommend using SZ as a parameter directly from m_omp_common.

allocate (ud_recv_e(SZ, n_halo))
allocate (ud_recv_s(SZ, n_halo))
allocate (ud(sz, n))
allocate (ud_recv_e(sz, n_halo))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor example, if we have both sz and n_halo as a parameter, it'll probably be better for performance as the size of the allocation can be determined at the compile time. n_halo is not a parameter yet, but we can actually make it, its always a constant value in backends and there are places we hardcoded things based on n_halo=4.

@@ -177,7 +178,7 @@ subroutine exec_dist_transeq_compact( &

do j = 1, n
!$omp simd
do i = 1, SZ
do i = 1, sz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another example where a parameter SZ will make performance better.

Copy link
Member

@semi-h semi-h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting a new review specifically for discussing field_t%n.

I tried to avoid inferring the actual size the tridiagonal solver will make use from the domain shape. This is mainly because of the difference between velocity grid and pressure grid.

I'll try to give an example:
We call tds_solve in various settings and often we pass a velocity grid data and get back pressure grid data such as

x3d2/src/solver.f90

Lines 284 to 285 in ce2efbc

call self%backend%tds_solve(du_x, u, self%xdirps, &
self%xdirps%stagder_v2p)

In this case, its very important to use the correct n in the tds_solve.

subroutine tds_solve_dist(self, du, u, dirps, tdsops, blocks, threads)

The tdsops class we have is the key here. It has got an n member where the size of the tridiagonal system is defined. Its always the correct value, either n of velocity or n of pressure because it is literally the class that defines the operation. It is determined based on the operation is a _p2v or _v2p or so on. We pass this down to exec_dist via tdsops_dev in #L409 below.

x3d2/src/cuda/backend.f90

Lines 404 to 411 in ce2efbc

call exec_dist_tds_compact( &
du_dev, u_dev, &
self%u_recv_s_dev, self%u_recv_e_dev, &
self%du_send_s_dev, self%du_send_e_dev, &
self%du_recv_s_dev, self%du_recv_e_dev, &
tdsops_dev, dirps%nproc, dirps%pprev, dirps%pnext, &
blocks, threads &
)

And evantually we call the kernel with tdsops%n below in #L43.
call der_univ_dist<<<blocks, threads>>>( & !&
du, du_send_s, du_send_e, u, u_recv_s, u_recv_e, &
tdsops%coeffs_s_dev, tdsops%coeffs_e_dev, tdsops%coeffs_dev, &
tdsops%n, tdsops%dist_fw_dev, tdsops%dist_bw_dev, tdsops%dist_af_dev &
)

Halo exchange

The actual n of the field is also important in halo exchanges. Because if we can't determine the actual n, whether its pressure or velocity, we can't send the right last 4 slices to the neighbouring ranks.

x3d2/src/cuda/backend.f90

Lines 395 to 396 in ce2efbc

call copy_into_buffers(self%u_send_s_dev, self%u_send_e_dev, u_dev, &
tdsops_dev%n)

That's why we need to pass an n into the copy_into_buffers, so that the last 4 layers are accessed correctly, based on its pressure or velocity data. See below.
u_send_e(i, j, k) = u(i, n - n_halo + j, k)

Because of this we need to pass an n value to the copy_into_buffers, and can't assume its field_t%n.

I think having a field_t%n is not a good idea because of all these. It'll be misleading and the n in a given direction can be stored in dirps_t%n. When its safe to do so we can use it, or when necessary we'll have to access tdsops_t%n.

u_send_e(i, j, k) = u(i, n - n_halo + j, k)
do i = 1, u%sz
u_send_s(i, j, k) = u%data(i, j, k)
u_send_e(i, j, k) = u%data(i, u%n - n_halo + j, k)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular here, we can't have field_t%n. We have to pass an n value into this subroutine which is determined from the corresponding tdsops, when copy_into_buffers is called in tds_solve. tds_solve always have a tdsops, and passing the tdsops%n down to copy_into_buffers is necessary. Otherwise we'll end up sending wrong set of slices.

Comment on lines +201 to +204
handle%n = -1
handle%SZ = -1
handle%n_padded = -1
handle%n_groups = -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also suggests that we shouldn't store these variables in field_t. field_t is effectively a scratch space and sometimes these variables don't have a meaning. These assignments in particular can cause annoying bugs.

Comment on lines +312 to +314
do k = 1, u%n_groups
do j = 1, u%n
do i = 1, u%sz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the reorder_omp how about introducing a variable n_groups(3) in the allocator_t class and;

do k = 1, self%allocator%n_groups(u%dir)
  do j = 1, self%allocator%cdims(u%dir)
    do i = 1, self%sz (which is from backend%sz or SZ directly)

It doesn't require us passing anything extra and removes all the select case stuff above.

@JamieJQuinn
Copy link
Collaborator

Seems like there are a few different definitions of n used here:

  1. Global n{x,y,z} - entire simulation domain across all nodes
  2. Local n{x,y,z} - local simulation domain unique to one node
  3. Padded local n{x,y,z} - actual size of array used to store local domain
  4. Pressure/velocity n{x,y,z} - either n or n-1 depending on staggeredness of variable

So questions to quell my quonfusion:

  1. What is the value intended to be stored in field_t%n and field_t%n_padded @Nanoseb? The actual size of the field, including staggeredness?
  2. What should tdsops%n contain @semi-h? It looks like it's always local, pressure grid n{x,y,z}. So we can't naively derive this from field_t%n because field_t%n can either be pressure or velocity size? Would having field_t%n help with error handling, e.g. if someone passes in the wrong staggeredness of array?
  3. Unsure why field_t%n wouldn't contain the right n for the halo exchange @semi-h. Can you elaborate?

FWIW it feels like having this information about field within field is sensible but I agree with @semi-h that it can't be misleading. I wonder if there's a better naming scheme we can use to distinguish between the above kinds of nx we have floating around.

@slaizet
Copy link
Contributor

slaizet commented Apr 16, 2024

I think we should discuss about n and a proper naming scheme in our next meeting on Monday. We can get some inspirations from 2DECOMP&FFT.

@Nanoseb
Copy link
Collaborator Author

Nanoseb commented Apr 22, 2024

I think there are two different questions here, where are these variable stored and how they are called.
The main purpose of this PR was to put everything in the fields. Most of the times when we act on fields we want to loop through them, so having all the sizes we would need at hand was the main drive for this work. That's the reason why, even if it leads to some duplication of quantities, I still think it makes sense to have it accessible via the field object.

Concerning the naming conventions, I agree this needs to be clarified/improved. One of the reason it isn't really clear is because get_block doesn't distinguish between velocity (n) and pressure (n+1) fields. Something we could do is, just like we do with the direction, to ask for either a cell based block or a vertex based one when requesting get_block. Requesting either will set the right value in n.
That way, we can always assume that n is the size we need either in a pressure or velocity array.

As we mentioned in the related issue, the plan was to remove the _loc from n and add _global whenever we talk about global sizes. So any n here is local.

For SZ, I will have a think on how we could do it, but indeed, if it means we can't vectorise any loop anymore that's not ideal.

@semi-h
Copy link
Member

semi-h commented Apr 25, 2024

So questions to quell my quonfusion:

...
2. What should tdsops%n contain @semi-h? It looks like it's always local, pressure grid n{x,y,z}. So we can't naively derive this from field_t%n because field_t%n can either be pressure or velocity size? Would having field_t%n help with error handling, e.g. if someone passes in the wrong staggeredness of array?

tdsops%n is not actually always local pressure grid size. It depends on the type of operation the tdsops_t instance is set to carry out. For example, we have a first derivative operator, der1st, and with this one we input velocity grid data and output velocity grid data, so in this case der1st%n is equal to the velocity grid. But sometimes we input velocity grid data but output on pressure grid, then, tdsops%n is the pressure grid size. Also, pressure grid is not always n_{velocity} - 1, it actually depends on the boundary condition and the location of the subdomain in the global domain. Only the subdomains at the end (w.r.t x, y, and z) might have a pressure grid size n_{velocity} - 1, the rest are identical to the velocity grid.

  1. Unsure why field_t%n wouldn't contain the right n for the halo exchange @semi-h. Can you elaborate?

If field_t%n is set to the velocity grid size, sending the last 4 layers of a pressure field to the next rank can be a problem. Lets assume we have a field with 256 velocity points along primary direction, and 255 pressure points. Then we might want to send pressure points 252 253 254 255 to the next rank. But if we have field_t%n=256, and take a slice like field%data(i, u%n - n_halo + j, k) (j from 1 to 4), we'll access data at 253 254 255 256, and this is not the last 4 layers of the pressure field.

I think there are two different questions here, where are these variable stored and how they are called. The main purpose of this PR was to put everything in the fields. Most of the times when we act on fields we want to loop through them, so having all the sizes we would need at hand was the main drive for this work. That's the reason why, even if it leads to some duplication of quantities, I still think it makes sense to have it accessible via the field object.

Concerning the naming conventions, I agree this needs to be clarified/improved. One of the reason it isn't really clear is because get_block doesn't distinguish between velocity (n) and pressure (n+1) fields. Something we could do is, just like we do with the direction, to ask for either a cell based block or a vertex based one when requesting get_block. Requesting either will set the right value in n. That way, we can always assume that n is the size we need either in a pressure or velocity array.

Passing an extra argument when calling get_block to set field_t%n correctly is an option, but this will only increase the complexity in my opinion. As of now the issue with different grid size is nicely hidden below in the backends, it works just fine and people working at solver level don't have to worry about it at all. Also, this can be a problem when we request a field at a level different than solver. For example we request intermediate storages in transeq, time integrator, and possibly more. If we plan to rely on field_t%n we would have to make sure that it is always correctly set every single time, this is not only error prone but also the correct value of n or accessing this info may not be available at all these levels.

Also, if we require user to specify a field wheter it is pressure or velocity, this is an extra responsibilty we give to a user. For example, to get a derivative in the staggared grid, user will have to take two distinct but corralated actions to get a single desired output. First specifying that the field is pressure or velocity, then calling tds_solve with the right tdsops_t instance. However we don't require this as of now and it works fine.

I think we need to focus on where exactly in the codebase we use the proposed members %SZ, %n, and %n_groups of field_t, and maybe consider alternative solutions. If their use is limited, then maybe big changes in the codebase (particularly around tds_solve) may not worth it. In the PR there is a loop in reorder_omp makes use of these members and its very helpful there, and also some of these members are used in copy_into_buffers, with minor changes. Of course this is only a draft and not complete, so what other subroutines you think these members will be useful?

@Nanoseb
Copy link
Collaborator Author

Nanoseb commented May 16, 2024

Going via an other route #91 so closing this PR

@Nanoseb Nanoseb closed this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issue affecting core mechanisms of the software
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have SZ, nx_loc, n_groups etc. accessible "globally"
4 participants