Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROMIO: excessive number of calls to memcpy() #6985

Open
wkliao opened this issue Apr 18, 2024 · 7 comments
Open

ROMIO: excessive number of calls to memcpy() #6985

wkliao opened this issue Apr 18, 2024 · 7 comments

Comments

@wkliao
Copy link
Contributor

wkliao commented Apr 18, 2024

A PnetCDF user reported a poor performance of collective writes when using a
non-contiguous write buffer. The root of problem is due to a large number of
calls to memcpy() in ADIOI_BUF_COPY in mpich/src/mpi/romio/adio/common/ad_write_coll.c

A performance reproducer is available in
https://github.com/wkliao/mpi-io-examples/blob/master/tests/pio_noncontig.c

This program makes a single call to MPI_File_write_at_all. The user buffer can
be either contiguous (command-line option -g 0) or noncontiguous (default).
The noncontiguous case adds a gap of 16 bytes into the buffer. The file
view consists of multiple subarray data types, appended one after another.
Further description about the I/O pattern can be found at the beginning of the
program file.

Running this program on a Linux machine using UFS ADIO driver on 16 MPI
processes reported run times of 33.07 and 8.27 seconds. The former is when the
user buffer is noncontiguous and the latter contiguous. The user buffer on each
process is of size 32 MB. The noncontiguous case adds a gap of size 16 bytes
into the buffer. The run command used:

    mpiexec -n 16 ./pio_noncontig -k 256 -c 32768 -w
    mpiexec -n 16 ./pio_noncontig -k 256 -c 32768 -w -g 0

The following patch if applied to MPICH prints the number of calls to memcpy().
https://github.com/wkliao/mpi-io-examples/blob/master/tests/0001-print-number-of-calls-to-memcpy.patch

The numbers of memcpy calls are 2097153 and 0 from the above two runs,
respectively.

@hzhou
Copy link
Contributor

hzhou commented Apr 19, 2024

I haven't looked at the code, but a scale of 4 (from 8.27 to 33.07) from contig to noncontig seems normal to me especially if the data consists of many small segments.

@wkliao
Copy link
Contributor Author

wkliao commented Apr 19, 2024

The noncontiguous case adds a gap of 16 bytes into the buffer.
means the buffer has two contiguous segments. One is of size 256 bytes and
the other 256x16x8191 bytes. Two are separated by a gap of 16 bytes.

The focus point of this issue is the numbers of memcpy calls, as indicated in
the issue title, which is 2097153 per process. In fact, ROMIO can be fixed to
reduce that to 2 memcpy calls.

The test runs I provided was just to prove the point. It is small and reproducible
even on one computer node, easier for debugging. When tested with less
number of processes, say 8, the timing gap becomes bigger, 24.48 vs. 1.98 seconds.
The actual runs reported by PnetCDF user are in much bigger scale, with a total
write amount > 20GB. Time difference was 198.5 vs. 14.9 seconds.

@hzhou
Copy link
Contributor

hzhou commented Apr 19, 2024 via email

@hzhou
Copy link
Contributor

hzhou commented Apr 21, 2024 via email

@wkliao
Copy link
Contributor Author

wkliao commented Apr 22, 2024

Your understanding of the issue is correct.

I am not familiar with ROMIO code so I could be way off – why don't we use MPI_Pack​ to prepare the send buffer?

I think it is because memory footprint. In my test program, the addition memory space is 32 MB. For bigger problem size, the footprint is bigger.

I do not follow the idea of "partial datatype". Will it help construct a datatype that is an intersection of 2 other datatypes (user buffer type and file view)?

@roblatham00
Copy link
Contributor

@wkliao Hui implemented a way to work on datatypes without flattening the whole thing first. we would still have to compute the intersection of memory type and file view but I think his hope is that the datatype data structures might be less memory intensive -- not as a solution to this issue but an idea for a ROMIO enhancement that came to mind while looking at this code.

@wkliao
Copy link
Contributor Author

wkliao commented Apr 25, 2024

As the current implementation of collective I/O is done in multiple rounds of two-phase I/O,
if such partial datatype flattening could work, then I expect the memory footprint could
be reduced significantly, which will be great.

FYI, I added codes inside ROMIO to measure the memory footprint and ran pio_noncontig.c
using the commands provided in my earlier comments. The high watermark is about
300 MB (maximal among 16 processes) for such a small test case. I think it mainly comes
from the fileview datatype flattening.

As for this issue, my own solution is to check whether or not the part of user buffer
in each two-phase I/O round is contiguous. If it is, then use it to call MPI_Issend and
thus, skip most of the memcpy calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants