Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVHPC Support #693

Draft
wants to merge 59 commits into
base: master
Choose a base branch
from
Draft

NVHPC Support #693

wants to merge 59 commits into from

Conversation

ptheywood
Copy link
Member

@ptheywood ptheywood commented Sep 22, 2021

  • Patch RapidJSON if using nvhpc
  • Change warning levels / suppressions if using nvhpc. It does not support isystem unfortunately.
  • Fix std::experimental::filesystem linker errors.
    • NVHPC/nvc++ uses GCC's standard library under the hood, so linking against stdc++fs is required for std::experimental::filesystem
  • Visualisation
    • This builds, links and runs but plenty of warnings are generated. Requires changes to the vis repo.
  • pyflamegpu/ swig
    • Swig does not like to be built with nvc++ as CXX. The build scripts set some flags which are not known (-ansi).
    • Buildling the python bindings themselves appears to work
  • Document in the readme as an alterantive to gcc
    • Requires CMake >= 3.20 for NVHPC support
  • Cmake Warning/Error if using less than the minimum supported version?
    • 20.7 (the first nvhpc rebrand of pgi) shipped with CUDA 10.1, 10.2 and 11.0 support, so going forwards this should work as long as the install includes 11.0. Since 20.11, it has shipped with 10.2, 11.0 + lastest 11.x at the time of release.
  • NVHPC CI?
    • CI using nvc++ as the host compiler, with nvcc provided by nvhpc
    • CI using gcc as the host compiler, but nvcc provided by nvhpc (i.e. account for the math_libs relocation).
    • Pick min/max nvhpc ci to run.
    • large container sizes might limit the use of this, potentially with out of memory errors during compilation for multiple arch's.
    • Using a container image is probablies easiest, from https://ngc.nvidia.com/catalog/containers/nvidia:nvhpc/tags, using one or more of the -devel-cudaXX.YY- images for nvhpc versions.
      • Centos images might make more sense, as I would assume most NVHPC users will be doing so on RHEL/Cent/Rocky based HPC systems.
      • 21.7-devel-cuda11.4-ubuntu20.04
      • 21.7-devel-cuda11.4-centos7
      • Older releases have older / different cuda versions available.
  • Run Every generated executable
    • Non-RTC examples
    • RTC Example(s)
    • CUDA C++ Test Suite
      • Debug passes
      • Release passes with the included patch from emplace -> insert. This could be made nvhpc specific if desired.
    • Python test suite
  • some versions don't work due to gcc stdlib + compiler incompatibilty, so we'll "support" this but with the caveat that not all versions will be usable.

Closes #977.

@ptheywood ptheywood added blocked and removed blocked labels Sep 22, 2021
@ptheywood
Copy link
Member Author

ptheywood commented Sep 22, 2021

Not actualyl blocked, by std::experimental::filesystem - it just uses gcc's stdlib so can pass the appropriate linker args.

Mostly seems to work, other than release RTC test suite failures. Debug is fine which makes tracing the fault more interesting. Release RTC examples also work, so it's test suite specific in some way?

Vis works, but the vis repo needs CMake changes to address warnings (the same as the main repo + some extras).

Segfault notes

  • The Segfaulr occurs when newRTCFunction is called, but duplicating that content to an example instead runs ok.
  • A new test file with just one of the offending tests in it runs OK.
  • test_cuda_simulation.cu in tests_dev is NOT ok...
  • Building it all, and filtering only the single test is NOT ok...
  • Commenting out most of test_cuda_simulation is OK
  • TestCUDASimulation.SetGetPopulationData being compilerd in appears to cause the issue?, regardless of whether or not it is filtered out
    • GetAgent, Step, AgentDeath and AgentID_MultipleStatesUniqueIDs all also cause issues if they are compiled, even if they are not executed?
    • commenting out a.newFunction("DeathFunc", DeathTestFunc) in AgentDeath is enough to remove the segfault...
  • The segfault appears to occur inside newRTCFunction if newFunction and newRTCFunction are used in the same compilation unit with nvhpc in release mode, regardless of which is first in the file? Possibly a compiler issue?
#include "flamegpu/flamegpu.h"
#include "gtest/gtest.h"

namespace flamegpu {
namespace tests {
namespace test_nvhpc {

FLAMEGPU_AGENT_FUNCTION(cudacxx_test_func, flamegpu::MessageNone, flamegpu::MessageNone) {
    return flamegpu::ALIVE;
}

const char* rtc_test_func = R"###(
FLAMEGPU_AGENT_FUNCTION(rtc_test_func, flamegpu::MessageNone, flamegpu::MessageNone) {
    return flamegpu::ALIVE;
}
)###";

TEST(testNVHPC, RTCElapsedTime) {
    ModelDescription m("m");
    AgentDescription &agent = m.newAgent("agent");

    // Using newRTCFunction and newFunction in the same compilation unit appears to cause the segfault within newRTCFunction.
    // Comment out either call to remove the segfault.
    agent.newFunction("cudacxx_test_func", cudacxx_test_func);
    AgentFunctionDescription &func = agent.newRTCFunction("rtc_test_func", rtc_test_func);
}

}  // namespace test_nvhpc
}  // namespace tests
}  // namespace flamegpu

After chucking a bunch of printf __LINE__ into AgentFunctionDescription::newRTCFunction the offending line appears to be agent->functions.emplace(function_name, rtn); Both funciton_name, rtn and agent->functions all appear to be valid however...

@ptheywood
Copy link
Member Author

If using GCC 8's stdlib rather than GCC 9'this builds ok in Release mode (nvc++ 21.7-0, ubuntu 21.04).

@ptheywood
Copy link
Member Author

ptheywood commented Oct 1, 2021

Currently working on reproducing this with a simpler use case. Currently leaning towards or more of the following:

  • Compiler error (as this only appears to be effected by nvhpc + gcc 9)
  • UB related to the use of enable_shared_from_this which the docs for list several opportunities for UB.
  • Unintialised member variables which are potentially present for many of tha *Data classes.

Will conitnue to work on the MWE a little, but if it doesn't reproduce soon it'll just get dumped into a gist for future reference. Running gcc and nvhpc builds through valgrind (with an appropriate cuda suppressions list) would be good and generally worthwhile on the whole.

I tried enabling -Wuninitialized on gcc build but this appeared to produce no additional warnings on my office box when building the flamegpu target.

@ptheywood
Copy link
Member Author

ptheywood commented Jan 9, 2024

NVHPC repackages the location of curand compared to standalone nvcc. Prior to nvhpc 22.3 this is not correctly reflected by the include path during compilation via cmake when using nvhpc installed nvcc, but gcc as the host compiler.

We may be able to resoilve this by requiring curand as a dependency in cmake, otherwise we might need to expliclty add an edge case to cmake to ensure this include path is set.

@ptheywood
Copy link
Member Author

ptheywood commented Jan 9, 2024

after some horrible cmake additions to explicitly add the non-symlinkg math_libs include directory to include path(s) if required, curand is now found when using nvhpc installed nvcc, and nvhpc as the host compiler.

However, this then exposes an issue with include path ordering and the finding / use of cub and thrust.

The cub/thrust version mismatch check is identifying that they do not agree. locally using a cuda 11.8 nvhpc 22.11 which ships with cub/thrust 1.15, this is conflicting with the explicitly added cub/thrust 1.17 we fetch.

This will be due to include directories and precedent. It might not be the case for all cmake/nvhpc combos, so i will force CI to investigate for me (once the outage ends?)

some commits are wip, as cmake 3.18 needs to use a differnet method for symlink resolution compared to 3.19, which is not tested.

@ptheywood
Copy link
Member Author

some NVHPC builds via containers which fail to configure CMake are erroring due to:

2024-01-11T18:00:25.4234501Z     #error -- unsupported pgc++ configuration! Only pgc++ 18, 19, 20 and 21 are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

This is with a version of nvcc distributed with the version of nvhpc which is apparently incompatible.

@ptheywood ptheywood added this to the 2.0.0-rc2 milestone Jan 12, 2024
Includes CMP0152 which in CMake >= 3.28 changes symlink resolution behaviour, relevant to nvhpc workarounds.
cudaMemset takes an int not a uint64, so 0xfffffff was triggering an implicit cast sign change.
nvcc believes it is incompatible with the versions of nvhpc it was distributed with...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support nvc++ as the host compiler
1 participant