You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With help from @casparvl, I've added the following to /project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua on our AWS build cluster, which will be picked up by the bot for builds relying on libfabric:
require("strict")
local hook = require("Hook")
-- LmodMessage("Load bot-specific SitePackage.lua")
local function eessi_bot_libfabric_set_psm3_devices_hook(t)
local simpleName = string.match(t.modFullName, "(.-)/")
-- we may want to be more specific in the future, and only do this for specific versions of libfabric
if simpleName == 'libfabric' then
-- set environment variables PSM3_DEVICES as workaround for MPI applications hanging in libfabric's PSM3 provider
-- crf. https://github.com/easybuilders/easybuild-easyconfigs/issues/18925
setenv('PSM3_DEVICES', 'self,shm')
end
end
-- combine all load hook functions into a single one
function site_specific_load_hook(t)
eessi_bot_libfabric_set_psm3_devices_hook(t)
end
local function combined_load_hook(t)
-- Assuming this was called from EESSI's SitePackage.lua, this should be defined and thus run
if eessi_load_hook ~= nil then
eessi_load_hook(t)
end
site_specific_load_hook(t)
end
hook.register("load", combined_load_hook)
This solves the Haswell OpenMPI issues that we observed in several PRs. I was going to make a PR for it, but I have some doubts on how this should be done:
does it have to be restricted to Haswell (we also saw some hangs with other architectures, but it's not entirely clear if they were caused by the same issue)?
does it have to be restricted to certain versions of libfabric?
do we also need this for the tests? Answer fron @casparvl: yes, might be needed.
which script should make sure that this SitePackage.lua is picked up / copied to the right location? bot/build.sh, EESSI-install-software.sh, eessi_container.sh, ...?
what if a PR wants to update SitePackage.lua, should it already pick up the new version? If so, we should probably prevent it from being copied to the shared directory already, otherwise other builds will also pick it up already before it's merged.
The text was updated successfully, but these errors were encountered:
@TopRichard also found an issue with our CUDA hook when trying to use it on NESSI, it will currently forbid the loading of dependency modules that have GPU support even for building purposes. Disabling that hook as part of the bot-specific SitePackage.lua seems like a good idea.
With help from @casparvl, I've added the following to
/project/def-users/bot/shared/host-injections/2023.06/.lmod/SitePackage.lua
on our AWS build cluster, which will be picked up by the bot for builds relying onlibfabric
:This solves the Haswell OpenMPI issues that we observed in several PRs. I was going to make a PR for it, but I have some doubts on how this should be done:
libfabric
?SitePackage.lua
is picked up / copied to the right location?bot/build.sh
,EESSI-install-software.sh
,eessi_container.sh
, ...?SitePackage.lua
, should it already pick up the new version? If so, we should probably prevent it from being copied to the shared directory already, otherwise other builds will also pick it up already before it's merged.The text was updated successfully, but these errors were encountered: