-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize_group not thread safe when generating PTX for GPU runtime. #1181
Comments
Digging into this a little more I've found that putting a lock around when the pass manager runs the optimization passes like so solves the symbol issues. diff --git a/src/liboslexec/llvm_util.cpp b/src/liboslexec/llvm_util.cpp
index 81d377b0..5e906614 100644
--- a/src/liboslexec/llvm_util.cpp
+++ b/src/liboslexec/llvm_util.cpp
@@ -1586,8 +1586,12 @@ LLVM_Util::ptx_compile_group (llvm::Module* lib_module, const std::string& name,
fn_pm.run (*i);
}
+{
+ static OIIO::spin_mutex mutex;
+ OIIO::spin_lock lock (mutex);
// Run the optimization passes on the module to generate the PTX
mod_pm.run (*linked_module);
+}
// TODO: Minimize string copying
out = assembly_stream.str().str(); This however is where most of the time is being spent in |
This is a good clue. Given that the same code works in parallel for the CPU case, maybe the problem is somewhere in the PTX backend itself? Have you tried LLVM 10? (or even master?) |
As a totally wild guess -- the |
I did of bit of grepping through the LLVM source. And sure enough, the So I'm afraid the bug is deep inside LLVM. A naive change would be to switch this to an atomic -- perhaps that would be enough? |
Chris, have you already emailed llvm-dev about this? Aside: that should be a regular mutex, not a spin lock, since running the llvm optimizer is not the kind of short operation you want to spin on. |
No not yet -- from my limited read through the code, its not totally obvious if simply making it an atomic will fix it, or if there will be other subtle issues. @DeclanRussell do you see the spot I'm referring to in the LLVM source? Are you able to run some tests?
|
If you point out the issue on the mail list, maybe somebody will either say "that is the solution" or have a different suggestion or volunteer to fix other similar issues. |
Thanks for the suggestions guys! Looks like @fpsunflower was on the money with that static counter in the NVPTX target. I changed the counter to be an atomic with the following patch which seems to work great! I guess the best course of action now would be to submit this patch onto the LLVM community. diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 9669fb5964e..1e8a99de8db 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -59,12 +59,13 @@
#include <string>
#include <utility>
#include <vector>
+#include <atomic>
#define DEBUG_TYPE "nvptx-lower"
using namespace llvm;
-static unsigned int uniqueCallSite = 0;
+static std::atomic<unsigned int> uniqueCallSiteCount = 0;
static cl::opt<bool> sched4reg(
"nvptx-sched4reg",
@@ -327,6 +328,9 @@ VectorizePTXValueVTs(const SmallVectorImpl<EVT> &ValueVTs,
NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
const NVPTXSubtarget &STI)
: TargetLowering(TM), nvTM(&TM), STI(STI) {
+
+ uniqueCallSite = uniqueCallSiteCount.fetch_add(1, std::memory_order_acq_rel);
+
// always lower memset, memcpy, and memmove intrinsics to load/store
// instructions, rather
// then generating calls to memset, mempcy or memmove.
@@ -1837,7 +1841,7 @@ SDValue NVPTXTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
true),
InFlag, dl);
InFlag = Chain.getValue(1);
- uniqueCallSite++;
+ uniqueCallSite = uniqueCallSiteCount.fetch_add(1, std::memory_order_acq_rel);
// Append ProxyReg instructions to the chain to make sure that `callseq_end`
// will not get lost. Otherwise, during libcalls expansion, the nodes can become
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index df9cd415996..79ca357c71f 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -580,6 +580,8 @@ private:
Align getArgumentAlignment(SDValue Callee, const CallBase *CB, Type *Ty,
unsigned Idx, const DataLayout &DL) const;
+
+ mutable unsigned int uniqueCallSite;
};
} // namespace llvm
|
Awesome. Yes, please submit that to LLVM. (If you have too much bureaucratic nonsense on your side to let you do it, let me know.) If we're really lucky, maybe it can make it into an LLVM 10.x? On our side, I think maybe what we should do is have an appropriate |
Hey guys, thought I would give an update on this. After further testing it seems that unfortunately my LLVM patch wasn't a complete fix and I was probably just getting lucky. I tried some other simple hack but to no avail. It seems the issue is deeper in the NVPTX backend than I have the knowledge to fix. I raised the issue on the LLVM mailing list here but it never really got any traction. |
Even if the patch is incomplete - its at least worth submitting right? It definitely was an obviously non-thread safe spot in the code. Maybe the review process would help jog the memory of whoever wrote the code in the first place about other areas that may also be unsafe. |
In the patch above, I am curious why you are incrementing uniqueCallSiteCount in two places? The original code only had one increment. I can't imagine it makes much difference either way, just curious. |
Nevermind - I think I see why you need two. |
Are the issues you are getting still centered around this |
I'm hesitant to submit the patch if it doesn't actually fix the thing as perhaps the real solution means actually removing the global variable all together. It could be a means to get more eyes on the matter though true. Yes the issues are still centered around the prototype_nnn error. |
That suggests that we are at least looking at the right place. With exactly zero understanding of what the code is doing or how LLVM is invoking it -- maybe the other thing to try would be to just make the counter local to the class? Can you tell (via printfs?) if there is one instance of this class per thread? or if a single one is being shared? Maybe re-using numbers is ok as long as its for different ptxs? |
Let's bring up at the TSC meeting. Maybe people from NVIDIA are involved in LLVM's PTX back end, or have connections who are? |
Sorry that we didn't get to this in time today -- @ee33, could you help us connect with anyone on the NVIDIA side? |
So we've talked to Nvidia a little about this already. It seems that making the whole NVPTX backend threadsafe would be a big task. However, given that we only care about the PTX generation and not about linking, perhaps this simplifies the problem down. |
Problem
We're trying to parallelize the PTX compilation of our OSL shaders as this generation time is becoming a bottleneck. In our current prototype we build the group in serial for all the shaders and then just run
ShadingSystem::optimize_group
on the groups in parallel to generate the PTX. All the callbacks to code on our end that could be called during this have been made thread safe. It seems that callingShadingSystem::optimize_group
in parallel produces bad PTX that OptiX will reject. Here is the output from OptiX,Interestingly when compiling in serial all these "prototype" symbols are no longer present in the PTX produced. So I guess something in this API call isn't thread safe and the code is getting mashed up.
Improving this would be super helpful as currently we're forced to compile all the PTX in serial and as soon as you have say 10 shaders that take 6 seconds to compile you're now into the minutes before you have even started rendering. Happy to provide more information or generated PTX if needed.
Versions
The text was updated successfully, but these errors were encountered: