Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize_group not thread safe when generating PTX for GPU runtime. #1181

Open
DeclanRussell opened this issue May 27, 2020 · 19 comments
Open

Comments

@DeclanRussell
Copy link
Contributor

DeclanRussell commented May 27, 2020

Problem

We're trying to parallelize the PTX compilation of our OSL shaders as this generation time is becoming a bottleneck. In our current prototype we build the group in serial for all the shaders and then just run ShadingSystem::optimize_group on the groups in parallel to generate the PTX. All the callbacks to code on our end that could be called during this have been made thread safe. It seems that calling ShadingSystem::optimize_group in parallel produces bad PTX that OptiX will reject. Here is the output from OptiX,

Parse error (Details: Function "_rtProgramCreateFromPTXString" caught exception: (api input string): error: Failed to parse input PTX string
(api input string), line 26694; error   : Unknown symbol 'prototype_708'
(api input string), line 26694; error   : Arguments mismatch for instruction 'call'
(api input string), line 26699; error   : Unknown symbol 'prototype_708'
(api input string), line 27288; error   : Unknown symbol 'prototype_724'
(api input string), line 27288; error   : Arguments mismatch for instruction 'call'
(api input string), line 27293; error   : Unknown symbol 'prototype_724'
(api input string), line 34108; error   : Unknown symbol 'prototype_1118'
(api input string), line 34108; error   : Arguments mismatch for instruction 'call'
(api input string), line 34113; error   : Unknown symbol 'prototype_1118'
Cannot parse input PTX string
)

Interestingly when compiling in serial all these "prototype" symbols are no longer present in the PTX produced. So I guess something in this API call isn't thread safe and the code is getting mashed up.

Improving this would be super helpful as currently we're forced to compile all the PTX in serial and as soon as you have say 10 shaders that take 6 seconds to compile you're now into the minutes before you have even started rendering. Happy to provide more information or generated PTX if needed.

Versions

  • OSL branch/version: master
  • OS: windows / linux
  • C++ compiler: clang 8.0.1 / 9.0.1
  • LLVM version: 8.0.1 / 9.0.1
@DeclanRussell
Copy link
Contributor Author

Digging into this a little more I've found that putting a lock around when the pass manager runs the optimization passes like so solves the symbol issues.

diff --git a/src/liboslexec/llvm_util.cpp b/src/liboslexec/llvm_util.cpp
index 81d377b0..5e906614 100644
--- a/src/liboslexec/llvm_util.cpp
+++ b/src/liboslexec/llvm_util.cpp
@@ -1586,8 +1586,12 @@ LLVM_Util::ptx_compile_group (llvm::Module* lib_module, const std::string& name,
         fn_pm.run (*i);
     }

+{
+   static OIIO::spin_mutex mutex;
+   OIIO::spin_lock lock (mutex);
     // Run the optimization passes on the module to generate the PTX
     mod_pm.run (*linked_module);
+}

     // TODO: Minimize string copying
     out = assembly_stream.str().str();

This however is where most of the time is being spent in optimize_group so unfortunately this doesn't really solve our problem. I don't really see why these optimization passes shouldn't be able to run in parallel. Perhaps this is something to do with the use of the llvm::legacy::PassManager and this just needs to be updated to something newer?

@fpsunflower
Copy link
Contributor

This is a good clue. Given that the same code works in parallel for the CPU case, maybe the problem is somewhere in the PTX backend itself?

Have you tried LLVM 10? (or even master?)

@fpsunflower
Copy link
Contributor

As a totally wild guess -- the prototype_724 strings suggest there is some sequential numbering of symbols for the PTX code. Perhaps the NVPTX backend is using an unprotected global counter here?

@fpsunflower
Copy link
Contributor

I did of bit of grepping through the LLVM source. And sure enough, the prototype_%d is assembled using a static counter.

So I'm afraid the bug is deep inside LLVM. A naive change would be to switch this to an atomic -- perhaps that would be enough?

@lgritz
Copy link
Collaborator

lgritz commented May 28, 2020

Chris, have you already emailed llvm-dev about this?

Aside: that should be a regular mutex, not a spin lock, since running the llvm optimizer is not the kind of short operation you want to spin on.

@fpsunflower
Copy link
Contributor

No not yet -- from my limited read through the code, its not totally obvious if simply making it an atomic will fix it, or if there will be other subtle issues.

@DeclanRussell do you see the spot I'm referring to in the LLVM source? Are you able to run some tests?

git blame on the source suggests google is actually the main author of that particular file, so reaching out via llvm-dev is probably the right channel for further discussion.

@lgritz
Copy link
Collaborator

lgritz commented May 28, 2020

If you point out the issue on the mail list, maybe somebody will either say "that is the solution" or have a different suggestion or volunteer to fix other similar issues.

@DeclanRussell
Copy link
Contributor Author

Thanks for the suggestions guys! Looks like @fpsunflower was on the money with that static counter in the NVPTX target. I changed the counter to be an atomic with the following patch which seems to work great! I guess the best course of action now would be to submit this patch onto the LLVM community.

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index 9669fb5964e..1e8a99de8db 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -59,12 +59,13 @@
 #include <string>
 #include <utility>
 #include <vector>
+#include <atomic>
 
 #define DEBUG_TYPE "nvptx-lower"
 
 using namespace llvm;
 
-static unsigned int uniqueCallSite = 0;
+static std::atomic<unsigned int> uniqueCallSiteCount = 0;
 
 static cl::opt<bool> sched4reg(
     "nvptx-sched4reg",
@@ -327,6 +328,9 @@ VectorizePTXValueVTs(const SmallVectorImpl<EVT> &ValueVTs,
 NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
                                          const NVPTXSubtarget &STI)
     : TargetLowering(TM), nvTM(&TM), STI(STI) {
+
+  uniqueCallSite = uniqueCallSiteCount.fetch_add(1, std::memory_order_acq_rel);
+
   // always lower memset, memcpy, and memmove intrinsics to load/store
   // instructions, rather
   // then generating calls to memset, mempcy or memmove.
@@ -1837,7 +1841,7 @@ SDValue NVPTXTargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
                                                    true),
                              InFlag, dl);
   InFlag = Chain.getValue(1);
-  uniqueCallSite++;
+  uniqueCallSite = uniqueCallSiteCount.fetch_add(1, std::memory_order_acq_rel);
 
   // Append ProxyReg instructions to the chain to make sure that `callseq_end`
   // will not get lost. Otherwise, during libcalls expansion, the nodes can become
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
index df9cd415996..79ca357c71f 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.h
@@ -580,6 +580,8 @@ private:
 
   Align getArgumentAlignment(SDValue Callee, const CallBase *CB, Type *Ty,
                              unsigned Idx, const DataLayout &DL) const;
+
+  mutable unsigned int uniqueCallSite;
 };
 } // namespace llvm
 

@lgritz
Copy link
Collaborator

lgritz commented May 29, 2020

Awesome. Yes, please submit that to LLVM. (If you have too much bureaucratic nonsense on your side to let you do it, let me know.) If we're really lucky, maybe it can make it into an LLVM 10.x?

On our side, I think maybe what we should do is have an appropriate #if controlling whether there is a mutex lock around the ptx generation. It doesn't need the lock if LLVM >= 11.0 (adjusting for wherever it turns out the patch is introduced) or if you set a build time option indicating that you have a custom LLVM build with the patch already installed.

@DeclanRussell
Copy link
Contributor Author

Hey guys, thought I would give an update on this. After further testing it seems that unfortunately my LLVM patch wasn't a complete fix and I was probably just getting lucky. I tried some other simple hack but to no avail. It seems the issue is deeper in the NVPTX backend than I have the knowledge to fix. I raised the issue on the LLVM mailing list here but it never really got any traction.

@fpsunflower
Copy link
Contributor

Even if the patch is incomplete - its at least worth submitting right? It definitely was an obviously non-thread safe spot in the code. Maybe the review process would help jog the memory of whoever wrote the code in the first place about other areas that may also be unsafe.

@fpsunflower
Copy link
Contributor

In the patch above, I am curious why you are incrementing uniqueCallSiteCount in two places? The original code only had one increment.

I can't imagine it makes much difference either way, just curious.

@fpsunflower
Copy link
Contributor

Nevermind - I think I see why you need two.

@fpsunflower
Copy link
Contributor

Are the issues you are getting still centered around this prototype_nnn error?

@DeclanRussell
Copy link
Contributor Author

I'm hesitant to submit the patch if it doesn't actually fix the thing as perhaps the real solution means actually removing the global variable all together. It could be a means to get more eyes on the matter though true.

Yes the issues are still centered around the prototype_nnn error.

@fpsunflower
Copy link
Contributor

That suggests that we are at least looking at the right place. With exactly zero understanding of what the code is doing or how LLVM is invoking it -- maybe the other thing to try would be to just make the counter local to the class? Can you tell (via printfs?) if there is one instance of this class per thread? or if a single one is being shared? Maybe re-using numbers is ok as long as its for different ptxs?

@lgritz
Copy link
Collaborator

lgritz commented Aug 6, 2020

Let's bring up at the TSC meeting. Maybe people from NVIDIA are involved in LLVM's PTX back end, or have connections who are?

@fpsunflower
Copy link
Contributor

Sorry that we didn't get to this in time today -- @ee33, could you help us connect with anyone on the NVIDIA side?

@DeclanRussell
Copy link
Contributor Author

So we've talked to Nvidia a little about this already. It seems that making the whole NVPTX backend threadsafe would be a big task. However, given that we only care about the PTX generation and not about linking, perhaps this simplifies the problem down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants