[SDK] `BatchSpanProcessor::ForceFlush` appears to be hanging forever #2574

lidavidm · 2024-03-04T15:19:50Z

Describe your environment
Platform: Linux/x86_64
Build system: CMake
Package manager: conda/conda-forge
OpenTelemetry version: 1.14.1

Steps to reproduce
Unfortunately, this is entwined into a large proprietary application and I've not been able to extricate it. We are using the OTLP exporter, BatchSpanProcessor, and we are calling Flush periodically with a timeout.

What is the expected behavior?
ForceFlush with a timeout should reliably complete within the timeout (roughly)

What is the actual behavior?
ForceFlush appears to get stuck inside the spinlock:

#0  0x00007f52a6c49f17 in __GI_sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007f52a84ce4ee in opentelemetry::v1::sdk::trace::BatchSpanProcessor::ForceFlush(std::chrono::duration<long, std::ratio<1l, 1000000l> >) () from /home/lidavidm/miniforge3/envs/dev/bin/../lib/libopentelemetry_trace.so

or

   0x00007f52a84ce49e <+334>:	jne    0x7f52a84ce4bd <_ZN13opentelemetry2v13sdk5trace18BatchSpanProcessor10ForceFlushENSt6chrono8durationIlSt5ratioILl1ELl1000000EEEE+365>
   0x00007f52a84ce4a0 <+336>:	pause
=> 0x00007f52a84ce4a2 <+338>:	mov    %ebx,%eax
   0x00007f52a84ce4a4 <+340>:	and    $0x7f,%eax
   0x00007f52a84ce4a7 <+343>:	cmp    $0x7f,%eax

These traces appear to me to correspond with the spinlock portion. It appears to me that the spinlock portion does not respect the timeout:

opentelemetry-cpp/sdk/src/trace/batch_span_processor.cc

Lines 120 to 137 in 07f6cb5

    
           // If it will be already signaled, we must wait util notified. 
        
           // We use a spin lock here 
        
           if (false == 
        
               synchronization_data_->is_force_flush_pending.exchange(false, std::memory_order_acq_rel)) 
        
           { 
        
             for (int retry_waiting_times = 0; 
        
                  false == synchronization_data_->is_force_flush_notified.load(std::memory_order_acquire); 
        
                  ++retry_waiting_times) 
        
             { 
        
               opentelemetry::common::SpinLockMutex::fast_yield(); 
        
               if ((retry_waiting_times & 127) == 127) 
        
               { 
        
                 std::this_thread::yield(); 
        
               } 
        
             } 
        
           } 
        
           synchronization_data_->is_force_flush_notified.store(false, std::memory_order_release);

Additional context
We think this only occurred once we upgraded from 1.13.0 to 1.14.1. Note that we did not used to have a timeout, since we did not feel a need to. When we upgraded, we noticed it was getting stuck exporting, and added the timeout to try to avoid this (and as general good practice), but it seems there's some deeper issue as to why the export seems to never complete.

I did set a breakpoint on the actual Export method and it does appear to be called regularly, but the main thread that is blocking in ForceFlush isn't getting unblocked.

The text was updated successfully, but these errors were encountered:

owent · 2024-03-05T11:28:39Z

Thanks, this may happen when more than one threads call ForceFlush.

lidavidm · 2024-03-05T15:40:54Z

Thanks for the quick diagnosis. I can say our problem definitely persists in 1.13.0 (so it's just coincidence that we started noticing after the 1.14.1 upgrade) and a lock does appear to fix the problem.

owent · 2024-03-06T06:56:50Z

Thanks. There should be only one thread calling this function internally, it's still a problem when users call it in another thread.
I was thinking about a lockless way to solve this problem. Or the critical section may be a little long to make more conflict.

github-actions · 2024-05-08T01:41:00Z

This issue was marked as stale due to lack of activity.

lidavidm added the bug Something isn't working label Mar 4, 2024

github-actions bot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 4, 2024

owent self-assigned this Mar 5, 2024

owent mentioned this issue Mar 6, 2024

[SDK] Avoid missing conditional variable update and simplify atomic bool #2553

Merged

3 tasks

marcalff added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 6, 2024

owent linked a pull request Mar 10, 2024 that will close this issue

[SDK] Fix forceflush may wait for ever #2584

Open

3 tasks

github-actions bot added the Stale label May 8, 2024

marcalff changed the title ~~BatchSpanProcessor::ForceFlush appears to be hanging forever~~ [SDK] BatchSpanProcessor::ForceFlush appears to be hanging forever May 23, 2024

github-actions bot removed the Stale label May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] `BatchSpanProcessor::ForceFlush` appears to be hanging forever #2574

[SDK] `BatchSpanProcessor::ForceFlush` appears to be hanging forever #2574

lidavidm commented Mar 4, 2024 •

edited

owent commented Mar 5, 2024

lidavidm commented Mar 5, 2024

owent commented Mar 6, 2024

github-actions bot commented May 8, 2024

[SDK] BatchSpanProcessor::ForceFlush appears to be hanging forever #2574

[SDK] BatchSpanProcessor::ForceFlush appears to be hanging forever #2574

Comments

lidavidm commented Mar 4, 2024 • edited

owent commented Mar 5, 2024

lidavidm commented Mar 5, 2024

owent commented Mar 6, 2024

github-actions bot commented May 8, 2024

[SDK] `BatchSpanProcessor::ForceFlush` appears to be hanging forever #2574

[SDK] `BatchSpanProcessor::ForceFlush` appears to be hanging forever #2574

lidavidm commented Mar 4, 2024 •

edited