Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeConfig: Immense Mutex Locking Times #156

Open
oxzi opened this issue Jan 29, 2024 · 0 comments
Open

RuntimeConfig: Immense Mutex Locking Times #156

oxzi opened this issue Jan 29, 2024 · 0 comments

Comments

@oxzi
Copy link
Member

oxzi commented Jan 29, 2024

The config.RuntimeConfig struct ensures data integrity through the use of its mu sync.RWMutex.
When creating lots of events in a bulk fashion, e.g., after loading multiple incidents on a fresh Icinga configuration, this mutex will result in long locking times and a high CPU-usage.

I experienced this bug after starting with a fresh Icinga 2-configuration, including lots of firing checks.
After configuring a new notification channel, it has not appeared after minutes. After taking a look at the process, I saw a huge CPU utilization and, after enabling debug logs, missing configuration reloads.

root@9d9171726bb8:/go/src/github.com/Icinga/icinga-notifications# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  998  3.0 3752980 1002368 ?     Ssl  09:00  78:16 /usr/local/bin/icinga-notifications-daemon --config /etc/icinga-notifications/config.yml
root        2519  0.0  0.0 1842300 22912 ?       Sl   09:05   0:00 /usr/libexec/icinga-notifications/channel/email
root        2525  0.3  0.0   4188  3328 pts/0    Ss   09:08   0:00 /bin/bash
root        2533  0.0  0.0   8088  4096 pts/0    R+   09:08   0:00 ps aux
:g/runtime-updates.*\(fetched configuration from database\|applied pending configuration\)
   73 2024-01-29T09:01:11.728Z  DEBUG runtime-updates fetched configuration from database {"took": "14.037396ms"}
   75 2024-01-29T09:01:11.728Z  DEBUG runtime-updates applied pending configuration {"took": "16.432µs"}
   96 2024-01-29T09:01:13.338Z  DEBUG runtime-updates fetched configuration from database {"took": "547.094244ms"}
  126 2024-01-29T09:01:34.164Z  DEBUG runtime-updates applied pending configuration {"took": "11.953µs"}
  144 2024-01-29T09:01:41.130Z  DEBUG runtime-updates fetched configuration from database {"took": "6.964389945s"}
  220 2024-01-29T09:05:52.842Z  DEBUG runtime-updates applied pending configuration {"took": "90.671µs"}
  248 2024-01-29T09:05:55.517Z  DEBUG runtime-updates fetched configuration from database {"took": "2.674626508s"}
  332 2024-01-29T09:09:47.432Z  DEBUG runtime-updates applied pending configuration {"took": "550.435µs"}
  357 2024-01-29T09:09:47.577Z  DEBUG runtime-updates fetched configuration from database {"took": "144.995215ms"}
  433 2024-01-29T09:09:52.194Z  DEBUG runtime-updates applied pending configuration {"took": "292.178µs"}
  459 2024-01-29T09:09:52.202Z  DEBUG runtime-updates fetched configuration from database {"took": "8.296999ms"}
  462 2024-01-29T09:09:52.203Z  DEBUG runtime-updates applied pending configuration {"took": "183.127µs"}

There's a delay between line 144 and 220 of over four minutes for a job which should be executed every second. Between those log messages, only Locking the RuntimeConfig's mutex mu takes place.

A small patch was applied to log mutex usage, the call origin and time.

Logging Patch
diff --git a/internal/config/runtime.go b/internal/config/runtime.go
index ad6dd2e..af70651 100644
--- a/internal/config/runtime.go
+++ b/internal/config/runtime.go
@@ -13,6 +13,7 @@ import (
 	"github.com/jmoiron/sqlx"
 	"go.uber.org/zap"
 	"golang.org/x/crypto/bcrypt"
+	"runtime/debug"
 	"strconv"
 	"strings"
 	"sync"
@@ -86,13 +87,19 @@ func (r *RuntimeConfig) PeriodicUpdates(ctx context.Context, interval time.Durat
 }
 
 // RLock locks the config for reading.
-func (r *RuntimeConfig) RLock() {
+func (r *RuntimeConfig) RLock(start time.Time) {
 	r.mu.RLock()
+	r.logger.Debugw("RLock",
+		zap.String("trace", string(debug.Stack())),
+		zap.Duration("took", time.Since(start)))
 }
 
 // RUnlock releases a lock obtained by RLock().
-func (r *RuntimeConfig) RUnlock() {
+func (r *RuntimeConfig) RUnlock(start time.Time) {
 	r.mu.RUnlock()
+	r.logger.Debugw("RUnlock",
+		zap.String("trace", string(debug.Stack())),
+		zap.Duration("took", time.Since(start)))
 }
 
 func (r *RuntimeConfig) GetRecipient(k recipient.Key) recipient.Recipient {
@@ -147,8 +154,9 @@ func (r *RuntimeConfig) GetContact(username string) *recipient.Contact {
 // This method returns either a *Source or a nil pointer and logs the cause to the given logger. This is in almost all
 // cases a debug logging message, except when something server-side is wrong, e.g., the hash is invalid.
 func (r *RuntimeConfig) GetSourceFromCredentials(user, pass string, logger *logging.Logger) *Source {
-	r.RLock()
-	defer r.RUnlock()
+	start := time.Now()
+	r.RLock(start)
+	defer r.RUnlock(start)
 
 	sourceIdRaw, sourceIdOk := strings.CutPrefix(user, "source-")
 	if !sourceIdOk {
@@ -220,8 +228,14 @@ func (r *RuntimeConfig) fetchFromDatabase(ctx context.Context) error {
 }
 
 func (r *RuntimeConfig) applyPending() {
+	lockStart := time.Now()
 	r.mu.Lock()
-	defer r.mu.Unlock()
+	r.logger.Debugw("Lock", zap.Duration("took", time.Since(lockStart)))
+
+	defer func() {
+		r.mu.Unlock()
+		r.logger.Debugw("Unlock", zap.Duration("took", time.Since(lockStart)))
+	}()
 
 	r.logger.Debug("applying pending configuration")
 	start := time.Now()
diff --git a/internal/config/verify.go b/internal/config/verify.go
index d6b7bf4..446ba5d 100644
--- a/internal/config/verify.go
+++ b/internal/config/verify.go
@@ -7,12 +7,14 @@ import (
 	"github.com/icinga/icinga-notifications/internal/recipient"
 	"github.com/icinga/icinga-notifications/internal/rule"
 	"github.com/icinga/icinga-notifications/internal/timeperiod"
+	"time"
 )
 
 // debugVerify performs a set of config validity/consistency checks that can be used for debugging.
 func (r *RuntimeConfig) debugVerify() error {
-	r.RLock()
-	defer r.RUnlock()
+	start := time.Now()
+	r.RLock(start)
+	defer r.RUnlock(start)
 
 	if r.Channels == nil {
 		return errors.New("RuntimeConfig.Channels is nil")
diff --git a/internal/incident/incident.go b/internal/incident/incident.go
index b0d60aa..a55f515 100644
--- a/internal/incident/incident.go
+++ b/internal/incident/incident.go
@@ -107,8 +107,9 @@ func (i *Incident) ProcessEvent(ctx context.Context, ev *event.Event, created bo
 	i.Lock()
 	defer i.Unlock()
 
-	i.runtimeConfig.RLock()
-	defer i.runtimeConfig.RUnlock()
+	start := time.Now()
+	i.runtimeConfig.RLock(start)
+	defer i.runtimeConfig.RUnlock(start)
 
 	tx, err := i.db.BeginTxx(ctx, nil)
 	if err != nil {
@@ -187,8 +188,9 @@ func (i *Incident) RetriggerEscalations(ev *event.Event) {
 	i.Lock()
 	defer i.Unlock()
 
-	i.runtimeConfig.RLock()
-	defer i.runtimeConfig.RUnlock()
+	start := time.Now()
+	i.runtimeConfig.RLock(start)
+	defer i.runtimeConfig.RUnlock(start)
 
 	if !i.RecoveredAt.IsZero() {
 		// Incident is recovered in the meantime.
@@ -636,8 +638,9 @@ func (i *Incident) processAcknowledgementEvent(ctx context.Context, tx *sqlx.Tx,
 
 // RestoreEscalationStateRules restores this incident's rules based on the given escalation states.
 func (i *Incident) RestoreEscalationStateRules(states []*EscalationState) {
-	i.runtimeConfig.RLock()
-	defer i.runtimeConfig.RUnlock()
+	start := time.Now()
+	i.runtimeConfig.RLock(start)
+	defer i.runtimeConfig.RUnlock(start)
 
 	for _, state := range states {
 		escalation := i.runtimeConfig.GetRuleEscalation(state.RuleEscalationID)

The logs can be downloaded zipped here.

A first inspection of the new logs shows locking times of over four minutes.
lock-time

Script
awk '
  BEGIN {
    print "set logscale y"
    print "set terminal png size 640, 480"
    print "set output \"lock-time.png\""
    print "plot \"-\" with lines title \"Lock Waiting, Seconds\""
  }

  function duration(ts,    sum, m) {
    sum = 0
    if (match(ts, /([0-9]+)m[0-9]+/, m))
      sum += strtonum(m[1]) * 60
    if (match(ts, /([0-9]+\.[0-9]+)s.*/, m))
      sum += strtonum(m[1])
    if (match(ts, /([0-9]+\.[0-9]+)ms.*/, m))
      sum += strtonum(m[1]) / 1000
    return sum
  }

  /runtime-updates[\t ]+(R|)Lock/ {
    match($0, /"took": "(.*)"/, m)
    print NR, duration(m[1])
  }
  ' runtime-log3 | \
  gnuplot -p

Taking a look at the (R|)Lock origins reveals only three places:

    271 UpdateFromDatabase->applyPending
    271 UpdateFromDatabase->debugVerify
   3734 Listener->ProcessEvent
Script
awk '
/runtime-updates.*Lock/ {
  if ($4 == "Lock") {
    print "UpdateFromDatabase->applyPending"
  } else if (index($0, "debugVerify")) {
    print "UpdateFromDatabase->debugVerify"
  } else if (index($0, "ProcessEvent")) {
    print "Listener->ProcessEvent"
  } else {
    print $0
  }
}' runtime-log3 | sort | uniq -c | sort -n

The same amount for UpdateFromDatabase results in the fact that both are called after another from UpdateFromDatabase as stated above.
However, most locks are originating in the Listener. A second look reveals that there are two different locking points:

   2695  github.com/icinga/icinga-notifications/internal/config.(*RuntimeConfig).GetSourceFromCredentials(?, {?, ?}, {?, ?}, ?)
   1039  github.com/icinga/icinga-notifications/internal/incident.(*Incident).ProcessEvent(?, {?, ?}, ?, ?)
Script
awk '
/runtime-updates.*RLock.*ProcessEvent/ {
  match($0, /"trace": "(.*)",/, m)

  trace = ""
  split(m[1], p, "\\\\n")
  for (i = 6; i < length(p); i++) {
    if (index(p[i], "\\t"))
      continue
    if (index(p[i], "(*Listener).ProcessEvent"))
      break
    trace = trace " " p[i]
  }

  gsub(/0x[0-9a-f]+/, "?", trace)
  print trace
}' runtime-log3 | sort | uniq -c
  • The (*RuntimeConfig).GetSourceFromCredentials method holds a RLock unnecessary long over the whole method call, including the bcrypt verification. This might be narrowed down.
  • (*Incident).ProcessEvent holds its RLock even longer over a SQL statement and sub-methods.

Taking the caller into account, might unveil a pattern:

[...]
190 97.3µs Listener->GetSourceFromCredentials
191 95.009µs Listener->GetSourceFromCredentials
326 22.086814835s UpdateFromDatabase->applyPending
329 21.887824507s Listener->GetSourceFromCredentials
330 19.199263445s Listener->GetSourceFromCredentials
[...]
1905 388.92µs Listener->GetSourceFromCredentials
1906 108.036µs Listener->GetSourceFromCredentials
3544 4m50.195727767s UpdateFromDatabase->applyPending
3552 4m49.201059997s Listener->GetSourceFromCredentials
3553 4m49.211130819s Listener->GetSourceFromCredentials
[...]
4570 9.854047633s Listener->GetSourceFromCredentials
4572 4m34.851601492s Listener->GetSourceFromCredentials
5641 3m23.225567006s UpdateFromDatabase->applyPending
5647 85µs UpdateFromDatabase->debugVerify
5648 1m27.909018073s Listener->GetSourceFromCredentials
Script
awk '
/runtime-updates[\t ]+(R|)Lock/ {
  match($0, /"took": "(.*)"/, m)
  took = m[1]

  if ($4 == "Lock") {
    origin = "UpdateFromDatabase->applyPending"
  } else if (index($0, "debugVerify")) {
    origin = "UpdateFromDatabase->debugVerify"
  } else if (index($0, "GetSourceFromCredentials")) {
    origin = "Listener->GetSourceFromCredentials"
  } else if (index($0, "ProcessEvent")) {
    origin = "Listener->ProcessEvent"
  } else {
    origin = $0
  }

  print NR, took, origin
}' runtime-log3

At least the first big spikes are related to a applyPending and not so much to GetSourceFromCredentials, which is all over the place anyway. (Edit: Comparing the times when a mutex lock was acquired does not make sense alone.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant