Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readloop processing time when using authorization #5255

Open
orchestd opened this issue Mar 28, 2024 · 8 comments
Open

Readloop processing time when using authorization #5255

orchestd opened this issue Mar 28, 2024 · 8 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@orchestd
Copy link

orchestd commented Mar 28, 2024

Observed behavior

A server running only few hundreds of connections (centos 9, 2 CPUs, 4GB RAM running on GCP), is working fine with this config:

listen: 0.0.0.0:4222
websocket {
  listen: 0.0.0.0:9222
  compression: false
  no_tls: true
}

When we add an auth section with 9 users, like so:

authorization {
  timeout: 1

  LIMITED_USER = {
    subscribe = ["somechan.>", "otherchan.>"]
    publish = {
        deny: ">"
    }
  }

  users [
    { user: user1, password: "xxx" }    
    { user: user2, password: "yyy" }
    { user: user3, password: "zzzz" }
....
    { user: user8, password: "....", permissions: $LIMITED_USER }
    { user: user9, password: "...." }
  ]
}

2 sec after starting the service we are getting

2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 4.322702242s

after a minute or so, it goes up

2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 17.422702242s

also, nats is taking almost 100% CPU, clients are disconnecting

We have upgraded the server (4 CPUs 16G RAM) and it did not solve the issue (although looks like Readloop processing time was less than 10 sec)

Removed the auth section - and all is back to normal.

Expected behavior

The server should run with no error, clients should connect easily

Server and client version

We tried both version (which worked before for months):
2.9.16
and
2.9.21

Host environment

Originally we used (for months)
GCP VM, running Centos 7.0 2 CPUs, 4GB RAM

When it failed, we moved to
GCP VM, running Centos 9.0 2 CPUs, 4GB RAM

When it failed, we upgraded to
GCP VM, running Centos 9.0 4 CPUs, 16GB RAM

None worked...

Steps to reproduce

No response

@orchestd orchestd added the defect Suspected defect such as a bug or regression label Mar 28, 2024
@wallyqs
Copy link
Member

wallyqs commented Mar 28, 2024

Hi @orchestd will check, but just to confirm that you only saw this in v2.9.X series?

@orchestd
Copy link
Author

orchestd commented Mar 28, 2024

Hi.

2.9.16 was the version we're using, changing the version to 2.9.21 was just "in case" the specific version we chose was buggy, we can try a diff one .

BTW, we have reconstructed the issue in a more vanilla manner

  1. wrote a small program that connect nats
  2. ran the nats server with auth - that has 1 user (alice:foo)
  3. ran the below code X 1000 times on parallel
  4. got the error

then we ran the same without auth - all was ok

code:

package main

import (
	"fmt"
	"github.com/nats-io/nats.go"
	"time"
)


func main() {
	testConnection("defaultBackend", "nats://some.server:4222", "alice", "foo")

}

func testConnection(entity string, url string, username string, password string) {

	nc, err := nats.Connect(url, nats.UserInfo(username, password), nats.Name(entity)
	if err != nil {
		fmt.Println("ERROR: " + err.Error())
	}

	defer nc.Close()

}

@orchestd
Copy link
Author

orchestd commented Apr 1, 2024

Update:
changing

authorization {
  timeout: 1

to

authorization {
  timeout: 8

"solved" the issue, but now its a risk of false authorization window

@orchestd
Copy link
Author

Any ideas on how can I avoid the Readloop processing time issue without setting authorization timeout to a large number which is a security risk ?

@derekcollison
Copy link
Member

Can you share the complete server config file that accompanies the test that shows the issue?

@orchestd
Copy link
Author

orchestd commented Apr 24, 2024

The COMPLETE configuration is on the first post (nothing fancy needed on config - its the bare minimum)

reposting here:

listen: 0.0.0.0:4222

websocket {
  listen: 0.0.0.0:9222
  compression: false
  no_tls: true
}

authorization {
    timeout: 4
    
    SOME_PERM = {
    subscribe = ["xxx.>", "yyy.>"]
    publish = {
    deny: ">"
    }
    }
    
    users [
    { user: super, password: "....." }
    { user: regular, password: "....", permissions: $SOME_PERM }
    ]
    }

@derekcollison
Copy link
Member

So I used your config file and your program and ran it in parallel 100 times with no issues.

So we must be doing something different. So I am running a v2.10.15 server with the config above and the program (changed the user/pass). So could you show exactly what command you used to run the program and the exact program code used?

@orchestd
Copy link
Author

orchestd commented May 8, 2024

We run
nats-server -c server.conf
no extra params

As for the code, we have various services, including WS frontend
When we look at the error
2024/03/28 11:25:51.523011 [WRN] x.y.z.a:10998 - cid:76885 - Readloop processing time: 17.422702242s

most of the x.y.z.a IP addresses are go backends using this lib to connect to nats
https://github.com/orchestd/nats.io
using NewNatsServiceWithBasicAuth
and then connect:

func (n *defaultNatsService) Connect(natsUrl, serviceName string, authOpt nats.Option,
	connectionAttempts, reconnectionAttempts, reconnectWaitSec, maxPingsOutstanding, pingIntervalSec int) error {
	opts := []nats.Option{nats.Name(serviceName)}
	opts = append(opts, authOpt)

	opts = append(opts, nats.ReconnectWait(time.Duration(reconnectWaitSec)*time.Second))

	opts = append(opts, nats.MaxReconnects(reconnectionAttempts))

	opts = append(opts, nats.MaxPingsOutstanding(maxPingsOutstanding))

	opts = append(opts, nats.PingInterval(time.Duration(pingIntervalSec)*time.Second))

	opts = append(opts, nats.ErrorHandler(func(_ *nats.Conn, s *nats.Subscription, err error) {

		n.logger.Error(context.Background(), n.formatErrorMsg("nats.ErrorHandler fired", err))
	}))

	opts = append(opts, nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
		n.logger.Error(context.Background(), n.formatErrorMsg("nats.DisconnectErrHandler fired", err))
	}))

	opts = append(opts, nats.ReconnectHandler(func(nc *nats.Conn) {
		n.logger.Warn(context.Background(), n.formatErrorMsg("nats.ReconnectHandler fired", nil))
	}))

	err := n.connect(natsUrl, opts)
	if err != nil {
		n.logger.Error(context.Background(), n.formatErrorMsg("can't connect to nats server", err))
		if connectionAttempts < 2 {
			return err
		}
		go func() {
			var connectionError error
			for i := 1; i <= connectionAttempts; i++ {
				time.Sleep(time.Duration(reconnectWaitSec) * time.Second)
				connectionError = n.connect(natsUrl, opts)
				if connectionError == nil {
					break
				}
				n.logger.Error(context.Background(), n.formatErrorMsg("can't connect to nats server attempt - "+fmt.Sprint(i), connectionError))
			}

			connectionError = fmt.Errorf(n.formatErrorMsg(fmt.Sprintf("can't connect to nats server after %v attempts. Stop trying.", connectionAttempts), connectionError))

			if n.connectionFailedHandler != nil {
				n.connectionFailedHandler(connectionError)
			}
			n.logger.Error(context.Background(), connectionError.Error())
		}()
		return nil
	}
	return nil
}

Hope this sheds more light

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

3 participants