Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force Init Connection #520

Open
khepin opened this issue May 26, 2022 · 7 comments
Open

Force Init Connection #520

khepin opened this issue May 26, 2022 · 7 comments
Labels

Comments

@khepin
Copy link

khepin commented May 26, 2022

Description

I wonder if either there is and I didn't find or if we could add a way to force initialize the connection.

context

I receive data in an HTTP request, based on that data, I have some amount of processing to do and a couple services to contact. Once that's done, I get a couple messages that I need to push to a kafka topic.

Because the kafka cluster I'm talking to is in a separate datacenter, I quite feel the pain of the lack of persistent connections and a call to ->produce() followed by ->flush() takes roughly 300ms to complete.

However if I call ->produce() with a bogus message early on in the process, before I do my local processing / call on to other services, then when I'm ready to produce my actual messages, the connection is already properly established and it only takes 50ms.
Because calls to ->produce() pass the message to a separate thread, my main program isn't blocked at any point in time by this and no time is spent waiting here.

The only way I've found to make this happen is to produce a message early on, but that means having a bunch of bogus messages on my topic. Was wondering if there are any other way to prep the connection in the background, without blocking the php program, so that it's fully available when needed.

thanks

No matter the result of this discussion, wanted to thank you for the hard work put in this extension and enabling kafka in PHP. I appreciate it very much.

@khepin khepin added the feature label May 26, 2022
@nick-zh
Copy link
Collaborator

nick-zh commented May 26, 2022

That's interesting, i am pretty sure connection is being established when you create the producer. Could be that some other things (topic metadata) is only being loaded when you produce the first message to that topic. Only thing that comes to mind, you could try to call poll first (this is getting producer events)

@khepin
Copy link
Author

khepin commented May 26, 2022

I had tried poll and it did not have the same effect. I tried both:

$rk->poll(0);
sleep(1);
$topic->produce(...);
$rk->flush();

To see if the async poll call would work. And:

$rk->poll(10_000);
$topic->produce(...);
$rk->flush();

And in both cases there was no latency improvement on the first produce + flush call.

@nick-zh
Copy link
Collaborator

nick-zh commented May 26, 2022

I see. Then i am not sure what you could do to improve this.
If you change your broker to a wrong address you will see connection errors, indicating that the producer connects upon creation.
Maybe the librdkafka community could shed some light on that (they are on Gitter), what could cause this.
This ext is just a wrapper of that c library, if there is a way that this behaviour can be optimized, it could be adapted in this extension

@nick-zh
Copy link
Collaborator

nick-zh commented May 26, 2022

So i tried to reproduce this, but i have flush times around 300 - 400ms regardless of how many messages i produce 1,2 even 100.
If i wait 300ms before calling flush, time drops to 50ms
Now why is that: Because flush actually waits until all messages are sent to the broker and replicated properly (according to your setting). Me waiting for 300ms, let the broker do it's job and the queue was empty, so flush was more or less a nop.
If you can consistently replicate your 50ms to 300ms behaviour with just sending 1 or 2 msgs, i would be indeed also very curious about findings 😄

@khepin
Copy link
Author

khepin commented May 26, 2022

I didn't play with the number of messages, just noticed that if something has been flushed before, then the next call is much faster. No matter if that flush was manually triggered or happened async in the background thread.

Example:

dump_time(function () use ($topic, $rk) {
    $topic->produce(RD_KAFKA_PARTITION_UA, 0, "message");
    $rk->flush(1000);
}); // 297ms
dump_time(function () use ($topic, $rk) {
    $topic->produce(RD_KAFKA_PARTITION_UA, 0, "message");
    $rk->flush(1000);
}); // 53ms

@nick-zh
Copy link
Collaborator

nick-zh commented May 26, 2022

Ah i see, so you shouldn't flush after every message, only after the last one 😄

@khepin
Copy link
Author

khepin commented May 27, 2022

I get that. The first call to produce + flush is only here to contrast the time.
The first latency is what I'm effectively getting right now.
The second one shows what is possible to get after everything is fully initialized. Which is why I was looking for a way to ensure that everything fully initialized asynchronously. That way if you initialize things early on in a request, by the time you call flush, you'd benefit from the lower latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants