refresh site creds on file fetcher processes #613

kathia-barahona · 2024-01-31T20:30:16Z

About this change - What it does

pghoard dispatches processes in charge of fetching files from sites. When starting such processes, pghoard provides its config as an argument. Meaning that if pghoard gets restarted with a different config (e.g object storage got new credentials), the running file fetcher processes won't acknowledge this and will keep using the old config.

So, in order to change this behavior. Its better to provide the current config directly on each task instead of the process itself. This way the process can update the transfer to the object storage (in case the site's config changed).

Resolves: #BF-2385

rdunklau

Since those Queues are based on pipes, it means we will pipe a lot of redundant data on each IPC, which may hurt performance as those will need to serialized / deserialized.
Also, I'm afraid we will have errors this way: the old tasks will still use the old config.

Couldn't we signal those processes to reload their config from disk whenever our own config change ? Or another IPC mechanism to send the new config.

rikonen · 2024-02-08T12:33:33Z

Since those Queues are based on pipes, it means we will pipe a lot of redundant data on each IPC, which may hurt performance as those will need to serialized / deserialized.

Is this really a problem? I thought of the same myself but the full config should be in the range of a few kilobytes and the cost of just sending it all over does not seem especially high unless I'm missing some big inefficiency here.

rdunklau · 2024-02-09T05:57:17Z

Looks like a default config on a fresh service, once pickled, is in the 5kB range.
Compared to the other members of that tuple, that's huge so performance impact should at least be evaluated.
But regardless of that point, the other one still stands: what happens to all previously enqueued task, which will be fetched using the wrong config ?

For minimal changes, we could use a separate queue for config update events: checking that without blocking means we could update config at any time and all subsequent tasks will use the new config.

rikonen · 2024-02-09T06:38:33Z

The acute problem being fixed here is that pghoard does not refresh keys at all so key rotation cannot be completed with restarting the application. For that particular problem it is irrelevant if the old key is used a bit longer as there's anyway long grace period of key inactivity after which old key is disabled. For cases where storage location actually gets changed that could be more relevant, though in that case too the exact point when data starts getting read from / written to the other location is arbitrary and touching the queued events does not feel like it would have big impact one way or the other.

Separate queue would work but I'd check the actual performance impact first. My guess is the overhead is low single digit milliseconds, which would probably be acceptable.

rdunklau · 2024-02-09T06:52:17Z

I'm fine with both of those points, as long as they are considered :-)

rikonen · 2024-04-15T08:13:25Z

Is there a plan to work on this? The problem is still very valid.

kathia-barahona · 2024-04-15T09:16:02Z

Is there a plan to work on this? The problem is still very valid.

Hi! Sorry, I had to pause a bit this task. I agreed with @rdunklau that I'll measure how much this affect on restoration, will give some prio

kathia-barahona · 2024-04-18T11:01:21Z

@rdunklau I did multiples test runs and measured restoration times. I saw no major difference after including these changes. Considered db sizes (mb): 100, 300, 600, 1000. Dataset was not super big, but I don't think it might have a bigger impact.

rikonen · 2024-04-18T11:57:20Z

It was unlikely the performance would be so much worse that it would be visible unless you used very heavy stress test but in this case a synthetic test should be quite sufficient because you can easily simulate the config being passed as part of tasks or it not being passed there. For example the following test app should work:

import json
import multiprocessing
import time


def task_handler(task_queue: multiprocessing.Queue, result_queue: multiprocessing.Queue) -> None:
	while (task := task_queue.get()):
		id_ = task["id"]
		result_queue.put(id_)


def main() -> None:
	manager = multiprocessing.Manager()
	result_queue = manager.Queue()
	task_queue = manager.Queue()
	process = multiprocessing.Process(target=task_handler, args=(task_queue, result_queue))
	process.start()
	# Wait for one message to be processed to ensure target process is running normally
	task_queue.put({"id": -1})
	result_queue.get()
	with open("pghoard.json") as f:
		full_config = json.load(f)
	start_time = time.monotonic()
	task_count = 10_000
	for id_ in range(task_count):
		task_queue.put({"id": id_, "config": full_config})
	task_queue.put(None)
	for _ in range(task_count):
		result_queue.get()
	duration_ms = (time.monotonic() - start_time) * 1000
	time_per_task_ms = duration_ms / task_count
	print(f"Processed {task_count} tasks in {duration_ms:.1f} milliseconds; {time_per_task_ms:.2f} milliseconds")


if __name__ == "__main__":
	main()

This gives me consistently 0.11ms processing time per task when not passing the full config and 0.13ms processing time when passing it. So the overhead isn't even full milliseconds but rather 0.02 milliseconds, which is completely negligible given the actual task processing is way heavier than the 0.11ms no-op time.

@rdunklau does this validation seem sufficient to you?

rdunklau · 2024-05-07T08:58:08Z

Sorry I missed the previous comments.

kathia-barahona force-pushed the kathiabarahona/refresh_site_creds branch from 0bb58ba to 1e64714 Compare February 1, 2024 11:09

rdunklau requested changes Feb 8, 2024

View reviewed changes

refresh site creds on file fetcher processes

137f1cf

kathia-barahona force-pushed the kathiabarahona/refresh_site_creds branch from 1e64714 to 137f1cf Compare April 17, 2024 18:45

kathia-barahona requested a review from rdunklau April 18, 2024 10:57

rikonen approved these changes May 1, 2024

View reviewed changes

rdunklau approved these changes May 7, 2024

View reviewed changes

alexole approved these changes May 7, 2024

View reviewed changes

alexole merged commit 58b56f7 into main May 7, 2024
7 checks passed

alexole deleted the kathiabarahona/refresh_site_creds branch May 7, 2024 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refresh site creds on file fetcher processes #613

refresh site creds on file fetcher processes #613

kathia-barahona commented Jan 31, 2024 •

edited

rdunklau left a comment •

edited

rikonen commented Feb 8, 2024

rdunklau commented Feb 9, 2024 •

edited

rikonen commented Feb 9, 2024

rdunklau commented Feb 9, 2024

rikonen commented Apr 15, 2024

kathia-barahona commented Apr 15, 2024

kathia-barahona commented Apr 18, 2024

rikonen commented Apr 18, 2024

rdunklau commented May 7, 2024

refresh site creds on file fetcher processes #613

refresh site creds on file fetcher processes #613

Conversation

kathia-barahona commented Jan 31, 2024 • edited

About this change - What it does

rdunklau left a comment • edited

Choose a reason for hiding this comment

rikonen commented Feb 8, 2024

rdunklau commented Feb 9, 2024 • edited

rikonen commented Feb 9, 2024

rdunklau commented Feb 9, 2024

rikonen commented Apr 15, 2024

kathia-barahona commented Apr 15, 2024

kathia-barahona commented Apr 18, 2024

rikonen commented Apr 18, 2024

rdunklau commented May 7, 2024

kathia-barahona commented Jan 31, 2024 •

edited

rdunklau left a comment •

edited

rdunklau commented Feb 9, 2024 •

edited