Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refresh site creds on file fetcher processes #613

Merged
merged 1 commit into from May 7, 2024

Conversation

kathia-barahona
Copy link
Contributor

@kathia-barahona kathia-barahona commented Jan 31, 2024

About this change - What it does

pghoard dispatches processes in charge of fetching files from sites. When starting such processes, pghoard provides its config as an argument. Meaning that if pghoard gets restarted with a different config (e.g object storage got new credentials), the running file fetcher processes won't acknowledge this and will keep using the old config.

So, in order to change this behavior. Its better to provide the current config directly on each task instead of the process itself. This way the process can update the transfer to the object storage (in case the site's config changed).

Resolves: #BF-2385

@kathia-barahona kathia-barahona force-pushed the kathiabarahona/refresh_site_creds branch from 0bb58ba to 1e64714 Compare February 1, 2024 11:09
Copy link
Contributor

@rdunklau rdunklau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since those Queues are based on pipes, it means we will pipe a lot of redundant data on each IPC, which may hurt performance as those will need to serialized / deserialized.
Also, I'm afraid we will have errors this way: the old tasks will still use the old config.

Couldn't we signal those processes to reload their config from disk whenever our own config change ? Or another IPC mechanism to send the new config.

@rikonen
Copy link
Contributor

rikonen commented Feb 8, 2024

Since those Queues are based on pipes, it means we will pipe a lot of redundant data on each IPC, which may hurt performance as those will need to serialized / deserialized.

Is this really a problem? I thought of the same myself but the full config should be in the range of a few kilobytes and the cost of just sending it all over does not seem especially high unless I'm missing some big inefficiency here.

@rdunklau
Copy link
Contributor

rdunklau commented Feb 9, 2024

Looks like a default config on a fresh service, once pickled, is in the 5kB range.
Compared to the other members of that tuple, that's huge so performance impact should at least be evaluated.
But regardless of that point, the other one still stands: what happens to all previously enqueued task, which will be fetched using the wrong config ?

For minimal changes, we could use a separate queue for config update events: checking that without blocking means we could update config at any time and all subsequent tasks will use the new config.

@rikonen
Copy link
Contributor

rikonen commented Feb 9, 2024

The acute problem being fixed here is that pghoard does not refresh keys at all so key rotation cannot be completed with restarting the application. For that particular problem it is irrelevant if the old key is used a bit longer as there's anyway long grace period of key inactivity after which old key is disabled. For cases where storage location actually gets changed that could be more relevant, though in that case too the exact point when data starts getting read from / written to the other location is arbitrary and touching the queued events does not feel like it would have big impact one way or the other.

Separate queue would work but I'd check the actual performance impact first. My guess is the overhead is low single digit milliseconds, which would probably be acceptable.

@rdunklau
Copy link
Contributor

rdunklau commented Feb 9, 2024

I'm fine with both of those points, as long as they are considered :-)

@rikonen
Copy link
Contributor

rikonen commented Apr 15, 2024

Is there a plan to work on this? The problem is still very valid.

@kathia-barahona
Copy link
Contributor Author

Is there a plan to work on this? The problem is still very valid.

Hi! Sorry, I had to pause a bit this task. I agreed with @rdunklau that I'll measure how much this affect on restoration, will give some prio

@kathia-barahona
Copy link
Contributor Author

@rdunklau I did multiples test runs and measured restoration times. I saw no major difference after including these changes. Considered db sizes (mb): 100, 300, 600, 1000. Dataset was not super big, but I don't think it might have a bigger impact.

@rikonen
Copy link
Contributor

rikonen commented Apr 18, 2024

It was unlikely the performance would be so much worse that it would be visible unless you used very heavy stress test but in this case a synthetic test should be quite sufficient because you can easily simulate the config being passed as part of tasks or it not being passed there. For example the following test app should work:

import json
import multiprocessing
import time


def task_handler(task_queue: multiprocessing.Queue, result_queue: multiprocessing.Queue) -> None:
	while (task := task_queue.get()):
		id_ = task["id"]
		result_queue.put(id_)


def main() -> None:
	manager = multiprocessing.Manager()
	result_queue = manager.Queue()
	task_queue = manager.Queue()
	process = multiprocessing.Process(target=task_handler, args=(task_queue, result_queue))
	process.start()
	# Wait for one message to be processed to ensure target process is running normally
	task_queue.put({"id": -1})
	result_queue.get()
	with open("pghoard.json") as f:
		full_config = json.load(f)
	start_time = time.monotonic()
	task_count = 10_000
	for id_ in range(task_count):
		task_queue.put({"id": id_, "config": full_config})
	task_queue.put(None)
	for _ in range(task_count):
		result_queue.get()
	duration_ms = (time.monotonic() - start_time) * 1000
	time_per_task_ms = duration_ms / task_count
	print(f"Processed {task_count} tasks in {duration_ms:.1f} milliseconds; {time_per_task_ms:.2f} milliseconds")


if __name__ == "__main__":
	main()

This gives me consistently 0.11ms processing time per task when not passing the full config and 0.13ms processing time when passing it. So the overhead isn't even full milliseconds but rather 0.02 milliseconds, which is completely negligible given the actual task processing is way heavier than the 0.11ms no-op time.

@rdunklau does this validation seem sufficient to you?

@alexole alexole merged commit 58b56f7 into main May 7, 2024
7 checks passed
@alexole alexole deleted the kathiabarahona/refresh_site_creds branch May 7, 2024 08:57
@rdunklau
Copy link
Contributor

rdunklau commented May 7, 2024

Sorry I missed the previous comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants