Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible data race when saving large results in file storage #869

Open
alex-van-vliet opened this issue Jun 3, 2022 · 3 comments
Open

Comments

@alex-van-vliet
Copy link

alex-van-vliet commented Jun 3, 2022

Hi,

When returning a large result from the main function (one that takes several seconds to serialize), there seem to be a data race on the run_entry object while the heartbeat is saving the file.

For example, with:

import numpy as np
import sacred
from sacred.observers import FileStorageObserver

experiment = sacred.Experiment('Exp')
experiment.observers.append(FileStorageObserver('tmp'))

@experiment.automain
def main():
    return np.random.rand(10000, 10000)

I get the following error:

Warning - Exp - An error ocurred in the '<...FileStorageObserver ...>' observer: dictionary changed size during iteration
Warning - Exp - The observer '<...FileStorageObserver...>' failed at some point during the run.

I am using sacred 0.8.2.

My understanding of the issue is the following:

There is a heartbeat, that heartbeat triggers the heartbeat_event in the observer, which calls save_json on the run_entry. The problem is, this takes some time, so while the save_json is flattening the run_entry, the job is completed and calls completed_event which adds stop_time to the run_entry, causing the run_entry to have a new key (because stop_time was not present before), thus the flatten from the heartbeat fails because it is currently iterating on the run_entry.

Now technically, because of the exception handling, the completed_event will overwrite the run.json anyway, so the file does not seem to be corrupted. But this could still create other issues I am not aware of.

@thequilo
Copy link
Collaborator

thequilo commented Jun 3, 2022

Hi @alex-van-vliet!
Thanks for reporting. From your description, this can also happen for a smaller run_entry, it just becomes more likely the longer it takes to serialize.

I think the only liable solution would be a lock. This would have the drawback that it locks the main thread when an event overlaps with saving the heartbeat event. Would you be willing to implement/test this and provide a pull request?

BTW, what are you saving that is that large that it takes several seconds to serialize? Maybe another way of saving your results could solve the issue for you.

@alex-van-vliet
Copy link
Author

Yes, that is true.

I am not familiar enough with the codebase, could another event clash with the heartbeat? If this is not the case, maybe completed could be triggered after the heartbeat has been joined? In this sense, this seems similar to #345. I agree that one solution could be to remove the timeout on the stop heartbeat, but I don't think that killing the thread is a good option as there could be data corruption.
Looking at the interface of the observer, maybe artifact_event/resource_event could also have a data race with the heartbeat? Then the above solution would not fix this issue and a lock would be needed anyway.

We would need to lock every loop that iterates on the observers (basically all emit functions), right (to prevent that one observer receives a different value than another observer)? Then we can also probably remove the timeout because if the heartbeat blocks it is not going to release the lock and the experiment will never finish anyway. But this causes another question from #345, what if the observer is reused in another experiment and the heartbeat from one run clashes with an event from the second run? (I have never had this use case, so I can't really comment)

I don't have the time right now to implement this, but I can try when I have.

Just a lot of segmentations for further analysis, thanks to the exception handling the completed event is called anyway which saves the results properly, so in this case it doesn't corrupt any data, but I could also save it to a file indeed.

@vnmabus
Copy link
Contributor

vnmabus commented Aug 23, 2023

This could cause the behavior that I and others have observed, in which after completion the info dict is not fully saved in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants