ConfigPackageUtility misses atomic file operations and #10055
Labels
area/api
REST API
area/configuration
DSL, parser, compiler, error handling
area/distributed
Distributed monitoring (master, satellites, clients)
bug
Something isn't working
ref/IP
In https://github.com/Icinga/icinga2/blob/v2.14.2/lib/remote/configpackageutility.cpp, files are written in-place using a basic
std::ofstream
. So if the Icinga 2 process is terminated while a file is being written, this can result in an incomplete file that will still be picked up when the configuration is loaded, resulting in a possibly incomplete or otherwise broken configuration. Additionally, as nofsync()
or similar is done, other corruption may happen if the whole system crashes (I think we hat such problems in the past on Windows in particular).Some of the
std::ofstream
should probably simply be replaced by our ownAtomicFile
, for example when writing files like/var/lib/icinga2/api/packages/director/active.conf
.However, if
AtomicFile
would be used everywhere, especially where the actual config package contents are written, this will probably slow down writing all the files quite a bit as it would enforce a sequential "write one file, sync it to disk, write the next one, ..." pattern. Keeping thestd::ofstream
there and syncing all files afterwards should allow the operating system to combine the sync operations, making them cheaper.1 The whole config stage directory tree can't be written atomically, so the atomic commit at the end should be replacing the active stage.There have been a few reports of broken config packages in the past, the problem described here probably is their cause.
ref/IP/532402
Footnotes
I did not specifically test this, but I expect that if
fsync()
calls are bundled at the end, some of them may not even have to do anything as the file was flushed in the meantime already. At least that optimization is possible, unlike whenAtomicFile
is used sequentially. ↩I noticed this problem while looking into this support ticket, but I couldn't confirm it as the likely cause. ↩
The text was updated successfully, but these errors were encountered: