Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util: check for errors after close and read in AutoFile #29307

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

vasild
Copy link
Contributor

@vasild vasild commented Jan 24, 2024

fclose(3) may fail to flush the previously written data to disk, thus a failing fclose(3) is as serious as a failing fwrite(3).

Previously the code ignored fclose(3) failures. This PR improves the explicit callers to check whether it failed. However there is a design issue that fclose(3) is also called from the AutoFile destructor. There is no good way to signal a failure to the caller from the destructor. So, if closing from the destructor fails, then log a message (the file name is unavailable at this time 😞) and terminate the program.

Also check for ferror(3) after calls to fread(3).


Other alternatives are:

  1. fflush(3) after each write to the file (and throw if it fails from the AutoFile::write() method) and hope that fclose(3) will then always succeed. Assert that it succeeds from the destructor 🙄. Will hurt performance.
  2. Throw nevertheless from the destructor. Exception within the exception in C++ I think results in terminating the program without a useful message.
  3. Redesign AutoFile so that its destructor cannot fail. Adjust all its users 😭. For example, if the file has been written to, then require the callers to explicitly call the AutoFile::fclose() method before the object goes out of scope. In the destructor, as a sanity check, assume/assert that this is indeed the case. Defeats the purpose of a RAII wrapper for FILE* which automatically closes the file when it goes out of scope and there are a lot of users of AutoFile.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jan 24, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #29612 (rpc: Optimize serialization and enhance metadata of dumptxoutset output by fjahr)
  • #27432 (contrib: add tool to convert compact-serialized UTXO set to SQLite database by theStack)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

Debug: https://github.com/bitcoin/bitcoin/runs/20819752683

src/streams.cpp Outdated Show resolved Hide resolved
src/test/fuzz/autofile.cpp Outdated Show resolved Hide resolved
@maflcko
Copy link
Member

maflcko commented Jan 29, 2024

I don't think any care close-check needs to be done when reading from a file.

So what about removing the write method from AutoFile, and introduce a new derived class to add it back. This class could Assume that the file was flushed/closed before the destructor is called?

@vasild
Copy link
Contributor Author

vasild commented Feb 4, 2024

I don't think any care close-check needs to be done when reading from a file.

That is my understanding too.

So what about removing the write method from AutoFile, and introduce a new derived class to add it back. This class could Assume that the file was flushed/closed before the destructor is called?

Assume is more for code correctness, not for external errors (like IO error). If it does not fail during testing and on CI, that does not mean IO errors are absent and will not occur on the users' machines. Given that Assume() is removed from production builds, it still means that file corruption could remain unnoticed on live environments, like now on master.

@maflcko
Copy link
Member

maflcko commented Feb 5, 2024

This class could Assume that the file was flushed/closed before the destructor is called?

Assume is more for code correctness, not for external errors (like IO error).

The assumption would be that the close was called before the destructor is called. This is a code correctness question and seems like the perfect fit for Assume. That is, if the Assume fails, the code was missing a close, and the code must be changed to fix the internal bug. The Assume does not in any way check for IO errors.

@vasild
Copy link
Contributor Author

vasild commented Feb 5, 2024

Then I misunderstood this:

This class could Assume that the file was flushed/closed before the destructor is called

How?

@maflcko
Copy link
Member

maflcko commented Feb 5, 2024

AutoFile holds a file, which will be nullptr when it is closed. So if the file is nullptr, then it can be assumed to be flushed.

@vasild
Copy link
Contributor Author

vasild commented Feb 5, 2024

How/when would the file be closed?

@maflcko
Copy link
Member

maflcko commented Feb 6, 2024

How/when would the file be closed?

An AutoFile can be closed by calling the fclose() method in the code.

@vasild
Copy link
Contributor Author

vasild commented Feb 6, 2024

Ok, the users of the class then have to explicitly close. That is 1.4. from the OP, I elaborated it with more words.

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

Debug: https://github.com/bitcoin/bitcoin/runs/20850328376

Retropex pushed a commit to Retropex/bitcoin that referenced this pull request Mar 28, 2024
Explicit callers of `AutoFile::fclose()` should check whether
`fclose(3)` failed because this may be due to a failure to
flush written data to disk. Such failures after writing to
a file should be treated as failures to `fwrite(3)`.

Github-Pull: bitcoin#29307
Rebased-From: e9abf95
Retropex pushed a commit to Retropex/bitcoin that referenced this pull request Mar 28, 2024
`fread(3)` does not distinguish between end-of-file and error,
and callers must use `feof(3)` and `ferror(3)` to determine which
occurred.

Github-Pull: bitcoin#29307
Rebased-From: 5543990
@achow101
Copy link
Member

achow101 commented Apr 9, 2024

Seems like there is some discussion in the description that is a better fit for a brainstorming issue.

@DrahtBot DrahtBot marked this pull request as draft April 23, 2024 06:50
@DrahtBot
Copy link
Contributor

Converted to draft due to failing CI and inactivity

@vasild
Copy link
Contributor Author

vasild commented May 10, 2024

5543990321...1fa61d9edc: rebase and log a message from ~AutoFile() if closing fails and terminate the program.

Ready for review. I think this would be a strict improvement over what we have now in master - a silent ignoring of a possible file corruption.

@vasild vasild marked this pull request as ready for review May 10, 2024 10:28
@Sjors
Copy link
Member

Sjors commented May 10, 2024

@vasild CI seems unhappy.

@vasild vasild marked this pull request as draft May 13, 2024 12:25
Explicit callers of `AutoFile::fclose()` should check whether
`fclose(3)` failed because this may be due to a failure to
flush data to disk. Such failures should be treated as failures to
`fwrite(3)`.

Also terminate the program if `AutoFile::~AutoFile()` is unable
to close the file because that is a file write failure which
cannot be handled inside the destructor or reported to the caller.
`fread(3)` does not distinguish between end-of-file and error,
and callers must use `feof(3)` and `ferror(3)` to determine which
occurred.
@vasild vasild marked this pull request as ready for review May 14, 2024 09:50
@vasild
Copy link
Contributor Author

vasild commented May 14, 2024

Fixed the CI!

@@ -23,8 +25,9 @@ std::size_t AutoFile::detail_fread(Span<std::byte> dst)

void AutoFile::read(Span<std::byte> dst)
{
if (detail_fread(dst) != dst.size()) {
throw std::ios_base::failure(feof() ? "AutoFile::read: end of file" : "AutoFile::read: fread failed");
if (detail_fread(dst) != dst.size() || ferror(m_file)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

de23848: I still don't understand this. Why would fread succeed to read the requested number of bytes, but then mark the stream with an error? If the goal is to somehow detect errors after the read succeeded, I am not sure this should be a goal. Otherwise, it would be good to explain what real user-facing bug this fixes or introduces.

Either this is a refactor, in which case it seems pointless, or it is a behavior change, in which case it needs to explain why it beneficial and correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was somewhat answered in #29307 (comment) above but let me elaborate - the thing that prompted me to do this change is this text from the fread(3) manual page:

The function fread() does not distinguish between end-of-file and error,
and callers must use feof(3) and ferror(3) to determine which occurred.

I think checking for ferror(3) after fread(3) is a kind of harmless "belt-and-suspenders". If you are sure that if detail_fread(dst) == dst.size() then ferror(m_file) will always be false or feel strongly against this then I could drop the commit de23848 util: check for error after reading from a file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are sure that if detail_fread(dst) == dst.size() then ferror(m_file) will always be false or feel strongly against this then I could drop the commit de23848 util: check for error after reading from a file.

I don't feel strongly in one way or another. I am just saying that with the currently available information, it is not possible to confidently review this and know it is a good change. At a minimum, it should be clear whether this is a refactor or a behavior change, and what user-facing bug this could possibly prevent or fix. (I can't think of one, but maybe I am missing something?)

@@ -213,6 +218,7 @@ bool DumpMempool(const CTxMemPool& pool, const fs::path& dump_path, FopenFn mock
fs::file_size(dump_path));
} catch (const std::exception& e) {
LogInfo("Failed to dump mempool: %s. Continuing anyway.\n", e.what());
(void)file.fclose();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this change either. This seems to be mixing the third alternative in the pull request description with the approach you wanted to take?

Defeats the purpose of a RAII wrapper for FILE* which automatically closes the file when it goes out of scope and there are a lot of users of AutoFile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is best if all users of AutoFile explicitly call the fclose() method and check that it succeeds (option 3. from PR description). This is not feasible because there are many users of AutoFile. However, some of those users already do call the fclose() method explicitly. For those users it is easy and I added checks whether fclose() succeeds. DumpMempool() is such a user and the change is:

             throw std::runtime_error("FileCommit failed");
-        file.fclose();           
+        if (file.fclose() != 0) {    
+            throw std::runtime_error(
+                strprintf("Error closing %s: %s", fs::PathToString(file_fspath), SysErrorString(errno)));
+        }
         if (!RenameOver(dump_path + ".new", dump_path)) {

For consistency in DumpMempool() I changed all code paths in that function to explicitly call fclose(). There are two codepaths and this added (void)file.fclose(); is the other one. I ignore the return value because this is already failure handling and the function is going to return false. I didn't want to do something like "while handling error1, error2 occurred".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I see this is required, because otherwise the node would crash on this error instead of continuing with the current code, if this was missing?

Not sure this is desirable, see also https://github.com/bitcoin/bitcoin/pull/29307/files#r1612065514

@DrahtBot
Copy link
Contributor

🐙 This pull request conflicts with the target branch and needs rebase.

Comment on lines +409 to +413
~AutoFile()
{
if (fclose() != 0 && m_was_written) {
LogPrintLevel(BCLog::ALL, BCLog::Level::Error, "Error closing file: %s\n", SysErrorString(errno));
std::abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about unconditionally crashing the node on any failure here. This may or may not be an issue, so the call site will have to decide about this.

For example, If an RPC user dumps the mempool to an external (corrupt) USB thumb drive, a failure shouldn't crash the node, but notify the user over the RPC.

Also, I am not sure about involving logging in this low-level code. Logging may not be set up, so this may never be printed before the crash.

Also, I am not sure about the log message. It just says that there was an error, but not whose fault it was or what should be done about it.

A better solution overall may be to just call if (m_was_written) Assume(IsNull());. This should highlight all violating call sites during CI, because the test should be deterministic. (If there is a code path that is not deterministically tested, all hopes are lost anyway, and I don't think the solution should be to crash the users node completely and without considering the context?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants