Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause recording errors instead of clearing the error stack #4475

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

qkoziol
Copy link
Contributor

@qkoziol qkoziol commented May 10, 2024

An internal capability that's similar to the H5E_BEGIN_TRY / H5E_END_TRY macros in H5Epublic.h, but more efficient since we can avoid pushing errors on the stack entirely (and those macros use public API routines).

This capability (and other techniques) can be used to remove use of H5E_clear_stack() and H5E_BEGIN_TRY / H5E_END_TRY within library routines.

We want to remove H5E_clear_stack() because it can trigger calls to the H5I interface from within the H5E code, which creates a great deal of complexity for threadsafe code. And we want to remove H5E_BEGIN_TRY / H5E_END_TRY's because they make public API calls from within the library code.

Also some other minor tidying in routines related to removing the use of H5E_clear_stack() and H5E_BEGIN_TRY / H5E_END_TRY from H5Fint.c

qkoziol and others added 2 commits May 10, 2024 13:27
An internal capability that's similar to the H5E_BEGIN_TRY / H5E_END_TRY
macros in H5Epublic.h, but more efficient since we can avoid pushing errors on
the stack entirely (and those macros use public API routines).

This capability (and other techniques) can be used to remove use of
H5E_clear_stack() and H5E_BEGIN_TRY / H5E_END_TRY within library routines.

We want to remove H5E_clear_stack() because it can trigger calls to the H5I
interface from within the H5E code, which creates a great deal of complexity
for threadsafe code.  And we want to remove H5E_BEGIN_TRY / H5E_END_TRY's
because they make public API calls from within the library code.

Also some other minor tidying in routines related to removing the use of
H5E_clear_stack() and H5E_BEGIN_TRY / H5E_END_TRY from H5Fint.c

Signed-off-by: Quincey Koziol <quincey@koziol.cc>
@qkoziol
Copy link
Contributor Author

qkoziol commented May 13, 2024

I have a change to make for this, please hold off on reviewing for the moment. I should have something up later today.

@derobins derobins added Merge - To 1.14 This needs to be merged to HDF5 1.14 Priority - 1. High 🔼 These are important issues that should be resolved in the next release Component - C Library Core C library issues (usually in the src directory) Type - Improvement Improvements that don't add a new feature or functionality labels May 13, 2024
Uses 'try open' routines now, down to the foundation of the operation

Signed-off-by: Quincey Koziol <quincey@koziol.cc>
@qkoziol
Copy link
Contributor Author

qkoziol commented May 13, 2024

I have a change to make for this, please hold off on reviewing for the moment. I should have something up later today.

OK, I've updated things to a cleaner solution. Please go ahead with reviews now.

github-actions bot and others added 4 commits May 13, 2024 21:16
Signed-off-by: Quincey Koziol <quincey@koziol.cc>
Signed-off-by: Quincey Koziol <quincey@koziol.cc>
@qkoziol
Copy link
Contributor Author

qkoziol commented May 14, 2024

Interesting - removing the H5E_clear_stack() from the virtual dataset storage code has revealed a failure in the VDS SWMR test. I'm looking into it, but it's another motivator to get the clear stack calls out of the library.

@jhendersonHDF
Copy link
Collaborator

jhendersonHDF commented May 14, 2024

How does this interact with the old way of "pausing" the error stack by just removing the callback function temporarily? I'm thinking that this PR might need to add a way of publicly retrieving whether an error stack has been paused. The error macros I used in different VFDs/VOLs currently decide whether to suppress that driver's/connector's own error reporting based on whether the default error stack currently has a callback function set, so with these changes it seems I'd likely also need to check if the error stack has been paused to be correct.

@qkoziol
Copy link
Contributor Author

qkoziol commented May 14, 2024

How does this interact with the old way of "pausing" the error stack by just removing the callback function temporarily? I'm thinking that this PR might need to add a way of publicly retrieving whether an error stack has been paused. The error macros I used in different VFDs/VOLs currently decide whether to suppress that driver's/connector's own error reporting based on whether the default error stack currently has a callback function set, so with these changes it seems I'd likely also need to check if the error stack has been paused to be correct.

Setting the reporting callback to NULL just suppresses error reports from being printed when leaving an API routine, it doesn't pause actually pushing errors on the stack. This change allows the library to internally pause pushing errors on the stack in circumstances when it knows the operation could fail and would just call H5E_clear_stack() on a failure.

@qkoziol
Copy link
Contributor Author

qkoziol commented May 14, 2024

Currently, I don't see a need for an API for this, but I'm open to ideas around one.

brtnfld and others added 2 commits May 14, 2024 17:42
…up#4477)

* Properly clean up cache when failing to load an object header

* Don't check message type a second time in H5G__open_oid if the first attempt returns error

* Add more asserts to H5O__assert() to avoid segfaults

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Fixes a race condition where the reader opens the file and sets its EOF from the
file's size (from the stat() call in the driver open callback).  Then, before
the reader can read the file's superblock, a SWMR writer races in, extends the
file, and closes the file, writing an updated superblock with the 'writer' and
'SWMR writer' flags in the superblock off (appropriately).  Then the reader
proceeds to read the superblock, and flags the EOF as wrong.  Taking out the
check for the 'writer' and 'SWMR writer' flags will cause SWMR readers to avoid
flagging the file as incorrect.

Signed-off-by: Quincey Koziol <quincey@koziol.cc>
@qkoziol
Copy link
Contributor Author

qkoziol commented May 14, 2024

Interesting - removing the H5E_clear_stack() from the virtual dataset storage code has revealed a failure in the VDS SWMR test. I'm looking into it, but it's another motivator to get the clear stack calls out of the library.

I've put up a PR (#4489) that addresses the SWMR bug, and cherry-picked that commit into this branch, so the tests should pass now.

*/
if (tent_flags != flags) {
/* Make tentative attempt to open file */
H5E_BEGIN_TRY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, I believe switching from H5E_BEGIN_TRY -> H5E_PAUSE_ERRORS (inside H5FD_try_open) here will cause some VOL connectors/VFDs to start printing out error stacks from failures during speculative file opens because, at the time, the easiest way to determine whether or not to print an error stack was to check if the library's default error stack has error reporting disabled. Since I setup these connectors/drivers with their own error stack separate from the library, it generally only makes sense for that plugin to print out the contents of its own error stack on the way out from a "top-level" function (e.g., one of the callback functions the plugin implements). The error macros I setup check whether the library's default error stack has a callback function set before deciding whether to print out the error stack on the way out. With these changes, it seems I'd instead need to know whether error reporting on the error stack has been paused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes sense.

I can add an API routine (H5Eis_paused ?) that will allow plugin developers to query if the library has paused error reporting on the default stack. I don't think it's a good idea to let those developers unpause the default stack, but I could also add a H5Epause & H5Eresume for them to have similar behavior for their own stacks.

I'll try to get these additional routines added tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new routines

@qkoziol
Copy link
Contributor Author

qkoziol commented May 20, 2024

@jhendersonHDF - How do these new API routines look?

src/H5E.c Outdated
FUNC_ENTER_API(FAIL)

/* Applications can't pause the default stack */
if (stack_id == H5E_DEFAULT)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this sort of comes down to deciding if we want to have multiple methods to essentially disable error reporting, but it seems to me that H5Epause/resume_stack on the default error stack is a reasonable use case. Otherwise, it seems like there's a bit of a disconnect where internal code is intended to use the internal H5E_pause_stack to temporarily disable error reporting, while "external" code has to use H5Eget/set_auto, which developers have to remember. Is there harm in allowing an application to call H5Epause/resume_stack on the default error stack and promoting that as a more natural way of temporarily disabling error reporting, as opposed to just temporarily NULLing out the printing callback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember thinking "I don't think this will work from user space" when I was writing it, so I disabled it. But, after your prompt, I went back and think it should be OK. I think my main concern is that a plugin could manipulate the default error stack's pause status (unpause it, or make it paused when it returns), but that's true for the get_auto/set_auto also.

I'll update the code to allow it, and also switch the H5E_BEGIN_TRY / H5E_END_TRY macros to pause & resume pushing errors on the stack, since that's more logical than just disabling the reporting (as you say).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it a little more - I can add some code to save & restore library state around making a call to a plugin's function pointer (starting with the error stack reporting routine and pause status). We're going to need that concept when we start converting routines that make plugin callbacks to be concurrent, and it's not bad to start now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember thinking "I don't think this will work from user space" when I was writing it, so I disabled it. But, after your prompt, I went back and think it should be OK. I think my main concern is that a plugin could manipulate the default error stack's pause status (unpause it, or make it paused when it returns), but that's true for the get_auto/set_auto also.

I'll update the code to allow it, and also switch the H5E_BEGIN_TRY / H5E_END_TRY macros to pause & resume pushing errors on the stack, since that's more logical than just disabling the reporting (as you say).

OK, I've pushed an update to allow pausing and resuming the default error stack. I decided not to change the H5E_BEGIN_TRY / H5E_END_TRY macros, since that would modify application visible behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it a little more - I can add some code to save & restore library state around making a call to a plugin's function pointer (starting with the error stack reporting routine and pause status). We're going to need that concept when we start converting routines that make plugin callbacks to be concurrent, and it's not bad to start now.

I haven't made this change yet, since I want to think about it a little more. Generally speaking, I'm planning to add the state of the default error stack to the information retrieved and restored with H5CX_retrieve_state / H5CX_restore_state, but I'm not 100% certain yet. There's no urgency to this defensive state saving, so I've put it on my to-do list to tackle in a little while.

Signed-off-by: Quincey Koziol <quincey@koziol.cc>
@qkoziol
Copy link
Contributor Author

qkoziol commented May 23, 2024

@jhendersonHDF - I believe I've covered the changes necessary to allow plugin authors to use the error pause / resume capability correctly, so this should be ready for another review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - C Library Core C library issues (usually in the src directory) Merge - To 1.14 This needs to be merged to HDF5 1.14 Priority - 1. High 🔼 These are important issues that should be resolved in the next release Type - Improvement Improvements that don't add a new feature or functionality
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

5 participants