-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10813. Improve Recon Debuggability for Snapshot Fetching, Sync Monitoring, and Permission Validation. #6640
base: master
Are you sure you want to change the base?
Conversation
…Sync Monitoring, and Permission Validation.
@devmadhuu @dombizita Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArafatKhan2198 Thanks for working over this, have few comments
@@ -264,10 +270,12 @@ private void startSyncDataFromOM(long initialDelay) { | |||
LOG.debug("Started the OM DB sync scheduler."); | |||
scheduler.scheduleWithFixedDelay(() -> { | |||
try { | |||
LOG.info("Last known sequence number before sync: {}", getCurrentOMDBSequenceNumber()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the frequency of the log ... else recon log will be filled with this log only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of times the sequence number is logged will be equal to the frequency set in the configuration below.
In the recon logs, the log messages displaying the sequence number occur at one-minute intervals, which are determined by the value set by the user for the flags OZONE_RECON_OM_SNAPSHOT_TASK_INTERVAL_DELAY
. The default value for this flag is set to 10 minutes. This delay represents the interval between Recon initiating updates on its DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consolidate the two log messages into a single entry, which will be printed after the sync operation, displaying the sequence number before and after the sync.
What do you think @sumitagrawl
LOG.warn("No SST files found in the OM snapshot directory: {}", untarredDbDir); | ||
return null; | ||
} | ||
LOG.info("Valid OM snapshot with SST files found at: {}", untarredDbDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be debug log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets printed only once when we are fetching the snapshot, should we set it to debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, Recon keeps fetching partial updated from OM, we can have a test over live system with on-going regular updates.
// Fetch expected permissions from configuration | ||
String expectedPermissions = | ||
ServerUtils.getPermissions(ReconConfigKeys.OZONE_RECON_DB_DIR, configuration); | ||
String expectedPermissionsSymbolic = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we controlling exact permission? I think depends on OS configuration also. IMO, we need set minimum permission required in validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments and offline discussion @devmadhuu
I have made a few changes like your asked for :-
In new implementation:
- We retrieve the expected permissions from the configuration.
- We convert the expected permissions to a
Set<PosixFilePermission>
. - We retrieve the actual permissions of the directory.
- We check if the actual permissions contain all the expected permissions.
- We log whether the actual permissions meet the minimum required permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArafatKhan2198 for the patch. Few comments. Pls check.
LOG.warn("No SST files found in the OM snapshot directory: {}", untarredDbDir); | ||
return null; | ||
} | ||
LOG.info("Valid OM snapshot with SST files found at: {}", untarredDbDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to debug level, I would request to log the list of valid SST files, so in case of issues where compaction of SST files happens at OM, SST files fetched may not be exactly matching at Recon side, or another scenario where due to some unexpected issue at OM, SST files being fetched but not all being fetched since the last sequence number, then printing the list of SST files along with corresponding sequence number as debug log would be helpful to debug.
- Add the last sequence number somewhere in DB under ReconTask table with the corresponding task details and list of SST files, if not done yet. This can be added later in troubleshooting doc in case of OM DB display data and customer can get the details from table.
- Same thing need to do for other tasks like NSSummary tasks which may impact DU and customer should be able to troubleshoot on its own.
dbSnapshot.getCheckpointLocation()); | ||
try { | ||
omMetadataManager.updateOmDB(dbSnapshot.getCheckpointLocation().toFile()); | ||
LOG.info("Successfully updated Recon OM DB with new snapshot."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend this to change to debug level. because this will flood the logs of recon. And also log the sequence number being used to fetch the OM DB snapshot.
And check if this task details being updated in ReconTask table as suggested above with sequence number and list of SST files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest leaving the information to be printed in the logs, as it will only be called once, specifically during the first update of the full snapshot.
LOG.warn("Permissions for Recon DB directory '{}' are set to '{}', but expected '{}'", | ||
dbDir.getAbsolutePath(), actualPermissionsStr, expectedPermissionsSymbolic); | ||
} else { | ||
LOG.info("Permissions for Recon DB directory '{}' are correctly set to '{}'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log as well will flood logs of recon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkAndValidateReconDbPermissions
will be called by updateReconOmDBWithNewSnapshot
which will only get called when the full DB sync happens once which will be for the first time, I believe it wont flood the logs as much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What changes were proposed in this pull request?
Changes in this patch include the following :-
Enhanced Logging for OM Snapshot Fetching
We included checks for the presence of SST files, logging a warning if the snapshot is empty or lacks SST files, which could indicate issues such as OM leader instability or snapshot corruption.
Logging Last Sequence Number on Periodic Syncs
The synchronization process between Recon and the OM database now includes logging of the last sequence number before and after each sync.
Permission Checks for Recon DB Directories
We implemented a method to validate the permissions of the Recon DB directories against the expected configuration settings. The method logs a warning if the actual permissions do not match the expected permissions, enhancing operational awareness.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10813
How was this patch tested?
Manually Tested out the logs. The required log messages get printed :-