Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10813. Improve Recon Debuggability for Snapshot Fetching, Sync Monitoring, and Permission Validation. #6640

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ArafatKhan2198
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 commented May 5, 2024

What changes were proposed in this pull request?

Changes in this patch include the following :-

Enhanced Logging for OM Snapshot Fetching

We included checks for the presence of SST files, logging a warning if the snapshot is empty or lacks SST files, which could indicate issues such as OM leader instability or snapshot corruption.

Logging Last Sequence Number on Periodic Syncs

The synchronization process between Recon and the OM database now includes logging of the last sequence number before and after each sync.

Permission Checks for Recon DB Directories

We implemented a method to validate the permissions of the Recon DB directories against the expected configuration settings. The method logs a warning if the actual permissions do not match the expected permissions, enhancing operational awareness.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10813

How was this patch tested?

Manually Tested out the logs. The required log messages get printed :-

  1. Logging for SST Files Availability on Full DB Fetch:
2024-05-06 02:13:03 2024-05-05 20:43:03,285 [Recon-SyncOM-0] INFO impl.OzoneManagerServiceProviderImpl: Valid OM snapshot with SST files found at: /data/metadata/om.snapshot.db_1714941781590
  1. Printing the Sequence Number Before and After Sync:
2024-05-06 02:13:01 2024-05-05 20:43:01,587 [Recon-SyncOM-0] INFO impl.OzoneManagerServiceProviderImpl: Last known sequence number before sync: 0
.....
2024-05-06 02:13:03 2024-05-05 20:43:03,620 [Recon-SyncOM-0] INFO impl.OzoneManagerServiceProviderImpl: Sequence number after sync: 2
  1. Recon Data Directory Permissions:
2024-05-06 02:13:01 2024-05-05 20:43:01,590 [Recon-SyncOM-0] INFO impl.OzoneManagerServiceProviderImpl: Permissions for Recon DB directory '/data/metadata' are correctly set to 'rwxr-x---'

@ArafatKhan2198
Copy link
Contributor Author

@devmadhuu @dombizita Please take a look.

@ivandika3 ivandika3 added the recon label May 6, 2024
@adoroszlai adoroszlai changed the title HDDS-10813. Improve Ozone Recon Debuggability for Snapshot Fetching, Sync Monitoring, and Permission Validation. HDDS-10813. Improve Recon Debuggability for Snapshot Fetching, Sync Monitoring, and Permission Validation. May 6, 2024
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArafatKhan2198 Thanks for working over this, have few comments

@@ -264,10 +270,12 @@ private void startSyncDataFromOM(long initialDelay) {
LOG.debug("Started the OM DB sync scheduler.");
scheduler.scheduleWithFixedDelay(() -> {
try {
LOG.info("Last known sequence number before sync: {}", getCurrentOMDBSequenceNumber());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the frequency of the log ... else recon log will be filled with this log only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of times the sequence number is logged will be equal to the frequency set in the configuration below.

In the recon logs, the log messages displaying the sequence number occur at one-minute intervals, which are determined by the value set by the user for the flags OZONE_RECON_OM_SNAPSHOT_TASK_INTERVAL_DELAY. The default value for this flag is set to 10 minutes. This delay represents the interval between Recon initiating updates on its DB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consolidate the two log messages into a single entry, which will be printed after the sync operation, displaying the sequence number before and after the sync.

What do you think @sumitagrawl

LOG.warn("No SST files found in the OM snapshot directory: {}", untarredDbDir);
return null;
}
LOG.info("Valid OM snapshot with SST files found at: {}", untarredDbDir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be debug log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets printed only once when we are fetching the snapshot, should we set it to debug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, Recon keeps fetching partial updated from OM, we can have a test over live system with on-going regular updates.

// Fetch expected permissions from configuration
String expectedPermissions =
ServerUtils.getPermissions(ReconConfigKeys.OZONE_RECON_DB_DIR, configuration);
String expectedPermissionsSymbolic =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we controlling exact permission? I think depends on OS configuration also. IMO, we need set minimum permission required in validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments and offline discussion @devmadhuu
I have made a few changes like your asked for :-
In new implementation:

  1. We retrieve the expected permissions from the configuration.
  2. We convert the expected permissions to a Set<PosixFilePermission>.
  3. We retrieve the actual permissions of the directory.
  4. We check if the actual permissions contain all the expected permissions.
  5. We log whether the actual permissions meet the minimum required permissions.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for the patch. Few comments. Pls check.

LOG.warn("No SST files found in the OM snapshot directory: {}", untarredDbDir);
return null;
}
LOG.info("Valid OM snapshot with SST files found at: {}", untarredDbDir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to debug level, I would request to log the list of valid SST files, so in case of issues where compaction of SST files happens at OM, SST files fetched may not be exactly matching at Recon side, or another scenario where due to some unexpected issue at OM, SST files being fetched but not all being fetched since the last sequence number, then printing the list of SST files along with corresponding sequence number as debug log would be helpful to debug.

  1. Add the last sequence number somewhere in DB under ReconTask table with the corresponding task details and list of SST files, if not done yet. This can be added later in troubleshooting doc in case of OM DB display data and customer can get the details from table.
  2. Same thing need to do for other tasks like NSSummary tasks which may impact DU and customer should be able to troubleshoot on its own.

dbSnapshot.getCheckpointLocation());
try {
omMetadataManager.updateOmDB(dbSnapshot.getCheckpointLocation().toFile());
LOG.info("Successfully updated Recon OM DB with new snapshot.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend this to change to debug level. because this will flood the logs of recon. And also log the sequence number being used to fetch the OM DB snapshot.
And check if this task details being updated in ReconTask table as suggested above with sequence number and list of SST files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest leaving the information to be printed in the logs, as it will only be called once, specifically during the first update of the full snapshot.

LOG.warn("Permissions for Recon DB directory '{}' are set to '{}', but expected '{}'",
dbDir.getAbsolutePath(), actualPermissionsStr, expectedPermissionsSymbolic);
} else {
LOG.info("Permissions for Recon DB directory '{}' are correctly set to '{}'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log as well will flood logs of recon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkAndValidateReconDbPermissions will be called by updateReconOmDBWithNewSnapshot which will only get called when the full DB sync happens once which will be for the first time, I believe it wont flood the logs as much.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants