Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: log DirectPath misconfiguration #2105

Merged
merged 9 commits into from Oct 18, 2023

Conversation

mohanli-ml
Copy link
Contributor

@mohanli-ml mohanli-ml commented Oct 11, 2023

We met issues where customers wanted to use DirectPath, but DirectPath was misconfigured and it took 2 weeks for debugging. So we want to add WARNING logs if DirectPath flag/option is set but DirectPath is not used.

Issue: #2164.

@blakeli0 @frankyn

@mohanli-ml mohanli-ml requested a review from a team as a code owner October 11, 2023 22:40
@product-auto-label product-auto-label bot added the size: s Pull request size is small. label Oct 11, 2023
@mohanli-ml
Copy link
Contributor Author

@blakeli0 This log will happen in every channel creation. I am worried this PR will make too many logs, especially if the customer has a large channel pool size. Does Java open source has something similar to LOG_EVERY_N_SEC as C++ absl?

@sonarcloud
Copy link

sonarcloud bot commented Oct 11, 2023

[java_showcase_unit_tests] SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

46.7% 46.7% Coverage
0.0% 0.0% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

Copy link
Member

@frankyn frankyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling in a colleague from Dataproc team to weigh in; Thanks for tackling this @mohanli-ml;

// Case 2: just enable DirectPath xDS
if (!isDirectPathOptionSet && (isDirectPathXdsEnvSet || isDirectPathXdsOptionSet)) {
LOG.log(
Level.WARNING, "DirectPath is misconfigured. Please set the attemptDirectPath option.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhravidutt this PR is w.r.t R.1 for observability.

Would hadoop connector customers be able to see these warnings emitted when running the connector or would this require a change?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be able to surface these logs.

// Case 2: just enable DirectPath xDS
if (!isDirectPathOptionSet && (isDirectPathXdsEnvSet || isDirectPathXdsOptionSet)) {
LOG.log(
Level.WARNING, "DirectPath is misconfigured. Please set the attemptDirectPath option.");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add values of isDirectPathOptionSet, isDirectPathXdsEnvSet and isDirectPathXdsOptionSet as part of log statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case user enables DirectPathXds either by env or option, but DirectPath itself is not enabled. I think the log is already clear that the user needs to set attemptDirectPath and there is no need to log the values for judging the case. WDYT?

@@ -266,6 +270,40 @@ boolean isDirectPathXdsEnabled() {
return false;
}

private void logDirectPathMisconfig() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this logic is added specifically for log statements. How do we make sure it is in parity with directPath decision making?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the bit that actually handles connecting to DirectPath in gRPC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DirectPath decision making is L363:

if (isDirectPathEnabled() && isNonDefaultServiceAccountAllowed() && isOnComputeEngine()) {

And I came up with the 4 cases based on it. WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that this logic will be invoked only when directPath is misconfigured. But this is not the logic which decides if directPath is misconfigured. This code snippet could become stale if there is change in logic which is actually deciding when directpath is misconfigured.

Just as an e.g. this is the code snippet which decides directPath misconfiguration

  if (isDirectPathEnabled() && isNonDefaultServiceAccountAllowed() && isOnComputeEngine()) {

Here we check for isNonDefaultServiceAccountAllowed. Hypothetically in future if isNonDefaultServiceAccountAllowed check is removed there is no easy way to know that fun:logDirectPathMisconfig needs to be updated as well.

I guess there is no easy way to enforce it and this is fine for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your inputs! isNonDefaultServiceAccountAllowed() and isOnComputeEngine() should be kept even after DirectPath rolls out, and then we will need to cleanup env/options, and this warning logs should also be cleaned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern as @singhravidutt, the logic in logDirectPathMisconfig() could evolve to be different than (isDirectPathEnabled() && isNonDefaultServiceAccountAllowed() && isOnComputeEngine()). Ideally, we should check if we should use DirectPath or not during the initialization of this class, set the value to a field useDirectPath, and use it to check if we should create a DirectPath channel afterwards.
See the following code that demonstrates my idea:

public InstantiatingGrpcChannelProvider build() {
      if (shouldUseDirectPath()) {
          setUseDirectPath(true);
      }
      return new InstantiatingGrpcChannelProvider(this);
}

  private boolean shouldUseDirectPath() {
    if (isDirectPathEnabled()) {
      return isOnComputeWithValidCredential();
    } else {
      if (isDirectPathXdsEnabled()) {
        //log
        isOnComputeWithValidCredential();
      }
    }
    return false;
  }

  private boolean isOnComputeWithValidCredential() {
    if (!isNonDefaultServiceAccountAllowed()) {
      //log
      return false;
    }
    if (!isOnComputeEngine()) {
      //log
      return false;
    }
    return true;
  }

This also makes the new code much more testable as SonarCheck is failing due to lacking test coverage of the new code.

@mohanli-ml Let me know if it makes sense, if you think it might be too much at this moment, our team can take it as a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Moving the DirectPath enabling logic to the builder instead of checking it every time when construct a channel sounds good to me. However, pay attention to the logic here:

private boolean shouldUseDirectPath() {
    if (isDirectPathEnabled()) {       <- This should be `if (isDirectPathEnabled() && !isDirectPathXdsEnabled())`
      return isOnComputeWithValidCredential();
    } else {
      if (isDirectPathXdsEnabled()) {
        //log
        isOnComputeWithValidCredential();
      }
    }
    return false;
  }

In short I feel the logic is already complex and I would suggest to do this with the cleanup of removing DirectPath gRPCLB, which is blocked by CBT and is expected to be done in Q4.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short I feel the logic is already complex and I would suggest to do this with the cleanup of removing DirectPath gRPCLB, which is blocked by CBT and is expected to be done in Q4.

I think that's reasonable. For now, can you please add some unit tests for these logging statement? So that the test coverage for new code meets our least standard and gives us more confidence when we refactor it. For example, one way to test it is to create a FakeLogger and assert the log message captured in the FakeLogger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

@blakeli0
Copy link
Collaborator

@blakeli0 This log will happen in every channel creation. I am worried this PR will make too many logs, especially if the customer has a large channel pool size. Does Java open source has something similar to LOG_EVERY_N_SEC as C++ absl?

It's a valid concern, I would suggest refactor the code so that we don't have to log it on channel creation in createSingleChannel(). From what I can see, we can move pretty much all the logging logic to the constructor/builder of InstantiatingGrpcChannelProvider.

}

// The following WARNING logs are GCS only.
if (serviceAddress.contains("storage.googleapis.com")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely not allowed. Gax must not be aware of anything related to a specific service. Looks like we need two logging strategies for opt-in and opt-out, can we make it a flag and Storage can pass the flag into gax?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this generally useful though? This log helps call out that attempting directpath wasn't configured correctly to developers.

Copy link
Member

@frankyn frankyn Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed your comment above, about adding this into Constructor which sounds like a good path for this; just calling out this is generally useful.

Copy link
Collaborator

@blakeli0 blakeli0 Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree this log is generally useful. But hardcoding the storage specific endpoint storage.googleapis.com in Gax is not allowed, sorry if I didn't make this clear in previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a new idea. Let’s only make the WARNING log if DirectPath xDS env/option is set. DirectPath xDS is opt-in for GCS/CBT/CS, so we do not need to have two strategies. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these warnings should be emitted for both cases; a user is attempting DirectPath and it would be helpful to know they're doing something wrong in configuring it. Otherwise, it's a blackbox and users are left to weeks of debugging what went wrong. The goal of this is to make it easier to understand a user did something wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Your concern is that what if a customer set attemptDirectPath but not attemptDirectPathXds option/env, there will be no warnings.

Let me provide some context here. attemptDirectPath is originally created for DirectPath gRPCLB (gRPCLB is the name resolver). Then DirectPath xDS is developed (TD is the name resolver) and we decide to migrate DirectPath gRPCLB to DirectPath xDS. To differentiate the two DirectPath infrastructures, attemptDirectPathXds option/env are created:

  • If attemptDirectPath is NOT set, CloudPath will be used;
  • If attemptDirectPath is set AND attemptDirectPathXds option/env is NOT set, DirectPath gRPCLB is used;
  • If attemptDirectPath is set AND attemptDirectPathXds option/env is set, DirectPath xDS is used.

Currently the migration is still ongoing as CBT still has DirectPath gRPCLB traffic (CBT's timeline is Q4). After that, I will cleanup DirectPath gRPCLB code in the gax library. For example, attemptDirectPath will be removed and we just need to keep attemptDirectPathXds option/env.

With that being said, I can change the DirectPath enabling logic to:

  • If attemptDirectPath is NOT set AND attemptDirectPathXds option/env is NOT set, CloudPath will be used;
  • If attemptDirectPath is set AND attemptDirectPathXds option/env is NOT set, DirectPath gRPCLB is used;
  • If attemptDirectPathXds option/env is set, DirectPath xDS is used.

In this way, GCS customers just need to set attemptDirectPathXds option/env, and they do not need to care about attemptDirectPath. Then the current log logic should be good. WDYT?

Copy link
Member

@frankyn frankyn Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the way the options work sounds like a reasonable thing to do after migration is complete.
In java-storage, the attemptDirectPath and attemptDirectPathXds issue has been addressed by enabling both, but these warnings would have helped save the team time. Moreover, we don't expose configuration of each flag attemptDirectPath and attemptDirectPathXds directly to users instead only expose 1 flag that enables both.

My original goal is to add warnings to let GCS users know they have misconfigured the client to use DirectPath. This also lends to supporting future warnings if new conditions are added.

The three main cases surfaced in this logic (ignoring gRPCLB case because it's not ready yet):

  1. attemptDirectPath vs. attemptDirectPathXds is one way;
  2. non GCE credentials; this is likely to bite a customer
  3. Not using GCE; not as much a concern, but we don't document DirectPath in public documentation so a user could do something silly and use the our clients outside GCE without understanding that DP is not available.

I think the warnings would be generally useful outside of GCS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, we don't expose configuration of each flag attemptDirectPath and attemptDirectPathXds directly to users instead only expose 1 flag that enables both.

Sorry I am confused. What are the two cases in the previous comment I think these warnings should be emitted for both cases?

Copy link
Member

@frankyn frankyn Oct 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we are speaking past each other?

I have a new idea. Let’s only make the WARNING log if DirectPath xDS env/option is set. DirectPath xDS is opt-in for GCS/CBT/CS, so we do not need to have two strategies. WDYT?

The point i was driving at is that there's nothing telling a user they configured DirectPath incorrectly when they use attemptDirectPath or attemptDirectPathXds; and selecting the cases only when attemptDirectPathXds is set to emit these warnings misses this point.

@blakeli0
Copy link
Collaborator

@mohanli-ml Thanks for opening this PR! The logic to determine different logs seems pretty complicated and people have different opinions. Can you please create an issue and/or an internal doc, list all the scenarios there, and make sure we all agree on the approach first?
cc: @meltsufin

@mohanli-ml
Copy link
Contributor Author

@blakeli0 This log will happen in every channel creation. I am worried this PR will make too many logs, especially if the customer has a large channel pool size. Does Java open source has something similar to LOG_EVERY_N_SEC as C++ absl?

It's a valid concern, I would suggest refactor the code so that we don't have to log it on channel creation in createSingleChannel(). From what I can see, we can move pretty much all the logging logic to the constructor/builder of InstantiatingGrpcChannelProvider.

Yes I moved the method to the channel pool creation, so no matter how many channels customers want, the log should only be logged once. Can you double check if this is the right place?

@mohanli-ml
Copy link
Contributor Author

@mohanli-ml Thanks for opening this PR! The logic to determine different logs seems pretty complicated and people have different opinions. Can you please create an issue and/or an internal doc, list all the scenarios there, and make sure we all agree on the approach first? cc: @meltsufin

Before we do this, can you take a look at the latest code?

LOG.log(
Level.WARNING,
"DirectPath is misconfigured. Please make sure the credential is an instance of"
+ " ComputeEngineCredentials.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fully qualified class name ComputeEngineCredentials.class.getName()

TIL you need GCE credentials for DirectPath; what is a customer is using Workload Identity would that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

GKE workload identity is compatible with DirectPath.

// Case 1: does not enable DirectPath
if (!isDirectPathOptionSet) {
LOG.log(
Level.WARNING, "DirectPath is misconfigured. Please set the attemptDirectPath option.");
Copy link
Member

@frankyn frankyn Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend:

DirectPath is misconfigured. Please enable the attemptDirectPath option along with the attemptDirectPathXds option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Case 3: not running on GCE
if (!isOnComputeEngine()) {
LOG.log(
Level.WARNING, "DirectPath is misconfigured. Please run in the GCE environment. ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

DirectPath is misconfigured. DirectPath is only available in a GCE environment.

You also have whitespace at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

+ " attemptDirectPathXds option.");
} else {
// Case 2: credential is not correctly set
if (!isNonDefaultServiceAccountAllowed()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a ComputeEngineCredentials but not on ComputeEngine? I though auth would only return ComputeEngineCredentials if on ComputeEngine, so isOnComputeEngine() may not be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is possible, this is the original fix for this problem, googleapis/gax-java#1250.

@@ -233,6 +237,7 @@ public TransportChannel getTransportChannel() throws IOException {
}

private TransportChannel createChannel() throws IOException {
logDirectPathMisconfig();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createChannel() could still be called multiple times, I would suggest to log the info in the builder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not do this because logDirectPathMisconfig() is a non-static function so it can not be used before the construction. Can you suggest more?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move logDirectPathMisconfig() into the Builder class then it should be fine, ideally we should do it but then we would need to move all related logics into it as well. The quickest solution would be add it to the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

boolean isDirectPathXdsEnvSet =
Boolean.parseBoolean(envProvider.getenv(DIRECT_PATH_ENV_ENABLE_XDS));
boolean isDirectPathXdsOptionSet = Boolean.TRUE.equals(attemptDirectPathXds);
if (isDirectPathXdsEnvSet || isDirectPathXdsOptionSet) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have isDirectPathXdsEnabled() for the same purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

boolean isDirectPathXdsOptionSet = Boolean.TRUE.equals(attemptDirectPathXds);
if (isDirectPathXdsEnvSet || isDirectPathXdsOptionSet) {
// Case 1: does not enable DirectPath
if (!isDirectPathOptionSet) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use isDirectPathEnabled()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -266,6 +270,40 @@ boolean isDirectPathXdsEnabled() {
return false;
}

private void logDirectPathMisconfig() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern as @singhravidutt, the logic in logDirectPathMisconfig() could evolve to be different than (isDirectPathEnabled() && isNonDefaultServiceAccountAllowed() && isOnComputeEngine()). Ideally, we should check if we should use DirectPath or not during the initialization of this class, set the value to a field useDirectPath, and use it to check if we should create a DirectPath channel afterwards.
See the following code that demonstrates my idea:

public InstantiatingGrpcChannelProvider build() {
      if (shouldUseDirectPath()) {
          setUseDirectPath(true);
      }
      return new InstantiatingGrpcChannelProvider(this);
}

  private boolean shouldUseDirectPath() {
    if (isDirectPathEnabled()) {
      return isOnComputeWithValidCredential();
    } else {
      if (isDirectPathXdsEnabled()) {
        //log
        isOnComputeWithValidCredential();
      }
    }
    return false;
  }

  private boolean isOnComputeWithValidCredential() {
    if (!isNonDefaultServiceAccountAllowed()) {
      //log
      return false;
    }
    if (!isOnComputeEngine()) {
      //log
      return false;
    }
    return true;
  }

This also makes the new code much more testable as SonarCheck is failing due to lacking test coverage of the new code.

@mohanli-ml Let me know if it makes sense, if you think it might be too much at this moment, our team can take it as a follow up PR.

@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Oct 17, 2023
@sonarcloud
Copy link

sonarcloud bot commented Oct 18, 2023

[gapic-generator-java-root] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

94.7% 94.7% Coverage
0.0% 0.0% Duplication

@sonarcloud
Copy link

sonarcloud bot commented Oct 18, 2023

[java_showcase_integration_tests] SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

26.3% 26.3% Coverage
0.0% 0.0% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

@blakeli0 blakeli0 added the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 18, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Oct 18, 2023
@blakeli0 blakeli0 merged commit 860ae76 into googleapis:main Oct 18, 2023
34 of 35 checks passed
@mohanli-ml
Copy link
Contributor Author

@blakeli0 Hi Blake, I just realized that this PR could be misleading. For example, a customer's application may contains a GCS client and a Bigquery client. The customer use setAttemptDirectPath to enable DirectPath in its GCS client, and set GOOGLE_CLOUD_ENABLE_DIRECT_PATH_XDS to enable DirectPath xDS. In this case, this warning log will be printed in the customer's Bigquery client. I think this is a P2 or P3 issue, and could you find someone to fix it (as you mentioned your team want to take over DirectPath)?

cc @frankyn

@frankyn
Copy link
Member

frankyn commented Jan 26, 2024

Warnings started being emitted in recent version of java-storage w.r.t "incorrect" GCE credentials; however requests are still using GCE ADC, was there a change recently to credentials that could trigger this incorrectly?

@blakeli0
Copy link
Collaborator

@mohanli-ml I see, the problem is that isDirectPathXdsEnabled() could be enabled by environment variable, which is shared by all clients. So maybe we want to change the logic to check isDirectPathEnabled() first. Can you please create an issue and I'll put it into our backlog?

@frankyn Thanks for reporting this issue. #2281 might caused it, looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants