Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing callers of tentacle client to drive orchestration themselves #799

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sburmanoctopus
Copy link
Contributor

@sburmanoctopus sburmanoctopus commented Feb 9, 2024

Background

In the ExecutionsV2 pipeline, all tentacle communications are done asynchronously via events.

Results

Before

The TentacleClient.ExecuteScript method does the entire communication stack, from starting, to getting status, to completing etc.

After

We now expose the required operations needed to interact with Tentacle.

Considerations

Script Ticket

Tentacle V1 creates the script ticket on the Tentacle itself, whereas Tentacle V2 onwards are given the script ticket. This concept was hidden in the TentacleClient.

But now that we are driving the orchestration in Octopus Server, we need to expose this to the caller.

Orchestration vs Execution

We had to split apart the execution away from orchestration. This is because our new methods on the client are purely execution. But we want to make sure we are executing in the same way we do during orchestration.

Cancelling/Completing During Start

One thing we had to maintain when splitting execution and orchestration is that we try to cancel/complete in orchestration if we started attempting to start during execution.

This is why the ScriptExecutionCancelledAfterPotentiallyStartingException was introduced.

We also need to respect this contract in the ExecutionsV2 pipeline.

Using Latest Command/Response

The code became far simpler when the orchestrators and executors respected the contract that parameters and return values were always using the latest version (in this case V3Alpha).

This prevented the need for many generic arguments between layers.

It also simplified mapping.

ClientOperationMetricsBuilder

This PR does not have a good solution for the metrics we are currently gathering.

ClientOperationMetricsBuilder gets all the metrics about an operation, and reports back on that. But the operation is now split up into separate calls. So we cannot do this.

We need to revisit this at some point.

Testing

We should write our own set of tests on the new methods of the TentacleClient if this draft is considered worth progressing.

How to review this PR

This PR moves logic around so that it can be exposed. It would be great if we could please see if they are still logically equivalent.

Quality ✔️

Pre-requisites

  • I have read How we use GitHub Issues for help deciding when and where it's appropriate to make an issue.
  • I have considered informing or consulting the right people, according to the ownership map.
  • I have considered appropriate testing for my change.

using System.Threading;
using System.Threading.Tasks;

namespace Octopus.Tentacle.Client.Scripts
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see the use of this.


namespace Octopus.Tentacle.Contracts.ScriptServiceV3Alpha
{
public class GetStatusCommandV3Alpha
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these classes need to be tied to what is going over the wire, maybe it could just be GetStatusCommand no v3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't need to be. I wasn't sure what convention we were following.

I guess it depends whether we want the versioning of all commands to increment together (with StartScriptCommandV3Alpha etc.), or whether they can progress at their own cadence.

I leant towards keeping them in sync, as we tend to make orchestrators (and now executors) for a specific version.

But if this is wrong, then I am happy to change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need more guidance.

But in general the tentacle client should be working as hard as it can be for you. Which I think should me trying to hide away details best it can, including new different wire command/responses.

@@ -191,11 +193,11 @@ async Task<DataStream> DownloadFileAction(CancellationToken ct)
clientOptions,
logger);

var orchestrator = await factory.CreateOrchestrator(scriptExecutionCancellationToken);
var orchestrator = factory.CreateOrchestrator(scriptServiceVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sburmanoctopus Could we make it so that the "in memory" version of execute a scripts calls the below methods. That way the client itself gives an example of how to call the methods below and we always test those methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I follow this one sorry.

The below methods are the publicly facing versions, which will have their own issues with metrics gathering etc. So I would not be inclined to call them from other methods.

They do call the same executors, so end up calling the same code.

It would be nice to have something show you how they should be called. But that's all done within the orchestrator.

Maybe this one is worth a chat?

//TODO: sast - think about metrics. This isn't right, so we have ignored it for now. But we need to ask what we want to report on in an async event based world.
var operationMetricsBuilder = ClientOperationMetricsBuilder.Start();

var scriptServiceVersion = await DetermineScriptServiceVersionToUse(operationMetricsBuilder, logger, scriptExecutionCancellationToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be cached within that second object that could be returned by start script. Otherwise we are very much relying on halibut caching to work. If we don't cache this we will need to ensure we don't log too much since DetermineScriptServiceVersionToUse might be logging to task log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good points 🤔

The complication is a multi-node environment.

Returning the version in start script would be the most efficient, but it does expose internal concerns to the public 😞

Like you say, caching does help, but it does mean every server node that does a task (e.g. GetStatus) now has to determine the version again (and log it to the task log 😑 ).

Which then brings us to the idea of reducing what we log when doing this.

@jimmyp , do you have any thoughts on this one?

Copy link

@jimmyp jimmyp Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very much relying on halibut caching to work

I dont understand halibut enough to understand this

ensure we don't log too much since DetermineScriptServiceVersionToUse might be logging to task log.

I'm wondering if this is something that might need to change about tentacle client. The way I've been thinking about task logging in the new pipeline is that its not something most components should know about. The pipeline will raise events and task logging component somewhere will build up the task log as a read model.

Returning the version in start script would be the most efficient, but it does expose internal concerns to the public 😞

What exactly is the internal concern being exposed here sorry? In general our multi-node caching mechanism is going to be the DeploymentExecution so my first thought is to reach for that if we do need caching. But I'm keen to here more about why we need caching here and what dependancies this would create

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything written down anywhere of the goals and concerns tentacle client is supposed to deal with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the internal concern being exposed here sorry?

Before any RPC call, we determine the capabilities of the Tentacle (i.e. is it V1, V2 etc.) so we can use the correct interface. We cache the result of this so that when we keep doing multiple "get status" checks etc., we don't have to keep asking the same question.

Right now, this is all done within the Tentacle Client itself. So the caller does not need to care about it.

So the idea we are exploring is returning this information about the tentacle back to Octopus Server, so then Octopus Server can use it for future calls to Tentacle Client (and therefore Tentacle Client would not have to ask the Tentacle again).

But in doing so, we are exposing some info to achieve this.

However, recently, @gb-8 and I were exploring the idempotency issue with V1 tentacles, and realized that we may need to make some changes to support that. The main change is to actually expose the results of "Get Capabilities" so that the caller can make choices about how to drive Tentacle Client if it's a V1 tentacle.

If we do that, then we would already have exposed this info. So maybe it becomes moot.

Is there anything written down anywhere of the goals and concerns tentacle client is supposed to deal with?

Not that I'm aware of. But I wasn't around in the early days. @LukeButters might know if there is?

@@ -209,6 +211,136 @@ async Task<DataStream> DownloadFileAction(CancellationToken ct)
}
}


public async Task<ScriptStatusResponseV3Alpha> StartScript(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to return something that is not the type that is received over the wire.

Perhaps two things distinct objects should be returned here

  • The interesting data returned e.g. logs.
  • An object which can be passed to GetStatus containing everything required to do the next GetStatus call. This object could additionally have some extension methods for serialization and deserialization, which is all taken care of in the client (and tested in the client).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like that. It helps as an anti-corruption layer, plus it helps to define responsibilities.

Thinking this through, I'd say we create:

  • New class CommandContext (or a better name?)
    • Has ScriptTicket and NextLogSequence
  • New class CommandResponse (or a better name?)
    • Has State, ExitCode, and Logs (where Logs is still a list of ProcessOutput?)
    • A new property that holds the CommandContext for the next command. E.g. NextCommandContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we could use CommandContext as the parameter to the other commands, to really highlight this is what you need to use these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, in the future it would be nice if the client can tell us if the script is hung (which is after the script has started and it has not logged output for more than some amount of time). Could we stash the information needed to do that in the command context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably.

This sounds like a nice separate piece of work to come afterwards. I'll create a card for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, that hung stuff is something we almost did (2nd highest voted bet). Also I think it has to be in the client since it is not obvious when a script actually started :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Also, story created

return result;
}

public async Task<ScriptStatusResponseV3Alpha> Finish(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling Finish out makes sense since cleaning up is destructive and we do need to let callers do stuff like collect artifices before we clean up

var result = await executor.GetStatus(getStatusCommand, scriptExecutionCancellationToken);

return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the caller of this method now responsible for delays between calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 🙂

The current ExecutionsV2 pipeline does that already

@LukeButters
Copy link
Contributor

I think in general this is rather good.

@sburmanoctopus sburmanoctopus force-pushed the sast/SpikeOutExecutionsV2TentacleClientChanges branch 2 times, most recently from d685d53 to 6494980 Compare February 19, 2024 05:43
@gb-8
Copy link

gb-8 commented Feb 22, 2024

Summary of discussion of PR:

What the PR achieves:

  • It separates orchestration and execution logic inside tentacle client.
  • It adds new methods to allow callers to manage orchestration, leaving client responsible for execution.

What the PR does not address:

  • It does not allow the determination of what features the tentacle supports, so does not allow us to detect when we are dealing with a V1 tentacle which does not support idempotent script execution, and generates a new script ticket on the tentacle.

Benefits of getting this PR merged:

  • It does separate orchestration and execution concerns in TentacleClient and moves us towards being able to fully orchestrate tentacle interactions (with the exclusion of handling V1 tentacle limitations).
  • It would allow us to stop using FudgedTentacleClient and reduce filesystem usage that that is involves.

Cost of getting this PR merged:

  • Still requires further review before we can consider merging (It's had some attention from Luke).
  • It does not move us towards our cycle goal, and could distract from that.

Risks management:

  • The PR would need to be re-worked if we want to roll it out via a feature toggle.
  • The changes in the PR should not change behaviour for existing usages of TentacleClient, and this behaviour is well covered by automated tests.
  • New methods (for external orchestration) are not covered by automated tests to the same level.
  • Even if we wanted to address all required changes to tentacle client as a contiguous piece of work (including the missing tentacle version management concerns), we would likely want to split it up into smaller PRs, and this PR would be a reasonable first step.

@sburmanoctopus sburmanoctopus force-pushed the sast/SpikeOutExecutionsV2TentacleClientChanges branch from 6494980 to 568d071 Compare February 27, 2024 03:29
@sburmanoctopus sburmanoctopus marked this pull request as ready for review February 27, 2024 03:29
@sburmanoctopus sburmanoctopus requested a review from a team as a code owner February 27, 2024 03:29
@pawelpabich
Copy link
Contributor

A few thoughts:

  1. ITentacleClient is used by the classic execution pipeline and Dynamic Workers. Would it make sense to create a different interface for the resilient pipeline so you can freely experiment with it in isolation?
  2. The Modern Deployment Team will prepare a mini RFC re how to expose the new K8s feature in the Tentacle Client. I don't think this should affect this PR and it is more of FYI.

using Octopus.Tentacle.Contracts.ScriptServiceV3Alpha;

namespace Octopus.Tentacle.Client
{
public interface ITentacleClient : IDisposable
{
Task<CommandResponseV3Alpha> StartScript(StartScriptCommandV3Alpha startScriptCommand, ILog logger, CancellationToken scriptExecutionCancellationToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further to @LukeButters point around service types leaking out of tentacle client, The StartScriptCommandV3Alpha also really needs to get changed to a non-service type.

In my mini-RFC (as mentioned by Pawel), part of the proposal is to remove service types leaking into the Server codebase.

I'm happy for you to leave it for now, but just be aware that this is something I'm intending to address as part of the proposed solution in the RFC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. When you're back, maybe we should get together to sync our two streams of work so we get a good path forward 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants