[WIP] finalizer approach to fix race condition due to pruning in results watcher #703

ramessesii2 · 2024-02-05T12:43:05Z

Changes

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

Has Docs included if any changes are user facing
Has Tests included if any functionality added or changed
Tested your changes locally (if this is a code change)
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
Release notes contain the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

tekton-robot · 2024-02-05T12:43:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign sayan-biswas after the PR has been reviewed.
You can assign the PR to them by writing /assign @sayan-biswas in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot · 2024-02-05T12:43:15Z

Hi @ramessesii2. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ramessesii2

Debugging showed few anomalies in the send log process. For e.g. - this piece of code from tknlog writer is blocking function execution.
I've added comment on other items as well.

@sayan-biswas @enarha fyi

ramessesii2 · 2024-02-05T12:44:58Z

pkg/watcher/reconciler/dynamic/dynamic.go

@@ -342,11 +356,6 @@ func (r *Reconciler) sendLog(ctx context.Context, o results.Object) error {
 					zap.Error(err),
 				)
 			}
-			logger.Debugw("Streaming log completed",


This asynchronous functions never reaches this point if there is a successful streaming of logs while we do see this when there an error while streaming logs

khrm

/ok-to-test

tekton-robot · 2024-02-05T13:54:10Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	60.8%	-8.5

Roming22 · 2024-02-05T17:00:44Z

pkg/watcher/reconciler/dynamic/dynamic.go

 		IsReadyForDeletionFunc: func(ctx context.Context, object results.Object) (bool, error) {
+			if LogsFinalizerExist(object, LogFinalizer) {


I would think that this is not needed, as it is redundant with the finalizer pattern.

I would expect the resource to be marked as being deleted, but the deletion to be defered by k8s until the last finalizer has been removed.

I could definitely be wrong though 😉

You're right deletion is deferred if the Finalizer is there.
So, prior to that change we could see the following message while debugging even if the resource is not deleted properly(due to Finalizer) which is misleading and more importantly we do go on to execute AfterDeletion

gabemontero

The child context with timeout bit idea from @adambkaplan looks like how I expected it .... just the moving of the CloseSend is needed @ramessesii2

gabemontero · 2024-02-06T13:45:14Z

pkg/watcher/reconciler/dynamic/dynamic.go

+			if gofuncerr != nil {
+				logger.Error(gofuncerr)
+			}
+			if gofuncerr = logsClient.CloseSend(); gofuncerr != nil {


per comment from @adambkaplan in our internal discussion @ramessesii2 we want to execute the CloseSend() here regardless of which case in the select fires, meaning it needs to be moved out of this case, out of the select in fact, so that it is also executed when ctx.Done() is processed as well

i.e release the streaming resources regardless if the tkn client library is not properly reconciling the error channel used to block the goroutine from completing

that at least would help slow things .... cleaning up the threads (or moving to a different model) could become a separate concern based on what you find as you drill down on your suspicion

So we've confirmed in an internal slack thread including @ramessesii2 @sayan-biswas and myself that the streamLogs code in the "main" thread needs to close the channel the goroutine is blocked on.

OK @ramessesii2 your call to CloseSend minimally is still only in the case writeErr := <-echan: block

we need to call this as well if the case <-ctx.Done(): block fire first ... i.e. there was some weird hang on the write, so we can abort and free up the resources

my initial assumption is the flush is irrelevant in this error case and case stay where it is, but do let us know if your testing uncovers a different conclusion

please update your PR so we can close this thread and have @pmacik 's next round of testing include this change

done, I've done a bit more of refactoring. Hopefully, this is what you're asking as well here. Pushing the changes now.

tekton-robot · 2024-02-08T14:30:40Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	60.2%	-9.1

gabemontero · 2024-02-08T14:59:01Z

pkg/watcher/reconciler/dynamic/dynamic.go


 	// logctx is derived from ctx. Therefore, if ctx is cancelled (either explicitly through a call to its cancel
 	// function or when it reaches its deadline), logctx will be cancelled automatically.
 	logctx, _ := context.WithTimeout(ctx, 10*time.Minute)

 	go func(ctx context.Context, echan <-chan error, o metav1.Object) {
+		defer close(errChanRepeater)


good use of defer @ramessesii2 @sayan-biswas

where are we closing the original logChan and errChan from the reader.Read() ?

logChan and errChan channels are receiver ends. When the channels ae no longer used, they will be garbage collected.
It's new subject for me, this is my reference - https://stackoverflow.com/questions/8593645/is-it-ok-to-leave-a-channel-open

yeah there was a bit of back and forth on that thread @ramessesii2 but yes I think that is the end interpretation.

good find / research @ramessesii2

that said, given the exchanges I just saw with @pmacik today in slack, I should probably make a more thorough pass of the changes here .... admittedly I only focused on the mem leak bit initially.

will do that today

thanks

ok @ramessesii2 @sayan-biswas I think I looked at this too quickly yesterday

the close of the errChanRepeater should not be in the goroutine that blocks on it !!

move the defer close(errChanRepeater) to immediately after the declaration of errChanRepeater in your current line 400

The reason why I'm deferring close(errChanRepeater) in this goroutine is because it's important to tell the receiving goroutines that all data have been sent which is here:

tknlog.NewWriter(logType, true).Write(&cli.Stream{ Out: writer, Err: writer, }, logChan, errChanRepeater)

else the above code will not exit the execution if we don't close errChanRepeater before we do a tknlog.NewWriter()...... I confirmed this as well while debugging.

yep we are good here per the thread below

gabemontero

have more requests for changes

pkg/watcher/reconciler/dynamic/dynamic.go

gabemontero · 2024-02-09T14:16:59Z

pkg/watcher/reconciler/dynamic/dynamic.go

+			if gofuncerr != nil {
+				logger.Error(gofuncerr)
+			}
+			if gofuncerr = logsClient.CloseSend(); gofuncerr != nil {


OK @ramessesii2 your call to CloseSend minimally is still only in the case writeErr := <-echan: block

we need to call this as well if the case <-ctx.Done(): block fire first ... i.e. there was some weird hang on the write, so we can abort and free up the resources

my initial assumption is the flush is irrelevant in this error case and case stay where it is, but do let us know if your testing uncovers a different conclusion

please update your PR so we can close this thread and have @pmacik 's next round of testing include this change

gabemontero · 2024-02-09T14:19:38Z

pkg/watcher/reconciler/dynamic/dynamic.go


 	// logctx is derived from ctx. Therefore, if ctx is cancelled (either explicitly through a call to its cancel
 	// function or when it reaches its deadline), logctx will be cancelled automatically.
 	logctx, _ := context.WithTimeout(ctx, 10*time.Minute)

 	go func(ctx context.Context, echan <-chan error, o metav1.Object) {
+		defer close(errChanRepeater)


ok @ramessesii2 @sayan-biswas I think I looked at this too quickly yesterday

the close of the errChanRepeater should not be in the goroutine that blocks on it !!

move the defer close(errChanRepeater) to immediately after the declaration of errChanRepeater in your current line 400

pkg/watcher/reconciler/dynamic/dynamic.go

rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED

Signed-off-by: Satyam Bhardwaj <sabhardw@redhat.com> rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED

tekton-robot · 2024-02-12T11:54:58Z

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/watcher/reconciler/dynamic/dynamic.go	69.3%	61.0%	-8.4

tekton-robot · 2024-02-12T12:01:13Z

@ramessesii2: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-tekton-results-build-tests	`52aafe7`	link	true	`/test pull-tekton-results-build-tests`
pull-tekton-results-integration-tests	`52aafe7`	link	true	`/test pull-tekton-results-integration-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gabemontero · 2024-02-12T12:44:07Z

pkg/watcher/reconciler/dynamic/dynamic.go

+				zap.String("namespace", o.GetNamespace()),
+				zap.String("name", o.GetName()),
+			)
+		case writeErr := <-echan:


to close the loop @ramessesii2 on you comment https://github.com/tektoncd/results/pull/703/files#r1486049142 if you are only closing the channel in the go routine, what unblocks this call?

are we always hitting the case <-ctx.Done() path ?

Hey Gabe,
We don't hit ctx.Done() most of the time. Actually, I've never encountered that bcs of 10 min of timeout.

case writeErr := <-echan, what unblocks this call?

My understanding is based on the fact that even if echan is closed by tkn reader, reading a closed channel, it will always be successful, returning the zero value.
We could run into problem if echan is a nil. Reading from or writing to a nil channel causes code to hang forever. But since the check was not there in the original Results code I assume we expect a non-nil channel from tknReader.
Although, in the next iteration, I'll address it. I missed this edge case.

neither the tekton reader or writer are closing this channel based on the code I've seen @ramessesii2

can you provide the code link other than this goroutine where this channel is being closed?

nevermind I found it :-)

I think it is

https://github.com/tektoncd/results/blob/main/vendor/github.com/tektoncd/cli/pkg/log/pipeline_reader.go#L50-L51

sure, the flow is tknReader.read() -> readPipelineLog() -> finally landing to readAvailablePipelineLogs() for our case since we only stream longs once PipelineRuns/TRs have a specific closed state -> close echan (errC)

not sure how the original echan could ever be nil, but in general defensive programming is good :-)

I'll wait for you update and then we can go from there - thanks

ramessesii2 · 2024-02-14T15:13:16Z

hi @gabemontero @adambkaplan with finalizers, at least locally, I've been able to fix the race condition. There's a small caveat with finalizer. Bcs we add finalizer for the PR to not get pruned until the streaming/sending of logs is done. Finalizer along with the PipelineRun object is stored as well.
While that might not be a deal breaker (I'm not sure though), but I find it simpler to utilize this field from LogStatus to simply hold on to pruning until we find isStored: true.

gabemontero · 2024-02-14T15:22:31Z

hi @gabemontero @adambkaplan with finalizers, at least locally, I've been able to fix the race condition. There's a small caveat with finalizer. Bcs we add finalizer for the PR to not get pruned until the streaming/sending of logs is done. Finalizer along with the PipelineRun object is stored as well. While that might not be a deal breaker (I'm not sure though), but I find it simpler to utilize this field from LogStatus to simply hold on to pruning until we find isStored: true.

for me at least I like using IsStored() instead as well @ramessesii2 @adambkaplan @ramessesii2 @sayan-biswas assuming the watcher reconciler that handles pruning can watch for that and requeue if it is not yet stored

I believe that is the case based on what I recall from the fix for handling cancelled pipeline/task runs

putting metadata i.e. in the object we are storing seems more fragile to me in hindsight

what do you all think?

perhaps as part of breaking out the mem leak fix from this one, either create a separate PR or a separate commit in this PR so we can compare IsStored vs. finalizer

ramessesii2 · 2024-02-16T01:35:32Z

FYI : #713 uses Logs API to address race condition

tekton-robot · 2024-02-27T20:09:21Z

@ramessesii2: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesnt merit a release note. labels Feb 5, 2024

tekton-robot requested review from enarha and khrm February 5, 2024 12:43

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 5, 2024

ramessesii2 commented Feb 5, 2024

View reviewed changes

khrm reviewed Feb 5, 2024

View reviewed changes

tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 5, 2024

Roming22 reviewed Feb 5, 2024

View reviewed changes

gabemontero suggested changes Feb 6, 2024

View reviewed changes

ramessesii2 force-pushed the RAMESSESII2/pruner-bug-2 branch from 1d1873d to b22378a Compare February 8, 2024 14:23

gabemontero reviewed Feb 8, 2024

View reviewed changes

gabemontero suggested changes Feb 9, 2024

View reviewed changes

gabemontero and others added 4 commits February 12, 2024 12:47

employ some channel best practices to avoid hanging goroutines

3217f6a

rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED

fix race condition due to pruning in results watcher

0189111

Signed-off-by: Satyam Bhardwaj <sabhardw@redhat.com> rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED

bug fix: close open 'sending channels'

19d73a4

Signed-off-by: Satyam Bhardwaj <sabhardw@redhat.com> rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED

address review comments

52aafe7

Signed-off-by: Satyam Bhardwaj <sabhardw@redhat.com> rh-pre-commit.version: 2.1.0 rh-pre-commit.check-secrets: ENABLED

ramessesii2 force-pushed the RAMESSESII2/pruner-bug-2 branch from b22378a to 52aafe7 Compare February 12, 2024 11:52

gabemontero reviewed Feb 12, 2024

View reviewed changes

ramessesii2 changed the title ~~[WIP] fix race condition due to pruning in results watcher~~ [WIP] finalizer approach to fix race condition due to pruning in results watcher Feb 16, 2024

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2024

		IsReadyForDeletionFunc: func(ctx context.Context, object results.Object) (bool, error) {
		if LogsFinalizerExist(object, LogFinalizer) {

[WIP] finalizer approach to fix race condition due to pruning in results watcher #703

Are you sure you want to change the base?

[WIP] finalizer approach to fix race condition due to pruning in results watcher #703

Conversation

ramessesii2 commented Feb 5, 2024 • edited

Changes

Submitter Checklist

Release Notes

tekton-robot commented Feb 5, 2024

tekton-robot commented Feb 5, 2024

ramessesii2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khrm left a comment

Choose a reason for hiding this comment

tekton-robot commented Feb 5, 2024

Choose a reason for hiding this comment

ramessesii2 Feb 6, 2024 • edited

Choose a reason for hiding this comment

gabemontero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Feb 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramessesii2 Feb 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabemontero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Feb 12, 2024

tekton-robot commented Feb 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramessesii2 commented Feb 14, 2024

gabemontero commented Feb 14, 2024

ramessesii2 commented Feb 16, 2024

tekton-robot commented Feb 27, 2024

ramessesii2 commented Feb 5, 2024 •

edited

ramessesii2 Feb 6, 2024 •

edited

ramessesii2 Feb 12, 2024 •

edited