Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use base64+json for trace context encoding #1775

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

goodspark
Copy link
Contributor

@goodspark goodspark commented Oct 4, 2022

Back when I added trace propagation to the agent, I was short-sighted and used the golang-binary format. Since the trace context was meant to be propagated between builds of a pipeline, it was safe to assume the consumers would always be able to decode golang-binary formats.

But now I want to trace additional spans within non-Golang code and link it to the parent trace of the pipeline build. This will let me gather hollistic metrics about specific failure modes and performance of various commands/tooling running in my repos.

So I picked JSON, since it's a format all modern languages can easily parse without the need for a special library.

TODOs

  • Tests
  • Test a build and verify traces go through and trace context is properly passed to subprocesses

@goodspark goodspark force-pushed the json-trace-context branch 2 times, most recently from fd54544 to 4d53ce9 Compare October 5, 2022 01:55
@@ -3,7 +3,7 @@ package tracetools
import (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test:

Launched an agent that was configured to use the Datadog tracing backend and setup a DD agent for it to talk to.
Ran a job that simply printed out the env vars:

[2022-10-05T01:52:48Z] BUILDKITE_TRACE_CONTEXT=eyJ4LWRhdGFkb2ctcGFyZW50LWlkIjoiNDQ4OTAzMDg0NTI3NTk3ODUxMSIsIngtZGF0YWRvZy1zYW1wbGluZy1wcmlvcml0eSI6IjIiLCJ4LWRhdGFkb2ctdGFncyI6Il9kZC5wLmRtPS00IiwieC1kYXRhZG9nLXRyYWNlLWlkIjoiODk0NzQ0MDA1Njc4MTEwNTcxNCJ9Cg==

Decoding it:

{"x-datadog-parent-id":"4489030845275978511","x-datadog-sampling-priority":"2","x-datadog-tags":"_dd.p.dm=-4","x-datadog-trace-id":"8947440056781105714"}

Then launched another job but with the above env vars given as inputs (to mimic distributed tracing).

[2022-10-05T02:05:04Z] BUILDKITE_TRACE_CONTEXT=eyJ4LWRhdGFkb2ctcGFyZW50LWlkIjoiODE0NzEzODA3MDc2MDA0OTEzMyIsIngtZGF0YWRvZy1zYW1wbGluZy1wcmlvcml0eSI6IjIiLCJ4LWRhdGFkb2ctdGFncyI6Il9kZC5wLmRtPS00IiwieC1kYXRhZG9nLXRyYWNlLWlkIjoiODk0NzQ0MDA1Njc4MTEwNTcxNCJ9Cg==

Decoded:

{"x-datadog-parent-id":"8147138070760049133","x-datadog-sampling-priority":"2","x-datadog-tags":"_dd.p.dm=-4","x-datadog-trace-id":"8947440056781105714"}

Verified traces still get sent and that the two builds were linked together in Datadog.

@@ -22,8 +22,7 @@ func EncodeTraceContext(span opentracing.Span, env map[string]string) error {
}

buf := bytes.NewBuffer([]byte{})
enc := gob.NewEncoder(buf)
if err := enc.Encode(textmap); err != nil {
if err := json.NewEncoder(buf).Encode(textmap); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching immediately means there will be a period when people might lose trace linking (but not traces) while they roll out the new agent across their CI fleet. Seemed simpler than trying to plumb through all the env vars and configs, which was what I started with but was having trouble getting working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'm not an opentracing expert, but the value of the BUILDKITE_TRACE_CONTEXT env var is something that downstream (ie, non-buildkite agent) processes have to decode this to get a trace context, right? it's not just an opaque value?

if this is the case, would the fact that we're switching encodings here means that tracing setups that previously worked would have to change their decoding strategy? if so, this would be a breaking change, and i don't think we'd be able to ship it.

there are other options, i think - we could expose a second env var like BUILDKITE_TRACE_CONTEXT_BASE64 or something. We could also do the other thing and wire up all the CLI flags and configs and such, though i agree with you that that's not a super great solution; not least of which because it's a pain to implement, but it's also good to keep the number of options on the agent down where we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wound up adding an arg and plumbing it through. Turns out the code I had before was mostly fine, I just had typos that was breaking some agent-to-bootstrap plumbing.

@goodspark goodspark changed the title Provide option for base64 json trace context encoding USe base64+json for trace context encoding Oct 5, 2022
@goodspark goodspark marked this pull request as ready for review October 5, 2022 02:15
@goodspark goodspark changed the title USe base64+json for trace context encoding Use base64+json for trace context encoding Oct 5, 2022
@@ -3,7 +3,7 @@ package tracetools
import (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll also be interesting to see if we can use the proposed env vars in open-telemetry/opentelemetry-specification#740 (which is also mentioned in #1663) as a move towards more standardization. This will help others reusing libraries native to whatever language/tool they're using while still maintaining trace linking across process boundaries.

Copy link
Contributor

@moskyb moskyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code of this looks great, but i have some concerns about the backward-compatibility of it. Provided we can sort those out, i'd love to get this into the agent - more, more normal encoding methods is a good thing!

@@ -22,8 +22,7 @@ func EncodeTraceContext(span opentracing.Span, env map[string]string) error {
}

buf := bytes.NewBuffer([]byte{})
enc := gob.NewEncoder(buf)
if err := enc.Encode(textmap); err != nil {
if err := json.NewEncoder(buf).Encode(textmap); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'm not an opentracing expert, but the value of the BUILDKITE_TRACE_CONTEXT env var is something that downstream (ie, non-buildkite agent) processes have to decode this to get a trace context, right? it's not just an opaque value?

if this is the case, would the fact that we're switching encodings here means that tracing setups that previously worked would have to change their decoding strategy? if so, this would be a breaking change, and i don't think we'd be able to ship it.

there are other options, i think - we could expose a second env var like BUILDKITE_TRACE_CONTEXT_BASE64 or something. We could also do the other thing and wire up all the CLI flags and configs and such, though i agree with you that that's not a super great solution; not least of which because it's a pain to implement, but it's also good to keep the number of options on the agent down where we can.

@agates4
Copy link

agates4 commented Nov 7, 2022

I'm excited about this PR - it would help us send additional traces through to Datadog!

@goodspark
Copy link
Contributor Author

This has been sitting around for months. Any chance someone from Buildkite can review this?

Back when I added trace propagation to the agent, I was short-sighted
and used the golang-binary format. Since the trace context was meant to
be propagated between builds of a pipeline, it was safe to assume the
consumers would always be able to decode golang-binary formats.

But now I want to trace additional spans within code that runs within a
Buildkite pipeline and link it to the parent trace of the pipeline
build. This will let me gather hollistic metrics about specific failure
modes and performance of various commands/tooling running in my repos.

So JSON was added, since it's a format all modern languages can easily
parse without the need for a special library.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants