Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Improve detection of parent-child relationships #537

Open
worldtiki opened this issue Sep 28, 2018 · 13 comments
Open

[Feature Request] Improve detection of parent-child relationships #537

worldtiki opened this issue Sep 28, 2018 · 13 comments

Comments

@worldtiki
Copy link
Contributor

A service is only displayed in the service graph if a parent-child relationship is found.
This seems to be done in the SpanAccumulator.scala where it looks for 'cs', 'cr' and 'sr', 'ss' tags (to identify as client or server) and then tries to find span pairs based on if they share the same ids.

If a server service does not reporting tracing data (for eg: an external third party or a system that doesnt support distributed tracing) and we only have spans coming from the client service, this parent-child relationship is not established and the server service is not represented in the service graph.

Except from it's name, there is enough information to represent this service in the service graph as well it's stats (rate of calls, % of failures, etc).

I'm not really sure how the implementation would look like. One option would be to extend the idl.
Currently it only supports the name of the local service. Adding the name of the remote service as an optional field would allow us to display the correct name on the service graph (as well as in the traces view).

@shreyaa-sharma
Copy link
Member

Thanks for sharing Daniel. The team will get back to you here post discussions.

@ashishagg
Copy link
Contributor

There is some legacy code that uses tags to generate service graph, we hope to get rid of it very soon. However there exists a alternate and better logic working in parallel that computes the service graph using three IDs namely TraceId, SpanId and ParentSpanId and a serviceName.

How it works?
Say service 'A' calls 'B'. 'A' submits a client span with three ids - T1, S1, P1. 'B' receive these IDs esp T1, S1 over the http header. And then 'B' submits a server span with three ids - T1, S2, S1.

If you notice, B's span is a child of A's. And this parent-child relationship is used by haystack to generate an edge between the two service nodes. Note when B calls its downstream service (say C), then B's client span will be a child of B's server span. But since these two spans are coming from the service (B in this case), so we don't create a loopback edge.

Regarding your ask to show a third-party system that doesn't support tracing as a graph-node, we can think of using peer.service tag instead of extending IDL

Now if you look at the current implementation, for every span, we wait for few seconds(configurable) to look for its parent or child span. Now if we don't receive a pair, we ignore that span altogether. We can add a fallback to use 'peer.service' tag but that might conflict with a case where the span-pair came late and hence missed.

For such a scenario, we will show two edges, one for peer.service and other actual service, if they have different names. A calls peer.service as 'foo' but foo calls itself as 'bar'.

I need to evaluate if fallback strategy is ok, may be turn this on behind a feature flag?

@worldtiki
Copy link
Contributor Author

Thanks for your time @ashishagg.
I'm happy to help test this feature if you go with the feature flag approach.

@codefromthecrypt
Copy link
Contributor

codefromthecrypt commented Oct 10, 2018

This field is too important to be relegated to a tag, and even if it weren't, the opentracing peer convention is not widely used and not consistently either.

You should consider a field remoteServiceName to go next to your existing field serviceName. It is more consistent with your existing model and likely to result in less trouble.

Long history and context below

Bear in mind that the original concept of peer.service came literally from zipkin based conversions. For example, it was conventionally "sa" server address or "ca" client address in Zipkin. When Yuri decided to make the tag named "peer.service", Uber was still a Zipkin site and probably wanted something easier than the old binary annotations we had. They knew the limitations of looking up based on different names ("sa" vs "ca"), so one name ("peer.service") was a bit of an improvement back then. Note Yuri and Ben started and decided this in 2 days while a lot of folks were on holiday (27-29 Dec 2015). Notably, "service" was left out until several years later!

Meanwhile, Zipkin knew where bodies were buried... we all suffered them but wanted to always remain backwards compatible from a data pov. Fixing the root problem requires buy-in from more than two people though, at least that's been the case in Zipkin. We also need tests to pass. For example, we can't just define something and not use it in practice (such as in a service graph).

In early 2016, we started an effort to get rid of the confusion around the remote side and fix the model. This was called v2. Through that process, we noted some other things. Firstly "peer" is an awkward and bad name to use. It likely wouldn't have been used if others were involved in OpenTracing's PR besides just ben and Yuri. Anyway, "remote" was a far more accessible term. Next, indexing nested collections has performance problems, merging complexity, and the concept of remote is quite a primary concern.

So, A top-level remote service field is both more efficient and and easier to map apis. PS.. yes APIS! because this is too important to be relegated to a tag. In early 2017 we had the v2 format including this and Brave 4 which had an api for it as well. This made things much much simpler on users and the data side.
https://github.com/openzipkin/zipkin-api/blob/master/zipkin2-api.yaml#L412

Anyway.. instead of using tag naming convention which would only be present on a subset of instrumentation, I'd just add remoteServiceName field to your data instead.

You already have a serviceName field, and this makes the most sense from an intuition vs having local service as a field and relegating remote one to a tag naming convention! Plus you can always map "peer.service" in the rare case it is present just like you map into your span service field today. Meanwhile, many zipkin based instrumentation have top-level apis for service including remote service, and so you'd get a stronger mapping when that's the case as well.

So, would you consider just doing this with a model field? I'd be more likely to help with work involved if you agree.

@codefromthecrypt
Copy link
Contributor

PS if it helps, here's our dependency linker code, which covers a bunch of edge cases as well. For example, people making circular links etc.

https://github.com/openzipkin/zipkin/blob/master/zipkin/src/main/java/zipkin2/internal/DependencyLinker.java
https://github.com/openzipkin/zipkin/blob/master/zipkin/src/test/java/zipkin2/internal/DependencyLinkerTest.java

Since haystack data model is somewhat like zipkin v1.5 ( :) ), the following mapper might also be of interest https://github.com/openzipkin/zipkin/blob/master/zipkin-storage/mysql-v1/src/main/java/zipkin2/storage/mysql/v1/DependencyLinkV2SpanIterator.java

@ashishagg
Copy link
Contributor

Good to read the history behind these 'tags' 👍

My concern in putting remoteServiceName in data model is about ownership and uniformity. I will illustrate this with an example of two services A calling B. The choice of remoteServiceName is entirely upto A, and B may not call itself as 'B' in its own span(serviceName) .

The two services can set whatever they desire which is correct. But from service-graph side, if it only depends on serviceName, semantics remain intact. However using both attributes (serviceName and remoteServiceName) with different naming may complicate things?

RemoteServiceName makes absolute sense if there is a uniformity across service-mesh for service names.

Your thoughts?

@codefromthecrypt
Copy link
Contributor

codefromthecrypt commented Oct 10, 2018 via email

@ashishagg
Copy link
Contributor

Makes sense.
If we think remoteServiceName as an optional attribute, then it might have been equivalent to a tag in a span, because this tag key is documented in OpenTracing spec.

But as you mentioned, there is a history behind its introduction in the spec. Anyhow, give us couple of days to think through it and we will create a work item for it.

@codefromthecrypt
Copy link
Contributor

codefromthecrypt commented Oct 10, 2018

yep: the key here will be likely a decision of..

span.serviceName and span.tags[] where key = "peer.service"

  • nested query and only valid in opentracing, not zipkin, census, etc

vs

span.serviceName and span.remoteServiceName

  • matches your model and what you need

:)

@mchandramouli
Copy link
Contributor

@ashishagg @ayansen I was looking at this again as I ran into a scenario to record a span for a AWS call. Re-reading the thread, I think we should add remoteServiceName as a first class field in data model and go with the overriding logic mentioned in the thread. Can we discuss again and make the change?

@shreyaa-sharma
Copy link
Member

@abhinavtripathi7 and @ashishagg to look into and prioritize this issue.

@jickyasu
Copy link

jickyasu commented Oct 11, 2019

Any idea when we would be able get this request implemented? We are seeing some holes in our service graphs as a result of this.

@ashishagg
Copy link
Contributor

Did you get a chance to try our ServiceInsights (beta) feature? You need to install newer haystack UI version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants