Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metrics and access logs #937

Open
5 tasks
FlorentinDUBOIS opened this issue May 16, 2023 · 3 comments
Open
5 tasks

Improve metrics and access logs #937

FlorentinDUBOIS opened this issue May 16, 2023 · 3 comments
Assignees
Milestone

Comments

@FlorentinDUBOIS
Copy link
Collaborator

FlorentinDUBOIS commented May 16, 2023

Hello there 👋,

After some discussions with @StevenLeRoux, @BlackYoup and @Wonshtrum.
We found out that Sōzu need to provides a little more information about request and response.
So, I use this issue to make a summary of what we need.

Firstly, I will remember that we have to tackle the issue #641, #892 and #804.

What we already have

Access logs

Currently, when we handle request (TCP, HTTP[s] or WS[s]) will write following access logs.

01H0FXZVQK7RK3W55X7V26M259 TcpTest TcpTest-0-127.0.0.1:4000	127.0.0.1:52378 -> 127.0.0.1:4000	19ms 168μs 79 68522	owner=John, uuid=3f740af1-45fd-45ce-b61f-17bf1a51505f	TCP -
01H0FXX3VTR8T3B1K0ZHXQZZNQ MyCluster the-backend-to-my-app	127.0.0.1:50444 -> 127.0.0.1:1026	61ms 4959μs 79 68535	key=value	HTTPS-TLS1.3 lolcatho.st:8443 200	GET /

Below, an explanation of the above access logs:

request id cluster id backend id client socket address -> backend socket address response time service time bytes in bytes out metadata protocol hostname status verb path
01H0FXX3VTR8T3B1K0ZHXQZZNQ MyCluster the-backend-to-my-app 127.0.0.1:50444 -> 127.0.0.1:1026 61ms 4959μs 79 68535 key=value HTTPS-TLS1.3 lolcatho.st:8443 200 GET /
request id cluster id backend id client socket address -> backend socket address response time service time bytes in bytes out metadata protocol websocket context
01H0FXZVQK7RK3W55X7V26M259 TcpTest TcpTest-0-127.0.0.1:4000 127.0.0.1:52378 -> 127.0.0.1:4000 19ms 168μs 79 68522 owner=John, uuid=3f740af1-45fd-45ce-b61f-17bf1a51505f TCP -

response time : the total duration of the request
service time : the total duration that Sozu took to treat the request
websocket context: the value is - if the protocol is TCP

You can find the code source that output those logs here:

TCP or WS

info_access!(
"{}{} -> {}\t{} {} {} {}\t{}\t{} {}",
self.log_ctx,
session_addr,
backend,
LogDuration(response_time),
LogDuration(service_time),
metrics.bin,
metrics.bout,
OptionalString::from(tags.as_ref()),
proto,
self.websocket_context.as_deref().unwrap_or("-")
);

error_access!(
"{}{} -> {}\t{} {} {} {}\t{} {} | {}",
self.log_ctx,
session,
backend,
LogDuration(response_time),
LogDuration(service_time),
metrics.bin,
metrics.bout,
proto,
self.websocket_context.as_deref().unwrap_or("-"),
message
);

HTTP

info_access!(
"{}{} -> {}\t{} {} {} {}\t{}\t{} {} {}\t{}",
self.log_context(),
session,
backend,
LogDuration(response_time),
LogDuration(service_time),
metrics.bin,
metrics.bout,
OptionalString::from(tags.as_ref()),
proto,
host,
status_line,
request_line
);

info_access!(
"{}{} -> {}\t{} {} {} {}\t{}\t{} {} {}\t{}",
self.log_context(),
session,
backend,
LogDuration(response_time),
LogDuration(service_time),
metrics.bin,
metrics.bout,
OptionalString::from(tags.as_ref()),
proto,
host,
status_line,
request_line
);

error_access!(
"{}{} -> {}\t{} {} {} {}\t{} {} {}\t{} | {} {}",
self.log_context(),
session,
backend,
LogDuration(response_time),
LogDuration(service_time),
metrics.bin,
metrics.bout,
proto,
host,
request_line,
status_line,
message,
OptionalString::from(tags.as_ref())
);

Metrics

Today, we have a bunch of metrics and we do not use all of them.
We should investigate.
To do so, we have to:

  • make a proper list of metrics (classname+labels+kind+description).
  • Then, establish the cardinality of each one and segregate them in two parts:
    • The one that are Sōzu-related (event loop, worker, etc..)
    • The one that are cluster-related (status code, response time, service time, cluster response time, data transfer, etc..)
      • The cluster-related have to be labeled with the cluster identifier and optionaly the backend identifier.

What is missing

Access logs

To us, it miss some informations in access logs as :

  • client rtt: the time in millisecond to establish a connection between the client and Sōzu
  • server rtt: the time in millisecond to establish a connection between Sōzu and the backend
  • cluster response time: the time in millisecond between the end of the transfer of the request body and the response header betweend Sōzu and the backend
  • data transfer time: the total duration in millisecond of transfering data between the client to the backend and vice-versa
  • user-agent

Metrics

We should investigate first the item above.

Registry:

@Wonshtrum
Copy link
Member

The first draft can be found on the latest commit of #904.

@Keksoj
Copy link
Member

Keksoj commented Jun 15, 2023

In the spirit of the request counter of #947, we could keep track of how many access logs were produced. See #950

@Keksoj
Copy link
Member

Keksoj commented Jun 16, 2023

Oh and we should do something about this:

Worker 0
=========
┌──────────────────────┬───────┬───────┐
│                      │ gauge │ count │
├──────────────────────┼───────┼───────┤
│ accept_queue.count   │ 0     │       │
├──────────────────────┼───────┼───────┤
│ buffer.count         │ 0     │       │
├──────────────────────┼───────┼───────┤
│ bytes_in             │       │ 150   │
├──────────────────────┼───────┼───────┤
│ bytes_out            │       │ 178   │
├──────────────────────┼───────┼───────┤
│ client.connections   │ 0     │       │
├──────────────────────┼───────┼───────┤
│ http.404.errors      │       │ 2     │
├──────────────────────┼───────┼───────┤
│ http.active_requests │ 0     │       │
├──────────────────────┼───────┼───────┤
│ http.requests        │       │ 2     │
├──────────────────────┼───────┼───────┤
│ protocol.http        │ 0     │       │
├──────────────────────┼───────┼───────┤
│ sent_access_logs     │       │ 2     │
├──────────────────────┼───────┼───────┤
│ slab.count           │ 6     │       │
└──────────────────────┴───────┴───────┘

Notice that many .count are actually gauges. This is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants