Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sql text counts stats to vtcombo,vtgate+vttablet #15897

Merged
merged 6 commits into from
Jun 4, 2024

Conversation

timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented May 9, 2024

Description

This PR adds stats for the size of query text processed by vtgate and vttablet

This is useful to understand large queries that potentially cause performance, grpc message size, bandwidth problems

Related Issue(s)

Resolves #15929

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented May 9, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels May 9, 2024
@github-actions github-actions bot added this to the v20.0.0 milestone May 9, 2024
@timvaillancourt timvaillancourt marked this pull request as ready for review May 13, 2024 14:27
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt timvaillancourt changed the title Add sql text counts stats to vtgate+vttablet Add sql text counts stats to vtcombo,vtgate+vttablet May 13, 2024
@timvaillancourt timvaillancourt removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 13, 2024
@timvaillancourt
Copy link
Contributor Author

Opened this one with a lot of failing CI to see if others have ideas how to debug

The failures are confusing and feel unrelated to this change 🤔

Signed-off-by: Harshit Gangal <harshit@planetscale.com>
@harshit-gangal
Copy link
Member

Opened this one with a lot of failing CI to see if others have ideas how to debug

The failures are confusing and feel unrelated to this change 🤔

There failures are related to the changes made. I have pushed a commit that fixes some of those. You should run the test locally and fix the remaining if any.

Copy link

codecov bot commented May 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.22%. Comparing base (0353ad4) to head (0531ce1).
Report is 59 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15897      +/-   ##
==========================================
- Coverage   68.45%   68.22%   -0.24%     
==========================================
  Files        1559     1541      -18     
  Lines      196825   197330     +505     
==========================================
- Hits       134736   134627     -109     
- Misses      62089    62703     +614     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@timvaillancourt
Copy link
Contributor Author

Opened this one with a lot of failing CI to see if others have ideas how to debug
The failures are confusing and feel unrelated to this change 🤔

There failures are related to the changes made. I have pushed a commit that fixes some of those.

@harshit-gangal indeed it is, thanks! 🤦 🙇

You should run the test locally and fix the remaining if any.

Good point, I've made a bad habit of relying on GitHub Actions CI because many go test CIs fail on my MacBook when they rely on external things (perhaps not this test). I'll give running CI locally another shot 👀

Somehow this was not obvious from the CI outputs I looked at, including this Unit test run that ends in just:

+ go-junit-report -set-exit-code
make: *** [Makefile:209: unit_test] Error 1
Error: Process completed with exit code 1.

Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative

@harshit-gangal
Copy link
Member

Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative

The first step records the output, if you click on the step below in the CI you will see the output.

@harshit-gangal
Copy link
Member

Screenshot 2024-05-17 at 8 54 54 PM

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented May 17, 2024

Perhaps it's just the v15 fork/release I'm working on, but I'm used to the Unit test CI including the output of go test -v to help explain the failure. I'll see if something about that changed and if there is an alternative

The first step records the output, if you click on the step below in the CI you will see the output.

@harshit-gangal ahh, that's where it goes! Thanks a lot more on this one soon

@timvaillancourt timvaillancourt added Component: Observability Pull requests that touch tracing/metrics/monitoring Component: VTCombo labels May 29, 2024
@frouioui
Copy link
Member

frouioui commented May 31, 2024

Since the PR has two approvals I am adding the Do Not Merge label so we don't merge this by accident before @deepthi's concern is answered.

timvaillancourt added a commit to slackhq/vitess that referenced this pull request Jun 1, 2024
* Add sql text counts stats to `vtcombo`,`vtgate`+`vttablet`

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* initialize sqlTextCounts where missed

Signed-off-by: Harshit Gangal <harshit@planetscale.com>

* missing fix

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* fix accidental line delete

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

---------

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Harshit Gangal <harshit@planetscale.com>
Co-authored-by: Harshit Gangal <harshit@planetscale.com>
@timvaillancourt
Copy link
Contributor Author

@deepthi / @frouioui, here are the metric outputs:

tvaillancourt@REDACTED:~$ curl -s localhost:15000/debug/vars | jq .QuerySQLTextCounts
{
  "Join.Select.": 3234,
  "dual.Select.": 3615
}
tvaillancourt@REDACTED:~$ curl -s localhost:15000/metrics 2>&1 | grep vttablet_query_sql_text_counts
# HELP vttablet_query_sql_text_counts query sql text counts
# TYPE vttablet_query_sql_text_counts counter
vttablet_query_sql_text_counts{plan="Select",table="Join",workload=""} 3234
vttablet_query_sql_text_counts{plan="Select",table="dual",workload=""} 3650

So looks good 👍. While doing that I thought of two potential alternatives in case this didn't work: QueryTextProcessed or QueryTextRead, any thoughts?

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat repeating @frouioui , will merge once @deepthi 's concern is addressed (likely you need o change ...SQL... to ...Sql... and will then merge.

@timvaillancourt
Copy link
Contributor Author

Somewhat repeating @frouioui , will merge once @deepthi 's concern is addressed (likely you need o change ...SQL... to ...Sql... and will then merge.

@shlomi-noach the potential issue of s_q_l that Deepthi pointed out didn't seem to be a problem in practice, here is the metric output:

tvaillancourt@REDACTED:~$ curl -s localhost:15000/metrics 2>&1 | grep vttablet_query_sql_text_counts
# HELP vttablet_query_sql_text_counts query sql text counts
# TYPE vttablet_query_sql_text_counts counter
vttablet_query_sql_text_counts{plan="Select",table="Join",workload=""} 3234
vttablet_query_sql_text_counts{plan="Select",table="dual",workload=""} 3650

Let me know if any change is needed. I'll add a changelog/ note once we agree on the metric name and I think we're done! 🙇

deepthi
deepthi previously requested changes Jun 3, 2024
@@ -209,6 +209,11 @@ var (
"VtgateApiRowsAffected",
"Rows affected by a write (DML) operation through the VTgate API",
[]string{"Operation", "Keyspace", "DbType"})

sqlTextCounts = stats.NewCountersWithMultiLabels(
"VtgateSQLTextCounts",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now I understand the intent of this metric, it won't work as implemented. Counters should only be used for monotonically increasing numbers, prometheus will not let you reduce the value of a "counter".
The right type of metric to use here is a Gauge, which can take any value, and change up or down. If I had read the linked issue properly, I would have realized this during the original review.
The name of the metric is also misleading, because it is not a "count" of some specific kind of sql text, it is instead the length in text of the query that is being processed, so the name needs to reflect that.

@@ -209,6 +209,11 @@ var (
"VtgateApiRowsAffected",
"Rows affected by a write (DML) operation through the VTgate API",
[]string{"Operation", "Keyspace", "DbType"})

sqlTextCounts = stats.NewCountersWithMultiLabels(
"VtgateSQLTextCounts",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryTextProcessed will in fact be a better name for the metric.

@deepthi
Copy link
Member

deepthi commented Jun 3, 2024

@shlomi-noach please feel free to dismiss my review if @timvaillancourt is able complete the requested changes before code freeze.

@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Jun 3, 2024

OK, now I understand the intent of this metric, it won't work as implemented. Counters should only be used for monotonically increasing numbers, prometheus will not let you reduce the value of a "counter".

@deepthi the intent of this metric is a counter that increments forever so we can infer the rate of query text processed at any given time. I don't believe there is a need to ever reduce this metric, it only increases from the time the daemon starts like other counters

EDIT: here we can see the counter is only incremented by 0 or more: vtg.sqlTextCounts.Add(statsKey, int64(len(sql)))

@deepthi
Copy link
Member

deepthi commented Jun 3, 2024

OK, now I understand the intent of this metric, it won't work as implemented. Counters should only be used for monotonically increasing numbers, prometheus will not let you reduce the value of a "counter".

@deepthi the intent of this metric is a counter that increments forever so we can infer the rate of query text processed at any given time. I don't believe there is a need to ever reduce this metric, it only increases from the time the daemon starts like other counters

EDIT: here we can see the counter is only incremented by 0 or more: vtg.sqlTextCounts.Add(statsKey, int64(len(sql)))

I think the term count is misleading in this context. Count is typically how many of an entity we encountered, so for example total number of queries. Here we are adding the length of the query each time. A counter will work for this, but I do request that the name be changed to avoid this kind of confusion.

@shlomi-noach
Copy link
Contributor

@timvaillancourt to summarize the current state: the changes make sense, and the metric looks correct. But if you could please rename it to something such as QueryProcessedTextLengths? And with that we can merge the PR.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Jun 3, 2024

But if you could please rename it to something such as QueryProcessedTextLengths

@shlomi-noach / @deepthi I felt Lengths is a bit awkward so I've gone with QueryTextCharactersProcessed which I think is long but very clear. I've also added this to the 20.0.0 changelog

Rerequested your review 🙇

changelog/20.0/20.0.0/summary.md Outdated Show resolved Hide resolved
changelog/20.0/20.0.0/summary.md Outdated Show resolved Hide resolved
changelog/20.0/20.0.0/summary.md Outdated Show resolved Hide resolved
@deepthi
Copy link
Member

deepthi commented Jun 3, 2024

Some minor wording changes to release notes are required to avoid user confusion. Rest LGTM. @shlomi-noach you may dismiss my review once these changes are done.

timvaillancourt and others added 2 commits June 4, 2024 01:16
Co-authored-by: Deepthi Sigireddi <deepthi.sigireddi@gmail.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: Deepthi Sigireddi <deepthi.sigireddi@gmail.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@shlomi-noach shlomi-noach dismissed deepthi’s stale review June 4, 2024 04:43

comments have been addressed

@shlomi-noach shlomi-noach merged commit e640384 into vitessio:main Jun 4, 2024
92 checks passed
@timvaillancourt timvaillancourt deleted the sql-text-counts-stats branch June 4, 2024 13:50
timvaillancourt added a commit to slackhq/vitess that referenced this pull request Jun 4, 2024
…15897)

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Harshit Gangal <harshit@planetscale.com>
Co-authored-by: Harshit Gangal <harshit@planetscale.com>
Co-authored-by: Deepthi Sigireddi <deepthi.sigireddi@gmail.com>
timvaillancourt added a commit to slackhq/vitess that referenced this pull request Jun 4, 2024
* `vtgate`: support filtering tablets by tablet-tags  (vitessio#15911)

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Add support for sampling rate in `streamlog` (vitessio#15919)

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Fix merge conflict resolution

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* update rand import

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* Add sql text counts stats to `vtcombo`,`vtgate`+`vttablet` (vitessio#15897)

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Harshit Gangal <harshit@planetscale.com>
Co-authored-by: Harshit Gangal <harshit@planetscale.com>
Co-authored-by: Deepthi Sigireddi <deepthi.sigireddi@gmail.com>

* missing rename

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

* missing rename again

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

---------

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Harshit Gangal <harshit@planetscale.com>
Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Co-authored-by: Harshit Gangal <harshit@planetscale.com>
Co-authored-by: Deepthi Sigireddi <deepthi.sigireddi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Observability Pull requests that touch tracing/metrics/monitoring Component: Query Serving Component: VTCombo Component: VTGate Component: VTTablet Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: add stats for rate of SQL text processed
5 participants