Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: instrument Spanner client with OpenCensus metrics #54

Merged
merged 8 commits into from Feb 6, 2020
Merged

feat: instrument Spanner client with OpenCensus metrics #54

merged 8 commits into from Feb 6, 2020

Conversation

mayurkale22
Copy link
Member

@mayurkale22 mayurkale22 commented Jan 31, 2020

Updates #53

First set of Session Pool related metrics:

  1. cloud.google.com/java/spanner/max_allowed_sessions
    The maximum number of sessions allowed. Configurable by the user. This metric is labeled by database name, instance id and library version.
  2. cloud.google.com/java/spanner/in_use_sessions
    The number of sessions currently in use (or checked out from the pool at this very moment). This metric is labeled by database name, instance id and library version.
  3. cloud.google.com/java/spanner/max_in_use_session
    The maximum number of sessions in use during the last maintenance window interval. A maintenance window is a set 10 minute interval. After a complete maintenance window has passed, the value is reset to zero (and then start increasing again). This metric is labeled by database name, instance id and library version.

I will open follow-up PRs to address other metrics mentioned in the issue.

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jan 31, 2020
Copy link

@rghetia rghetia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

@mayurkale22 mayurkale22 changed the title Instrument Spanner client with OpenCensus metrics feat: instrument Spanner client with OpenCensus metrics Feb 4, 2020
Copy link
Contributor

@skuruppu skuruppu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good as well. Adding @olavloite to get his thoughts on the implementation as well since he's the most familiar with the Java client library.

Copy link
Collaborator

@olavloite olavloite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks generally good to me, but I would recommend changing the description of the measurement of maxSessionsInUse. I think it would confuse many users if they see a value that is only increasing (or at least never decreasing) during a period of 10 minutes that is called 'The number of active sessions'.

// The Metric name and description
public static final String ACTIVE_SESSIONS = "cloud.google.com/java/spanner/active_sessions";
public static final String MAX_SESSIONS = "cloud.google.com/java/spanner/max_sessions";
public static final String ACTIVE_SESSIONS_DESCRIPTION = "The number of active sessions";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description of this metric should be different, considering it is reading maxSessionsInUse. This value will only increase during a 10 minute interval, and then be reset to zero (and then start increasing again). Suggestion, something like:
'Max number of sessions in use during the last 10 minutes'.

I noticed the other discussion about whether to measure numSessionsInUse or maxSessionsInUse. I think both could be valuable for the user, and numSessionsInUse should have a description like 'The number of sessions currently in use'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will change the description for maxSessionsInUse. Could you please elaborate more on numSessionsInUse metric. How is it different from maxSessionsInUse and how do you think it is useful for users in conjunction with maxSessionsInUse?

Just so you know, we don't want to add too many metrics due to costs associated with exporting to backend (like stackdriver). Having said that, I am more than happy to include metrics which are meaningful (and useful) for the users to resolve/understand the production issues at some extend.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • numSessionsInUse reflects the number of sessions checked out from the pool at this very moment. Depending on probe intervals and average transaction duration, this number might often be measured to be zero for a system under low load. E.g. if the system executes 1 transaction per second, the average transaction duration is 300ms and the value is probed once every 10 seconds, the chances are big that you will often measure the value between two transactions and get a zero. On a system with higher load, measuring this value will give the user a better idea about how the distribution of the transactions are over time. If this value is relatively constant most of the time, but suddenly shows a spike that cannot be correlated with a similar sudden spike in user requests, it might be an indication of a concurrency problem somewhere (locks in the database, concurrency issues in the application itself etc.).
  • maxSessionsInUse reflects the maximum number of numSessionsInUse during the current maintenance window. A maintenance window is a hard-coded 10 minute interval. After a complete maintenance window has passed, the value is reset to 0. The value is updated every time a session is checked out of the pool by the simple calculation maxSessionsInUse = Math.max(numSessionsInUse, maxSessionsInUse).

As you can conclude from the above (and which I also just now realized ;-)), the maxSessionsInUse has a slight flaw: If no new session is checked out during a maintenance window, the value will remain zero during the entire window, even if there are sessions that are kept checked out during the entire window (i.e. the sessions were checked out during a previous maintenance window).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I have added numSessionsInUse metric PTAL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to get this PR merged before adding other metrics.

Copy link
Collaborator

@olavloite olavloite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think you should change the packages of the MetricsRegistryConstants and MetricsRegistryTestUtils. Unless there's a very good reason for it, I would suggest placing these in the normal com.google.cloud.spanner package. Sorry for not noticing this before :-(

For the rest; LGTM

@mayurkale22
Copy link
Member Author

@rghetia and @skuruppu I have addressed all the comments, PTAL. As mentioned earlier, I will work on follow-up PR to add other metrics.

"Max number of sessions in use during the last 10 minutes";
public static final String MAX_SESSIONS_DESCRIPTION = "The number of max sessions configured";
public static final String SESSIONS_IN_USE_DESCRIPTION =
"The number of sessions checked out from the pool";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names are confusing. I would suggest

old new Description
active_sessions max_in_use_session The max number of sessions concurrently in use during the current 10 minutes interval
max sessions max_allowed_sessions The Maximum number of sessions allowed. Configurable by the user.
sessions_in_use in_use_sessions The number of sessions currently in use

Alternatively, in_use can be replaced with 'active' in name and description.
checked out has different meaning than active/in_use. If that is appropriate, then it should be used in the name and the description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e6938bc, decided to use in_use. PTAL

Copy link
Contributor

@skuruppu skuruppu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits and a comment about the test but LGTM otherwise.

Add active_sessions (The number of active sessions in use) and max_sessions (The number of max sessions configured the user) metrics
Use maxSessionsInUse instead of numSessionsInUse and update the description.
gopherbot pushed a commit to googleapis/google-cloud-go that referenced this pull request Mar 31, 2020
This change addes five metrics for session management:

* in_use_sessions: the current number of sessions that are checked out
from the session pool.
* max_in_use_sessions: the maximum number of in_use_sessions in last 10
minutes. This is based on tumbling windows, instead of sliding windows.
* max_allowed_sessions: the maximum number of sessions that is
configured by the user.
* get_sessions_timeout: the cumulative number of get sessions timeouts
when pool exhaustion happens.
* num_acquired_sessions: the cumulative number of sessions that are
checked out from the session pool.
* num_released_sessions: the cumulative number of sessions that are
released back to the session pool.

All metrics are tagged by:

* client_id: each database has its own increasing ID sequence. For
two different databases, their client IDs all start from "client-1".
* database: the database ID.
* instance_id: the instance ID.
* library_version: the library version from
google-cloud-go/internal/version which is a date in YYYYMMDD format.

Notes:

There are three ways to check out a session from the pool:

1) take(): get a read session called by a user
2) takeWriteSession(): get a read/write session by a user
3) getNextForTx() in healthchecker's worker (session.go:1336):
healthchecker's workers convert read sessions to r/w sessions. E.g.,
when initializing the pool, workers check out 20 read sessions from
the pool, turn them into r/w sessions, and put them back to the pool
, assume that MinOpended=100 and WriteSessions=0.2.

So some metrics are also emitted by case 3. This might confuse users if
they are not aware of this behaviour.

Related Java PRs:

* googleapis/java-spanner#54
* googleapis/java-spanner#65
* googleapis/java-spanner#67

Change-Id: Ie163b08ef18ac2a47e1669fefab92d61fe8f2a82
Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/52953
Reviewed-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Knut Olav Løite <koloite@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants