Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: client side metrics handlers #924

Open
wants to merge 11 commits into
base: client_side_metrics_data_model
Choose a base branch
from

Conversation

daniel-sanche
Copy link
Contributor

@daniel-sanche daniel-sanche commented Jan 26, 2024

This PR builds off of #923 to add handlers to the client-side metrics system, which can subscribe to the metrics stream, and export the results into different collection systems

Follow-up PR:


We add three handlers to the system:

  • GoogleCloudMetricsHandler: sends metrics to a private OpenTelemetry meter, and then periodically exports them to GCP. Built on top of OpenTelemetryMetricsHandler
  • OpenTelemetryMetricsHandler: sends metrics to the root MeterProvider, so the user can access the exported metrics for their own systems. This will be off by default, but can be added alongside GoogleCloudMetricsHandler if needed
  • _StdoutMetricsHandler: can print metrics to stdout as they arrive. Mostly for debugging (we can remove this if you don't think it's useful)

@daniel-sanche daniel-sanche requested review from a team as code owners January 26, 2024 23:25
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. api: bigtable Issues related to the googleapis/python-bigtable API. labels Jan 26, 2024
# process each metric from OTel format into Cloud Monitoring format
for resource_metric in metrics_data.resource_metrics:
for scope_metric in resource_metric.scope_metrics:
for metric in scope_metric.metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to filter only bigtable related metrics? To avoid people from publishing irrelevant metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exporter is attached to a private MeterProvider, so there shouldn't be any other metrics showing up here

)
self.retry_count = meter.create_counter(
name="retry_count",
description="A count of additional RPCs sent after the initial attempt. Under normal circumstances, this will be 1.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# grab meter for this module
meter = meter_provider.get_meter("bigtable.googleapis.com")
# create instruments
self.operation_latencies = meter.create_histogram(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure the metric descriptions are identical to our public doc https://cloud.google.com/bigtable/docs/client-side-metrics-descriptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, although I didn't include this part of the client blocking latencies, since it doesn't apply: For versions 2.21.0 and later, this metric also includes the latencies of requests queued on gRPC channels.

# fixed labels sent with each metric update
self.shared_labels = {
"client_name": f"python-bigtable/{bigtable_version}",
"client_uid": client_uid or str(uuid4()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to detect the hostname where the client is run similar to java? https://github.com/googleapis/java-bigtable/blob/main/google-cloud-bigtable-stats/src/main/java/com/google/cloud/bigtable/stats/BigtableStackdriverExportUtils.java#L158-L173 Or that's not possible with python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to do something like that, I'll take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"resource_table": table_id,
}
if app_profile_id:
self.shared_labels["app_profile"] = app_profile_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else: we should tag app_profile with "default"

try:
status = str(op.final_status.value[0])
except (IndexError, TypeError):
status = "2" # unknown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually export status string instead of the numeric value, so OK, DEADLINE_EXCEEDED, etc.

self.otel.operation_latencies.record(
op.duration, {"streaming": is_streaming, **labels}
)
self.otel.retry_count.add(len(op.completed_attempts) - 1, labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We leave retry_count as empty if there's no retries. So we only export this if (len(op.completed_attempt) - 1 > 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so don't even send anything for this metric if there's no errors? And I assume that applies to connectivity_error_count too then?

labels = {
"method": op.op_type.value,
"status": status,
"resource_zone": op.zone,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to fallback to default zone and cluster here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do that when building CompletedOperationMetric before calling on_operation_complete

The idea that ActiveOperationMetric has some possibly empty fields, but when it's finalized into a CompletedOperationMetric, all the defaults are applied and it becomes more type-strict

- export_interval: The interval (in seconds) at which to export metrics to Cloud Monitoring.
"""

def __init__(self, *args, project_id: str, export_interval=60, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the minimum interval we can publish metrics to cloud monitoring is 60 seconds. And I don't think we want to update with larger intervals. So let's remove this option?

[
0, 0.01, 0.05, 0.1, 0.3, 0.6, 0.8, 1, 2, 3, 4, 5, 6, 8, 10, 13, 16,
20, 25, 30, 40, 50, 65, 80, 100, 130, 160, 200, 250, 300, 400,
500, 650, 800, 1000, 2000, 5000, 10000, 20000, 50000, 100000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We updated the buckets in java:

      ImmutableList.of(
          0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0, 13.0, 16.0, 20.0, 25.0, 30.0, 40.0,
          50.0, 65.0, 80.0, 100.0, 130.0, 160.0, 200.0, 250.0, 300.0, 400.0, 500.0, 650.0,
          800.0, 1000.0, 2000.0, 5000.0, 10000.0, 20000.0, 50000.0, 100000.0, 200000.0,
          400000.0, 800000.0, 1600000.0, 3200000.0));

I think 100k is too small

)
# use private meter provider to store instruments and views
meter_provider = MeterProvider(metric_readers=[gcp_reader], views=VIEW_LIST)
otel = _OpenTelemetryInstruments(meter_provider=meter_provider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would a customer provide their own otel instrumentation? (this could be in a follow up PR)

In java we let customers override it in the settings and pass the otel instance down. see the description in googleapis/java-bigtable#1796

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client holds a controller that manages a set of handlers. Users can add their own OpenTelemetryHandler to send metrics to a different MeterProvider if they want

I'll have to write up some documentation for this at some point

if data_point.attributes:
project_id = data_point.attributes.get("resource_project")
if not isinstance(project_id, str):
# we expect string for project_id field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log a warning? Maybe something like: "malformatted resource_project x. Skip publishing"


def __init__(self, project_id: str):
super().__init__()
self.client = MetricServiceClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to configure retry attempt number for create_service_time_series method? what's the default? In java we are not retrying the method, I think because republishing could lead to errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I just looked at the gapic code, and it looks like this rpc defaults to no retries.

We can add them if we want, but if we export every 60 seconds, maybe we should add back to the queue for the next batch? (This may actually be the default for OpenTelemetry too, I'd have to look into it. Right now we just return a MetricExportResult.FAILURE, I'm not sure what happens to the failed metrics)

Would republishing lead to errors here too?

self.project_name = self.client.common_project_path(project_id)

def export(
self, metrics_data: MetricsData, timeout_millis: float = 10_000, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just pass in second? :) so we don't need to convert it back to seconds on line 145 later :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the superclass method definition unfortunately. We inherit from opentelemetry.sdk.metrics.export.MetricExporter

@daniel-sanche daniel-sanche requested a review from a team as a code owner February 9, 2024 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/python-bigtable API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants