Observability #2759

mattwiller · 2023-08-31T18:04:13Z

mattwiller
Aug 31, 2023
Maintainer

Goals

Expose logs, metrics, and traces in a standardized way; align with industry best practices
Start with great support for AWS, but enable future growth and flexibility
Simple user configuration of different observability backends
Maintain high server performance, especially if observability is not enabled

Background: Observability

Service observability covers three core categories of server activity: logging, recording and computing metrics, and tracing both execution within a service and calls between services.

Logging

The simplest form of observability for a service is emitting logs. Log entries provide point-in-time information about the state of the system, and can include arbitrary content. Logging is generally opt-in, and occurs only if a developer specifically writes the data needed from the relevant code. Logs provide forensic information about what a system did in the past, and can be used to reconstruct the sequence of events from an incident.

Logs are useful out of the box, and do not require a backend to function: the service by default will emit logs to the console (stdout), which can be saved to a file or otherwise ingested into a data store for querying later.

Metrics

Metrics allow gathering aggregated numeric measurements of service health indicators. These can be tracked over time and used to easily determine how a service's behavior has changed over time. Many systems also allow setting up automated alerts to notify developers if a metric value reaches some predefined threshold. These provide critical near-realtime information to help diagnose problems as they arise.

Metrics are generally recorded in a data store optimized for time series data, and usually require some kind of visualization tool to make the most use of them.

Tracing

Tracing provides rich information about the execution path of a single user interaction with a service, or optionally following an interaction between multiple services involved in handling the request. A trace is essentially a timeline view of an interaction, showing the execution path that the code took and how long critical parts of the handling took. They provide detailed forensic information about how a request was executed, and can be used to correlate requests that pass through multiple different services.

Each trace is broken up into spans: areas of code representing a specific operation that are tracked as they execute. The start and end times of the span are automatically recorded, but additional attributes can be added to the span as well. OpenTelemetry curates a list of conventional span attributes for common operations (e.g. HTTP requests, database queries) that should be used where possible to align with industry standards and tooling.

Proposed solution

The industry already appears to be coalescing around OpenTelemetry as the standard foundation for observability services. Aligning with this ecosystem of tools should provide maximum compatibility for customers with whatever backend they might choose. OpenTelemetry currently supports metrics and tracing, with standardized logs in development.

In broad strokes, adopting OpenTelemetry would involve a sequence of steps:

Adding the OpenTelemetry SDK to the server and instrumenting core server metrics
Setting up a way of exposing the metrics for collection by an observability backend
Instrumenting our own metrics and traces for the server

We also have an existing logger in @medplum/server that sends logs to CloudWatch. This should be abstracted to allow customers to bring their own logging solution. Concurrently, we should also evaluate migrating to a structured logging pattern to enable easier integration and automation over the logs emitted by our server.

Logging

Migrating to an approach where server logs are written as JSON with explicitly named fields will improve our ability to add and search for rich debugging information. Structured logs can be searched more quickly than unstructured text in AWS CloudWatch, and can more easily be parsed into metrics as well. Adopting structured logging would involve updating our logger utility with methods that take data as an object and write out JSON, e.g.:

log(level: LogLevel, msg: string, data: Record<string, any>): void {
  console.log(JSON.stringify({
    level: level.toString(),
    msg,
    ...data,
  }));
}

By formatting our logs as JSON, we can ensure that the entire log entry fits on one line: any newlines naturally occurring in the log data would be escaped by JSON.stringify(), which also emits the JSON string as a single line by default.

In addition to changing the log format, we should consider adding some standard log fields to capture critical troubleshooting data. By generating unique IDs for every request to the server and then adding that ID to all logs generated by that request, we can more easily correlate logs from the same user interaction. We could additionally consider adding details like the server version to aid in troubleshooting for self-hosted customers.

The simplest way for services to emit logs is via stdout, so that no changes are required to integrate with different logging backends: anything that can process the standard log stream can ingest logs from the server. We should align fully with this approach and remove our dedicated CloudWatch logger from the server. Instead, we can leverage a telemetry agent like FluentBit to route log entries to different log streams based on their contents. This would decouple the server logs from any specific provider, and allows customers to configure the infrastructure to send logs to a backend of their own choosing.

Metrics

First, we can begin gathering explicit metrics from the server instead of relying on CloudFront's parsing of the unstructured server logs. This would decouple the server from its dependency on CloudWatch, and also improve our ability to scalably report various types of metrics. Setting up the OpenTelemetry SDK can provide automatic instrumentation of several core Node.js and third-party modules, including metrics for duration of all inbound and outbound HTTP requests.

OpenTelemetry metrics can be reported as different kinds:

Counter: used to accumulate the count of some event or condition occurring
Gauge: measures the current state of a value that changes over time
Histogram: computes the aggregate statistics (e.g. mean, median) over multiple measurements

Reporting metrics directly from the service rather than relying on post-processing service logs into metrics allows customers to bring their own logging backend while still receiving service health metrics in a standardized format. It also allows us to compute more advanced metrics like a histogram that would be much more expensive to post-process out of logs.

OpenTelemetry generally exposes an internal HTTP endpoint on the service, from which metrics data is served. A collector agent is responsible for reading metrics from the service and exporting them to a backend metrics platform. AWS publishes an OpenTelemetry collector that can easily be used with ECS services to read and report telemetry data directly into CloudWatch. FluentBit also supports ingesting OpenTelemetry metrics over HTTP/Protobuf and sending them to CloudWatch, if we wanted to align on a single sidecar agent for telemetry.

In addition to the out-of-the-box HTTP metrics, the OpenTelemetry SDK enables us to report our own custom metrics:

const histogram = opentelemetry.metrics.getMeter('@medplum/server').createHistogram('validation.duration');

const startTime = process.hrtime.bigint();
validate(resource);
const duration = process.hrtime.bigint() - startTime;

histogram.record(duration, {
  resourceType: resource.resourceType,
  size: JSON.stringify(resource).length,
});

Since OpenTelemetry only appears to provide http.*.duration metrics out of the box, we will need to identify and add the key service health metrics that should be measured. These should include, at a minimum, some basic service health statistics based on OpenTelemetry's semantic conventions:

Metric	Type	Unit	Description
`http.server.requests`	Counter	`{request}`	Number of requests received by the server
`http.server.responses.2xx`	Counter	`{response}`	Number of HTTP 2xx responses served
`http.server.responses.3xx`	Counter	`{response}`	Number of HTTP 3xx responses served
`http.server.responses.4xx`	Counter	`{response}`	Number of HTTP 4xx responses served
`http.server.responses.5xx`	Counter	`{response}`	Number of HTTP 5xx responses served

FluentBit also provides its own plugins to gather basic system metrics like CPU and memory statistics.

Tracing

The OpenTelemetry SDK also offers excellent support for tracing, and provides out-of-the-box tracing for Node.js internals (e.g. fs, http) and common libraries (e.g. express, pg, ioredis). This should give us rich visibility into the execution of a server request even before we instrument any of our own code. We can subsequently use the SDK to record specific key parts of our own server code:

const tracer = opentelemetry.trace.getTracer('@medplum/server');
tracer.startActiveSpan('validate', (span) => {
  let success = true;
  try {
    validate(resource);
  } catch(_) {
	  success = false;
  }

  span.setAttribute('resourceType', resource.resourceType);
  span.setAttribute('success', success);
  span.end();
});

This would allow viewing the end-to-end timeline of a single request, as shown in the X-Ray documentation. This information can be incredibly useful in troubleshooting and diagnosing issues in the live service. We can instrument the server code this way opportunistically, adding new spans only where needed to capture important context and relying mostly on the built-in instrumentation around core operations like network calls, filesystem access, and database queries.

Dharshan-K · 2023-10-09T16:59:59Z

Dharshan-K
Oct 9, 2023

Hi @mattwiller, I saw the issue #2994 to build a logging service. I would like to join with you to build the logging service using Open Telemetry. Can I take part in this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability #2759

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Observability #2759

mattwiller Aug 31, 2023 Maintainer

Goals

Background: Observability

Logging

Metrics

Tracing

Proposed solution

Logging

Metrics

Tracing

Replies: 1 comment

Dharshan-K Oct 9, 2023

mattwiller
Aug 31, 2023
Maintainer

Dharshan-K
Oct 9, 2023