Observability #2759
mattwiller
started this conversation in
Ideas
Observability
#2759
Replies: 1 comment
-
Hi @mattwiller, I saw the issue #2994 to build a logging service. I would like to join with you to build the logging service using Open Telemetry. Can I take part in this? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Goals
Background: Observability
Service observability covers three core categories of server activity: logging, recording and computing metrics, and tracing both execution within a service and calls between services.
Logging
The simplest form of observability for a service is emitting logs. Log entries provide point-in-time information about the state of the system, and can include arbitrary content. Logging is generally opt-in, and occurs only if a developer specifically writes the data needed from the relevant code. Logs provide forensic information about what a system did in the past, and can be used to reconstruct the sequence of events from an incident.
Logs are useful out of the box, and do not require a backend to function: the service by default will emit logs to the console (stdout), which can be saved to a file or otherwise ingested into a data store for querying later.
Metrics
Metrics allow gathering aggregated numeric measurements of service health indicators. These can be tracked over time and used to easily determine how a service's behavior has changed over time. Many systems also allow setting up automated alerts to notify developers if a metric value reaches some predefined threshold. These provide critical near-realtime information to help diagnose problems as they arise.
Metrics are generally recorded in a data store optimized for time series data, and usually require some kind of visualization tool to make the most use of them.
Tracing
Tracing provides rich information about the execution path of a single user interaction with a service, or optionally following an interaction between multiple services involved in handling the request. A trace is essentially a timeline view of an interaction, showing the execution path that the code took and how long critical parts of the handling took. They provide detailed forensic information about how a request was executed, and can be used to correlate requests that pass through multiple different services.
Each trace is broken up into spans: areas of code representing a specific operation that are tracked as they execute. The start and end times of the span are automatically recorded, but additional attributes can be added to the span as well. OpenTelemetry curates a list of conventional span attributes for common operations (e.g. HTTP requests, database queries) that should be used where possible to align with industry standards and tooling.
Proposed solution
The industry already appears to be coalescing around OpenTelemetry as the standard foundation for observability services. Aligning with this ecosystem of tools should provide maximum compatibility for customers with whatever backend they might choose. OpenTelemetry currently supports metrics and tracing, with standardized logs in development.
In broad strokes, adopting OpenTelemetry would involve a sequence of steps:
We also have an existing logger in
@medplum/server
that sends logs to CloudWatch. This should be abstracted to allow customers to bring their own logging solution. Concurrently, we should also evaluate migrating to a structured logging pattern to enable easier integration and automation over the logs emitted by our server.Logging
Migrating to an approach where server logs are written as JSON with explicitly named fields will improve our ability to add and search for rich debugging information. Structured logs can be searched more quickly than unstructured text in AWS CloudWatch, and can more easily be parsed into metrics as well. Adopting structured logging would involve updating our logger utility with methods that take data as an object and write out JSON, e.g.:
By formatting our logs as JSON, we can ensure that the entire log entry fits on one line: any newlines naturally occurring in the log data would be escaped by
JSON.stringify()
, which also emits the JSON string as a single line by default.In addition to changing the log format, we should consider adding some standard log fields to capture critical troubleshooting data. By generating unique IDs for every request to the server and then adding that ID to all logs generated by that request, we can more easily correlate logs from the same user interaction. We could additionally consider adding details like the server version to aid in troubleshooting for self-hosted customers.
The simplest way for services to emit logs is via stdout, so that no changes are required to integrate with different logging backends: anything that can process the standard log stream can ingest logs from the server. We should align fully with this approach and remove our dedicated CloudWatch logger from the server. Instead, we can leverage a telemetry agent like FluentBit to route log entries to different log streams based on their contents. This would decouple the server logs from any specific provider, and allows customers to configure the infrastructure to send logs to a backend of their own choosing.
Metrics
First, we can begin gathering explicit metrics from the server instead of relying on CloudFront's parsing of the unstructured server logs. This would decouple the server from its dependency on CloudWatch, and also improve our ability to scalably report various types of metrics. Setting up the OpenTelemetry SDK can provide automatic instrumentation of several core Node.js and third-party modules, including metrics for duration of all inbound and outbound HTTP requests.
OpenTelemetry metrics can be reported as different kinds:
Reporting metrics directly from the service rather than relying on post-processing service logs into metrics allows customers to bring their own logging backend while still receiving service health metrics in a standardized format. It also allows us to compute more advanced metrics like a histogram that would be much more expensive to post-process out of logs.
OpenTelemetry generally exposes an internal HTTP endpoint on the service, from which metrics data is served. A collector agent is responsible for reading metrics from the service and exporting them to a backend metrics platform. AWS publishes an OpenTelemetry collector that can easily be used with ECS services to read and report telemetry data directly into CloudWatch. FluentBit also supports ingesting OpenTelemetry metrics over HTTP/Protobuf and sending them to CloudWatch, if we wanted to align on a single sidecar agent for telemetry.
In addition to the out-of-the-box HTTP metrics, the OpenTelemetry SDK enables us to report our own custom metrics:
Since OpenTelemetry only appears to provide
http.*.duration
metrics out of the box, we will need to identify and add the key service health metrics that should be measured. These should include, at a minimum, some basic service health statistics based on OpenTelemetry's semantic conventions:http.server.requests
{request}
http.server.responses.2xx
{response}
http.server.responses.3xx
{response}
http.server.responses.4xx
{response}
http.server.responses.5xx
{response}
FluentBit also provides its own plugins to gather basic system metrics like CPU and memory statistics.
Tracing
The OpenTelemetry SDK also offers excellent support for tracing, and provides out-of-the-box tracing for Node.js internals (e.g.
fs
,http
) and common libraries (e.g.express
,pg
,ioredis
). This should give us rich visibility into the execution of a server request even before we instrument any of our own code. We can subsequently use the SDK to record specific key parts of our own server code:This would allow viewing the end-to-end timeline of a single request, as shown in the X-Ray documentation. This information can be incredibly useful in troubleshooting and diagnosing issues in the live service. We can instrument the server code this way opportunistically, adding new spans only where needed to capture important context and relying mostly on the built-in instrumentation around core operations like network calls, filesystem access, and database queries.
Beta Was this translation helpful? Give feedback.
All reactions