Online logging #13

TSchmiedlechner · 2021-10-02T11:14:33Z

User story

As a user of the Middleware Launcher, I want relevant log messages to show up in the Portal, so that I can diagnose issues without having to connect to the POS system.

Context

As of now, we use Application Insights to monitor the Launcher installations, and make this information accessible to our users via the Metrics feature of our Portal. However, the current approach has several disadvantages:

It's not possible to differentiate between the log level for file logging and AppInsights monitoring
The current approach generates very high amounts of data that we have to process with high sample rates to reduce costs
There is no underlying concept to decide which data should be sent to AppInsights

We therefore need to define which upcoming features we might introduce for online monitoring purposes, and which data we need for that with the other development teams.

Important: we should absolutely not send any user-specific data to AppInsights, e.g. the content of requests.

TSchmiedlechner · 2021-10-02T11:15:03Z

Of course it might also make sense to discuss if AppInsights is the correct tool for monitoring on-premise software, and which alternatives are available.

mijomilicevic · 2022-11-10T13:18:33Z

Launcher 2.0 – online Monitoring Tools

Metrics:

Request processing time

Including: Operation name, CashboxID

Events

Launcher Startup (with OS, Launcher config, Versions of Launcher & Cashbox configuration Timestamp, OS, user)
Launcher Shutdown
Self update

Exceptions

There should be are maximum of logs colllected for e.g. a day

This value should be controllable

General requirements

It should be possible to disable AI logging entirely, e.g. with a config feature flag
We will need different log levels which ideally can be configured remotely (debug mode everything, default only exceptions)
Visualization in the portal “metrics” page
We should create separate AI instances to differentiate between “old” and “new” and finetune the sampling

Queue lifecycle

Lifecycle events should be recorded in the actionjournal
Startup
Closing
Information if the mw is signing properly and info to failed mode or not (header info)

StefanKert · 2022-12-05T10:19:34Z

Hi folks,

sorry for not getting back to this earlier.

To me the most important part here is the definition of the different events that we want to track. Especially the lifecycle events are more than just logging since they could also be used for auditing purposes (e.g. When was the Queue started?). We are already using this mechanism when initialising the Queue (ftActionJournal) so IMO we should also think about those use cases first. This not only will enable us to have better logging / auditing capabilities, it will also hugely benefit the free of charge product since PosCreators could just pull up the ActionJournal entries to see what has been going on.

IMO this also applies to things that are happening during the startup / shutdown of the launcher. Especially things like starting the Launcher on a different PC are currently not detected and also hard to figure even though we could use this information for things like accidentally duplicates of cashboxes.

Generally speaking I think that we should split things between the following signals (also matching with the OpenTelemetry standards). (we should also think about basing our implementations on the OpenTelemtry standard)

Metrics
Traces
Logs
Audit entries (e.g. ftActionJournal)

Audit entries

Explained above, but this IMO is the most important category since it gives us a clear overview on what has happened. Another important example IMO is the first start with a new configuration. This will help to figure when changes are applied and also make it easier to relate new problems to changes.

One thing that is important to notice is the fact that not all Audit entries are directly connected to a Queue. There are some things that are probably connected to another entity (SCU). These cases need to be considered specifically, if those are really audit entries or if it is a log.

Metrics

One example for a metric is the sign-duration. In most cases we don't need this information since the queue runs stable and there is no need to investigate. In cases that are exceeding some expectations (duration is > 2sec) it would help to have information, but also details on what has happened since it is hard to figure out why the time was higher than usual.

Traces

For this purpose we need to be able to trace the sign call, from inbound ( launcher endpoint), to queue, to scu and back, for being able to make a clear statement on what has happened. In addition to that in some cases we need to have specific information on what has happened.

Logs

A log is a timestamped text record, either structured (recommended) or unstructured, with metadata. (taken from the OpenTelemetry docs). This gives us information on what exactly has happened (exception?).

Which things should we store?

As outlined above one of our biggest issues is the amount of logs that we will expect. There are lots of queues with only very little traffic, but there are also cases that creates lots of noise with no real value (exceptions because of a missing card... could be a audit log though?).

To solve this I think that we should reduce the noise produced by the client, by already dropping things in the client. This could ~~be based on a statistical method (95 percentile)~~ be a static "If a sign call is longer than 2 seconds store all related signals". In these cases we should store as many signals as possible since otherwise we will just end up with lots alerts but no real information on what has happened. From an implementation perspective this could mean that before sending something off to our cloud endpoint we check if it is something that we should log or not. This would greatly reduce noise and still give us the option to notice real issues.

TSchmiedlechner added needs-clarification Features or bugs that need further clarification and should be discussed. planned-feature A feature that was planned and committed to by the maintainers labels Oct 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online logging #13

Online logging #13

TSchmiedlechner commented Oct 2, 2021

TSchmiedlechner commented Oct 2, 2021

mijomilicevic commented Nov 10, 2022 •

edited

StefanKert commented Dec 5, 2022 •

edited

Online logging #13

Online logging #13

Comments

TSchmiedlechner commented Oct 2, 2021

User story

Context

TSchmiedlechner commented Oct 2, 2021

mijomilicevic commented Nov 10, 2022 • edited

StefanKert commented Dec 5, 2022 • edited

Audit entries

Metrics

Traces

Logs

Which things should we store?

mijomilicevic commented Nov 10, 2022 •

edited

StefanKert commented Dec 5, 2022 •

edited