Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online logging #13

Open
TSchmiedlechner opened this issue Oct 2, 2021 · 3 comments
Open

Online logging #13

TSchmiedlechner opened this issue Oct 2, 2021 · 3 comments
Labels
needs-clarification Features or bugs that need further clarification and should be discussed. planned-feature A feature that was planned and committed to by the maintainers

Comments

@TSchmiedlechner
Copy link
Member

User story

As a user of the Middleware Launcher, I want relevant log messages to show up in the Portal, so that I can diagnose issues without having to connect to the POS system.

Context

As of now, we use Application Insights to monitor the Launcher installations, and make this information accessible to our users via the Metrics feature of our Portal. However, the current approach has several disadvantages:

  • It's not possible to differentiate between the log level for file logging and AppInsights monitoring
  • The current approach generates very high amounts of data that we have to process with high sample rates to reduce costs
  • There is no underlying concept to decide which data should be sent to AppInsights

We therefore need to define which upcoming features we might introduce for online monitoring purposes, and which data we need for that with the other development teams.

Important: we should absolutely not send any user-specific data to AppInsights, e.g. the content of requests.

@TSchmiedlechner
Copy link
Member Author

Of course it might also make sense to discuss if AppInsights is the correct tool for monitoring on-premise software, and which alternatives are available.

@TSchmiedlechner TSchmiedlechner added needs-clarification Features or bugs that need further clarification and should be discussed. planned-feature A feature that was planned and committed to by the maintainers labels Oct 2, 2021
@mijomilicevic
Copy link
Member

mijomilicevic commented Nov 10, 2022

Launcher 2.0 – online Monitoring Tools

Metrics:

  • Request processing time

Including: Operation name, CashboxID

Events

  • Launcher Startup (with OS, Launcher config, Versions of Launcher & Cashbox configuration Timestamp, OS, user)

  • Launcher Shutdown

  • Self update

Exceptions

  • There should be are maximum of logs colllected for e.g. a day

This value should be controllable

General requirements

  • It should be possible to disable AI logging entirely, e.g. with a config feature flag

  • We will need different log levels which ideally can be configured remotely (debug mode everything, default only exceptions)

  • Visualization in the portal “metrics” page

  • We should create separate AI instances to differentiate between “old” and “new” and finetune the sampling

Queue lifecycle

  • Lifecycle events should be recorded in the actionjournal

  • Startup

  • Closing

  • Information if the mw is signing properly and info to failed mode or not (header info)

@StefanKert
Copy link
Member

StefanKert commented Dec 5, 2022

Hi folks,

sorry for not getting back to this earlier.

To me the most important part here is the definition of the different events that we want to track. Especially the lifecycle events are more than just logging since they could also be used for auditing purposes (e.g. When was the Queue started?). We are already using this mechanism when initialising the Queue (ftActionJournal) so IMO we should also think about those use cases first. This not only will enable us to have better logging / auditing capabilities, it will also hugely benefit the free of charge product since PosCreators could just pull up the ActionJournal entries to see what has been going on.

IMO this also applies to things that are happening during the startup / shutdown of the launcher. Especially things like starting the Launcher on a different PC are currently not detected and also hard to figure even though we could use this information for things like accidentally duplicates of cashboxes.

Generally speaking I think that we should split things between the following signals (also matching with the OpenTelemetry standards). (we should also think about basing our implementations on the OpenTelemtry standard)

  • Metrics
  • Traces
  • Logs
  • Audit entries (e.g. ftActionJournal)

Audit entries

Explained above, but this IMO is the most important category since it gives us a clear overview on what has happened. Another important example IMO is the first start with a new configuration. This will help to figure when changes are applied and also make it easier to relate new problems to changes.

One thing that is important to notice is the fact that not all Audit entries are directly connected to a Queue. There are some things that are probably connected to another entity (SCU). These cases need to be considered specifically, if those are really audit entries or if it is a log.

Metrics

One example for a metric is the sign-duration. In most cases we don't need this information since the queue runs stable and there is no need to investigate. In cases that are exceeding some expectations (duration is > 2sec) it would help to have information, but also details on what has happened since it is hard to figure out why the time was higher than usual.

Traces

For this purpose we need to be able to trace the sign call, from inbound ( launcher endpoint), to queue, to scu and back, for being able to make a clear statement on what has happened. In addition to that in some cases we need to have specific information on what has happened.

Logs

A log is a timestamped text record, either structured (recommended) or unstructured, with metadata. (taken from the OpenTelemetry docs). This gives us information on what exactly has happened (exception?).

Which things should we store?

As outlined above one of our biggest issues is the amount of logs that we will expect. There are lots of queues with only very little traffic, but there are also cases that creates lots of noise with no real value (exceptions because of a missing card... could be a audit log though?).

To solve this I think that we should reduce the noise produced by the client, by already dropping things in the client. This could be based on a statistical method (95 percentile) be a static "If a sign call is longer than 2 seconds store all related signals". In these cases we should store as many signals as possible since otherwise we will just end up with lots alerts but no real information on what has happened. From an implementation perspective this could mean that before sending something off to our cloud endpoint we check if it is something that we should log or not. This would greatly reduce noise and still give us the option to notice real issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-clarification Features or bugs that need further clarification and should be discussed. planned-feature A feature that was planned and committed to by the maintainers
Projects
No open projects
Development

No branches or pull requests

3 participants