Questions about the roadmap #15648

StefanSa · 2023-07-31T10:58:22Z

StefanSa
Jul 31, 2023

Hi there,
first of all thank you for Netdata, in particular ML, alerting and of course opensource.
Rarely seen such a well done tool (more than 30 years IT Admin).
But for me, and i think for others, netdata is too top-heavy for *nix, less so for Windows.
This is less optimal, because we have a "zoo" to manage here, there are all from *nix to Windows.

Wouldn't it be better in the future to have only parents and no netdata agents ?
To these Parents, one could send universally for all operating systems, via OTEL (OpenTelemetry) collector after a specified standard metrics, logs and traces. This open-source observability solution (APM) would be a good vision for the future of Netdata.
Several market leaders have been using OTEL for some time and it is becoming more and more common.

regards
Stefan

Answered by ktsaou

Aug 1, 2023

i know this is off topic,
but i hope your home survived the bad fires reasonably well.
We wanted to visit your beloved island Rodos this summer, all very sad.

thanks!

i understand very well that with such a historically grown software, it is not easy to just want to implement OTEL.

The opposite actually. We want it badly..

And that without endangering or even losing all the advantages that netdata once offered.

This is the problem. And also that monitoring architecturaly should be somewhat different from what is common today. We really need zero configuration, zero touch ML, automated dashboards, easy scalability, more structure and information into the tool so that monitoring is ea…

View full answer

andrewm4894 · 2023-07-31T11:29:44Z

andrewm4894
Jul 31, 2023

Hi @StefanSa - thanks for posting - this is an interesting topic - I think we have actually been recommending using parents more and more for a variety of reasons (high availability, scalability, central config, and probably a few other reasons i'm missing).

Your idea of "parents only" for sure makes sense for certain use cases - e.g. if only interested in a very specific subset of metrics or the thing you are monitoring just does not fit well into the "nodes" abstraction. I think we have actually seen a fair few users take this approach, especially with SNMP devices etc. and also some APM use-cases where maybe all the is being emitted is metrics to some Prometheus endpoint that basically can just be a Netdata parent, similar for StatsD metrics too. For example we do something similar for monitoring our Apache Airflow instance that does some ETL stuff - essentially its emitting statsd metrics to a parent.

I know @ralphm is also actively looking at open telemetry and what it would mean to say we are "open telemetry compatible" or "fully supported".

We do also have ongoing work to bring some more log based capabilities into Netdata too - mega PR is here that focuses on that for background.

I work on ML so that my main area of expertise so not the most informed on the overall architecture side of things but i'm sure @ralphm, @Ferroin and @ktsaou probably have some opinions too.

My understanding is that having a "parents only" approach like you suggest would just be another totally valid topology or deployment strategy but we would more be just trying to support as many different approaches that make sense for users and try share best practices and pro's and con's of each.

We recently put down some latest thinking on this here: https://learn.netdata.cloud/docs/architecture/deployment-strategies

One thing i'm less sure about is the difference between "parent only" and what we call "standalone" in link above (maybe it more about emphasis on scale and variety or metrics that a parent might receive in your scenario) - essentially, i think a parent only approach can be considered a flavor of standalone, and if you wanted to do active-active then that would be two parents sharing metrics.

0 replies

ilyam8 · 2023-07-31T11:35:53Z

ilyam8
Jul 31, 2023
Collaborator

I'd start with

But for me, and i think for others, netdata is too top-heavy for *nix

What is top-heavy and why Netdata is too top-heavy? I don't think so, Netdata is not heavy in general and can be configured to use less CPU, mem, and io resources (e.g. changing memory mode, disabling some collectors, changing default data collection frequency, disabling some components like ML, health).

0 replies

StefanSa · 2023-07-31T11:43:04Z

StefanSa
Jul 31, 2023
Author

Hi Andrew,
I think i expressed myself a bit wrong.
Children can be present but not send netdata specific data and remain as deployment strategies. It would be important in my eyes to have only one data collector that applies to all OS in the case OTEL.

Hi Ilay,
I don't mean a load on CPU but more optimized for *nix. You can see this in virtual host, charts number of (different) metrics.
Integration for Windows metrics, graphs and handling should be better.

0 replies

ktsaou · 2023-08-01T00:02:52Z

ktsaou
Aug 1, 2023
Maintainer

Hi @StefanSa and thank you for your message!

OTEL is great and we are working in the direction you suggest. But even OTEL has limitations we are trying to overcome.

One of the biggest differences in Netdata is that we automate most of the monitoring configuration. Netdata automatically picks what is useful to monitor and behind the scenes takes a lot of very important decisions.

Take errors for example: usually error counters are zero. Netdata collects many thousands of "usually" zero metrics behind the scenes (hardware errors, operating system errors, etc), and as long as they are zero, it ignores most of them. It doesn't do anything at all about them. But it will automatically inject charts and alerts immediately after a collected error counter is non zero. Expect 2-3 times more metrics per node if this didn't happen.

Another issue is the OTEL data model. At Netdata we need fully automated visualization. For this to work we need to make a few important decisions: group metrics together into meaningful charts, decide what exactly each metric monitors, identify (and name) the specific instance of each component monitored, create a tree structure of metrics, give titles to all charts, and many more.

All these, do not exist in OTEL and in many cases OTEL is fuzzy (you can't really tell what exactly is monitored, or more than 1 component is monitored at the same time). Of course Netdata could receive OTEL data and enrich it, but this would involve a lot of guess work to derive the metadata we need, which can easily lead to a nightmare as the data models of OTEL and Netdata change over time.

One of the things I want us to do, is to document how OTEL should be for Netdata to work without heuristics using OTEL as the primary data source. But unfortunately we haven't found the time to do it yet.

Another thing you will miss is about Netdata Functions. Netdata Functions are exposed by collectors to allow us interact with the data source in a way that is not just metrics. The first function is processes, that exposes all the PIDs of the system with all possible data about them, including cpu utilization, memory usage, page faults, disk I/O, file descriptors, and more. It like top, iotop, fdstop, pagegaultstop, in one tool.

We currently have another function that we will release this week: systemd-journal, which allows you to query the systemd log.

We also plan to build a "database slow queries", a trace for operating system calls per PID, even restart a systemd service or reboot a server. Our goal is to reach a point that complex applications like k9s, or perf can be built as Netdata functions.

So, although your idea is very nice, I can see a future like this in Netdata and we are working to make Netdata fully OTEL compatible so that your idea is supported by Netdata, it may not be as practical as you expect it...

0 replies

StefanSa · 2023-08-01T08:13:01Z

StefanSa
Aug 1, 2023
Author

@ktsaou
Γεια σας
i know this is off topic,
but i hope your home survived the bad fires reasonably well.
We wanted to visit your beloved island Rodos this summer, all very sad.

Now to the topic.
i understand very well that with such a historically grown software, it is not easy to just want to implement OTEL.
And that without endangering or even losing all the advantages that netdata once offered.
As an admin with more than 30 years in the business, i'm still looking for the universal tool to do everything with.
i would be happy if i had a tool with one data collector for all OS and also graphs that were also available for all OS if possible.
Especially in a hybrid operating system environment where Windows is still present and recurring.
Similar to "uptrace.dev" or "signoz.io", but here i miss everything netdata offers.

i am still a seeker over all these years, from mrtg, nagios, zabbix, influx ellastic etc and now netdata :)
Wish you continued success with netdata and will watch how you try to integrate OTEL.

Πολλούς χαιρετισμούς από τη Γερμανία.

1 reply

ktsaou Aug 1, 2023
Maintainer

i know this is off topic,
but i hope your home survived the bad fires reasonably well.
We wanted to visit your beloved island Rodos this summer, all very sad.

thanks!

i understand very well that with such a historically grown software, it is not easy to just want to implement OTEL.

The opposite actually. We want it badly..

And that without endangering or even losing all the advantages that netdata once offered.

This is the problem. And also that monitoring architecturaly should be somewhat different from what is common today. We really need zero configuration, zero touch ML, automated dashboards, easy scalability, more structure and information into the tool so that monitoring is easier for most people independently of their skills, best practices in health monitoring and so many more. I don't want us to lose all these.

As an admin with more than 30 years in the business, i'm still looking for the universal tool to do everything with.
i would be happy if i had a tool with one data collector for all OS and also graphs that were also available for all OS if possible.
Especially in a hybrid operating system environment where Windows is still present and recurring.

Another path is for Netdata to evolve to unify and become a viable edge tool to replace the others. Netdata is designed to kill the console, not just collect some metrics.

Until then, I think that using all the others is the only viable solution...

Let's see. We need everyone's help to make this successful. Let's hope that more and more people love Netdata and help to evolve it.

Answer selected by StefanSa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the roadmap #15648

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions about the roadmap #15648

StefanSa Jul 31, 2023

Replies: 5 comments · 1 reply

andrewm4894 Jul 31, 2023

ilyam8 Jul 31, 2023 Collaborator

StefanSa Jul 31, 2023 Author

ktsaou Aug 1, 2023 Maintainer

StefanSa Aug 1, 2023 Author

ktsaou Aug 1, 2023 Maintainer

StefanSa
Jul 31, 2023

Replies: 5 comments 1 reply

andrewm4894
Jul 31, 2023

ilyam8
Jul 31, 2023
Collaborator

StefanSa
Jul 31, 2023
Author

ktsaou
Aug 1, 2023
Maintainer

StefanSa
Aug 1, 2023
Author

ktsaou Aug 1, 2023
Maintainer