Alternative to Ganglia #1159

Spenhouet · 2020-03-09T06:53:50Z

Ganglia currently has no maintainers: https://sourceforge.net/p/ganglia/mailman/message/36795542/
For the centOS8.1 version you might want to think about replacing Ganglia.

koomie · 2020-03-24T22:09:28Z

Thanks for that note - do you have suggestions for open-source alternatives?

Perhaps something simple for a load dashboard using Grafana or the like....

Spenhouet · 2020-03-24T22:27:21Z

I did find netdata to be very interesting: https://github.com/netdata/netdata
It seems actively maintained and has many features. It provides a dashboard but also allows to use ELK-stack, Grafana, ... as far as I can see.

xdmod is another one that seems well maintained: https://github.com/ubccr/xdmod/
But I'm not sure on the feature set.

EDIT:

Other mentions:

It would be nice if openHPC would push for an ELK-stack or TICK-stack.

This would allow for an integration into existing monitoring solutions.

e-alfred · 2020-06-03T11:05:03Z

Now that Ganglia (and Nagios) are deprecated from OpenHPC 2.0, which alternatives are officially recommended for monitoring?

mtds · 2020-06-05T15:58:30Z

@e-alfred I can offer my 2cent tips: choose a modern monitoring software were tasks are neatly separated like Prometheus for collect and store the data, in combination with Grafana for the dashboards. Or another stack like TICK, as already mentioned by @Spenhouet.
Personally, I am not a big fan of the ELK stack: it's quite a complicated piece of software, deceivingly simple to setup but the devil lies in the details once you need to increase the number of servers under monitoring.

What I don't like about Ganglia is that the approach is a monolithic one, without nearly any possibility of decoupling single components (e.g. replace and/or extend gmond is not an easy task) not to mention change/adapt the web dashboard. While Ganglia was an important piece of monitoring technology (with the added bonus of being open source) the approach in the field is nowadays quite dated (IMHO).

Modern monitoring software should be modular, with different components and every one of them should be replaceable, which means now we can talk about monitoring pipelines (or stacks) and not referring to a single software in charge of every single aspect of monitoring.

In particular, modern monitoring software components should fall in the following four macro categories regarding their tasks with neatly distinction between each of them:

Collect
Store
Graphing
Alerting

mtds · 2020-06-06T11:38:14Z

Just forgot to add that I am also a great believer in Netdata, despite it's not exactly following the clear tasks division I have mentioned before. It can be easily installed (via packages too nowadays) and give immediately a very rich and detailed dashboard accessible via web browser.

But there is an important issue (not yet) solved: there is no support for Infiniband network monitoring.
There was a long discussion about this aspect in this issue but possibly things will improve soon thanks to the following pull request, which is still marked as work in progress but I hope it will be integrated in the near future.

Saruspete · 2020-06-06T23:27:00Z

Greetings,
I'm coming from netdata repository, as my infiniband integration PR was linked here.

For the Infiniband collector, I currently monitor all elements that would be provided by perfquery. Do you have other counters or info that would be useful ?
I also plan to provide MPI message type collector in the future to get total coverage.

Also, I'd like to provide a bit of experience about netdata usage in HPC environment (real experience, not just ideas):

The main advantage is its very low overhead. It takes origins in a jitter issue that no other tools managed to find, and has optimization as a main objective from day 1.
That allows to avoid slowdowns between the mpi jobs when going CPU intensive. As it's very safe by default, it'll place itself in a very nice-level mode + batch scheduling. If you have runs that consumes all CPU time, you might want to place with a higher priority.
It's the only monitoring too that gathers almost all metrics a linux system can provide, and allows to see issues no other tool can show. It's a very good indicator to check wether you should invest in nvme drives for your local scratch
The cgroup plugin integrates nicely with schedulers like slurm (where the jobid is part of the cgroup name), as it allows you to get the cpu, memory and swap consumption of the jobs. You'll see easily when your job is slowed down by a shared resource (disk, network) or if it's your own code that needs optimization. With a few queries, you can provide easy reports to your users (usage efficiency, nodes waiting for others to finish, etc...), and insights like "well, you don't need to poll here, you're spending 90% of your cpu time in kernel... you see 100% usage, but in reality you're wasting resources"
alerts are integrated within its core, and doesn't consume network uselessly. Almost all other monitoring tool send data to a central server that gathers everything and decides to generate an alert, wasting incredible network resources and needing a beefy & redundant central server. Here, everything is distributed, allowing you to scale your cluster without issue, which no other tool can provide.

fwiw, I'm using netdata the only monitoring tool in my 1K+ HFT nodes & 1K+ HPC nodes. Best of both worlds.

Cheers

zack-shoylev · 2020-06-08T16:27:11Z

Hello!
Are there any other blockers to using Netdata outside of infiniband support?

e-alfred · 2020-06-09T10:50:12Z

While I really like Netdata, it has two problems/disadvantages:

It is intended for real time monitoring, you still need a monitoring system for long term data and alerting in place (using their exporters: https://learn.netdata.cloud/docs/agent/exporting)
It is decentralized, running a server process on every machine that has to be secured and needs a central interface if multiple nodes/systems should be watched

All of this makes it not completely useful on e. g. diskless systems and for centralized monitoring.

Saruspete · 2020-06-09T11:40:53Z

Allow me to correct the points you made:

Netdata does alert itself, and provides a wide range of notification methods by default (https://my-netdata.io/infographic.html): mail, http request, irc, syslog, pagerduty, slack, alerta, awssns, dynatrace, flock, hangouts, matrix, messagebird, discord, pushbullet, prowl, telegram, twilio, pushover...
It also has features like hysteresis to avoid notification spamming.

It also provides a cloud-based long-term retention, service (netdata.cloud) or you can integrate it in any TSDB you already have that understands one of these formats: graphite, json, mongodb, opentsdb, prometheus.

if you don't want to have a TSDB, you can also set one or more netdata servers to act as collector (streaming feature), and they'll hold the data of others nodes, without other tool than the default binary. It also allows to not have any data stored locally, useful for diskless nodes.

If you don't want the web server, you can either filter IP Addresses, or disable it entirely by a single configuration:

[web]
  mode = none

and it'll only push the data to backends or netdata collector.
It also checks all the security best-practices, as detailled in the doc. FWIW, it's the standard monitoring it in at least 2 top10 EU Bank, without hardening required.

If you don't want an instance of netdata everywhere, you can set a single netdata instance to monitor multiple out-of-band elements, like BMC (through freeipmi), fping, or the multiple collectors available.

My current workflow is the following:

I have a central dashboard (similar to the one provided by alerta.io) that centralize alerts
When a single alert arise, I can go to the dashboard to immediately have detailled info on what generated it, and who is responsible for it (jobs currently running). This saves precious time not having to check the logs on "who ran what where", and allow to dig directly using performance tools like ebpf.
When multiple alerts arise, I can find the source / spof and fix it quickly.

And I think I've a pretty acceptable infiniband monitoring now. If you can provide feedback, It'll be greatly appreciated ❤️

severgun · 2021-03-24T09:28:04Z

netdata is not a replacement for Ganglia.
Most/decent amount of HPC clusters have no internet connection at all.
Replacement MUST support self-hosted operation.

Saruspete · 2021-03-24T15:04:53Z

That's the case: it's already deployed in banks, industry, hpc and other air-gaped environment. I also stated it in the comment:

It also provides a cloud-based long-term retention, service (netdata.cloud) or you can integrate it in any TSDB you already have that understands one of these formats: graphite, json, mongodb, opentsdb, prometheus.

You mismatched the cloud dashboard and the agent, or just didn't read the doc at all.

As a general matter, please read thoughtfully before making a (wrong) opinion.

severgun · 2021-03-24T17:37:58Z

Top10 banks have enough resources to build their own netdata.cloud alternative. What about enthusiasts without frontend dev experience?

Ganglia not only gmond service. It is also web frontend.

If we accept that netdata monitor can/should replace gmond. Then grafana integration must be described at installation guide and some kind of dashboard provided.

viniciusferrao · 2021-03-24T17:49:44Z

Hello, I'm deeply interested on this thread, but accordingly to this site: https://www.netdata.cloud/integrations/#featured there are some support for exporters. I think we may be able to archive historical data on those services, is this correct @Saruspete? If yes, the question from @severgun would be answered.

Also, can netdata read from them, after exporting, to display the historical data?

At this moment we are considering zabbix-grafana on all our Clusters to replace the hole that Ganglia is leaving. Perhaps not the best solution but we already have Zabbix in place, and for Grafana it's the only supported out-of-the-box monitoring tool from BeeGFS, so it was a natural fit.

Saruspete · 2021-03-24T19:38:51Z

What about enthusiasts without frontend dev experience?

No dev experience is required: just 1 configuration file and it'll start sending data to the TSDB of your choice (eg opentsdb / grafana)
As prometheus is the de-factor standard in monitoring & tsdb, netdata can also integrate in an existing prometheus installation: Using netdata with prometheus

Ganglia not only gmond service. It is also web frontend.

Netdata also provide a web frontend, which by default only monitors the local host. any netdata instance can be a central collector and show the data other nodes sends it.
You can also create custom pages (like for TV monitoring or to embed in an existing dashboard) by just including 1 js & creating a div like :

<div data-netdata="system.io" data-host="https://registry.my-netdata.io"  data-common-max="io"  data-common-min="io" data-title="I/O on registry.my-netdata.io"  data-chart-library="dygraph"  data-width="49%"   data-height="100%" data-after="-300" ></div>

And that's all: you have a live graph of your choice.

If we accept that netdata monitor can/should replace gmond. Then grafana integration must be described at installation guide and some kind of dashboard provided.

Indeed, I can help you providing these configuration templates (a lot of them are already in the github documentation, but not really well indexed on the website)

But even better: you can let end-users chose which solution they want, eg :

simple / small cluster and not wanting to manage a full TSDB: plain netdata installation
standard cluster : all netdata agents (with or without local storage) send data to the central database.
advanced multi-site cluster: nodes sends data to one or more local collectors, which in turns sends data to the central database That minimize the maximum connection to handle & allow batch processing, while still providing realtime monitoring in case of issue.

accordingly to this site there are some support for exporters.

Yes, netdata is able to send its collected values into any of these databases (and many more: 33 in fact

I think we may be able to archive historical data on those services, is this correct @Saruspete?

Of course. you can also chose to send to multiple & different db at the same time.

Also, can netdata read from them, after exporting, to display the historical data?

I don't think so: the agent can keep its values in a highly compressed local DB (in 1G in /var/cache/netdata I store a whole week of data, sampled every second) but it's not done to read an external TSDB from its frontend (that's grafana's job)

At this moment we are considering zabbix-grafana on all our Clusters to replace the hole that Ganglia is leaving. Perhaps not the best solution but we already have Zabbix in place, and for Grafana it's the only supported out-of-the-box monitoring tool from BeeGFS, so it was a natural fit.

I'm using beegfs, and was planning on doing the beegfs collector. I can either create a wrapper over beegfs-mon, or add an output format in beegfs itself. The former would allow integration with older versions of beegfs, while the latter would be more efficient.

Top10 banks have enough resources to build their own netdata.cloud alternative.

As a side note, this is not how most companies for which IT is not a core business works: IT is seen as a cost-center, and they are not willing to do internal development + take the blame in case of issue, while they can just pay for a paid support and send the blame to them ("nobody ever got fired for buying ibm").
So when I made netdata the monitoring solution for trading systems in the top EU banks, that was because of the features it provides, no other solution could do (especially precision and low overhead).

Final note: I have no part, no interest, am not employee of netdata. I'm only pushing a tool which has tremendous potential in all kinds of workloads, enabled me to investigate & fix hidden issues with major constructors, and enforce high code standards to trust them in the future.

fangjzh · 2021-08-09T03:40:25Z

I recommend Nightingale github page，it has advatanges:

adjustable
As the volume of the business to be monitored increases, the Nightingale server can easily increase the capacity by adding more machines
high performance
In Didi, 28 million data points are processed per second, with a total index of more than 700 million
High availability
The server-side module can easily form a cluster by deploying multiple machines to achieve high availability, and there is no impact on the service if one machine is hung up.
Scalable
Can flexibly integrate Prometheus ecology, Grafana ecology, Open-Falcon ecology, storage is pluggable
Efficient
At the same time, it supports traditional physical machine virtual machine scenarios, as well as container scenarios, and one-stop efficient processing of hybrid cloud ecology
Easy to deploy
The server has only one core module, which can be deployed with a few commands, so you can get started quickly

berlin2123 · 2022-04-29T08:11:41Z

Maybe, we can still work with Ganglia through a centos7-ganglia-web-docker-container

vkhodygo · 2023-04-25T17:43:25Z

Any updates?

@berlin2123 that's some abomination tbh. It might work, but it's not supposed to be like that. If you really want to use Ganglia that much just clone/migrate the repository to GitHub and keep it updated. I bet there are still people who are willing to maintain the project.

iGeorgeX · 2023-09-10T13:47:19Z

@vkhodygo It doesn't look to me like anyone is still willing to maintain the Ganglia software. There are dozens of forks of the project but nothing has been done. I think that Ganglia is not usable with this and I would therefore join the question if OpenHPC already has a plan what to use instead?

alanorth · 2024-02-19T08:49:00Z

Ganglia still exists in EPEL for CentOS Stream 8 and CentOS Stream 9. Eventually we'll have to find a solution that is maintained.

I have a VictoriaMetrics server collecting statistics from my HPC cluster, with each node running the Prometheus node_exporter agent. I suppose it should be "simple" to create a custom Grafana dashboard to show critical metrics for the cluster nodes.

berlin2123 · 2024-02-24T09:22:06Z

ganglia-web version 3.7.6 is released 3 days age, which directly works fine inside RHEL9/8 with php8/7 now.

The official rpms (EPEL) or deb packages may be available recently.

iGeorgeX · 2024-03-02T02:50:02Z

ganglia-web version 3.7.6 is released 3 days age, which directly works fine inside RHEL9/8 with php8/7 now.

The official rpms (EPEL) or deb packages may be available recently.

ganglia-web (rpm) in version 3.7.6 is in epel-testing

berlin2123 · 2024-03-04T07:22:35Z

3.7.6 is in epel (not-testing) now. Feel free to test.

An issue that has been identified is that the MONTH and YEAR pages still have problems in php8 (el9), which has been fixed in PULL 379 of ganglia-web.

Feel free to submit other issues !!!

iGeorgeX · 2024-03-05T16:06:47Z

There is also a bug in physical_view.php Issue

ilyam8 mentioned this issue Jun 6, 2020

Add Infiniband monitoring to collector proc.plugin netdata/netdata#9091

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative to Ganglia #1159

Alternative to Ganglia #1159

Spenhouet commented Mar 9, 2020

koomie commented Mar 24, 2020

Spenhouet commented Mar 24, 2020 •

edited

e-alfred commented Jun 3, 2020

mtds commented Jun 5, 2020 •

edited

mtds commented Jun 6, 2020

Saruspete commented Jun 6, 2020 •

edited

zack-shoylev commented Jun 8, 2020

e-alfred commented Jun 9, 2020 •

edited

Saruspete commented Jun 9, 2020 •

edited

severgun commented Mar 24, 2021

Saruspete commented Mar 24, 2021

severgun commented Mar 24, 2021 •

edited

viniciusferrao commented Mar 24, 2021

Saruspete commented Mar 24, 2021 •

edited

fangjzh commented Aug 9, 2021

berlin2123 commented Apr 29, 2022

vkhodygo commented Apr 25, 2023

iGeorgeX commented Sep 10, 2023

alanorth commented Feb 19, 2024 •

edited

berlin2123 commented Feb 24, 2024 •

edited

iGeorgeX commented Mar 2, 2024

berlin2123 commented Mar 4, 2024

iGeorgeX commented Mar 5, 2024

Alternative to Ganglia #1159

Alternative to Ganglia #1159

Comments

Spenhouet commented Mar 9, 2020

koomie commented Mar 24, 2020

Spenhouet commented Mar 24, 2020 • edited

e-alfred commented Jun 3, 2020

mtds commented Jun 5, 2020 • edited

mtds commented Jun 6, 2020

Saruspete commented Jun 6, 2020 • edited

zack-shoylev commented Jun 8, 2020

e-alfred commented Jun 9, 2020 • edited

Saruspete commented Jun 9, 2020 • edited

severgun commented Mar 24, 2021

Saruspete commented Mar 24, 2021

severgun commented Mar 24, 2021 • edited

viniciusferrao commented Mar 24, 2021

Saruspete commented Mar 24, 2021 • edited

fangjzh commented Aug 9, 2021

berlin2123 commented Apr 29, 2022

vkhodygo commented Apr 25, 2023

iGeorgeX commented Sep 10, 2023

alanorth commented Feb 19, 2024 • edited

berlin2123 commented Feb 24, 2024 • edited

iGeorgeX commented Mar 2, 2024

berlin2123 commented Mar 4, 2024

iGeorgeX commented Mar 5, 2024

Spenhouet commented Mar 24, 2020 •

edited

mtds commented Jun 5, 2020 •

edited

Saruspete commented Jun 6, 2020 •

edited

e-alfred commented Jun 9, 2020 •

edited

Saruspete commented Jun 9, 2020 •

edited

severgun commented Mar 24, 2021 •

edited

Saruspete commented Mar 24, 2021 •

edited

alanorth commented Feb 19, 2024 •

edited

berlin2123 commented Feb 24, 2024 •

edited