Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to Ganglia #1159

Open
Spenhouet opened this issue Mar 9, 2020 · 23 comments
Open

Alternative to Ganglia #1159

Spenhouet opened this issue Mar 9, 2020 · 23 comments

Comments

@Spenhouet
Copy link

Ganglia currently has no maintainers: https://sourceforge.net/p/ganglia/mailman/message/36795542/
For the centOS8.1 version you might want to think about replacing Ganglia.

@koomie
Copy link
Contributor

koomie commented Mar 24, 2020

Thanks for that note - do you have suggestions for open-source alternatives?

Perhaps something simple for a load dashboard using Grafana or the like....

@Spenhouet
Copy link
Author

Spenhouet commented Mar 24, 2020

I did find netdata to be very interesting: https://github.com/netdata/netdata
It seems actively maintained and has many features. It provides a dashboard but also allows to use ELK-stack, Grafana, ... as far as I can see.

xdmod is another one that seems well maintained: https://github.com/ubccr/xdmod/
But I'm not sure on the feature set.

EDIT:

Other mentions:

It would be nice if openHPC would push for an ELK-stack or TICK-stack.

This would allow for an integration into existing monitoring solutions.

@e-alfred
Copy link

e-alfred commented Jun 3, 2020

Now that Ganglia (and Nagios) are deprecated from OpenHPC 2.0, which alternatives are officially recommended for monitoring?

@mtds
Copy link

mtds commented Jun 5, 2020

@e-alfred I can offer my 2cent tips: choose a modern monitoring software were tasks are neatly separated like Prometheus for collect and store the data, in combination with Grafana for the dashboards. Or another stack like TICK, as already mentioned by @Spenhouet.
Personally, I am not a big fan of the ELK stack: it's quite a complicated piece of software, deceivingly simple to setup but the devil lies in the details once you need to increase the number of servers under monitoring.

What I don't like about Ganglia is that the approach is a monolithic one, without nearly any possibility of decoupling single components (e.g. replace and/or extend gmond is not an easy task) not to mention change/adapt the web dashboard. While Ganglia was an important piece of monitoring technology (with the added bonus of being open source) the approach in the field is nowadays quite dated (IMHO).

Modern monitoring software should be modular, with different components and every one of them should be replaceable, which means now we can talk about monitoring pipelines (or stacks) and not referring to a single software in charge of every single aspect of monitoring.

In particular, modern monitoring software components should fall in the following four macro categories regarding their tasks with neatly distinction between each of them:

  • Collect
  • Store
  • Graphing
  • Alerting

@mtds
Copy link

mtds commented Jun 6, 2020

Just forgot to add that I am also a great believer in Netdata, despite it's not exactly following the clear tasks division I have mentioned before. It can be easily installed (via packages too nowadays) and give immediately a very rich and detailed dashboard accessible via web browser.

But there is an important issue (not yet) solved: there is no support for Infiniband network monitoring.
There was a long discussion about this aspect in this issue but possibly things will improve soon thanks to the following pull request, which is still marked as work in progress but I hope it will be integrated in the near future.

@Saruspete
Copy link

Saruspete commented Jun 6, 2020

Greetings,
I'm coming from netdata repository, as my infiniband integration PR was linked here.

For the Infiniband collector, I currently monitor all elements that would be provided by perfquery. Do you have other counters or info that would be useful ?
I also plan to provide MPI message type collector in the future to get total coverage.

Also, I'd like to provide a bit of experience about netdata usage in HPC environment (real experience, not just ideas):

  • The main advantage is its very low overhead. It takes origins in a jitter issue that no other tools managed to find, and has optimization as a main objective from day 1.
    That allows to avoid slowdowns between the mpi jobs when going CPU intensive. As it's very safe by default, it'll place itself in a very nice-level mode + batch scheduling. If you have runs that consumes all CPU time, you might want to place with a higher priority.
  • It's the only monitoring too that gathers almost all metrics a linux system can provide, and allows to see issues no other tool can show. It's a very good indicator to check wether you should invest in nvme drives for your local scratch
  • The cgroup plugin integrates nicely with schedulers like slurm (where the jobid is part of the cgroup name), as it allows you to get the cpu, memory and swap consumption of the jobs. You'll see easily when your job is slowed down by a shared resource (disk, network) or if it's your own code that needs optimization. With a few queries, you can provide easy reports to your users (usage efficiency, nodes waiting for others to finish, etc...), and insights like "well, you don't need to poll here, you're spending 90% of your cpu time in kernel... you see 100% usage, but in reality you're wasting resources"
  • alerts are integrated within its core, and doesn't consume network uselessly. Almost all other monitoring tool send data to a central server that gathers everything and decides to generate an alert, wasting incredible network resources and needing a beefy & redundant central server. Here, everything is distributed, allowing you to scale your cluster without issue, which no other tool can provide.

fwiw, I'm using netdata the only monitoring tool in my 1K+ HFT nodes & 1K+ HPC nodes. Best of both worlds.

Cheers

@zack-shoylev
Copy link

Hello!
Are there any other blockers to using Netdata outside of infiniband support?

@e-alfred
Copy link

e-alfred commented Jun 9, 2020

While I really like Netdata, it has two problems/disadvantages:

  • It is intended for real time monitoring, you still need a monitoring system for long term data and alerting in place (using their exporters: https://learn.netdata.cloud/docs/agent/exporting)
  • It is decentralized, running a server process on every machine that has to be secured and needs a central interface if multiple nodes/systems should be watched

All of this makes it not completely useful on e. g. diskless systems and for centralized monitoring.

@Saruspete
Copy link

Saruspete commented Jun 9, 2020

Allow me to correct the points you made:

Netdata does alert itself, and provides a wide range of notification methods by default (https://my-netdata.io/infographic.html): mail, http request, irc, syslog, pagerduty, slack, alerta, awssns, dynatrace, flock, hangouts, matrix, messagebird, discord, pushbullet, prowl, telegram, twilio, pushover...
It also has features like hysteresis to avoid notification spamming.

It also provides a cloud-based long-term retention, service (netdata.cloud) or you can integrate it in any TSDB you already have that understands one of these formats: graphite, json, mongodb, opentsdb, prometheus.

if you don't want to have a TSDB, you can also set one or more netdata servers to act as collector (streaming feature), and they'll hold the data of others nodes, without other tool than the default binary. It also allows to not have any data stored locally, useful for diskless nodes.

If you don't want the web server, you can either filter IP Addresses, or disable it entirely by a single configuration:

[web]
  mode = none

and it'll only push the data to backends or netdata collector.
It also checks all the security best-practices, as detailled in the doc. FWIW, it's the standard monitoring it in at least 2 top10 EU Bank, without hardening required.

If you don't want an instance of netdata everywhere, you can set a single netdata instance to monitor multiple out-of-band elements, like BMC (through freeipmi), fping, or the multiple collectors available.

My current workflow is the following:

  • I have a central dashboard (similar to the one provided by alerta.io) that centralize alerts
  • When a single alert arise, I can go to the dashboard to immediately have detailled info on what generated it, and who is responsible for it (jobs currently running). This saves precious time not having to check the logs on "who ran what where", and allow to dig directly using performance tools like ebpf.
  • When multiple alerts arise, I can find the source / spof and fix it quickly.

And I think I've a pretty acceptable infiniband monitoring now. If you can provide feedback, It'll be greatly appreciated ❤️

@severgun
Copy link

netdata is not a replacement for Ganglia.
Most/decent amount of HPC clusters have no internet connection at all.
Replacement MUST support self-hosted operation.

@Saruspete
Copy link

That's the case: it's already deployed in banks, industry, hpc and other air-gaped environment. I also stated it in the comment:

It also provides a cloud-based long-term retention, service (netdata.cloud) or you can integrate it in any TSDB you already have that understands one of these formats: graphite, json, mongodb, opentsdb, prometheus.

You mismatched the cloud dashboard and the agent, or just didn't read the doc at all.

As a general matter, please read thoughtfully before making a (wrong) opinion.

@severgun
Copy link

severgun commented Mar 24, 2021

Top10 banks have enough resources to build their own netdata.cloud alternative. What about enthusiasts without frontend dev experience?

Ganglia not only gmond service. It is also web frontend.

If we accept that netdata monitor can/should replace gmond. Then grafana integration must be described at installation guide and some kind of dashboard provided.

@viniciusferrao
Copy link
Contributor

Hello, I'm deeply interested on this thread, but accordingly to this site: https://www.netdata.cloud/integrations/#featured there are some support for exporters. I think we may be able to archive historical data on those services, is this correct @Saruspete? If yes, the question from @severgun would be answered.

Also, can netdata read from them, after exporting, to display the historical data?

At this moment we are considering zabbix-grafana on all our Clusters to replace the hole that Ganglia is leaving. Perhaps not the best solution but we already have Zabbix in place, and for Grafana it's the only supported out-of-the-box monitoring tool from BeeGFS, so it was a natural fit.

@Saruspete
Copy link

Saruspete commented Mar 24, 2021

What about enthusiasts without frontend dev experience?

No dev experience is required: just 1 configuration file and it'll start sending data to the TSDB of your choice (eg opentsdb / grafana)
As prometheus is the de-factor standard in monitoring & tsdb, netdata can also integrate in an existing prometheus installation: Using netdata with prometheus

Ganglia not only gmond service. It is also web frontend.

Netdata also provide a web frontend, which by default only monitors the local host. any netdata instance can be a central collector and show the data other nodes sends it.
You can also create custom pages (like for TV monitoring or to embed in an existing dashboard) by just including 1 js & creating a div like :

<div data-netdata="system.io" data-host="https://registry.my-netdata.io"  data-common-max="io"  data-common-min="io" data-title="I/O on registry.my-netdata.io"  data-chart-library="dygraph"  data-width="49%"   data-height="100%" data-after="-300" ></div>

And that's all: you have a live graph of your choice.

If we accept that netdata monitor can/should replace gmond. Then grafana integration must be described at installation guide and some kind of dashboard provided.

Indeed, I can help you providing these configuration templates (a lot of them are already in the github documentation, but not really well indexed on the website)

But even better: you can let end-users chose which solution they want, eg :

  • simple / small cluster and not wanting to manage a full TSDB: plain netdata installation
  • standard cluster : all netdata agents (with or without local storage) send data to the central database.
  • advanced multi-site cluster: nodes sends data to one or more local collectors, which in turns sends data to the central database That minimize the maximum connection to handle & allow batch processing, while still providing realtime monitoring in case of issue.

accordingly to this site there are some support for exporters.

Yes, netdata is able to send its collected values into any of these databases (and many more: 33 in fact

I think we may be able to archive historical data on those services, is this correct @Saruspete?

Of course. you can also chose to send to multiple & different db at the same time.

Also, can netdata read from them, after exporting, to display the historical data?

I don't think so: the agent can keep its values in a highly compressed local DB (in 1G in /var/cache/netdata I store a whole week of data, sampled every second) but it's not done to read an external TSDB from its frontend (that's grafana's job)

At this moment we are considering zabbix-grafana on all our Clusters to replace the hole that Ganglia is leaving. Perhaps not the best solution but we already have Zabbix in place, and for Grafana it's the only supported out-of-the-box monitoring tool from BeeGFS, so it was a natural fit.

I'm using beegfs, and was planning on doing the beegfs collector. I can either create a wrapper over beegfs-mon, or add an output format in beegfs itself. The former would allow integration with older versions of beegfs, while the latter would be more efficient.

Top10 banks have enough resources to build their own netdata.cloud alternative.

As a side note, this is not how most companies for which IT is not a core business works: IT is seen as a cost-center, and they are not willing to do internal development + take the blame in case of issue, while they can just pay for a paid support and send the blame to them ("nobody ever got fired for buying ibm").
So when I made netdata the monitoring solution for trading systems in the top EU banks, that was because of the features it provides, no other solution could do (especially precision and low overhead).

Final note: I have no part, no interest, am not employee of netdata. I'm only pushing a tool which has tremendous potential in all kinds of workloads, enabled me to investigate & fix hidden issues with major constructors, and enforce high code standards to trust them in the future.

@fangjzh
Copy link

fangjzh commented Aug 9, 2021

I recommend Nightingale github page,it has advatanges:

  1. adjustable
    As the volume of the business to be monitored increases, the Nightingale server can easily increase the capacity by adding more machines
  2. high performance
    In Didi, 28 million data points are processed per second, with a total index of more than 700 million
  3. High availability
    The server-side module can easily form a cluster by deploying multiple machines to achieve high availability, and there is no impact on the service if one machine is hung up.
  4. Scalable
    Can flexibly integrate Prometheus ecology, Grafana ecology, Open-Falcon ecology, storage is pluggable
  5. Efficient
    At the same time, it supports traditional physical machine virtual machine scenarios, as well as container scenarios, and one-stop efficient processing of hybrid cloud ecology
  6. Easy to deploy
    The server has only one core module, which can be deployed with a few commands, so you can get started quickly

@berlin2123
Copy link

Maybe, we can still work with Ganglia through a centos7-ganglia-web-docker-container

@vkhodygo
Copy link

Any updates?

@berlin2123 that's some abomination tbh. It might work, but it's not supposed to be like that. If you really want to use Ganglia that much just clone/migrate the repository to GitHub and keep it updated. I bet there are still people who are willing to maintain the project.

@iGeorgeX
Copy link

@vkhodygo It doesn't look to me like anyone is still willing to maintain the Ganglia software. There are dozens of forks of the project but nothing has been done. I think that Ganglia is not usable with this and I would therefore join the question if OpenHPC already has a plan what to use instead?

@alanorth
Copy link

alanorth commented Feb 19, 2024

Ganglia still exists in EPEL for CentOS Stream 8 and CentOS Stream 9. Eventually we'll have to find a solution that is maintained.

I have a VictoriaMetrics server collecting statistics from my HPC cluster, with each node running the Prometheus node_exporter agent. I suppose it should be "simple" to create a custom Grafana dashboard to show critical metrics for the cluster nodes.

@berlin2123
Copy link

berlin2123 commented Feb 24, 2024

ganglia-web version 3.7.6 is released 3 days age, which directly works fine inside RHEL9/8 with php8/7 now.

The official rpms (EPEL) or deb packages may be available recently.

@iGeorgeX
Copy link

iGeorgeX commented Mar 2, 2024

ganglia-web version 3.7.6 is released 3 days age, which directly works fine inside RHEL9/8 with php8/7 now.

The official rpms (EPEL) or deb packages may be available recently.

ganglia-web (rpm) in version 3.7.6 is in epel-testing

@berlin2123
Copy link

3.7.6 is in epel (not-testing) now. Feel free to test.

An issue that has been identified is that the MONTH and YEAR pages still have problems in php8 (el9), which has been fixed in PULL 379 of ganglia-web.

Feel free to submit other issues !!!

@iGeorgeX
Copy link

iGeorgeX commented Mar 5, 2024

There is also a bug in physical_view.php Issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests