Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ganglia Missing Metrics for Some Nodes #308

Open
suanmiao opened this issue Oct 19, 2018 · 0 comments
Open

Ganglia Missing Metrics for Some Nodes #308

suanmiao opened this issue Oct 19, 2018 · 0 comments

Comments

@suanmiao
Copy link

suanmiao commented Oct 19, 2018

We are using Ganglia (gmond 3.6.0) and there are some metrics missing for some nodes.
Explanation or suggestion would be appreciated regarding this issue. Thanks!

Below is the detail

Symptom

The metric load_one (which shows our average one minute CPU load) is partially or completely missing for some nodes.

image

And the metric reported from other nodes during the same period looks like this:

image

And we checked the rrd file under /var/lib/ganglia/rrds, it shows that some rrd files are missing for these nodes with issue.

Below are the files under that folder for a normal node:

boottime.rrd bytes_out.rrd cpu_idle.rrd cpu_num.rrd cpu_system.rrd cpu_wio.rrd disk_total.rrd load_five.rrd mem_buffers.rrd mem_free.rrd mem_total.rrd pkts_in.rrd proc_run.rrd swap_free.rrd
bytes_in.rrd cpu_aidle.rrd cpu_nice.rrd cpu_speed.rrd cpu_user.rrd disk_free.rrd load_fifteen.rrd load_one.rrd mem_cached.rrd mem_shared.rrd part_max_used.rrd pkts_out.rrd proc_total.rrd swap_total.rrd

Below are the files under the node with issue:

boottime.rrd cpu_num.rrd cpu_speed.rrd mem_total.rrd swap_total.rrd

Thus we conclude that Ganglia failed to record these metrics.

Setup & Environment

Ganglia Version: gmond 3.6.0
System Version:

Gmond.conf:

/* Ganglia modules are defined in terms of .conf files in /etc/ganglia/conf.d; the directive below
includes all such .conf files.

Ganglia Python modules (e.g. the NVIDIA python module for monitoring GPUs) are specified
in .pyconf files within /etc/ganglia/conf.d. When we install support for Ganglia python modules,
(via apt-get install ganglia-monitor-python in our GPU base image) the installation process
creates a /etc/ganglia/conf.d/modpython.conf file. The directive below includes the modpython.conf
file, which in turn contains a directive to include all .pyconf files within /etc/ganglia/conf.d.
/
include ('/etc/ganglia/conf.d/
.conf')

/* This configuration is as close to 2.5.x default behavior as possible
The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
daemonize = yes
setuid = yes
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 0 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 0
}

/* The host section describes attributes of the host, like the location */
host {
location = "unspecified"
}

/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
tcp_accept_channel {
port = 8649
}

/* Each metrics module that is referenced by gmond must be specified and
loaded. If the module has been statically linked with gmond, it does not
require a load path. However all dynamically loadable modules must include
a load path. */
modules {
module {
name = "core_metrics"
}
module {
name = "cpu_module"
path = "/usr/lib/ganglia/modcpu.so"
}
module {
name = "disk_module"
path = "/usr/lib/ganglia/moddisk.so"
}
module {
name = "load_module"
path = "/usr/lib/ganglia/modload.so"
}
module {
name = "mem_module"
path = "/usr/lib/ganglia/modmem.so"
}
module {
name = "net_module"
path = "/usr/lib/ganglia/modnet.so"
}
module {
name = "proc_module"
path = "/usr/lib/ganglia/modproc.so"
}
module {
name = "sys_module"
path = "/usr/lib/ganglia/modsys.so"
}
}

/* The old internal 2.5.x metric array has been replaced by the following
collection_group directives. What follows is the default behavior for
collecting and sending metrics that is as close to 2.5.x behavior as
possible. */

/* This collection group will cause a heartbeat (or beacon) to be sent every
20 seconds. In the heartbeat is the GMOND_STARTED data which expresses
the age of the running gmond. */
collection_group {
collect_once = yes
time_threshold = 20
metric {
name = "heartbeat"
}
}

/* This collection group will send general info about this host every 1200 secs.
This information doesn't change between reboots and is only collected once. /
collection_group {
collect_once = yes
time_threshold = 1200
metric {
name = "cpu_num"
title = "CPU Count"
}
metric {
name = "cpu_speed"
title = "CPU Speed"
}
metric {
name = "mem_total"
title = "Memory Total"
}
/
Should this be here? Swap can be added/removed between reboots. */
metric {
name = "swap_total"
title = "Swap Space Total"
}
metric {
name = "boottime"
title = "Last Boot Time"
}
metric {
name = "machine_type"
title = "Machine Type"
}
metric {
name = "os_name"
title = "Operating System"
}
metric {
name = "os_release"
title = "Operating System Release"
}
metric {
name = "location"
title = "Location"
}
}

/* This collection group will send the status of gexecd for this host every 300 secs /
/
Unlike 2.5.x the default behavior is to report gexecd OFF. */
collection_group {
collect_once = yes
time_threshold = 300
metric {
name = "gexec"
title = "Gexec Status"
}
}

/* This collection group will collect the CPU status info every 20 secs.
The time threshold is set to 90 seconds. In honesty, this time_threshold could be
set significantly higher to reduce unneccessary network chatter. /
collection_group {
collect_every = 20
time_threshold = 90
/
CPU status /
metric {
name = "cpu_user"
value_threshold = "1.0"
title = "CPU User"
}
metric {
name = "cpu_system"
value_threshold = "1.0"
title = "CPU System"
}
metric {
name = "cpu_idle"
value_threshold = "5.0"
title = "CPU Idle"
}
metric {
name = "cpu_nice"
value_threshold = "1.0"
title = "CPU Nice"
}
metric {
name = "cpu_aidle"
value_threshold = "5.0"
title = "CPU aidle"
}
metric {
name = "cpu_wio"
value_threshold = "1.0"
title = "CPU wio"
}
/
The next two metrics are optional if you want more detail...
... since they are accounted for in cpu_system.
metric {
name = "cpu_intr"
value_threshold = "1.0"
title = "CPU intr"
}
metric {
name = "cpu_sintr"
value_threshold = "1.0"
title = "CPU sintr"
}
*/
}

collection_group {
collect_every = 20
time_threshold = 90
/* Load Averages */
metric {
name = "load_one"
value_threshold = "1.0"
title = "One Minute Load Average"
}
metric {
name = "load_five"
value_threshold = "1.0"
title = "Five Minute Load Average"
}
metric {
name = "load_fifteen"
value_threshold = "1.0"
title = "Fifteen Minute Load Average"
}
}

/* This group collects the number of running and total processes */
collection_group {
collect_every = 80
time_threshold = 950
metric {
name = "proc_run"
value_threshold = "1.0"
title = "Total Running Processes"
}
metric {
name = "proc_total"
value_threshold = "1.0"
title = "Total Processes"
}
}

/* This collection group grabs the volatile memory metrics every 40 secs and
sends them at least every 180 secs. This time_threshold can be increased
significantly to reduce unneeded network traffic. */
collection_group {
collect_every = 40
time_threshold = 180
metric {
name = "mem_free"
value_threshold = "1024.0"
title = "Free Memory"
}
metric {
name = "mem_shared"
value_threshold = "1024.0"
title = "Shared Memory"
}
metric {
name = "mem_buffers"
value_threshold = "1024.0"
title = "Memory Buffers"
}
metric {
name = "mem_cached"
value_threshold = "1024.0"
title = "Cached Memory"
}
metric {
name = "swap_free"
value_threshold = "1024.0"
title = "Free Swap Space"
}
}

collection_group {
collect_every = 40
time_threshold = 300
metric {
name = "bytes_out"
value_threshold = 4096
title = "Bytes Sent"
}
metric {
name = "bytes_in"
value_threshold = 4096
title = "Bytes Received"
}
metric {
name = "pkts_in"
value_threshold = 256
title = "Packets Received"
}
metric {
name = "pkts_out"
value_threshold = 256
title = "Packets Sent"
}
}

/* Different than 2.5.x default since the old config made no sense */
collection_group {
collect_every = 1800
time_threshold = 3600
metric {
name = "disk_total"
value_threshold = 1.0
title = "Total Disk Space"
}
}

collection_group {
collect_every = 40
time_threshold = 180
metric {
name = "disk_free"
value_threshold = 1.0
title = "Disk Space Available"
}
metric {
name = "part_max_used"
value_threshold = 1.0
title = "Maximum Disk Space Used"
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant