Skip to content

Puppet managed VM and HW nodes

Collectd metrics

Collectd is deployed and configured in all DC nodes managed via Puppet.

Several metrics are pre deployed and configured in nodes.

From upstream plugins

From IT provided plugins

Each metric comes with default settings. This configuration can be overwritten for your hostgroup using Hiera:

collectd::plugin::{plugin_name}::{parameter}: {expected_value}
# e.g: collectd::plugin::cpu::reportbycpu: true

Collectd alarms

Collectd is deployed and configured in all DC nodes managed via Puppet.

Several alarms are pre deployed and configured in all nodes.

Each alarms comes with default settings. This configuration can be overwritten for your hostgroup using Hiera:

cerncollectd_contrib::alarm::swap_full::failure_max: 70
cerncollectd_contrib::alarm::swap_full::hits: 20

Nocontact alarms

Nocontact alarms are a special type of alarms generated outside of the host using heartbeats producer by the host. When heartbeats stop being produced a Nocontact alarm is generated. These alarms create SNOW tickets using information from the puppet facts below. If required, the default value for these facts can be overridden for Nocontact alarms.

cerncollectd::nc_override::fename: "My new fe"
cerncollectd::nc_override::troubleshooting: "My troubleshooting URL"
cerncollectd::nc_override::snow_assignment_level: "3"
cerncollectd::nc_override::snow_grouping: "1"
cerncollectd::nc_override::egroup_name: "foo@cern.ch,bar@cern.ch"

All SNOW tickets created for Nocontacts alarms have the following additional features:

  • Are automatically closed when an OK notification is received (heartbeats produced again).
  • Send a periodic reminder every 24 hours while the Nocontact alarm is still active.
  • No ticket is created in case the "roger_alarmed" flag is set to false (check roger docs), in the case of nocontact this flag is "nc_alarmed".

Nocontact alarms are also available from the Host Nocontact Grafana dashboard.

Why is my node in nocontact?

As mentioned before, nocontact is a special flow that relies on the heartbeat generated by the nc_heartbeat "service" (systemd) in Puppet managed nodes and the "remote probes" service for Windows infrastructure.

In the case you are receiving nocontact for a node managed with Puppet, here's a list of things to check before reaching MONIT.

  • Service is active: systemctl status nc_heartbeat will provide already a good indication on the state
  • Check /var/log/messages: NC_heartbeat script logs there any issue there might be while running
  • Ping monit-remote.cern.ch: It might be your node has connectivity but fails to contact the "nocontact" infrastructure.

For Windows nodes that are handled under the remote probes service it might be triggered by different factors: lack of connectivity, timeouts... you can check more details in the available dashboard. In the case of a Windows machine that has been removed but it's still creating nocontacts, please make sure it has been deleted properly from foreman, as it's the source of truth for our PuppetDB service discovery.

Avoid your node to register for nocontact at all

In case you have a very specific kind of flow, where your nodes are supposed to live for a very short-time (few hours), it would make sense that these nodes are not registered at all in the nocontact infrastructure, which will avoid some extra load in both sides.

Please note that this is different from masking your nocontact in Roger, as that still allows the node to register and "silent" nocontacts might be triggered until the cleanup in our side runs.

If your flow is in this case scenario you can override the variable in Hiera to disable the registration:

cerncollectd::enable_nc_regestration: false

NoMonitoring alarms

NoMonitoring alarms, as well as NoContact ones, are a special type of alarms generated using the host using the Monitoring heartbeats producer run by Collectd. When Collectd stops producing heartbeats or it detects there's any issue with Flume (the forwarding agent) a NoMonitoring alarm is generated.

Default value for this alarms can be overriden using the plugin configuration.

cerncollectd_contrib::plugin::heartbeat::snow_functional_element: 'ignore'
...

As mentioned, there are two types of "NoMonitoring" alarms:

  • NoCollectd: Means Collectd is not able to send heartbeats. This may be caused by a stuck plugin or the service not running at all, please check the "collectd" service and the logs in "/var/log/collectd.log" to understand better the reason.
  • NoFlume: Means Flume is not listening for Collectd metrics in the host, please check the "flume-ng-agent-collectd" service and the logs in "/var/log/flume-ng-agent-collectd/flume-agent-collectd.log" to better understand the situation.

In case you are receiving these new alarms and the reason is not clear, please get in contact with the MONIT team via SNOW.

If you would just like to stop receiving this alarms, please assign 'ignore' as the "snow_functional_element" parameter when configuring the plugin in Hiera.

To avoid receiving no_monitoring alarms when draining/removing nodes, the best will be to mask the node in roger, setting the "app_alarmed" flag as "false". Please note that this will stop any non "Operating system, Hardware or Nocontact" alarm from being raised as a SNOW ticket.

Syslog

By default monitoring will gather certain syslog information and send it to the central infrastructure, this data will be available in monit-syslog.

Abrt

By default monitoring doesn't enable any notification related to abrt module, however as part of RHEL9+ and due to the fact abrt dissapears we are providing a small daemon that aims to replace the notification part based on journal-notify.

This component is disabled by default but it can be enabled by setting (please note that for the time being you will also need to include the monitoring module in your manifest as not yet part of base):

monitoring::enable_journal_notify: true

Which will then start producing emails to the owner of the machine for systemd-coredumps messages, the monitoring module comes with a Puppet definition for journal-notify so service managers can add their own ones with the configuration they will please.