Collectd Alarms

This section explains how to set alarms on Collectd metrics.

In the Collectd world an alarm is created using the threshold plugin. Alarms are defined configuring collectd Threshold plugin with a Puppet wrapper that we provide in cerncollectd::alarms::threshold::plugin. Note the equivalent to "value_instance" in Grafana views is "data_source" in the threshold wrapper. Alarm definitions (base alarms and user alarms) go into the shared Puppet module cerncollectd_contrib.

Defining an alarm

To define your own collectd alarm, you will create the configuration for it in cerncollectd_contrib and start a CRM process. You can take the example below, or use existing alarm definitions as a template. Once defined, it's possible to use any of the existing alarms from cerncollectd_contrib in your hostgroups.

It's strongly recommended to keep a $custom_targets parameter in your alarm definition, for the SNOW override options. You can take the parameter from any existing example of other alarm definitions, and just change the resource name and the ctd_namespace parameter. The same is applied for $actuator in case you want your alarm to specify it (an actuator is a command that will be executed in your host if the alarm is triggered) and $troubleshooting (usually an url with some extra information about the alert).

class cerncollectd_contrib::alarm::swap_full (
  Integer $failure_max = 85,
  Integer $hits = 15,
  Hash $custom_targets = undef,
  Hash $custom_workflow = undef,
  String $actuator = undef,
  String $troubleshooting = undef,
) {
  ::cerncollectd::alarms::threshold::plugin {'swap_full':
    plugin        => 'swap',
    type          => 'percent',
    type_instance => 'used',
    failure_max   => $failure_max,
    hits          => $hits,
  }
  if $custom_targets != undef {
    ::cerncollectd::alarms::extra {'swap_full':
      ctd_namespace   => 'swap_percent_used',
      targets         => $custom_targets,
      workflow        => $custom_workflow,
      actuator        => $actuator,
      troubleshooting => $troubleshooting,
    }
  }
}

Release process

To add a new alarm in cerncollectd_contrib please follow these steps:

Complete the implementation of the puppet module for your alarm (please branch from QA).
If you are updating an existing alarm, open a CRM ticket and add "monit-support" in the watch list.
Open one Merge Request for your module into QA and tag it with the CRM number "[CRM-XXXX] Descriptive message" or "[NOCRM] Descriptive message" and select "Squash commits" and "Remove source branch" in both cases.
The MONIT team will take care of reviewing the MR and move the changes to QA and Master (will ping you in the CRM in case there is one, otherwise the change will be moved to master straightforward).
Once the MR is in production you can close the CRM ticket.

Service not running

It's a common use case to generate alarms when a given process is not running. At CERN we call this kind of alarms "service_wrong" alarms. In order to be alarmed when a service is not running you need two steps:

Monitoring the process
Generating the alarm

As the monitoring team, we offer you some predefined configuration you can start using to arrive to this goal.

Monitoring the process

First thing first, you need to generate a metric indicating the status of your process, usually this means counting the running processes under a given name, for this Collectd offers three options.

Processes plugin (not recommended): Core Collectd plugin, counts the number of processes running as well as many other metrics.
Systemd plugin: Counts the matching processes running under systemd.
ProcessCount plugin: Counts the matching processes running, use it when the systemd plugin is not a valid option.

In terms of configuration you just need to append the process you want to monitor to the list using Hiera, the following configuration example reflects it using the Systemd plugin, but similar approach is used for the other ones.

Metrics specific for every single process that comes from the Collectd process plugin go to service databases (so generic use cases are isolated from the service specific ones).

cerncollectd_contrib::plugin::systemd::services:
  - process1
  - process2

If filtering is required, the ProcessCount plugin can be configured similarly to filter the monitored processes by user id and/or parent process id:

cerncollectd_contrib::plugin::processcount::filtering:
  process_1:
    myfilters:
      ppid: [0, 1]
      uid: [100]
    anotherfilter:
      ppid: [987]
  process_2:
  process_3:
    afilter:
      uid: [0, 1, 2]

Generating the alarm

Now that the metric is produced within the Collectd world, we will show how to use the Threshold plugin to generate the alarm. Since this is a generic use case we already provide a wrapper which will trigger and alert in case the number of processes running is below 1.

What you will need to do is write the alarm definition, here is one example and include it in your hostgroup as any other alarm.

NOTE: Defining multiple filters for a single process using the ProcessCount plugin results in a dedicated metric per filter. As the default service_wrong behaviour is raising an alarm per process, it will be triggered if any of the defined metrics for a given process reports value below 1. In case you prefer having a dedicated alarm for each process filter, you should also set the type_instance variable to match your filter name while creating the service_wrong alarms.

In case you want to have different bounds for the alarm just skip using the wrapper and use it instead as inspiration for you own alarm.

Actuators

Actuator is a concept coming from the old Lemon world, it refers to a command linked to an alarm that will be executed in case the alarm is triggered. Usually trying to fix the issue. e.g: restart a process

In the case of Collectd this concept have been ported as a plugin, allowing Collectd to run a command when a specific alarm is received.

Note: For the time being Collectd needs to be in permissive mode under selinux to run actuators, so some extra user configuration is needed.

cerncollectd::alarms::config::allow_actuator: true

Once that is allowed you can specify the $actuator parameter in your alarms which will add the configuration to the alarm configuration file (similar to custom_targets).

Custom workflow

In some cases, it will be interesting for service managers to define a "workflow" for the alarm, what this configuration allows to be done is to select which Collectd plugins will run when receiving the notification.

Currently there are two possible Collectd plugins that the workflow can target:

"alarm": stands for the alarm handler plugin that will generate a GNI notification
"actuator": stands for the alarm actuator plugin that will run an specific actuator

Wokrflow configuration can be defined per Collectd notification severity: "OK", "WARNING" or "CRITICAL".

One example to override it for a specific alarm will be as follows, where the "ok", won't trigger anything, the "warning" for the alarm will trigger the run of the actuator and only if the alarm becomes "critical" it will generate a notification:

cerncollectd_contrib::alarm::swap_full::custom_workflow:
  ok: []
  warning: ['actuator']
  failure: ['alarm']

In order to preserve the default behaviour of the Collectd flow (i.e: without the workflow overriden) this is how the "workflow" parameter is used:

In case workflow is not defined at all, fallback to the regular behaviour
- alarm handler will send the notification independently of the severity for the notification
- alarm actuator will only be triggered by "FAILURE" notifications
If workflow is defined but the severity is not explicitly overriden fallback to the regular behaviour for that specific severity
If workflow is defined and severity is overriden, execute the targets defined

Note: Order of defined actions don't have any impact on the flow (i.e: setting ['actuator', 'alarm']) doesn't mean the actuator will run first and only then the alarm will be generated, as Collectd runs the plugins in parallel.

Checking alarm status

You can check the current status for the Collectd metrics in the host running the "collectdctl" tool as explained in this section.