Skip to content

Using Prometheus

Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as email, on-call notification systems, and chat platforms.

Schema

The alerts produced by Prometheus and sent by the alertmanager should contain the labels as described by the schema.

  • (mandatory) timestamp: the timestamp of the alarm (check #1 below)
  • (default) source: the source will be set to "prometheus"
  • (mandatory) alarm_name: the name of the alarm
  • (default) entities: for the case of prometheus, entities will be composed by any joining together any of the following fields available in the message:
  • "instance", "pod_name", "pod", "cluster", "service", "producer"
  • (mandatory) status: the alarm state, possible values: "resolved" -> OK, "firing" -> FAILURE
  • (optional) summary: a short summary of the alarm; if provided it is used as the ticket subject
  • (optional) description: a more verbose description of the alarm
  • (optional) submitter_environment: environment from where the alarm was fired
  • (optional) submitter_hostgroup: hostgroup from where the alarm was fired
  • (optional) troubleshooting: A link providing troubleshooting details
  • (optional) roger_alarmed: if the alarm is masked by roger or not
  • (optional) snow_service_element: the SNOW SE of the incident
  • (optional) snow_functional_element: the SNOW FE of the incident
  • (optional) snow_assignment_level: the SNOW assignment level of the incident (as an int)
  • (optional) snow_hostgroup_grouping: if the incident should be grouped by hostgroup
  • (optional) snow_auto_closing: if grouping is disabled, OK alerts automatically close tickets
  • (optional) snow_notification_type: the alarm type, possible values: app (default), hw, os
  • (optional) snow_watchlist: list of emails that will form the watchlist of the SNOW incident
  • (optional) snow_troubleshooting: used historically by nocontacts, will be override by "troubleshooting" if specified

Kubernetes and Alertmanager

Cloud team provides a away to deploy Kubernetes on OpenStack. If the required flags were used as described here your cluster should be already sending some base alarms to Monit. To create your own alarms you will need to deploy these via Helm. Take a look at the cloud docs to better understand how to use Helm.

Alertmanager standalone

If you want your Prometheus alarms to reach Monit, you will need to create a new receiver pointing to Monit's alarmsource.

It is possible to define and manage your Alertmanager through Puppet. Take a look here for more information on how to do it.

Here is one example on the configuration as deployed by default in the Kubernetes clusters that can be used to send alarms to the monitoring infrastructure.

global:
  resolve_timeout: 5m
route:
  receiver: default-receiver
  group_by:
  - alertname
  - cluster
  - service
  - pod_name
  routes:
  - match:
      monit_forward: "true"
    receiver: cern-central-monitoring
    continue: true
  group_wait: 1m
  group_interval: 5m
receivers:
- name: default-receiver
- name: cern-central-monitoring
  webhook_configs:
  - send_resolved: true
    http_config: {}
    url: http://monit-alarms.cern.ch:10014

GNI integration

Prometheus alarm integration with GNI is already in place, in the sense a Prometheus alarm is treated as any other alarm inside the infrastructure.

Currently the only possible targets for this alarms are the storage backends (Elasticsearch and InfluxDB) and SNOW. There is no plan to extend it for Email as the alertmanager can handle that.