Service alarms

The following options are provided:

Document schema

Let's start explaining how an alarm document looks like.

Inside metadata you will find different fields depending on the alarm source:

_id: used to identify a message inside the infrastructure
producer: alarm
type_prefix: raw
type: gni
timestamp: when the alarm was created

(additional collectd fields)

toplevel_hostgroup: identifies the toplevel hostgroup of the host where the alarm was triggered
submitter_hostgroup: identifies the full hostgroup of the host where the alarm was triggered
submitter_environment: identifies the environment of the host where the alarm was triggered
submitted_host: identifies the host in which the alarm was triggered (usually matches data.entities)

Inside data you will find information related to the alarm triggered

source: identifies the source of the alarm (collectd, grafana...)
metrics: high level categorisation for the alarm
alarm_type: the type of the alarm [os, hw, app]
correlation: shows why the alarm was triggered, usually comparision of real value with the threshold
alarm_name: full categorisation of the alarm
entities: set of entities that caused the alarm to be triggered
targets: set of final destinations, outside of the monit infrastructure, for the alarm (only snow and email for the moment)
messages: a descriptive message for the alarm

A sample MONIT alarm document looks like this:

{
    "metadata": {
        "availability_zone": "None",
        "submitter_hostgroup": "bi/condor/gridworker/shareshort",
        "type_prefix": "raw",
        "submitter_host": "b6d596ff20.cern.ch",
        "submitter_environment": "batchtest",
        "producer": "alarm",
        "_id": "6bbb4288-539c-d077-a97b-8620f5478018",
        "type": "gni",
        "toplevel_hostgroup": "bi",
        "timestamp": 1525329635233
    },
    "data": {
        "source": "collectd",
        "metrics": "tail_count",
        "alarm_type": "os",
        "status": "OK",
        "snow": {
            "functional_element": "LXBATCH",
            "troubleshooting": "",
            "hostgroup_grouping": true,
            "assignment_level": 3,
            "service_element": "default"
        },
        "correlation": "nan > 0.0",
        "alarm_name": "tail_base_count_vm_kill_value",
        "entities": "b6d596ff20.cern.ch",
        "targets": ["snow"],
        "message": "Im a descriptive message"
    }
}

Data access

The alarms are treated inside the monitoring infrastructure as any other document, which means they are available in the default endpoints:

Kafka: under the "alarm_raw_gni" topic
HDFS: under "/project/monitoring/archive/alarm/raw/gni" folder
OpenSearch: in monit-opensearch select "monit_prod_alarm_raw_*"
InfluxDB: in monit-grafana use the "monit_idb_alarms" datasource

A set of generic dashboards is also provided by us to help service managers visualising their alarms, you can find them in the MONIT organisation.