Service alarms
The following options are provided:
Document schema
Let's start explaining how an alarm document looks like.
Inside metadata you will find different fields depending on the alarm source:
- _id: used to identify a message inside the infrastructure
- producer: alarm
- type_prefix: raw
- type: gni
- timestamp: when the alarm was created
(additional collectd fields)
- toplevel_hostgroup: identifies the toplevel hostgroup of the host where the alarm was triggered
- submitter_hostgroup: identifies the full hostgroup of the host where the alarm was triggered
- submitter_environment: identifies the environment of the host where the alarm was triggered
- submitted_host: identifies the host in which the alarm was triggered (usually matches data.entities)
Inside data you will find information related to the alarm triggered
- source: identifies the source of the alarm (collectd, grafana...)
- metrics: high level categorisation for the alarm
- alarm_type: the type of the alarm [os, hw, app]
- correlation: shows why the alarm was triggered, usually comparision of real value with the threshold
- alarm_name: full categorisation of the alarm
- entities: set of entities that caused the alarm to be triggered
- targets: set of final destinations, outside of the monit infrastructure, for the alarm (only snow and email for the moment)
- messages: a descriptive message for the alarm
A sample MONIT alarm document looks like this:
{
"metadata": {
"availability_zone": "None",
"submitter_hostgroup": "bi/condor/gridworker/shareshort",
"type_prefix": "raw",
"submitter_host": "b6d596ff20.cern.ch",
"submitter_environment": "batchtest",
"producer": "alarm",
"_id": "6bbb4288-539c-d077-a97b-8620f5478018",
"type": "gni",
"toplevel_hostgroup": "bi",
"timestamp": 1525329635233
},
"data": {
"source": "collectd",
"metrics": "tail_count",
"alarm_type": "os",
"status": "OK",
"snow": {
"functional_element": "LXBATCH",
"troubleshooting": "",
"hostgroup_grouping": true,
"assignment_level": 3,
"service_element": "default"
},
"correlation": "nan > 0.0",
"alarm_name": "tail_base_count_vm_kill_value",
"entities": "b6d596ff20.cern.ch",
"targets": ["snow"],
"message": "Im a descriptive message"
}
}
Data access
The alarms are treated inside the monitoring infrastructure as any other document, which means they are available in the default endpoints:
- Kafka: under the "alarm_raw_gni" topic
- HDFS: under "/project/monitoring/archive/alarm/raw/gni" folder
- OpenSearch: in monit-opensearch select "monit_prod_alarm_raw_*"
- InfluxDB: in monit-grafana use the "monit_idb_alarms" datasource
A set of generic dashboards is also provided by us to help service managers visualising their alarms, you can find them in the MONIT organisation.