Data Storage

The following backends are used today by MONIT: HDFS, OpenSearch, InfluxDB, Mimir.

HDFS

The monitoring data in HDFS is stored under `/project/monitoring/archive'.

OpenSearch

As part of the MONIT service different opensearch clusters are being provided: * monit-opensearch.cern.ch: Stores metrics for short term periods (max 90 days). * monit-opensearch-lt.cern.ch: Stores metrics for long term periods (some data without any expiration time). * monit-timber.cern.ch: Stores logs that are considered public by the service managers (i.e: don't containt sensitive information). * monit-timberprivate.cern.ch: Stores logs that are considered private by the service managers, it's fully manages by tenancy and security roles. * monit-syslog.cern.ch: Stores system logs from the DC nodes managed by Puppet.

InfluxDB

Collectd data is stored inside InfluxDB, depending on the type of data (base monitoring or service monitoring) it will end up in a different database.

Base monitoring

For base monitoring at the current time there are a total of 15 different InfluxDB instances, one per base plugin. Each of them run in a different port but share the same read only account and have the same database name (monit_production_collectd).

Service monitoring

By default service specifc metrics are stored in a common InfluxDB instance (dbod-m-ctd) split in different databases. These databases are based on the service and the toplevel hostgroups of the data. For some exceptions metrics can also be stored in dedicated instances.

Measurements

In the InfluxDB world there is a thing called measurement that is similar to an SQL table, for the Collectd use case these measurements are being named after the "plugin"_"type" so for example the next document:

time                plugin type
----                ------ ----
1515486329000000000 cpu    percent

Will end up in a measurment called "cpu_percent".

Data schema

The InfluxDB data schema differs from the rest of the infrastructure, due to InfluxDB restrictions, however, all the Collectd databases share the same schema.

Inside the data you will find some main types of data:

Monitoring metadata like host, submitter_environment, submitter_hostgroup, toplevel_hostgroup. Providing information about the host environment.
Collectd metadata like plugin, plugin_instance, type, type_instance. Providing information about the Collectd namespace
Collectd data like *_value, this is a representation of the different aggregations provided for Collectd data

Note: In the Collectd world there is the concept of multi-metrics, which mean the same document might have multiple values. e.g: network plugin will come with rate_in and rate_out. In the monitoring infrastructure we do split these documents producing single value ones, and we promote the value name to a new metadata like field "value_instance".

- time: long:milliseconds
- host: string
- max_value: double|long
- mean_value: double|long
- min_value: double|long
- plugin: string
- plugin_instance: string|UNKNOWN
- producer: string
- submitter_environment: string
- submitter_hostgroup: string
- sum_value: double|long
- toplevel_hostgroup: string
- type: string
- type_instance: string:UNKNOWN
- type_prefix: string
- value_instance: string:UNKNOWN

Data binning

In order to improve the resource usage of the InfluxDB databases and the amount of time we can keep the data we have set up three different retention policies for Collectd data.

one_week: one minute resolution data that stays for one week (Default one).
one_month: five minutes resolution data that stays for one month.
five_years: one hour resolution data that stays for five years

Prometheus

MONIT provides a central PromQL compatible storage currently based in Grafana Mimir. The storage is split by tenants and each of them have their own limits set.