Data Storage
The following backends are used today by MONIT: HDFS, OpenSearch, InfluxDB, Mimir.
HDFS
The monitoring data in HDFS is stored under `/project/monitoring/archive'.
OpenSearch
As part of the MONIT service different opensearch clusters are being provided: * monit-opensearch.cern.ch: Stores metrics for short term periods (max 90 days). * monit-opensearch-lt.cern.ch: Stores metrics for long term periods (some data without any expiration time). * monit-timber.cern.ch: Stores logs that are considered public by the service managers (i.e: don't containt sensitive information). * monit-timberprivate.cern.ch: Stores logs that are considered private by the service managers, it's fully manages by tenancy and security roles. * monit-syslog.cern.ch: Stores system logs from the DC nodes managed by Puppet.
InfluxDB
Collectd data is stored inside InfluxDB, depending on the type of data (base monitoring or service monitoring) it will end up in a different database.
Base monitoring
For base monitoring at the current time there are a total of 15 different InfluxDB instances, one per base plugin. Each of them run in a different port but share the same read only account and have the same database name (monit_production_collectd
).
Service monitoring
By default service specifc metrics are stored in a common InfluxDB instance (dbod-m-ctd
) split in different databases. These databases are based on the service and the toplevel hostgroups of the data. For some exceptions metrics can also be stored in dedicated instances.
Measurements
In the InfluxDB world there is a thing called measurement that is similar to an SQL table, for the Collectd use case these measurements are being named after the "plugin"_"type" so for example the next document:
time plugin type
---- ------ ----
1515486329000000000 cpu percent
Will end up in a measurment called "cpu_percent".
Data schema
The InfluxDB data schema differs from the rest of the infrastructure, due to InfluxDB restrictions, however, all the Collectd databases share the same schema.
Inside the data you will find some main types of data:
- Monitoring metadata like host, submitter_environment, submitter_hostgroup, toplevel_hostgroup. Providing information about the host environment.
- Collectd metadata like plugin, plugin_instance, type, type_instance. Providing information about the Collectd namespace
- Collectd data like *_value, this is a representation of the different aggregations provided for Collectd data
Note: In the Collectd world there is the concept of multi-metrics, which mean the same document might have multiple values. e.g: network plugin will come with rate_in and rate_out. In the monitoring infrastructure we do split these documents producing single value ones, and we promote the value name to a new metadata like field "value_instance".
- time: long:milliseconds
- host: string
- max_value: double|long
- mean_value: double|long
- min_value: double|long
- plugin: string
- plugin_instance: string|UNKNOWN
- producer: string
- submitter_environment: string
- submitter_hostgroup: string
- sum_value: double|long
- toplevel_hostgroup: string
- type: string
- type_instance: string:UNKNOWN
- type_prefix: string
- value_instance: string:UNKNOWN
Data binning
In order to improve the resource usage of the InfluxDB databases and the amount of time we can keep the data we have set up three different retention policies for Collectd data.
- one_week: one minute resolution data that stays for one week (Default one).
- one_month: five minutes resolution data that stays for one month.
- five_years: one hour resolution data that stays for five years
Prometheus
MONIT provides a central PromQL compatible storage currently based in Grafana Mimir. The storage is split by tenants and each of them have their own limits set.