Data Storage
The following backends are used today by MONIT: HDFS, OpenSearch, InfluxDB.
HDFS
The monitoring data in HDFS is stored under `/project/monitoring/archive'.
OpenSearch
InfluxDB
Collectd data is stored inside InfluxDB, depending on the type of data (base monitoring or service monitoring) it will end up in a different database.
Base monitoring
For base monitoring at the current time there are a total of 15 different InfluxDB instances, one per base plugin. Each of them run in a different port but share the same read only account and have the same database name (monit_production_collectd
).
Base metrics | Port |
---|---|
dbod-m-c-cpu | 8080 |
dbod-m-c-df | 8081 |
dbod-m-c-disk | 8082 |
dbod-m-c-inte | 8085 |
dbod-m-c-irq | 8086 |
dbod-m-c-load | 8087 |
dbod-m-c-memo | 8088 |
dbod-m-c-moni | 8080 |
dbod-m-c-pupp | 8083 |
dbod-m-c-proc | 8089 |
dbod-m-c-swap | 8090 |
dbod-m-c-tcpc | 8091 |
dbod-m-c-upti | 8092 |
dbod-m-c-user | 8093 |
dbod-m-c-vmem | 8094 |
Service monitoring
By default service specifc metrics are stored in a common InfluxDB instance (dbod-m-ctd
) split in different databases. These databases are based on the service and the toplevel hostgroups of the data. For some exceptions metrics can also be stored in dedicated instances.
Service metrics | Port |
---|---|
dbod-m-ctd | 8084 |
dbod-m-batch | 8080 |
dbod-m-cld | 8091 |
dbod-m-mig | 8080 |
dbod-m-moni | 8080 |
Measurements
In the InfluxDB world there is a thing called measurement that is similar to an SQL table, for the Collectd use case these measurements are being named after the "plugin"_"type" so for example the next document:
time plugin type
---- ------ ----
1515486329000000000 cpu percent
Will end up in a measurment called "cpu_percent".
Data schema
The InfluxDB data schema differs from the rest of the infrastructure, due to InfluxDB restrictions, however, all the Collectd databases share the same schema.
Inside the data you will find some main types of data:
- Monitoring metadata like host, submitter_environment, submitter_hostgroup, toplevel_hostgroup. Providing information about the host environment.
- Collectd metadata like plugin, plugin_instance, type, type_instance. Providing information about the Collectd namespace
- Collectd data like *_value, this is a representation of the different aggregations provided for Collectd data
Note: In the Collectd world there is the concept of multi-metrics, which mean the same document might have multiple values. e.g: network plugin will come with rate_in and rate_out. In the monitoring infrastructure we do split these documents producing single value ones, and we promote the value name to a new metadata like field "value_instance".
- time: long:milliseconds
- host: string
- max_value: double|long
- mean_value: double|long
- min_value: double|long
- plugin: string
- plugin_instance: string|UNKNOWN
- producer: string
- submitter_environment: string
- submitter_hostgroup: string
- sum_value: double|long
- toplevel_hostgroup: string
- type: string
- type_instance: string:UNKNOWN
- type_prefix: string
- value_instance: string:UNKNOWN
Data binning
In order to improve the resource usage of the InfluxDB databases and the amount of time we can keep the data we have set up three different retention policies for Collectd data.
- one_week: one minute resolution data that stays for one week (Default one).
- one_month: five minutes resolution data that stays for one month.
- five_years: one hour resolution data that stays for five years