MONIT Agent

MONIT agent is something that has been there ever since the beginning of MONIT but was never named like this. It refers to all the daemons and tools installed by the MONIT team in the DataCentre hosts in order to collect monitoring base information.

Old Agent

The old MONIT agent is the one based on Collectd (for metrics collection and alarms evaluation) and Apache Flume (for metrics, alarms and syslog forwarding).

All this is configured through the cerncollectd module in Puppet, so please check the module documentation for more information on how to use it.

Current Agent

As required by the department we started evaluating alternative tools in order to improve the monitoring tools offering and try to converge into a single system for Puppet, Non Puppet and Kubernetes nodes, therefore chosing Prometheus as the ecosystem to be focused on.

The new agent is currently based on exporters like node exporter (for metric exposing) and fluent-bit (for metrics and logs collection and forwarding). This allows us to keep using collectd as a metric collection tool (since it comes with a Prometheus exporter) and tho preserving plugins developed by service managers while still integrating them in the new ecosystem.

We have kept the support for Collectd metrics in this new agent, both in InfluxDB (to be retired in the future) and in Prometheus backends. The current recommendation is to not develop/invest in Collectd plugins and instead focus on Prometheus exporters when configuring new metrics.

All this is configured with a new module named monitoring in Puppet.

Disabling node exporter + otlp flow

By default we have the flow to Prometheus enabled, this is meant to be the desired way to send metrics, but we understand that not all service managers are still ready for this and in this case we should allow them to disable it.

In order to do so, please set the following hiera variables:

monitoring::monit_agent::node_exporter_otlp: false

This will take care of removing node exporter as well as unsetting some default configuration.

Extending monit agent metrics input

If you want to extend the monit agent capabilities with new prometheus endpoints, there're different ways of doing so:

Enable more processors in node exporter

By default, the list of enabled collectors in node exporter is limited to base monitoring. You can extend this list by setting a hiera variable like (please check the list of available collectors in the official documentation):

monitoring::monit_agent::node_exporter_enabled_collectors:
- newcollector
- anothernewcollector

Extend the scrapping to new Prometheus endpoints

As part of the monit agent, we provide a wrapper that will allow you to configure fluentbit as a scrapper for Prometheus endpoints (local or remote), in order to do so you can use the following hiera variable:

monitoring::monit_agent::user_metrics_scrape_targets:
  remote1.cern.ch:
    2001:
      tag: mytag
      path: /metrics
      interval: 5s
      labels:
        add:
          my_new_label: value
        remove: [default_label]
    2002:
      tag: mytag2
      path: /metrics2
      interval: 52s
      tls:
        enabled: true
    2003:
      tag: mytag2
      path: /metrics2
      http_user: login
      http_passwd: <tbag_key>
      interval: 52s
      tls:
        enabled: true

Send data as a different tenant (defaults to toplevel hostgroup)

If you want to have specific metric flows sending data into other tenants outside of the "toplevel_hostgroup" one it's possible using few configuration options.

The first thing you need to do is to set up the new tenant configuration:

monitoring::monit_agent::user_extra_tenants:
  new_tenant: tbag_secret_key_with_newtenant_password

And second you will need to specify in which of the custom scrape targets you want to set the new tenant as the destination:

monitoring::monit_agent::user_metrics_scrape_targets:
  remote1.cern.ch:
    2001:
      tag: mytag
      tenant: new_tenant
      path: /metrics
      interval: 5s

Please note that if there's a discrepancy between the configured and the used tenants Puppet will complain and raise an error.

Phasing out new agent

Moving into the new agent will be done in different phases, being the first keeping things as they are in terms of data but moving into the usage of fluent-bit instead of Flume for data forwarding (2024 Q1), the second one enabling the Prometheus integration in parallel with the current flow (2024) and the third and last one to switch dashboards and data storage to be Prometheus based only (2025).

Disable Flume replacement

Since the first phase should already be deployed for everyone as part of the base Puppet definition, here's how to disable it in case it's creating big issues in your hostgroup. Please note that this should be done only temporaryly while working on fixing whatever issue might have happened (as the old stuck will not be maintained further ) so please contact us in case you need any help.

In order to do this, service managers need to configure several hiera variables in order to disable the new monit agent.

monitoring::monit_agent: false
monitoring::monit_agent::flume_replacement: false
cerncollectd::enable_flume_replacement: false

Remove Flume completely

If you are absolutely sure that you are not using flume in your instances and want to remove any previous Flume leftovers you can use the following cerncollectd flag to clean them:

cerncollectd::ensure_flume_is_removed: true

This will remove any log, configuration and service files created by Flume. Also will stop and remove the service and remove the installed package.

Special cases

Puppet hostgroup tests start failing after enabling the new agent

Note: Some people have reported issues when loading the fixtures after adding the monitoring module, this is a known issue and the current workaround requires to add a pre condition to the spec file in order to make sure cerncollectd is included before monitoring.

describe 'hg_myhostgroup' do
  let(:pre_condition) do
    ['
        include cerncollectd
    ']
  end

Running fluent-bit before MONIT installs the new one

When enabling the new agent, it will also enable the new MONIT repositories in the machine, which contain the package "monit-fluent-bit".

Although it has a different name, it's been built as a virtual package for fluent-bit, so in case you were using it before the new package will be identified as an upgrade.

This by itself should not be a big issue (unless some breaking change between versions), but the problem might come with the way the new package service works. This new package is shipped with a service enabled to run multiple instances of fluentbit in a single machine via "fluent-bit@".

[Unit]
Description=Fluent Bit daemon
Documentation=https://docs.fluentbit.io/manual/
Requires=network.target
After=network.target

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/fluent-bit.d/%i
ExecStart=/usr/sbin/fluent-bit $FLUENT_BIT_OPTIONS
Restart=always

SyslogIdentifier=fluent-bit@%i

[Install]
WantedBy=multi-user.target

So you will need to adapt your current configuration to work with this, there are few ways of managing it, but we recommend using this wrapper provided by us.

    # Fluent-bit configuration constants.
    $fluentbit_agent_name = 'my-agent'
    $fluentbit_service_name = "fluent-bit@${fluentbit_agent_name}.service"
    $fluentbit_agent_config_base_dir = "/etc/fluent-bit/${fluentbit_agent_name}"

    # Instantiate a fluent-bit service as monit-agent
    monitoring::monit_agent::forwarders::fluentbit::agent { $fluentbit_agent_name: }

This will make sure you have your service environment configuration "fluent-bit@my-agent" and all the folders needed where to place configuration.