1 Introduction¶

This document describes monitoring for the IT deployments at Cerro Pachon and La Serena.

2 Goal¶

The goal of the IT monitoring system is to provide an interface into the state of the IT deployment, provide tools to inspect and reason the state, and generate notifications when the system is an undesired state.

3 Implementation¶

The monitoring implementation has four main concerns:

Metrics collection
Metrics storage and querying
Dashboarding and visualization
Alerting

3.1 Metrics collection¶

Metrics are collected from a range of sources by Telegraf. Metrics are collected from three main sources:

System and application metrics
Remote availability/liveness metrics
Metrics relayed from an external system

3.1.1 Host-based metrics¶

Telegraf instances are deployed on all Puppet managed hosts, and collect system and application metrics. Standard system metrics are collected, such as CPU use, memory free, disk I/O, and so forth. In additional application metrics can be collected from sources such as Kubernetes, Docker, Kafka, et cetera.

See the profile::core::telegraf <https://github.com/lsst-it/lsst-itconf/blob/master/site/profile/manifests/core/telegraf.pp> module for implementation details.

3.1.2 Availability metrics¶

Metrics tracking external availability (e.g. ping, SSH, HTTP, FTP, Gopher) must be run independently of hosts being monitored so that alerts can be generated when hosts or services become unreachable. This functionality is provided by standalone Telegraf instances deployed on Kubernetes. Each check runs in a separate telegraf deployment.

Telegraf automatically monitors hosts as they are added to and removed from Foreman by way of the Foreman Telegraf Configurer (ftc). Ftc works by querying Foreman for a list of all hosts, generates Telegraf configurations to monitor a given service (DNS records, ping, etc), and updates the Telegraf deployments and configmaps in Kubernetes.

See k8s-monitoring <https://github.com/lsst-it/k8s-monitoring> for details.

3.1.3 Relaying metrics¶

Additional Telegraf instances may be deployed within Kubernetes to monitor services that cannot directly run Telegraf. Examples include devices that expose metrics via SNMP, such as the summit generator, switches and routers, and other embedded devices.

3.2 Metrics storage¶

InfluxDB is the core of the monitoring system, and provides a central point for metrics storage, retrieval, and querying.

Note

InfluxDB 2.0 is currently in beta, and when released will necessitate a partial overhaul of the IT monitoring implementation. InfluxDB 2.0 includes a new query language, Flux, and provides both alerting and dashboarding in a single package.

3.3 Visualization/dashboarding¶

Note: dashboarding and alerting are explicitly decoupled so implementations, features, and availability can be handled separately.

3.4 Alerting¶

Kapacitor monitors time series from InfluxDB and generates corresponding alerts.

Alerts can be generated on a variety of conditions, such as the following:

Value above/below threshold
- CPU idle too low
- Swap utilization too high
Standard deviation of a time series exceeded
- System load repeatedly jumps to 100 and then falls back to 0
Deadman alert when time series not sent
- Telegraf on a given system is no longer submitting metrics

Note

As noted above, the advent of InfluxDB 2.0 GA will require that IT rewrite monitor checks in Flux from their current form of TICK script. While this rewrite will take time and effort, development in TICK script is painful and costly so the rewrite will ultimately reduce the barrier to entry for writing alerts.

Note that Kapacitor must be directly accessible from InfluxDB. Kapacitor uses InfluxDB subscriptions <https://docs.influxdata.com/kapacitor/v1.5/administration/subscription-management/> to push metrics to Kapacitor; without a direct connection Kapacitor cannot operate.

3.4.1 Alert management¶

Alert management is the process of aggregating alerts generated by monitoring system, deduplicating alerts generated by the same failure, acknowledging/silencing alerts for known failures,

At present alert management is a weak spot in the monitoring implementation. At present alert notifications are generated by sending a one-time message to Slack for a given alert.

Open issues:

Flap detection and mitigation
Alert deduplication
Maintenance windows/alert silencing

The following systems were tested and found lacking:

Kapacitor has some rudimentary shims where alert management can be injected, but there’s no out of the box solution.
Prometheus Alertmanager has most of the desired functionality but integrating Kapacitor and Alertmanager is particularly painful, and might not be possible without non-trivial development.
Alerta was tested as Kapacitor supports Alerta as a first class citizen, and Alerta provides all required functionality. Unfortunately Alerta does not appear to be mature enough to be an integral part of the monitoring implementation.

Prometheus Alertmanager is a suitable option where Prometheus is used, but for the Influxdata site of the equation the available options in the open source space are insufficient. Proprietary options such as VictorOps or PagerDuty should be considered to solve this problem.

ITTN-007: Infrastructure Monitoring