Incident Description
The BRIAN Sensu cluster for scheduling SNMP polling checks had an outage for approximately 1 hour total, spanning Sunday/Monday. No Counters were fetched from routers & saved in InfluxDB during this time.
The reason for degradation:
- Network outage resulted in loss of connectivity between all sensu cluster nodes (prod-poller-sensu-agent(01|02|03).geant.org)
- Resulted in complete loss of clustering, causing sensu to unschedule all checks.
The impact of this service degradation was:
- No interfaces were polled between approximately 16:00 and 16:50 UTC, resulting in loss of data on the Production BRIAN instance.
- No interfaces were polled between approximately 10:20 and 10:35 UTC, resulting in loss of data on the Production BRIAN instance, due to re-boot of the degraded Sensu cluster.
Incident severity: CRITICAL Temporary service outage
Data loss: YES
Total duration of incident: ~18 hours
Timeline
All times are in UTC
Date | Time (UTC) | Description |
---|---|---|
| 12:52:37 | The first evidence of this incident appeared in the logs of
|
31 May 2022 | 11:56 | Keith Slater informed APMs - BRIAN is back to normal operation. |
Proposed Solution
- TBD