Incident Description
New BRIAN bitrate traffic was not available for approximately 21 hours.
The reason for degradation:
- cf. IT incident: 30052022
- Local system partition corruption
- Failure to connect or write data to InfluxDB
The impact of this service degradation was:
- Data points between 12:50 and 9:30 UTC were temporarily not visible in the BRIAN gui
Incident severity: CRITICAL Temporary service outage
Data loss: NO
Total duration of incident: 21 hours
Timeline
All times are in UTC
Date | Time (UTC) | Description |
---|---|---|
| 12:52:37 | The first evidence of this incident appeared in the logs of
|
| afternoon | Several performance issues started being reported across the network:
|
| 19:08 | Keith Slater (and others) alerted on the |
| 20:30 | Bjarke Madsen replied that it seemed related to service problems seen earlier in the day |
| 21:12 | Massimiliano Adamo replied on |
| 23:28 | Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down |
| 12:53 | For the duration of this event, Kapacitor continuously logged failures regarding writing to or communicating with InfluxDB, as below:
This means that while Kapacitor was receiving live network counters in real time, the results of the rate calculations weren't being saved to InfluxDB. |
| 08:12 | |
| 02:34-08:11 | There were many incidents of disk i/o failure logged over the duration of the event, indicating filesystem/disk corruption. For example:
|
| 07:34 | Keith Slater took ownership of informing APM's |
| 08:12 | Pete Pedersen stopped the system and fixed the corrupt partition. |
| 08:26:55 | System was rebooted. |
| 08:26:55 | There was a network DNS failure during the boot process and May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:30] : 'server prod-inventory-provider01.geant.org' : could not resolve address 'prod-inventory-provider01.geant.org'. May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:31] : 'server prod-inventory-provider02.geant.org' : could not resolve address 'prod-inventory-provider02.geant.org'. |
| 08:27:07 |
Since the Kapacitor tasks weren't running, network counters were still not being processed or saved to InfluxDB. |
| 08:41:11 |
At this time DNS resolution was back to normal, and But |
| 09:27:10 | Manual restart of Kapacitor . Normal BRIAN data processing of real-time data was restored. |
| 10:39 | Sam Roberts copied the data points lost during the incident from UAT to production
|
31 May 2022 | 11:56 | Keith Slater informed APMs - BRIAN is back to normal operation. |
Proposed Solution
- The core issue seems to be related to VMWare and IT need to provide a solution. S.M.A.R.T. alerts have been found in the vCenter, but monitoring has not been configured to detect these alerts.
- This incident suggests that a previously logged technical debt issue (POL1-529), which has been considered medium/low priority, could be prioritized for development:
- fixing this issue could generally help with temporary DNS resolution errors, however the DNS issues were secondary in this incident and fixing this issue wouldn't have prevented the overall outage
- while VMWare disk corruption and network dns failures are external events and out of the control of SWD, a further investigation for potential improvement in processing resiliency is described in POL1-607.