Incident description
The host names opsdb1.dante.net and opsdb2.dante.net could not be resolved from the both Dashboard boxes. The root cause incident description can be found at: DNS Outage 2019-02-27
Incident severity: CRITICAL
Data loss: NO
Timeline
Time (CET) |
---|
28 Jul, |
22: |
00 | Issue Reported by OC |
28 Jul, |
22: |
30 | Picked up by |
Fixed by turning off SSL temporarily to restore the service. Initial investigation revealed certificate has expired but later turned out that wasn't the case.
Further investigations were carried out to avoid such failures in future
The actual cause identified for the failure - due to IT patching certificates were automatically changed.
Proposal was discussed between IT and SWD to avoid such failures in future.
Part one of Nagios check in the proposed solution implemented
New certs provided by IT installed on crowd servers. SSL switched back on (crowd ↔ AD).
Total downtime: 09:14 hours.
Robert L | |
28 Jul, 22:45 | Fixed by updating the Dashboard application to point at prod-opsdb01,geant.net (and 02). |
29 Jul, 08:10 | The issue reported to Devops for root cause analyses. Details at - DNS Outage 2019-02-27 |
Proposed Solution
The Dashboard application updated to point at prod-opsdb01,geant.net (and 02) .
...