Page History

Incident description

The host names opsdb1.dante.net and opsdb2.dante.net could not be resolved from the both Dashboard boxes. The root cause incident description can be found at: DNS Outage 2019-02-27

Incident severity: CRITICAL

Data loss: NO

Timeline

Time (CET)

18 Jun


28 Jul,

23

22:

16

00	Issue Reported by OC

19 Jun

28 Jul,

07

22:

35

30	Picked up by

Michael H the following morning19 Jun, 08:30

Fixed by turning off SSL temporarily to restore the service. Initial investigation revealed certificate has expired but later turned out that wasn't the case.

19 Jun, 10:30

Further investigations were carried out to avoid such failures in future

19 Jun, 16:08

The actual cause identified for the failure - due to IT patching certificates were automatically changed.

20 Jun, 09:30

Proposal was discussed between IT and SWD to avoid such failures in future.

20 Jun, 11:30

Part one of Nagios check in the proposed solution implemented

20 Jun, 16.30

New certs provided by IT installed on crowd servers. SSL switched back on (crowd ↔ AD).

Total downtime: 09:14 hours.

Robert L
28 Jul, 22:45	Fixed by updating the Dashboard application to point at prod-opsdb01,geant.net (and 02).
29 Jul, 08:10	The issue reported to Devops for root cause analyses. Details at - DNS Outage 2019-02-27

Proposed Solution

The Dashboard application updated to point at prod-opsdb01,geant.net (and 02) .

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Incident description

Timeline

Proposed Solution