Production Dashboard Outage 2018-07-11

Incident description

Both the Primary and Secondary instances of Dashboard exhausted their inodes leading to Dashboard unable to process any traps

Incident severity: CRITICAL

Data Loss: YES

Timeline

Time (UTC)
11 July 2018 21:09	Dashboard stops processing traps
11 July 2018 21:50	Notification of MySQL error received
12 July 2018 07:35	Michael H notified Robert L
12 July 2018 07:55	Cause identified, fixed by deleting old files to restore service asap.
12 July 2018 08:30	Proposal discussed between Michael H and Robert L,
12 July 2018 09:00	Michael H implemented changes on the test servers, RL confirmed that Dashboard worked as expected
12 July 2018 11:00	Michael H implemented changes on the UAT servers, RL confirmed that Dashboard worked as expected
12 July 2018 14:00	As the updates require the servers Dashboard to be stopped RL liaises with IC to ensure that they are always using an operational Dashboard whilst the necessary changes are made
12 July 2018 14:30	Work started on Prod instances. An additional Nagios communication issue was noticed and corrected during this time)
12 July 2018 16:00	All work complete and OC informed

Total Downtime : 10h 46min

Details of Solution

Increased the number of inodes available by building a separate file system for the trap storage and configure the number of inodes (this needed to be done at build time)

Future mitigation

Monitor inode usage and generate email alerts if threshold is breached (80%)

Ensure that archiving of historical traps is in place

Page tree

Production Dashboard Outage 2018-07-11

Incident description

Timeline

Details of Solution

Future mitigation