Incident description
Both the Primary and Secondary instances of Dashboard exhausted their inodes leading to Dashboard unable to process any traps
Incident severity: CRITICAL
Data Loss: YES
Timeline
Time (UTC) | |
---|---|
11 July 2018 21:09 | Dashboard stops processing traps |
11 July 2018 21:50 | Notification of MySQL error received |
12 July 2018 07:35 | Michael H notified Robert L |
12 July 2018 07:55 | Cause identified, fixed by deleting old files to restore service asap. |
12 July 2018 08:30 | Proposal discussed between Michael H and Robert L, |
12 July 2018 09:00 | Michael H implemented changes on the test servers, RL confirmed that Dashboard worked as expected |
12 July 2018 11:00 | Michael H implemented changes on the UAT servers, RL confirmed that Dashboard worked as expected |
12 July 2018 14:00 | As the updates require the servers Dashboard to be stopped RL liaises with IC to ensure that they are always using an operational Dashboard whilst the necessary changes are made |
12 July 2018 14:30 | Work started on Prod instances. An additional Nagios communication issue was noticed and corrected during this time) |
12 July 2018 16:00 | All work complete and OC informed |
Total Downtime : 10h 46min
Details of Solution
Increased the number of inodes available by building a separate file system for the trap storage and configure the number of inodes (this needed to be done at build time)
Future mitigation
Monitor inode usage and generate email alerts if threshold is breached (80%)
Ensure that archiving of historical traps is in place