Incident Description
EMS service was not running on either prod-events01.geant.org or prod-events02.geant.org
which resulted in EMS service not being available.
The impact of this service degradation was:
- Users could not access EMS
- No emails from EMS were sent during the outage period
Incident severity: CRITICAL Complete service outage
Data loss: NO
Total duration of incident: 2 days
Timeline
All times are in UTC
Date | Time | Description |
---|---|---|
| 06:20 | According to the puppet and systemd logs, puppet tried to re-install EMS on the The PIP re-install of EMS resulted in partial EMS directories in the python virtual environment. This is explained in the following Stack Overflow page: https://stackoverflow.com/questions/55565760/anaconda-python-site-packages-subfolders-with-tilde-in-name-what-are-they |
| 08:18 | User enquired on Slack (#general) if EMS is down |
| 08:30 - 09:12 | Ian Galpin investigated and resolved the issue
|
| 09:12 | Service was restored and Ian Galpin informed the user on the Slack channel |
Proposed Solution
- One of the reasons why the service outage was not discovered by Sensu is that:
- Sensu checks that the service is up by doing a standard http check on the primary service hostname (https://events.geant.org) which is hosted by HAProxy
- When both downstream servers/services are unreachable (
prod-events01.geant.org and prod-events02.geant.org
), HAProxy returns a generic 'Maintenance' page - The 'Maintenance' page is returned with an HTTP 200 OK code, which results in the Sensu check marking the service as up and available
- Massimiliano Adamo has updated HAProxy to return an HTTP error code instead of a 200 code
- A secondary reason for not detecting the outage is that we're not monitoring the service availability on the downstream servers
- Plans are in place to add additional monitoring and checks for all SWD services and to test the outage detection and alerting
- Ian Galpin to determine why puppet and pip break the virtualenv
- Mandeep Saini to present the Incident Handling Process presentation
- Many new team members have joined since the last time the presentation was given