Incident Description

EMS service was not running on either prod-events01.geant.org or prod-events02.geant.org which resulted in EMS service not being available.

The impact of this service degradation was:

Incident severity: CRITICAL Complete service outage

Data loss: NO

Total duration of incident: 2 days

Timeline

All times are in UTC

Date	Time	Description
12 Mar 2022	06:20	According to the puppet and systemd logs, puppet tried to re-install EMS on the `prod-events01.geant.org and prod-events02.geant.org` The PIP re-install of EMS resulted in partial EMS directories in the python virtual environment. This is explained in the following Stack Overflow page: https://stackoverflow.com/questions/55565760/anaconda-python-site-packages-subfolders-with-tilde-in-name-what-are-they
14 Mar 2022	08:18	User enquired on Slack (#general) if EMS is down
14 Mar 2022	08:30 - 09:12	Ian Galpin investigated and resolved the issue Determined that the venv for EMS was broken Disabled puppet Fixed virtualenv Restarted services
14 Mar 2022	09:12	Service was restored and Ian Galpin informed the user on the Slack channel

One of the reasons why the service outage was not discovered by Sensu is that:
- Sensu checks that the service is up by doing a standard http check on the primary service hostname (https://events.geant.org) which is hosted by HAProxy
- When both downstream servers/services are unreachable ( prod-events01.geant.org and prod-events02.geant.org ), HAProxy returns a generic 'Maintenance' page
- The 'Maintenance' page is returned with an HTTP 200 OK code, which results in the Sensu check marking the service as up and available
Massimiliano Adamo has updated HAProxy to return an HTTP error code instead of a 200 code
A secondary reason for not detecting the outage is that we're not monitoring the service availability on the downstream servers
- Plans are in place to add additional monitoring and checks for all SWD services and to test the outage detection and alerting
Ian Galpin to determine why puppet and pip break the virtualenv
Mandeep Saini to present the Incident Handling Process presentation
- Many new team members have joined since the last time the presentation was given