...
- One of the reasons why the service outage was not discovered by Sensu is that:
- Sensu checks that the service is up by doing a standard http check on the primary service hostname (https://events.geant.org) which is hosted by HAProxy
- When both downstream servers/services are unreachable (
prod-events01.geant.org and prod-events02.geant.org
), HAProxy returns a generic 'Maintenance' page - The 'Maintenance' page is returned with an HTTP 200 OK code, which results in the Sensu check marking the service as up and available
- Massimiliano Adamo has updated HAProxy to return an HTTP error code instead of a 200 code
- A secondary reason for not detecting the outage is that we're not monitoring the service availability on the downstream servers
- Plans are in place to add additional monitoring and checks for all SWD services and to test the outage detection and alerting
- Ian Galpin to determine why puppet and pip break the virtualenv
- Mandeep Saini to present the Incident Handling Process presentation
- Many new team members have joined since the last time the presentation was given