Page History

...

One of the reasons why the service outage was not discovered by Sensu is that:
- Sensu checks that the service is up by doing a standard http check on the primary service hostname (https://events.geant.org) which is hosted by HAProxy
- When both downstream servers/services are unreachable ( prod-events01.geant.org and prod-events02.geant.org ), HAProxy returns a generic 'Maintenance' page
- The 'Maintenance' page is returned with an HTTP 200 OK code, which results in the Sensu check marking the service as up and available
Massimiliano Adamo has updated HAProxy to return an HTTP error code instead of a 200 code
A secondary reason for not detecting the outage is that we're not monitoring the service availability on the downstream servers
- Plans are in place to add additional monitoring and checks for all SWD services and to test the outage detection and alerting
Ian Galpin to determine why puppet and pip break the virtualenv
Mandeep Saini to present the Incident Handling Process presentation
- Many new team members have joined since the last time the presentation was given

Page tree