Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • One of the reasons why the service outage was not discovered by Sensu is that:
    • Sensu checks that the service is up by doing a standard http check on the primary service hostname (https://events.geant.org) which is hosted by HAProxy
    • When both downstream servers/services are unreachable ( prod-events01.geant.org and prod-events02.geant.org ), HAProxy returns a generic 'Maintenance' page
    • The 'Maintenance' page is returned with an HTTP 200 OK code, which results in the Sensu check marking the service as up and available
  • Massimiliano Adamo has updated HAProxy to return an HTTP error code instead of a 200 code
  • A secondary reason for not detecting the outage is that we're not monitoring the service availability on the downstream servers
    • Plans are in place to add additional monitoring and checks for all SWD services and to test the outage detection and alerting
  • Ian Galpin to determine why puppet and pip break the virtualenv
  • Mandeep Saini to present the Incident Handling Process presentation
    • Many new team members have joined since the last time the presentation was given