Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

At 14:00 on Saturday the 7th of February our SharePoint Servers suffered an outage with the message

"The server is busy now. Try again later."

This was a message from SharePoint informing us that it was suffering unusually high traffic. geant.net domain was not being resolved by ROOT servers.

Incident severity: CRITICAL

Data loss: NO

Affected Services 

Following services were affected

Cause

Our SharePoint servers get allocated requests from a “load balancer” which allocates the request to the server with the least load. SharePoint communicates with the load balancer using a service called the Request Management Service.

 It appears that the load balancer was sending all requests to only one of the servers.This was overloading that server causing it to hit a threshold, after which it stops accepting requests. It does this to force the load balancer to send requests to the other servers.

 However, either the load balancer malfunctioned, or the Request Management Service stopped which led to the load balancer malfunctioning. We are still investigating, but the problem appears to be with the “load balancer”.

This may or may not be linked to high traffic due to a DDoS attack. It could be the case that the load balancer may have been failing to work for a while but the requests were not high enough to trigger the threshold. 

Resolution.

In order to bring the SharePoint servers back online we turned off the throttling so that the SharePoint Server would continue to accept requests.

Later on we raised the Threshold from 500 queued requests to 1000.

Future Mitigation.

We are still investigating the issue.

...

  • Dashboard
  • mail.geant.net
  • repositories.geant.net
  • All the systems trying to send email to user on dante.net and geant.net

Cause

Our DNS was routing requests to geant.net and dante.net to external ROOT servers. This configuration was in place for around 2 years, but suddenly the ROOT servers missed the records for our domains. 

Bind, running on the Consul servers, was setup to route as following: 

  • domain service.ha.geant.net was forwarded to local consul on port 8600
  • win.dante.org.uk and geant.local were forwarded to Windows servers geantdc01.geant.local and geantdc02.geant.local
  • everything else was going to Internet ROOT servers

Resolution.

Bind was changed to forward as following:

  • domain service.ha.geant.net remains the forward to local consul on port 8600
  • geant.local remains the forward to geantdc01.geant.local and geantdc02.geant.local
  • win.dante.org.uk is now forwarded to am-prd-dc01.win.dante.org.uk and am-prd-dc02.win.dante.org.uk
  • geant.net, geant.org and dante.net are now forwarded to infoblox grid members 62.40.104.250, 62.40.116.122, 62.40.116.114
  • everything else remains on Internet ROOT servers