Incident description
At 14:00 on Saturday the 7th of February geant.net domain was not being resolved by ROOT servers
"The server is busy now. Try again later."
This was a message from SharePoint informing us that it was suffering unusually high traffic.
Incident severity: CRITICAL
Data loss: NO
Affected Services
All the systems trying to send email to user on dante.net and geant.net
- relay.geant.org and mail.geant.net, Dahsboard, repositories.geant.net, etc
Cause
Our DNS was routing requests to geant.net and dante.net to external ROOT servers. This configuration was in place for around 2 years, but suddenly the ROOT servers missed the records for our domains.
Bind, running on the Consul servers, was setup to route as following:
- domain service.ha.geant.net was forwarded to local consul on port 8600
- win.dante.org.uk and geant.local were forwarded to Windows servers geantdc01.geant.local and geantdc02.geant.local
- everything else was going to Internet ROOT servers
Resolution.
Bind was changed to forward as following:
- domain service.ha.geant.net remains the forward to local consul on port 8600
- geant.local remains the forward to geantdc01.geant.local and geantdc02.geant.local
- win.dante.org.uk is now forwarded to am-prd-dc01.win.dante.org.uk and am-prd-dc02.win.dante.org.uk
- geant.net, geant.org and dante.net are now forwarded to infoblox grid members 62.40.104.250, 62.40.116.122, 62.40.116.114
- everything else remains on Internet ROOT servers
Future Mitigation.
We are still investigating the issue.