NERSC Case Study: Stateful Firewall
Background
This case study was presented at Joint Techs 2006 (Albuquerque, NM) by Brian Draney of NERSC. NERSC is the US DoE's scientific computer centre, which has ~20 TFlops of processing power and 8.8PB of storage. It uses a 10GE LAN backbone and connects to EDnet at 10 Gbps.
Symptom
Traffic between a pair TCP-tuned hosts was being lost. When un-tuned TCP was used there were no such drops.
Troubleshooting
The point at which the transfer hung can be seen here in Xplot (the end of the white line)
A close up shows where there are 3 DUPACKs (marked with a green 3 in the plot below).
Using tcpdump
it could be seen that the sender was sending re-transmits, but these were never getting through to he receiver. The re-transmits would time out and the next re-transmit would be sent (after an increased interval), and this would contine until the sender gave up.
Outcome
It was determined that there was a stateful firewall in the path that did not believe the re-transmits were legitimate (because the TCP SEQ numbers were so different from what preceded them).
– Main.TobyRodwell - 16 Feb 2006