NERSC Case Study: Stateful Firewall

Background

This case study was presented at Joint Techs 2006 (Albuquerque, NM) by Brian Draney of NERSC. NERSC is the US DoE's scientific computer centre, which has ~20 TFlops of processing power and 8.8PB of storage. It uses a 10GE LAN backbone and connects to EDnet at 10 Gbps.

Symptom

Traffic between a pair TCP-tuned hosts was being lost. When un-tuned TCP was used there were no such drops.

Troubleshooting

The point at which the transfer hung can be seen here in Xplot (the end of the white line)

Xplot of hanging transfer

A close up shows where there are 3 DUPACKs (marked with a green 3 in the plot below).

Xplot of retransmit timeouts

Using tcpdump it could be seen that the sender was sending re-transmits, but these were never getting through to he receiver. The re-transmits would time out and the next re-transmit would be sent (after an increased interval), and this would contine until the sender gave up.

Outcome

It was determined that there was a stateful firewall in the path that did not believe the re-transmits were legitimate (because the TCP SEQ numbers were so different from what preceded them).

-- TobyRodwell - 16 Feb 2006

Topic revision: r1 - 2006-02-16 - TobyRodwell
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2004-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.