A.3 Performance Problems between FNAL and DESY

PTS Case 4

Background

The German Electron Synchrotron (DESY) is one of the leading accelerator centres in the world. DESY has locations in Hamburg and Zeuthen (Brandenburg). The Fermi National Accelerator Laboratory (Fermilab ) is an accelerator centre in Illinois, USA. DESY contacted the PERT because although they were connected to the German NREN G-WiN with 1 GE, and Fermilab were connected to ESNET with an OC-12 (622Mbps) link, data transfer between the two was limited to approximately 150Mbps. DESY have reported that they have no problem to exchange data with CERN (CH) at almost line rate (close to 1 Gbps) which indicates that there is no bottleneck on the route DESY - G-WiN - GÉANT - CERN.

Investigation

Initial progress was slow. Several iperf tests were run using 3rd party devices i.e. a device different from the A and B end, but these were inconclusive and it was deemed necessary to run all subsequent tests from the LANs in question (FNAL and DESY). Even then, regular access to a suitable machine was not always possible (this is quite often the case when investigating).

In order to get some continuity in the PERT investigation, Chris Welti from SWITCH volunteered to become the 'Special Case Manager' (SCM) for this issue, which meant he and not the weekly-changing Duty Case Manager would concentrate on the case. This was good in one sense, but it also added a further delay to the investigation since Chris was of course not always available to work on the case. The first breakthrough came at the end of November. The SCM co-ordinated an end-to-end iperf test, with people from all the involved networks on a conference call, and each checking how many packets made it across their network (packet counters were set up on the boundary interfaces, so packets-in could be compared with packets-out). At this point it became apparent some packets were being lost on the GÉANT FR-UK link. A careful look at this circuit showed that there were consistent framing errors on the FR receive side. COLT (FR-UK provider) were slow to fix the problem so as a temporary work round an MPLS LSP was used to engineer FNAL-DESY traffic away from the FR-UK link.

Eventually (26 Dec) the faulty hardware was traced and replaced, and the LSP bypass removed. End to end performance was still less than expected (13MB/s compared with 70MB/s to CERN), but on 27 Jan after getting root access to 2 DESY machines the PERT SCM was able to find out a problem with the line card of the switch the 2 workstations were connected to. When running 500Mbps UDP streams between the 2 hosts, which were in the same subnet, packet loss of up to 1% was seen (in comparison, the UK-FR packet loss was 0.2%). The DESY staff then connected the two hosts to a newer high performance line card which was also less congested and it seemed that this helped to eliminate the issue. Afterwards the SCM able to reach a steady 700Mbps from fntst-1.fnal.gov to both end hosts for a period of 5 minutes with a iperf TCP test with 10 parallel streams and 2MB window size at the sender and the receiver (the use of parallel streams helps to reduce the effect of cross traffic - 10 small streams get better service than one large stream if there is other traffic on a link).

Outcome

The UK-FR framing errors had not been detected by the GN2 NOC because the router did not record them as 'input errors'. Juniper (the router manufacturer) were asked about this and they said that there was an internal debate as to what exactly ought to be classified as an input error. Significantly, the current MIBs did not allow the counting of framing errors, and as result of the PERT’s comments they raised the priority of the Problem Report (PR) that should fix this omission. In the meantime GÉANT NOC will start monitoring low threshold framing errors which should identify future problems of this sort.

For a long while the DESY network connection was fully committed to running tests with CERN which meant that it was not been possible to carry out a final verification test. The case was suspended during this time. Eventually in August 2006 there was a data transfer between Fermilab and DESY and they achieved an average rate of 100MB/s over a 24 hour period, peaking at ~150MB/s. This was deemed a success and the case was closed.

-- TobyRodwell - 29 Jan 2007

Topic revision: r1 - 2007-01-29 - TobyRodwell
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2004-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.