DEISA Throughput Reduction
Problem Summary
On 21 April 2005, Thomas Schmid (DFN) sent a report (
DeisaThrougputReductionOriginalDescription) to the
pert-report mailing list, explaining a problem observed by participants in the DEISA project. Their TCP transfers between Juelich (Germany) and partner sites in Italy and France sometimes see their throughput drop from about
900 Mb/s to
400 Mb/s or less. As can be seen in the
#TrafficMeasurements below, the times when these throughput reductions occur are
correlated with high traffic from
Karlsruhe/CERN transfers. The DEISA traffic uses
PremiumIP, while the Karlsruhe/CERN traffic should be marked as
LessThanBestEffort.
Status of Investigation (13 June 2005)
Of all the impediments that could be measured on the network paths used by DEISA, only
packet reordering was (a) measured and (b) found to correlate with the periods of lower throughput. Therefore, packet reordering and TCP's reaction to it seems most attractive as the bottleneck to address.
According to the
#ReorderingMeasurements taken on 3 May, there is some reordering of TCP segments on all paths that cross the GEANT backbone (there is no reordering on the one path that only crosses DFN's G-WiN backbone).
It has been known for some time that the Juniper M-160 routers used in GEANT are prone to reorder packets, because the load from a single 10 Gb/s (e.g. STM-64c/OC-192c POS) interface is "striped" over four 2.5 Gb/s forwarding paths internal to the router. In the past, such reordering could be produced easily between packets of different sizes (when a large packet enters the router followed by a small packet, the small packet often leaves the router before the large one). In the DEISA test, all packets are presumably of equal size, but obviously reordering still occurs, especially when the total traffic offered to the GEANT backbone (at the DFN interface of the DE GEANT router) exceeds 2-3 Gb/s.
The end-systems used run IBM's AIX 5.2, and the TCP/IP parameters are configured as described in
DeisaTcpConfiguration. One suggestion could be for the DEISA people to turn on SACK, and check whether this makes their transfers perform better during times of higher load/reordering. In former tests,
selective acknowledgements have been used with no positive effect - but that was in a different situation, see the
note below about the possible real problem.
Missing Information
More measurements to test reordering impact hypothesis
It would still be useful to get some data from the hosts used:
- Looking at the end-systems' TCP counters, do the periods of lower throughput coincide with reordering?
- If there is indeed reordering, does this also lead to retransmissions at the sender?
- Does CPU utilization increase when there is reordering? (in particular at the receiver)
Real-time (MRTG-style) statistics of the periodic
iperf measurements exist at the the FZJ (Juelich, DE) site, but the current project policy prevents them from being given out by the DEISA project. Ralph Niederberger would probably be the appropriate contact for getting access to these.
The Real Problem - A Different One?
One fundamental question is: How relevant are these single-TCP Iperf test results for the actual DEISA applications?
In a
message to pert-discuss from Thu May 12 14:20:18 BST 2005, Klaus Desinger (MPG) explains that
GPFS (Tiger Shark File System) with several streams would be the main application for DEISA data transfers. But he also provided the following thought:
But if we don't achieve stable throughput with a single stream (on an otherwise "empty" link), adding more streams won't help much.
If reordering, and TCP's reaction to it, are the limiting factor for single-stream TCP, then adding more streams probably
will help, though.
However, in a phone conversation between Klaus Desinger (RZG) and
SimonLeinen, Klaus explained that the DEISA project had in fact tested GPFS with many parallel streams. The tests were between two DEISA sites separated by a
RTT of about 20 ms. The results of these tests weren't good, either. In particular, the aggregate GPFS traffic (of multiple TCP streams) followed a "sawtooth" curve, where traffic would increase to fill the 1 Gb/s path bottleneck (presumably within the sending site), then break down to a very low rate, take up to three minutes to recover towards line rate, break down again, and so on, and so on. Klaus noted that
selective acknowledgements were tried too, but didn't help.
This description is very different from the symptoms that the single-stream
iperf measurements show, and the explanation is probably also quite different.
From the "sawtooth" pattern, it looks as if the parallel TCPs experience
catastrophic loss (i.e. many segments of the same connection are lost, throwing TCP back to slow start) as offered load exceeds bottleneck bandwidth. There may also be a synchronization effect because all TCPs probably run over the same path, so share the same
round-trip time.
There is no indication that
packet reordering had an impact on these (GPFS over many TCP connections) measurements.
In order to solve this problem of GPFS performance, it would be necessary to look at measurements that reflect this situation - the
iperf measurements aren't really that helpful.
Complete Topology
Because the correlation between the Karlsruhe/CERN and DEISA traffic is so obvious, the PERT should focus on those places where these two traffic aggregates share resources, i.e. in Frankfurt. Therefore a complete view of the topology doesn't seem necessary here.
Possible Avenues for Improvement/Further Analysis
Go on to the real problem
As explained in
#MaybeTheRealProblem, it seems that the network and end-systems are already sufficient to run the DEISA application at the desired aggregate rate, given that several parallel TCP connections will be used.
But as also explained there, previous measurements with GPFS over parallel TCP streams didn't achieve the desired throughput either. It would make sense to close this current issue, and open a new one based on GPFS measurements, as it is very unlikely that both are symptoms of the same bottleneck.
DEISA to enable SACK (Selective Acknowledogements)
Given that the throughput loss seems to be related to packet reordering, it might help to change the
DeisaTcpConfiguration so that
SACK (Selective Acknowledgements) are used.
DEISA to use a different TCP
SimonLeinen suggested that modern versions of
TransmissionControlProtocol such as
BIC or
FAST might be able to better use the available capacity in the presence of cross traffic. The BIC variant has been included in the Linux kernel since version 2.6.7, although a bug has prevented it from being effective until version 2.6.11.7.
Given that the DEISA machines run IBM AIX, the options of using a different TCP may be limited.
Study/reduce reordering within the GEANT backbone
Reordering can be observed on all paths traversing the GEANT backbone, and the Juniper M-160 routers are architecturally prone to reorder packets received on 10 Gb/s links. Therefore it can be assumed that reordering is introduced in one or several M-160 routers on the path. While it would be interesting to find out more about where exactly reordering happens, and which situations lead to higher amounts of reordering, it might be more difficult to reduce or eliminate reordering.
The current plans for GEANT2 won't completely eliminate reordering in the IP part of the backbone, because some/most of the M-160 routers will be re-used. However, in GEANT2 the DEISA project could consider separate "lightpath" connections that would not use the M-160 platform for forwarding. Note that the current separation of the DEISA traffic from other GEANT traffic using MPLS doesn't help, because this separation is merely
logical, and all MPLS VPNs go through the same M-160 forwarding paths together.
Therefore - at least until M-160-free lightpaths are available to the DEISA project - the best way to improve TCP performance would be to improve the resilience of the TCPs in the end systems with respect to reordering (by using SACK, for instance).
Enable QoS on DFN's cr-frankfurt1 router
We cannot completely exclude that the impact from the Karlsruhe/CERN traffic and the DEISA traffic is created at DFN's Frankfurt core router sending to GEANT. Therefore it would be useful to see whether the DEISA performance changes when scheduling is enabled on
cr-frankfurt1 to prioritise
PremiumIP traffic and/or deprioritize
LessThanBestEffort traffic.
Measurements
Reordering Measurements (3 May 2005)
Klaus Desinger from RZG (Rechenzentrum Garching of the Max-Planck-Gesellschaft) contributed the following measurements.
Thomas had suggested that we take a closer look at the number of packets arriving out-of-order, so I did tests between all DEISA sites, sending a TCP stream with iperf for 10 seconds and noted the TCP-out-of-order counter on the receiving machine before and after:
| dst: | FZJ | IDRIS | CINECA | RZG 2.4MB | RZG 4MB |
| src: | | | | | |
| FZJ | - | 198 | 408 | 0 | 0 |
| IDRIS | 196 | - | 180 | 130 | 2007 |
| CINECA | 97 | 24 | - | 68 | 121 |
| RZG | 0 | 6 | 100 | - | - |
(these were single tests on May 3rd, with no other DEISA traffic).
The good thing: There is no packet reorder between RZG and FZJ (both are within DFN's GWiN).
About the two columns for RZG:
We have a special problem with data being sent from IDRIS to RZG, which may or may not be related to this issue. To get stable throughput IDRIS->RZG we currently have slowed things down by slightly reducing the TCP receive buffer size at RZG to 2400000 bytes. If we test with a 4 MB receive buffer (last column) we get lots of packets out of order from IDRIS.
I'm now also monitoring the number of packets we receive out of order at RZG:
http://post.rzg.mpg.de/mrtg/deisy1-order.html
We seem to receive more of them during daytime than at night. But since traffic on the DFN-GEANT interface hasn't been high (>1Gbps) for a longer period lately it's not clear yet whether there is a correlation.
(There was a traffic peak today around 10 a.m. corresponding to a drop in throughput IDRIS->FZJ and out-of-order packets here at RZG, but that might just have been a coincidence).
It might also be the case that we currently don't observe the drops in throughput at RZG because of the reduced receive buffer, but we will have to increase it again when other sites with larger RTTs join in.
Traffic Graphs Showing Correlation Between DEISA and KA/CERN
These graphs and explanations were sent by Thomas Schmid (DFN) on April 22, 2005.
aggregate traffic received and sent over the GEANT-DFN link
The 2-3 Gig traffic in week 12,13 and 15 come from Karlsruhe-CERN tests
tcp iperf throughput between Jülich and IDRIS
The iperf numbers are not from permanent traffic, but from hourly short tests (1-2mins).
interface between DFN and Geant
(suppressed for brevity, because this is just the complement to the
#GeantDfn graph - see attached file.)
DE-FR GEANT trunk load, April 2005
It can be seen that the DE-FR GEANT trunk has been heavily loaded on occasions, with 1-hour averages exceeding 3Gb/s.
Contact information for this issue:
Thomas Schmid <schmid@noc.dfn.de>
Links
--
ChrisWelti - 25 Apr 2005
--
SimonLeinen - 17 May 2005