r9 - 13 Jun 2005 - 14:06:52 - SimonLeinenYou are here: TWiki >  PERTDiary Web  > DeisaThroughputReduction

DEISA Throughput Reduction

Problem Summary

On 21 April 2005, Thomas Schmid (DFN) sent a report (DeisaThrougputReductionOriginalDescription) to the pert-report mailing list, explaining a problem observed by participants in the DEISA project. Their TCP transfers between Juelich (Germany) and partner sites in Italy and France sometimes see their throughput drop from about 900 Mb/s to 400 Mb/s or less. As can be seen in the #TrafficMeasurements below, the times when these throughput reductions occur are correlated with high traffic from Karlsruhe/CERN transfers. The DEISA traffic uses PremiumIP, while the Karlsruhe/CERN traffic should be marked as LessThanBestEffort.

Status of Investigation (13 June 2005)

Of all the impediments that could be measured on the network paths used by DEISA, only packet reordering was (a) measured and (b) found to correlate with the periods of lower throughput. Therefore, packet reordering and TCP's reaction to it seems most attractive as the bottleneck to address.

According to the #ReorderingMeasurements taken on 3 May, there is some reordering of TCP segments on all paths that cross the GEANT backbone (there is no reordering on the one path that only crosses DFN's G-WiN backbone).

It has been known for some time that the Juniper M-160 routers used in GEANT are prone to reorder packets, because the load from a single 10 Gb/s (e.g. STM-64c/OC-192c POS) interface is "striped" over four 2.5 Gb/s forwarding paths internal to the router. In the past, such reordering could be produced easily between packets of different sizes (when a large packet enters the router followed by a small packet, the small packet often leaves the router before the large one). In the DEISA test, all packets are presumably of equal size, but obviously reordering still occurs, especially when the total traffic offered to the GEANT backbone (at the DFN interface of the DE GEANT router) exceeds 2-3 Gb/s.

The end-systems used run IBM's AIX 5.2, and the TCP/IP parameters are configured as described in DeisaTcpConfiguration. One suggestion could be for the DEISA people to turn on SACK, and check whether this makes their transfers perform better during times of higher load/reordering. In former tests, selective acknowledgements have been used with no positive effect - but that was in a different situation, see the note below about the possible real problem.

Missing Information

More measurements to test reordering impact hypothesis

It would still be useful to get some data from the hosts used:

  1. Looking at the end-systems' TCP counters, do the periods of lower throughput coincide with reordering?
  2. If there is indeed reordering, does this also lead to retransmissions at the sender?
  3. Does CPU utilization increase when there is reordering? (in particular at the receiver)

Real-time (MRTG-style) statistics of the periodic iperf measurements exist at the the FZJ (Juelich, DE) site, but the current project policy prevents them from being given out by the DEISA project. Ralph Niederberger would probably be the appropriate contact for getting access to these.

The Real Problem - A Different One?

One fundamental question is: How relevant are these single-TCP Iperf test results for the actual DEISA applications?

In a message to pert-discuss from Thu May 12 14:20:18 BST 2005, Klaus Desinger (MPG) explains that GPFS (Tiger Shark File System) with several streams would be the main application for DEISA data transfers. But he also provided the following thought:

But if we don't achieve stable throughput with a single stream (on an otherwise "empty" link), adding more streams won't help much.

If reordering, and TCP's reaction to it, are the limiting factor for single-stream TCP, then adding more streams probably will help, though.

However, in a phone conversation between Klaus Desinger (RZG) and SimonLeinen, Klaus explained that the DEISA project had in fact tested GPFS with many parallel streams. The tests were between two DEISA sites separated by a RTT of about 20 ms. The results of these tests weren't good, either. In particular, the aggregate GPFS traffic (of multiple TCP streams) followed a "sawtooth" curve, where traffic would increase to fill the 1 Gb/s path bottleneck (presumably within the sending site), then break down to a very low rate, take up to three minutes to recover towards line rate, break down again, and so on, and so on. Klaus noted that selective acknowledgements were tried too, but didn't help.

This description is very different from the symptoms that the single-stream iperf measurements show, and the explanation is probably also quite different.

From the "sawtooth" pattern, it looks as if the parallel TCPs experience catastrophic loss (i.e. many segments of the same connection are lost, throwing TCP back to slow start) as offered load exceeds bottleneck bandwidth. There may also be a synchronization effect because all TCPs probably run over the same path, so share the same round-trip time.

There is no indication that packet reordering had an impact on these (GPFS over many TCP connections) measurements.

In order to solve this problem of GPFS performance, it would be necessary to look at measurements that reflect this situation - the iperf measurements aren't really that helpful.

Complete Topology

Because the correlation between the Karlsruhe/CERN and DEISA traffic is so obvious, the PERT should focus on those places where these two traffic aggregates share resources, i.e. in Frankfurt. Therefore a complete view of the topology doesn't seem necessary here.

Possible Avenues for Improvement/Further Analysis

Go on to the real problem

As explained in #MaybeTheRealProblem, it seems that the network and end-systems are already sufficient to run the DEISA application at the desired aggregate rate, given that several parallel TCP connections will be used.

But as also explained there, previous measurements with GPFS over parallel TCP streams didn't achieve the desired throughput either. It would make sense to close this current issue, and open a new one based on GPFS measurements, as it is very unlikely that both are symptoms of the same bottleneck.

DEISA to enable SACK (Selective Acknowledogements)

Given that the throughput loss seems to be related to packet reordering, it might help to change the DeisaTcpConfiguration so that SACK (Selective Acknowledgements) are used.

DEISA to use a different TCP

SimonLeinen suggested that modern versions of TransmissionControlProtocol such as BIC or FAST might be able to better use the available capacity in the presence of cross traffic. The BIC variant has been included in the Linux kernel since version 2.6.7, although a bug has prevented it from being effective until version 2.6.11.7.

Given that the DEISA machines run IBM AIX, the options of using a different TCP may be limited.

Study/reduce reordering within the GEANT backbone

Reordering can be observed on all paths traversing the GEANT backbone, and the Juniper M-160 routers are architecturally prone to reorder packets received on 10 Gb/s links. Therefore it can be assumed that reordering is introduced in one or several M-160 routers on the path. While it would be interesting to find out more about where exactly reordering happens, and which situations lead to higher amounts of reordering, it might be more difficult to reduce or eliminate reordering.

The current plans for GEANT2 won't completely eliminate reordering in the IP part of the backbone, because some/most of the M-160 routers will be re-used. However, in GEANT2 the DEISA project could consider separate "lightpath" connections that would not use the M-160 platform for forwarding. Note that the current separation of the DEISA traffic from other GEANT traffic using MPLS doesn't help, because this separation is merely logical, and all MPLS VPNs go through the same M-160 forwarding paths together.

Therefore - at least until M-160-free lightpaths are available to the DEISA project - the best way to improve TCP performance would be to improve the resilience of the TCPs in the end systems with respect to reordering (by using SACK, for instance).

Enable QoS on DFN's cr-frankfurt1 router

We cannot completely exclude that the impact from the Karlsruhe/CERN traffic and the DEISA traffic is created at DFN's Frankfurt core router sending to GEANT. Therefore it would be useful to see whether the DEISA performance changes when scheduling is enabled on cr-frankfurt1 to prioritise PremiumIP traffic and/or deprioritize LessThanBestEffort traffic.

Measurements

Reordering Measurements (3 May 2005)

Klaus Desinger from RZG (Rechenzentrum Garching of the Max-Planck-Gesellschaft) contributed the following measurements.

Thomas had suggested that we take a closer look at the number of packets arriving out-of-order, so I did tests between all DEISA sites, sending a TCP stream with iperf for 10 seconds and noted the TCP-out-of-order counter on the receiving machine before and after:

dst: FZJ IDRIS CINECA RZG
2.4MB
RZG
4MB
src:          
FZJ - 198 408 0 0
IDRIS 196 - 180 130 2007
CINECA 97 24 - 68 121
RZG 0 6 100 - -

(these were single tests on May 3rd, with no other DEISA traffic).

The good thing: There is no packet reorder between RZG and FZJ (both are within DFN's GWiN).

About the two columns for RZG:

We have a special problem with data being sent from IDRIS to RZG, which may or may not be related to this issue. To get stable throughput IDRIS->RZG we currently have slowed things down by slightly reducing the TCP receive buffer size at RZG to 2400000 bytes. If we test with a 4 MB receive buffer (last column) we get lots of packets out of order from IDRIS.

I'm now also monitoring the number of packets we receive out of order at RZG: http://post.rzg.mpg.de/mrtg/deisy1-order.html

We seem to receive more of them during daytime than at night. But since traffic on the DFN-GEANT interface hasn't been high (>1Gbps) for a longer period lately it's not clear yet whether there is a correlation. (There was a traffic peak today around 10 a.m. corresponding to a drop in throughput IDRIS->FZJ and out-of-order packets here at RZG, but that might just have been a coincidence).

It might also be the case that we currently don't observe the drops in throughput at RZG because of the reduced receive buffer, but we will have to increase it again when other sites with larger RTTs join in.

Traffic Graphs Showing Correlation Between DEISA and KA/CERN

These graphs and explanations were sent by Thomas Schmid (DFN) on April 22, 2005.

aggregate traffic received and sent over the GEANT-DFN link

The 2-3 Gig traffic in week 12,13 and 15 come from Karlsruhe-CERN tests

GEANT-DFN-20040422.png

tcp iperf throughput between Jülich and IDRIS

The iperf numbers are not from permanent traffic, but from hourly short tests (1-2mins). iperf-padded.png

interface between DFN and Geant

(suppressed for brevity, because this is just the complement to the #GeantDfn graph - see attached file.)

DE-FR GEANT trunk load, April 2005

It can be seen that the DE-FR GEANT trunk has been heavily loaded on occasions, with 1-hour averages exceeding 3Gb/s. de-fr-2005-04-cropped.png

Contact information for this issue:

Thomas Schmid <schmid@noc.dfn.de>

Links

-- ChrisWelti - 25 Apr 2005 -- SimonLeinen - 17 May 2005

toggleopenShow attachmentstogglecloseHide attachments
Topic attachments
I Attachment Action Size Date Who Comment
pngpng cr-fra1_pos11_0_ifoctets-month.png manage 10.9 K 25 Apr 2005 - 07:54 ChrisWelti interface between DFN and Geant. The 2-3 Gig traffic in week 12,13 and 15 come from Karlsruhe-CERN tests
Edit | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r9 < r8 < r7 < r6 < r5 | More topic actions
PERTDiary.DeisaThroughputReduction moved from PERTDiary.SlowPerformance on 21 Apr 2005 - 16:15 by ChrisWelti - put it back



 
GEANT2
Copyright © 2004-2005 by the contributing authors.