DS3.3.3b - Advanced Guide to Network Performance

Network Performance Metrics

There are many metrics that are commonly used to characterize the performance of networks and parts of networks. We present the most important of these, explain what influences them, how they can be measured, how they influence end-to-end performance, and what can be done to improve them.

A framework for network performance metrics has been defined by the IETF's IP Performance Metrics (IPPM) Working Group in RFC 2330. The group also developed definitions for several specific performance metrics; those are referenced from the respective sub-topics.

References

-- SimonLeinen - 31 Oct 2004

One-way Delay (OWD)

One-way delay is the time it takes for a packet to reach its destination. It is considered a property of network links or paths. RFC 2679 contains the definition of the one-way delay metric of the IETF's IPPM (IP Performance Metrics) working group.

Decomposition

One-way delay along a network path can be decomposed into per-hop one-way delays, and these in turn into per-link and per-node delay components.

Per-link delay components: propagation delay and serialization delay

The link-specific component of one-way delay consists of two sub-components:

Propagation Delay is the time it takes for signals to move from the sending to the receiving end of the link. On simple links, this is the product of the link's physical length and the characteristical propagation speed. The velocity of propagation (VoP) for copper and fibre optic are similar (the VoP of copper is slightly faster), being approximately 2/3 the speed of light in a vacuum.

Serialization delay is the time it takes for a packet to be serialized into link transmission units (typically bits). It is the packet size (in bits) divided by the link's capacity (in bits per second).

In addition to the propagation and serialization delays, some types of links may introduce additional delays, for example to avoid collisions on a shared link, or when the link-layer transparently retransmits packets that were damaged during transmission.

Per-node delay components: forwarding delay, queueing delay

Within a network node such as a router, a packet experiences different kinds of delay between its arrival on one link and its departure on another (or the same) link:

Forwarding delay is the time it takes for the node to read forwarding-relevant information (typically the destination address and other headers) from the packet, compute the "forwarding decision" based on routing tables and other information, and to actually forward the packet towards the destination, which involves copying the packet to a different interface inside the node, rewriting parts of it (such as the IP TTL and any media-specific headers) and possibly other processing such as fragmentation, accounting, or checking access control lists.

Depending on router architecture, forwarding can compete for resources with other activities of the router. In this case, packets can be held up until the router's forwarding resource is available, which can take many milliseconds on a router with CPU-based forwarding. Routers with dedicated hardware for forwarding don't have this problem, although there may be delays when a packet arrives as the hardware forwarding table is being reprogrammed due to a routing change.

Queueing delay is the time a packet has to wait inside the node waiting for availability of the output link. Queueing delay depends on the amount of competing traffic towards the output link, and of the priorities of the packet itself and the competing traffic. The amount of queueing that a given packet may encounter is hard to predict, but it is bounded by the buffer size available for queueing.

There can be causes for queueing other than contention on the outgoing link, for example contention on the node's backplane interconnect.

Impact on end-to-end performance

When studying end-to-end performance, it is usually more interesting to look at the following metrics that are derived from one-way delay:

  • Round-trip time (RTT) is the time from node A to B and back to A. It is the sum of the one-way delays from A to B and from B to A, plus the response time in B.
  • Delay Variation represents the variation in one-way delay. It is important for real-time applications. People often call this "jitter".

Measurement

One-way delays from a node A to a node B can be measured by sending timestamped packets from A, and recording the reception times at B. The difficulty is that A and B need clocks that are synchronized to each other. This is typically achieved by having clocks synchronized to a standard reference time such as UTC (Universal Time Coordinated) using techniques such as Global Positioning System (GPS)-derived time signals or the Network Time Protocol (NTP).

There are several infrastructures that continuously measure one-way delays and packet loss: The HADES Boxes in DFN and GÉANT2; RIPE TTM Boxes between various research and commercial networks, and RENATER's QoSMetrics boxes.

OWAMP and RUDE/CRUDE are examples of tools that can be used to measure one-way delay.

Improving delay

Shortest-(Physical)-Path Routing and Proper Provisioning

On high-speed wide-area network paths, delay is usually dominated by propagation times. Therefore, physical routing of network links plays an important role, as well as the topology of the network and the selection of routing metrics. Ensuring minimal delays is simply a matter of

  • using an internal routing protocol with "shortest-path routing" (such as OSPF or IS-IS) and a metric proportional to per-link delay
  • provisioning the network so that these shortest paths aren't congested.

This could be paraphrased as "use proper provisioning instead of traffic engineering".

The node contributions to delay can be addressed by:

  • using nodes with fast forwarding
  • avoiding queueing by provisioning links to accomodate typical traffic bursts
  • reducing the number of forwarding nodes

References

-- SimonLeinen - 31 Oct 2004 - 25 Jan 2007

Propagation Delay

The propagation delay is the time it takes for a signal to propagate. It depends on the distance traveled, and the specific propagation speed of the medium. For instance, information transmitted via radio or through copper cables will travel at a speed close to c (speed of light in vacuum, ~300000 km/s). The prevalent medium for long-distance digital transmission is now light in optical fibers, where the propagation speed is about 2/3 c, i.e. 200000 km/s.

Propagation delay, along with serialization delay and processing delays in nodes such as routers, is a component of overall delay/RTT. For uncongested long-distance network paths, it is usually the dominant component.

Examples

Here are a few examples for propagation delay components of one-way and round-trip delays over selected distances in fiber.

Fibre length One-way delay Round-trip time
1m 5 ns 10 ns
1km 5 µs 10 µs
10km 50 µs 100 µs
100km 500 µs 1 ms
1000km 5 ms 10 ms
10000km 50 ms 100 ms

-- SimonLeinen - 28 Feb 2006

Serialization Delay (or Transmission Delay)

Serialization delay is the time it takes for a unit of data, such as a packet, to be serialized for transmission on a narrow (e.g. serial) channel such as a cable. Serialization delay is dependent on size, which means that longer packets experience longer delays over a given network path. Serialization delay is also dependent on channel capacity ("bandwidth"), which means that for equal-size packets, the faster the link, the lower the serialization delay.

Serialization delays are incurred at processing nodes, when packets are stored-and-copied between links and (router/switch) buffers. This includes the copying over internal links in processing nodes, such as router backplanes/switching fabrics.

In the core of the Internet, serialization delay has largely become a non-issue, because link speeds have increased much faster over the past years than packets sizes. Therefore, the "hopcount" as shown by e.g. traceroute is a bad predictor for delay today.

Example Serialization Delays

To illustrate the effects of link rates and packet sizes on serialization delay, here is a table of some representative values. Note that the maximum packet size for most computers is 1500 bytes today, but 9000-byte "jumbo frames" are already supported by many research networks.

Link Rate 64 kb/s 1 Mb/s 10 Mb/s 100 Mb/s 1 Gb/s 10 Gb/s
9000 bytes 1125 ms 72 ms 7.2 ms 720 µs 72 µs 7.2 µs
Packet Size            
64 bytes 8 ms 0.512 ms 51.2 µs 5.12 µs 0.512 µs 51.2 ns
512 bytes 64 ms 4.096 ms 409.6 µs 40.96 µs 4.096 µs 409.6 ns
1500 bytes 187.5 ms 12 ms 1.2 ms 120 µs 12 µs 1.2 µs

-- SimonLeinen - 28 Oct 2004 - 17 Jun 2010

Round-Trip Time (RTT)

Round-trip time (RTT) is the total time for a packet sent by a node A to reach is destination B, and for a response sent back by B to reach A.

Decomposition

The round-trip time is the sum of the one-way delays from A to B and from B to A, and of the time it takes B to send the response.

Impact on end-to-end performance

For window-based transport protocols such as TCP, the round-trip time influences the achievable throughput at a given window size, because there can only be a window's worth of unacknowledged data in the network, and the RTT is the lower bound for a packet to be acknowledged.

For interactive applications such as conversational audio/video, instrument control, or interactive games, the RTT represents a lower bound on response time, and thus impacts responsiveness directly.

Measurement

The round-trip time is often measured with tools such as ping (Packet InterNet Groper) or one of its cousins such as fping, which send ICMP Echo requests to a destination, and measure the time until the corresponding ICMP Echo Response messages arrive.

However, please note that while round-trip time reported by PING is relatively precise measurement, some network devices may prioritize ICMP handling routines, so the measured values do not correspond to the real values.

-- MichalPrzybylski - 19 Jul 2005

Improving the round-trip time

The network components of RTT are the one-way delays in both directions (which can use different paths), so see the OneWayDelay topic on how those can be improved. The speed of response generation can be improved through upgrades of the responding host's processing power, or by optimizing the responding program.

-- SimonLeinen - 31 Oct 2004

Bandwidth-Delay Product (BDP)

The BDP of a path is the product of the (bottleneck) bandwidth and the delay of the path. Its dimension is "information", because bandwidth (here) expresses information per time, and delay is a time (duration). Typically, one uses bytes as a unit, and it is often useful to think of BDP as the "memory capacity" of a path, i.e. the amount of data that fits entirely into the path (between two end-systems).

BDP is an important parameter for the performance of window-based protocols such as TCP. Network paths with a large bandwidth-delay product are called Long Fat Networks or "LFNs".

References

-- SimonLeinen - 14 Apr 2005

"Long Fat Networks" (LFNs)

Long Fat Networks (LFNs, pronounced like "elephants") are networks with a high bandwidth-delay product.

One of the issues with this type of network is that it can be challenging to achieve high throughput for individual data transfers with window-based transport protocols such as TCP. LFNs are thus a main focus of research on high-speed improvements for TCP.

-- SimonLeinen - 27 Oct 2004 - 17 Jun 2005

Delay Variation ("Jitter")

Delay variation or "jitter" is a metric that describes the level of disturbance of packet arrival times with respect to an "ideal" pattern, typically the pattern in which the packets were sent. Such disturbances can be caused by competing traffic (i.e. queueing), or by contention on processing resources in the network.

RFC 3393 defines an IP Delay Variation Metric (IPDV). This particular metric only compares the delays experienced by packets of equal size, on the grounds that delay is naturally dependent on packet size, because of serialization delay.

Delay variation is an issue for real-time applications such as audio/video conferencing systems. They usually employ a jitter buffer to eliminate the effects of delay variation.

Delay variation is related to packet reordering. But note that the RFC 3393 IPDV of a network can be arbitrarily low, even zero, even though that network reorders packets, because the IPDV metric only compares delays of equal-sized packets.

Decomposition

Jitter is usually introduced in network nodes (routers), as an effect of queueing or contention for forwarding resources, especially on CPU-based router architectures. Some types of links can also introduce jitter, for example through collision avoidance (shared Ethernet) or link-level retransmission (802.11 wireless LANs).

Measurement

Contrary to one-way delay, one-way delay variation can be measured without requiring precisely synchronized clocks at the measurement endpoints. Many tools that measure one-way delay also provide delay variation measurements.

References

The IETF IPPM (IP Performance Metrics) Working Group has formalized metrics for IPDV, and more recently started work on an "applicability statement" that explains how IPDV can be used in practice and what issues have to be considered.

  • RFC 3393, IP Packet Delay Variation Metric for IP Performance Metrics (IPPM), C. Demichelis, P. Chimento. November 2002.
  • draft-morton-ippm-delay-var-as-03.txt, Packet Delay Variation Applicability Statement, A. Morton, B. Claise, July 2007.

-- SimonLeinen - 28 Oct 2004 - 24 Jul 2007

Packet Loss

Packet loss is the probability of a packet to be lost in transit from a source to a destination.

A One-way Packet Loss Metric for IPPM is defined in RFC 2680. RFC 3357 contains One-way Loss Pattern Sample Metrics.

Decomposition

There are two main reasons for packet loss:

Congestion

When the offered load exceeds the capacity of a part of the network, packets are buffered in queues. Since these buffers are also of limited capacity, severe congestion can lead to queue overflows, which lead to packet drops. In this context, severe congestion could mean that a moderate overload condition holds for an extended amount of time, but could also consist of the sudden arrival of a very large amount of traffic (burst).

Errors

Another reason for loss of packets is corruption, where parts of the packet are modified in-transit. When such corruptions happen on a link (due to noisy lines etc.), this is usually detected by a link-layer checksum at the receiving end, which then discards the packet.

Impact on end-to-end performance

Bulk data transfers usually require reliable transmission, so lost packets must be retransmitted. In addition, congestion-sensitive protocols such as TCP must assume that packet loss is due to congestion, and reduce their transmission rate accordingly (although recently there have been some proposals to allow TCP to identify non-congestion related losses and treat those differently).

For real-time applications such as conversational audio/video, it usually doesn't make much sense to retransmit lost packets, because the retransmitted copy would arrive too late (see delay variation). The result of packet loss is usually a degradation in sound or image quality. Some modern audio/video codecs provide a level of robustness to loss, so that the effect of occasional lost packets are benign. On the other hand, some of the most effective image compression methods are very sensitive to loss, in particular those that use relatively rare "anchor frames", and that represent the intermediate frames by compressed differences to these anchor frames - when such an anchor frame is lost, many other frames won't be able to be reconstructed.

Measurement

Packet loss can be actively measured by sending a set of packets from a source to a destination and comparing the number of received packets against the number of packets sent.

Network elements such as routers also contain counters for events such as checksum errors or queue drops, which can be retrieved through protocols such as SNMP. When this kind of access is available, this can point to the location and cause of packet losses.

Reducing packet loss

Congestion-induced packet loss can be avoided by proper provisioning of link capacities. Depending on the probability of bursts (which is somewhat difficult to estimate, taking into account both link capacities in the network, and the traffic rates and patterns of a possibly large number of hosts at the edge of the network), buffers in network elements such as routers must also be sufficient. Note that large buffers can be detrimental to other aspects of network performance, in particular one-way delay (and thus round-trip time) and delay variation.

Quality-of-Service mechanisms such as DiffServ or IntServ can be used to protect some subset of traffic against loss, but this necessarily increases packet loss for the remaining traffic.

Lastly, Active Queue Management (AQM) and Excplicit Congestion Notification (ECN) can be used to mitigate both packet loss and queueing delay (and thus one-way delay, round-trip time and delay variation).

References

  • A One-way Packet Loss Metric for IPPM, G. Almes, S. Kalidindi, M. Zekauskas, September 1999, RFC 2680
  • Improving Accuracy in Endtoend Packet Loss Measurement, J. Sommers, P. Barford, N. Duffield, A. Ron, August 2005, SIGCOMM'05 (PDF)
-- SimonLeinen - 01 Nov 2004

Packet Reordering

The Internet Protocol (IP) does not guarantee that packets are delivered in the order in which they were sent. This was a deliberate design choice that distinguishes IP from protocols such as, for instance, ATM and IEEE 802.3 Ethernet.

Decomposition

Reasons why a network may reorder packets: Usually because of some kind of parallelism, either because of a choice of alternative routes (Equal Cost Multipath, ECMP), or because of internal parallelism inside switching elements such as routers. One particular kind of packet reordering concerns packets of different sizes. Because a larger packet takes longer to transfer over a serial link (or a limited-width backplane inside a router), larger packets may be "overtaken" by smaller packets that were sent subsequently. This is usually not a concern for high-speed bulk transfers - where the segments tend to be equal-sized (hopefully Path MTU-sized), but may pose problems for naive implementations of multi-media (Audio/Video) transport.

Impact on end-to-end performance

In principle, applications that use a transport protocol such as TCP or SCTP don't have to worry about packet reordering, because the transport protocol is responsible for reassembling the byte stream (message stream(s) in the case of SCTP) into the original ordering. However, reordering can have a severe performance impact on some implementations of TCP. Recent TCP implementations, in particular those that support Selective Acknowledgements (SACK), can exhibit robust performance even in the face of reordering in the network.

Real-time media applications such as audio/video conferencing tools often experience problems when run over networks that reorder packets. This is somewhat remarkable in that all of these applications have jitter buffers to eliminate the effects of delay variation on the real-time media streams. Obviously, the code that manages these jitter buffers is often not written in a way to accomodate reordered packets sensibly, although this could be done with moderate effort.

Measurement

Packet reordering is measured by sending a numbered sequence of packets, and comparing the received sequence number sequence with the original. There are many possible ways of quantifying reordering, some of which are described in the IPPM Working Group's documents (see the references below).

The measurements here can be done differently, depending on the measurement purpose:

a) measuring reordering for particular application can be done by capturing the application traffic (e.g. using the Wireshark/Ethereal tool), injecting the same traffic pattern via traffic generator and calculating the reordering.

b) measuring maximal reordering introduced by the network can be done by injecting relatively small amount of traffic, shaped as a short bursts of long packets immediately followed by short burst of short packets, with line rate. After capture and calculation on the other end of the path, the results will reflect the worst possible packet reordering situation which may occur on particular path.

For more information please refer to the measurement page, available at packet reordering - measurement site. Please notice, that although the tool is based on older versions of reordering drafts, the metrics used are compatible with the definitions from new ones.

In particular Reorder Density and Reorder Buffer-Occupany Density can be obtained from tcpdump by subjecting the dump files to RATtool, available at Reordering Analysis Tool Website

Improving reordering

The probability of reordering can be reduced by avoiding parallelism in the network, or by using network nodes that take care to use parallel paths in such a way that packets belonging to the same end-to-end flow are kept on a single path. This can be ensured by using a hash on the destination address, or the (source address, destination address) pair to select from the set of available paths. For certain types of parallelism this is hard to achieve, for example when a simple striping mechanism is used to distribute packets from a very high-speed interface to multiple forwarding paths.

References

-- SimonLeinen - 28 Oct 2004 - 04 Jun 2008
-- MichalPrzybylski - 19 Jul 2005
-- BartoszBelter - 28 Mar 2006

Maximum Transmission Unit (MTU)

The MTU (or to be exact, 'protocol MTU') of a link (logical IP subnet) describes the maximum size of an IP packet that can be transferred over the link without fragmentation. Common MTUs include

  • 1500 bytes (Ethernet, 802.11 WLAN)
  • 4470 bytes (FDDI, common default for POS and serial links)
  • 9000 bytes (Internet2 and GANT convention, limit of some Gigabit Ethernet adapters)
  • 9180 bytes (ATM, SMDS)

For entire network paths, see PathMTU. For specific information on configuring large (Jumbo) MTUs, see JumboMTU.

The term 'media MTU' is used to refer to the maximum sized Layer 2 PDU that a given interface can support. Media MTU bust equal to or greater the sum of protocol MTU and the Layer 2 header.

References

-- SimonLeinen - 27 Oct 2004
-- TobyRodwell - 24 Jan 2005
-- MarcoMarletta - 20 Jan 2006

Path MTU

The Path MTU is the Maximum Transmission Unit (MTU) supported by a network path. It is the minimum of the MTUs of the links (segments) that make up the path. Larger Path MTUs generally allow for more efficient data transfers, because source and destination hosts, as well as the switching devices (routers) along the network path have to process fewer packets. However, it should be noted that modern high-speed network adapters have mechanisms such as LSO (Large Send Offload) and Interrupt Coalescence that diminish the influence of MTUs on performance. Furthermore, routers are typically dimensioned to sustain very high packet loads (so that they can resist denial-of-service attacks) so the packet rates caused by high-speed transfers is not normally an issue for today's high-speed networks.

The prevalent Path MTU on the Internet is now 1500 bytes, the Ethernet MTU. There are some initiatives to support larger MTUs (JumboMTU) in networks, in particular on research networks. But their usability is hampered by last-mile issues and lack of robustness of RFC 1191 Path MTU Discovery.

Path MTU Discovery Mechanisms

Traditional (RFC1191) Path MTU Discovery

RFC 1191 describes a method for a sender to detect the Path MTU to a given receiver. (RFC 1981 describes the equivalent for IPv6.) The method works as follows:

  • The sending host will send packets to off-subnet destinations with the "Don't Fragment" bit set in the IP header.
  • The sending host keeps a cache containing Path MTU estimates per destination host address. This cache is often implemented as an extension of the routing table.
  • The Path MTU estimate for a new destination address is initialized to the MTU of the outgoing interface over which the destination is reached according to the local routing table.
  • When the sending host receives an ICMP "Too Big" (or "Fragmentation Needed and Don't Fragment Bit Set") destination-unreachable message, it learns that the Path MTU to that destination is smaller than previously assumed, and updates the estimate accordingly.
  • Normally, an ICMP "Too Big" message contains the next-hop MTU, and the sending host will use that as the new Path MTU estimate. The estimate can still be wrong because a subsequent link on the path may have an even smaller MTU.
  • For destination addresses with a Path MTU estimate lower than the outgoing interface MTU, the sending host will occasionally attempt to raise the estimate, in case the path has changed to support a larger MTU.
  • When trying ("probing") a larger Path MTU, the sending host can use a list of "common" MTUs, such as the MTUs associated with popular link layers, perhaps combined with popular tunneling overheads. This list can also be used to guess a smaller MTU in case an ICMP "Too Big" message is received that doesn't include any information about the next-hop MTU (maybe from a very very old router).

This method is widely implemented, but is not robust in today's Internet because it relies on ICMP packets sent by routers along the path. Such packets are often suppressed either at the router that should generate them (to protect its resources) or on the way back to the source, because of firewalls and other packet filters or rate limitations. These problems are described in RFC2923. When packets are lost due to MTU issues without any ICMP "Too Big" message, this is sometimes called a (MTU) black hole. Some operating systems have added heuristics to detect such black holes and work around them. Workarounds can include lowering the MTU estimate or disabling PMTUD for certain destinations.

Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821)

An IETF Working Group (pmtud) was chartered to define a new mechanism for Path MTU Discovery to solve these issues. This process resulted in RFC 4821, Packetization Layer Path MTU Discovery ("PLPMTUD"), which was published in March 2007. This scheme requires cooperation from a network layer above IP, namely the layer that performs "packetization". This could be TCP, but could also be a layer above UDP, let's say an RPC or file transfer protocol. PLPMTUD does not require ICMP messages. The sending packetization layer starts with small packets, and probes progressively larger sizes. When there's an indication that a larger packet was successfully transmitted to the destination (presumably because some sort of ACK was received), the Path MTU estimate is raised accordingly.

When a large packet was lost, this might have been due to an MTU limitation, but it might also be due to other causes, such as congestion or a transmission error - or maybe it's just the ACK that was lost! PLPMTUD recommends that the first time this happens, the sending packetization layer should assume an MTU issue, and try smaller packets. An isolated incident need not be interpreted as an indication of congestion.

An implementation of the new scheme for the Linux kernel was integrated into version 2.6.17. It is controlled by a "sysctl" value that can be observed and set through /proc/sys/net/ipv4/tcp_mtu_probing. Possible values are:

  • 0: Don't perform PLPMTUD
  • 1: Perform PLPMTUD only after detecting a "blackhole" in old-style PMTUD
  • 2: Always perform PLPMTUD, and use the value of tcp_base_mss as the initial MSS.

A user-space implementation over UDP is included in the VFER bulk transfer tool.

References

Implementations

  • for Linux 2.6 - integrated in mainstream kernel as of 2.6.17. However, it is disabled by default (see net.ipv4.tcp_mtu_probing sysctl)
  • for NetBSD

-- HankNussbacher - 03 Jul 2005
-- SimonLeinen - 19 Jul 2006 - 01 Nov 2010

Network Protocols

This section contains information about a few common Internet protocols, with a focus on performance questions.

For information about higher-level protocols, see under application protocols.

-- SimonLeinen - 31 Oct 2004

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

  • implicit congestion control protocols (Relies on implicit measurements of congestion such as loss or delay)

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- SimonLeinen - 28 May 2005 - 03 Feb 2011
-- OrlaMcGann - 10 Oct 2005
-- ChrisWelti - 25 Feb 2008

HighSpeed TCP

HighSpeed TCP is a modification to TCP congestion control mechanism, proposed by Sally Floyd from ICIR (The ICSI Center for Internet Research). The main problem of TCP connection is that it takes a long time to make a full recovery from packet loss for high-bandwidth long-distance connections. Like the others, new TCP implementations proposes a modification of congestion control mechanism for use with TCP connections with large congestion windows. For the current standard TCP with a steady-state packet loss rate p, an average congestion window is 1.2/sqrt(p) segments. It places a serious constraint in realistic environments.

Achieving an average TCP throughput of B bps requires a loss event at most every BR/(12D) round-trip times and an average congestion window W is BR/(8D). For round-trip time R = 0.1 seconds and 1500-bytes packets (D), throughput of 10 Gbps would require an average congestion window of 83000 segments, and a packet drop rate (equal 0,0000000002) of at most one congestion event every 5,000,000,000 packets or equivalently at most one congestion event every 1 2/3 hours. This is an unrealistic constraint and has a huge impact on TCP connections with larger congestion windows. So what we need is to achieve high per-connection throughput without requiring unrealistically low packet loss rates.

HighSpeed TCP proposes changing TCP response function w = 1.2/sqrt(p) to achieve high throughput with more realistic requirements for the steady-state packet drop rate. A modified HighSpeed TCP function uses three parameters: Low_Window, High_Window, and High_P. The HighSpeed response function uses the same response function as Standard TCP when the current congestion window is at most Low_Window, and uses the HighSpeed response function when the current congestion window is greater than Low_Window. The value of the average congestion window w greater than Low_Window is

HSTCP_w.png

where,

  • Low_P is the packet drop rate corresponding to Low_Window, using Standard TCP response function,
  • S is a constant as follows

HSTCP_S.png

The window needed to sustain 10Gbps throughput using HighSpeed TCP connection is 83000 segments, so the High_Window could be set to this value. The packet drop rate needed in the HighSpeed reponse function to achieve an average congestion window of 83000 segments is 10^-7. For informations on how to set up the remaining parameters see document of Sally Floyd HighSpeed TCP for Large Congestion Windows. Calibrating appropriate values, we can use bigger congestion window sizes with very lower RTTs between congestion events and achieve a greater throughput.

HighSpeed TCP will have to modify the increase parameter a(w) and the decrease parameter b(w) of AIMD (Additive Increase and Multiplicative Decrease). For Standard TCP, a(w) = 1 and b(w) = 1/2, regardless of the value of w. HighSpeed TCP uses the same values of a(w) and b(w) for w lower than Low_Window. Parameters a(w) and b(w) for HighSpeed TCP are specified below.

HSTCP_aw.png

HSTCP_bw.png

High_Decrease = 0,1 means decrease of 10% of congestion window when congestion event occurre.

For example, for w = 83000, parameters a(w) and b(w) for Standard TCP and HSTCP are specified in table below.

TCPHSTCP
a(w)172
b(w)0.50.1

Summary

Based on the results of HighSpeed TCP and Standard TCP connections that shared bandwidth together we obtain following conclusions. HSTCPs share fairly available bandwidth between themselves. HSTCP may take longer time to converge than Standard TCP. The great disadvantage is that HighSpeed TCP flow starves a TCP flow even in relatively low/moderate bandwidth. The author of the HSTCP specification justifies that there are not a lot of TCP connections effectively operating in this regime today, with large congestion windows, and that therefore the benefits of the HighSpeed response function would outweigh the unfairness that would be experienced by Standard TCP in this regime. Another benefit of applying HSCTP is that it is easier to deploy and no router support needed.

Implementations

HS-TCP is included in Linux as a selectable option in the modular TCP congestion control framework. A few problems with the implementation in Linux 2.6.16 were found by D.X. Wei (see references). These problems were fixed as of Linux 2.6.19.2.

A proposed OpenSolaris project foresees the implementation of HS-TCP and several other congestion control algorithms for OpenSolaris.

References

-- WojtekSronek - 13 Jul 2005
-- HankNussbacher - 06 Oct 2005
-- SimonLeinen - 03 Dec 2006 - 17 Nov 2009

Hamilton TCP

Hamilton TCP (H-TCP), developed by Hamilton Institute, is an enhancement to existing TCP congestion control protocol, with good mathematic and empirical basis. The authors makes an assumption that their improvement should behave as regular TCP in standard LAN and WAN networks where RTT and link capacity are not extremely high. But in the long distant fast networks, H-TCP is far more aggressive and more flexible while adjusting to available bandwidth and avoiding packet losses. Therefore two modes are used for congestion control and mode selection depends on the time since the last experienced congestion. In regular TCP window control algorithm, the window is increased by adding a constant value α and decreased by multiplying by β. The corresponding values are by default equals to 1 and 0.5. In H-TCP case the window estimation is a bit more complex, because both factors are not constant values, but are calculated during transmission.

The parameter values are estimated as follows. For each acknowledgment set:

HTCP_alpha.png

and then

HTCP_alpha2.png

On each congestion event set:

HTCP_beta.png

Where,

  • Δ_i is the time that elapsed since the last congestion experienced by i'th source,
  • Δ_L is the threshold for switching between fast and slow modes,
  • B_i(k) is the throughput of flow i immediately before the k'th congestion event,
  • RTT_max,i and RTT_min,i are maximum and minimum round trip time values for the i'th flow.

The α value depends on the time between experienced congestion events, while β depends on RTT and achieved bandwidth values. The constant values, as well as the Δ_L value, can be modified, in order to adjust the algorithm behavior to user expectations, but the default values are estimated empirically and seem to be the most suitable in most cases. The figures below present the behavior of congestion window, throughput and RTT values during tests performed by the H-TCP authors. In the first case β value was set to 0.5, in the second case the adaptive approach (explained with the equations above) was used. The throughput in the second case is far more stable and is closing to its maximal value.

HTCP_figure1.jpg

HTCP_figure2.jpg

H-TCP congestion window and throughput behavior (source: �H-TCP: TCP for high-speed and long-distance networks�, D.Leith, R.Shorten).

Summary

Another advantage of H-TCP is fairness and friendliness, which means that the algorithm is not "greedy", and will not consume the whole link capacity, but is able to fairly share the bandwidth with either another H-TCP like transmissions or any other TCP-like transmissions. H-TCP congestion window control improves the dynamic of sending ratio and therefore provides better overall throughput value than regular TCP implementation. On the Hamilton Institute web page (http://www.hamilton.ie/net/htcp/) more information is available, including papers, test results and algorithm implementations for Linux kernels 2.4 and 2.6.

References

-- WojtekSronek - 13 Jul 2005 -- SimonLeinen - 30 Jan 2006 - 14 Apr 2008

TCP Westwood

The current Standard TCP implementation rely on packet loss as an indicator of network congestion. The problem in TCP is that it does not possess the capability to distinguish congestion loss from loss invoked by noisy links. As a consequence, Standard TCP reacts with a drastic reduction of the congestion window. In wireless connections overlapping radio channels, signal attenuation, additional noises have a huge impact on such losses.

TCP Westwood (TCPW) is a small modification of Standard TCP congestion control algorithm. When the sender perceive that congestion has appeared, the sender uses the estimated available bandwidth to set the congestion window and the slow start threshold sizes. TCP Westwood avoids huge reduction of these values and ensure both faster recovery and effective congestion avoidance. It does not require any support from lower and higher layers and does not need any explicit congestion feedback from the network.

Measurement of bandwidth in TCP Westwood lean on simple relationship between amount of data sent by the sender to the time of receiving an acknowledgment. The estimation bandwidth process is a bit similar to the one used for RTT estimation in Standard TCP with additional improvements. When there is an absence of ACKs because no packets were sent, the estimated value goes to zero. The formulas of estimation bandwidth process is shown in many documents about TCP Westwood. Such documents are listed on the UCLA Computer Science Department website.

There are some issues that can potentially disrupt the bandwidth estimation process such a delayed or cumulative ACKs indicating wrong order of received segments. So the source must keep track of the number of duplicated ACKs and should be able to detect delayed ACKs and act accordingly.

As it was said before the general idea is to use the bandwidth estimate value to set the congestion window and the slow start threshold after a congestion episode. Below there is an algorithm of setting these values.

if (n duplicate ACKs are received) {
   ssthresh = (BWE * RTTmin)/seg_size;
   if (cwin > ssthresh) cwin = ssthresh; /* congestion avoid. */
}

where

  • BWE is a estimated value of bandwidth,
  • RTTmin is the smallest RTT value observed during the connection time,
  • seg_size is a length of TCPs segment payload.

Additional benefits

Based on the results of testing TCP Westwood we can see that its fairness is at least as good, than in widely used Standard TCP even if two flows have different round trip times. The main advantage of new implementation of TCP is that it does not starve the others variants of TCP. So, we can claim that TCP Westwood is also friendly.

References

-- WojtekSronek - 27 Jul 2005

TCP Selective Acknowledgements (SACK)

Selective Acknowledgements are a refinement of TCP's traditional "cumulative" acknowledgements.

SACKs allow a receiver to acknowledge non-consecutive data, so that the sender can retransmit only what is missing at the receiver�s end. This is particularly helpful on paths with a large bandwidth-delay product (BDP).

TCP may experience poor performance when multiple packets are lost from one window of data. With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. An aggressive sender could choose to retransmit packets early, but such retransmitted segments may have already been successfully received.

A Selective Acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, can help to overcome these limitations. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments.

Multiple packet losses from a window of data can have a catastrophic effect on TCP throughput. TCP uses a cumulative acknowledgment scheme in which received segments that are not at the left edge of the receive window are not acknowledged. This forces the sender to either wait a roundtrip time to find out about each lost packet, or to unnecessarily retransmit segments which have been correctly received. With the cumulative acknowledgment scheme, multiple dropped segments generally cause TCP to lose its ACK-based clock, reducing overall throughput. Selective Acknowledgment (SACK) is a strategy which corrects this behavior in the face of multiple dropped segments. With selective acknowledgments, the data receiver can inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost.

The selective acknowledgment extension uses two TCP options. The first is an enabling option, "SACK-permitted", which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection once permission has been given by SACK-permitted.

Blackholing Issues

Enabling SACK globally used to be somewhat risky, because in some parts of the Internet, TCP SYN packets offering/requesting the SACK capability were filtered, causing connection attempts to fail. By now, it seems that the increased deployment of SACK has caused most of these filters to disappear.

Performance Issues

Sometimes it is not recommended to enable SACK feature, e.g. for the Linux 2.4.x TCP SACK implementation suffers from significant performance degradation in case of a burst of packet loss. People from CERN observed that a burst of packet loss considerably affects TCP connections with large bandwidth delay product (several MBytes), because the TCP connection doesn't recover as it should. After a burst of loss, the throughput measured in their testbed is close to zero during 70 seconds. This behavior is not compliant with TCP's RFC. A timeout should occur after a few RTTs because one of the loss couldn't be repaired quickly enough and the sender should go back into slow start.

For more information take a look at http://sravot.home.cern.ch/sravot/Networking/TCP_performance/tcp_bug.htm

Additionally, work done at Hamilton Institute found that SACK processing in the Linux kernel is inefficient even for later 2.6 kernels, where at 1Gbps networks performance for a long file transfer can lose about 100Mbps of potential. Most of these issues should have been fixed in Linux kernel 2.6.16.

Detailed Explanation

The following is closely based on a mail that Baruch Even sent to the pert-discuss mailing list on 25 Jan '07:

The Linux TCP SACK handling code has in the past been extremely inefficient - there were multiple passes on a linked list which holds all the sent packets. This linked list on a large BDP link can span 20,000 packets. This meant multiple traversals of this list take longer than it takes for another packet to come. Pretty quickly after a loss with SACK the sender's incoming queue fills up and ACKs start getting dropped. There used to be an anti-DoS mechanism that would drop all packets until the queue had emptied - with a default of 1000 packets that took a long time and could easily drop all the outstanding ACKs resulting in a great degradation of performance. This value is set with proc/sys/net/core/netdev_max_backlog

This situation has been slightly improved in that though there is still a 1000 packet limit for the network queue, it acts as a normal buffer i.e. packets are accepted as soon as the queue dips below 1000 again.

Kernel 2.6.19 should be the preferred kernel now for high speed networks. It is believed it has the fixes for all former major issues (at least those fixes that were accepted to the kernel), but it should be noted that some other (more minor) bugs have appeared, and will need to be fixed in future releases.

Historical Note

Selective acknowledgement schemes were known long before they were added to TCP. Noel Chiappa mentioned that Xerox' PUP protocols had them in a posting to the tcp-ip mailing list in August 1986. Vint Cerf responds with a few notes about the thinking that lead to the cumulative form of the original TCP acknowledgements.

References

-- UlrichSchmid & SimonLeinen - 02 Jun 2005 - 14 Jan 2007
-- BartoszBelter - 16 Dec 2005
-- BaruchEven - 05 Jan 2006

Automatic Tuning of TCP Buffers

Note: This mechanism is sometimes referred to as "Dynamic Right-Sizing" (DRS).

The issues mentioned under "Large TCP Windows" are arguments in favor of "buffer auto-tuning", a promising but relatively new approach to better TCP performance in operating systems. See the TCP auto-tuning zoo reference for a description of some approaches.

Microsoft introduced (receive-side) buffer auto-tuning in Windows Vista. This implementation is explained in a TechNet Magazine "Cable Guy" article.

FreeBSD introduced buffer auto-tuning as part of its 7.0 release.

Mac OS X introduced buffer auto-tuning in release 10.5.

Linux auto-tuning details

Some automatic buffer tuning is implemented in Linux 2.4 (sender-side), and Linux 2.6 implements it for both the send and receive directions.

In a post to the web100-discuss mailing list, John Heffner describes the Linux 2.6.16 (March 2006) Linux implementation as follows:

For the sender, we explored separating the send buffer and retransmit queue, but this has been put on the back burner. This is a cleaner approach, but is not necessary to achieve good performance. What is currently implemented in Linux is essentially what is described in Semke '98, but without the max-min fair sharing. When memory runs out, Linux implements something more like congestion control for reducing memory. It's not clear that this is well-behaved, and I'm not aware of any literature on this. However, it's rarely used in practice.

For the receiver, we took an approach similar to DRS, but not quite the same. RTT is measured with timestamps (when available), rather than using a bounding function. This allows it to track a rise in RTT (for example, due to path change or queuing). Also, a subtle but important difference is that receiving rate is measured by the amount of data consumed by the application, not data received by TCP.

Matt Mathis reports on the end2end-interest mailing list (26/07/06):

Linux 2.6.17 now has sender and receiver side autotuning and a 4 MB DEFAULT maximum window size. Yes, by default it negotiates a TCP window scale of 7.
4 MB is sufficient to support about 100 Mb/s on a 300 ms path or 1 Gb/s on a 30 ms path, assuming you have enough data and an extremely clean (loss-less) network.

References

  • TCP auto-tuning zoo, Web page by Tom Dunegan of Oak Ridge National Laboratory, PDF
  • A Comparison of TCP Automatic Tuning Techniques for Distributed Computing, E. Weigle, W. Feng, 2002, PDF
  • Automatic TCP Buffer Tuning, J. Semke, J. Mahdavi, M. Mathis, SIGCOMM 1998, PS
  • Dynamic Right-Sizing in TCP, M. Fisk, W. Feng, Proc. of the Los Alamos Computer Science Institute Symposium, October 2001, PDF
  • Dynamic Right-Sizing in FTP (drsFTP): Enhancing Grid Performance in User-Space, M.K. Gardner, Wu-chun Feng, M. Fisk, July 2002, PDF. This paper describes an implementation of buffer-tuning at application level for FTP, i.e. outside of the kernel.
  • Socket Buffer Auto-Sizing for High-Performance Data Transfers, R. Prasad, M. Jain, C. Dovrolis, PDF
  • The Cable Guy: TCP Receive Window Auto-Tuning, J. Davies, January 2007, Microsoft TechNet Magazine
  • What's New in FreeBSD 7.0, F. Biancuzzi, A. Oppermann et al., February 2008, ONLamp
  • How to disable the TCP autotuning diagnostic tool, Microsoft Support, Article ID: 967475, February 2009

-- SimonLeinen - 04 Apr 2006 - 21 Mar 2011
-- ChrisWelti - 03 Aug 2006

Application Protocols

-- TobyRodwell - 28 Feb 2005

File Transfer

A common problem for many scientific applications is the replication of - often large - data sets (files) from one system to another. (For the generalized problem of transferring data sets from a source to multiple destinations, see DataDissemination.) Typically this requires reliable transfer (protection against transmission errors) such as provided by TCP, typically access control based on some sort of authentication, and sometimes confidentiality against eavesdroppers, which can be provided by encryption. There are many protocols that can be used for file transfer, some of which are outlined here.

  • FTP, the File Transfer Protocol, was one of the earliest protocols used on the ARPAnet and the Internet, and predates both TCP and IP. It supports simple file operations over a variety of operating systems and file abstractions, and has both a text and a binary mode. FTP uses separate TCP connections for control and data transfer.
  • HTTP, the Hypertext Transfer Protocol, is the basic protocol used by the World Wide Web. It is quite efficient for transferring files, but is typically used to transfer from a server to a client only.
  • RCP, the Berkeley Remote Copy Protocol, is a convenient protocol for transferring files between Unix systems, but lacks real security beyond address-based authentication and clear-text passwords. Therefore it has mostly fallen out of use.
  • SCP is a file-transfer application of the SSH protocol. It provides various modern methods of authentication and encryption, but its current implementations have some performance limitations over "long fat networks" that are addressed under the SSH topic.
  • BitTorrent is an example of a peer-to-peer file-sharing protocol. It employs local control mechanisms to optimize the global problem of replicating a large file to many recipients, by allowing peers to share partial copies as they receive them.
  • VFER is a tool for high-performance data transfer developed at Internet2. It is layered on UDP and implements its own delay-based rate control algorithm in user-space, which is designed to be "TCP friendly". Its security is based on SSH.
  • UDT is another UDP-based bulk transfer protocol, optimized for high-capacity (1 Gb/s and above) wide-area network paths. It has been used in the winning entry at the Supercomputing'06 Bandwidth Challenge.

Several high-performance file transfer protocols are used in the Grid community. The "comparative evaluation..." paper in the references compares FOBS, RBUDP, UDT, and bbFTP. Other protocols include GridFTP and Tsunami. The eVLBI community uses file transfer tools from the Mark5 software suite: File2Net and Net2File. The ESnet "Fasterdata" knowledge base has a very nice section on Data Transfer Tools, providing both general background information and information about several specific tools.

Network File Systems

Another possibility of exchanging files over the network involves networked file systems, which make remote files transparently accessible in a local system's normal file namespace. Examples for such file systems are:

  • NFS, the Network File System, was initially developed by Sun and is widely utilized on Unix-like systems. Very recently, NFS 4.1 added support for pNFS (parallel NFS), where data access can be striped over multiple data servers.
  • AFS, the Andrew File System from CMU, evolved into DFS (Distributed File System)
  • SMB (Server Message Block) or CIFS (Common Internet File System) is the standard protocol for connecting to "network shares" (remote file systems) in the Windows world.
  • GPFS (General Parallel File System) is a high-performance scalable network file system by IBM.
  • Lustre is an open-source file systems for high-performance clusters, distributed by Sun.

References

  • A Comparative Evaluation of High-Performance File Transfer Systems for Data-intensive Grid Applications, C. Anglano, M. Canonico, Proc. 13th IEEE Int. Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WET ICE'04) PDF, requires IEEE account
  • Transfer Tools, ESnet fasterdata Network Performance Knowledge Base, February 2011 (last checked)

-- TobyRodwell - 28 Feb 2005
-- SimonLeinen - 26 Jun 2005 - 27 Feb 2011

-- TobyRodwell - 28 Feb 2005

Secure Shell (SSH)

SSH is a widely used protocol for remote terminal access with secure authentication and data encryption. It is also used for file transfers, using tools such as scp (Secure Copy), sftp (Secure FTP), or rsync-over-ssh.

Performance Issues With SSH

Application Layer Window Limitation

When users use SSH to transfer large files, they often think that performance is limited by the processing power required for encryption and decryption. While this can indeed be an issue in a LAN context, the bottleneck over "long fat networks" (LFNs) is most likely a window limitation. Even when TCP parameters have been tuned to allow sufficiently large TCP Windows, the most common SSH implementation (OpenSSH) has a hardwired window size at the application level. Until OpenSSH 4.7, the limit was ~64K but since then, the limit was increased 16-fold (see below) and window increase logic was made more aggressive.

This limitation is replaced with a more advanced logic in a modification of the OpenSSH software provided by the Pittsburgh Supercomputing Center (see below).

The performance difference is substantial especially when RTT grows. In a test setup, with 45 ms RTT, two Linux systems with 8 MB read/write buffers could achive 1.5 MB/s performance with regular OpenSSH (3.9p1 + 4.3p1). Switching to OpenSSH 5.1p1 + HPN-SSH patches on both ends allow up to 55-70 MB/s (no encryption) or 35/50 MB/s (aes128-cbc/ctr encryption) , with the stable rate somewhat lower, the bottleneck being CPU on one end. By just upgrading the receiver (client) side, transfer could still reach 50 MB/s (with encryption).

Crypto overhead

When the window-size limitation is removed, encryption/decryption performance may become the bottleneck again. So it is useful to choose a "cipher" (encryption/decryption method) that performs well, while still being regarded as sufficiently secure to protect the data in question. Here is a table that displays the performance of several ciphers supported by OpenSSH in a reference setting:

cipher throughput
3des-cbc 2.8MB/s
arcfour 24.4MB/s
aes192-cbc 13.3MB/s
aes256-cbc 11.7MB/s
aes128-ctr 12.7MB/s
aes192-ctr 11.7MB/s
aes256-ctr 11.3MB/s
blowfish-cbc 16.3MB/s
cast128-cbc 7.9MB/s
rijndael-cbc@lysator.liu.se 12.2MB/s

The High Performance Enabled SSH/SCP (HPN-SSH) version also supports an option to the scp program that supports use of the "none" cipher, when confidentiality protection of the transferred data is not required. The program also supports a cipher-switch option where password authentication can be encrypted but the transferred data not.

References

Basics

SSH Performance

-- ChrisWelti - 03 Apr 2006
-- SimonLeinen - 12 Feb 2005 - 26 Apr 2010
-- PekkaSavola - 01 Oct 2008

BitTorrent

BitTorrent is an example of a peer-to-peer file-sharing protocol. It employs local control mechanisms to optimize the global problem of replicating a large file to many recipients, by allowing peers to share partial copies as they receive them. It was developed by Bram Cohen, and is widely used to distribute large files over the Internet. Because much of this copying is for audio and video data and without the accordance of the rights owners of that material, BitTorrent has become a focus of attention of media interest groups such as the Motion Picture Artists of America (MPAA). But BitTorrent is also used to distribute large software archives under "Free Software" or similar legal-redistribution regimes.

References

-- SimonLeinen - 26 Jun 2005

Explicit Control Protocol (XCP)

The eXplicit Control Protocol (XCP) is a new congestion control protocol developed by Dina Katabi from MIT Computer Science & Artificial Intelligence Lab. It extracts information about congestion from routers along the path between endpoints. It is more complicated to implement than other proposed Internet congestion control protocols.

XCP-capable routers make a fair per-flow bandwidth allocation without carrying per-flow congestion state in packets. In order to request for the desired throughput, the sender sends a congestion header (XCP packet) located between the IP and transport headers. It enables the sender to learn about the bottleneck on the path from the sender to the receiver in a single round trip.

In order to increase congestion window size for TCP connection, the sender require feedback from the network, informing about the maximum, available throughput along the path for injecting data into the network. The routers update such information in the congestion header as it moves from the sender to the receiver. The main task for the receiver is to copy the network feedback into outgoing packets belonging to the same bidirectional flow.

The congestion header consists of four fields as follows:

  • Round-Trip Time (RTT): The current round-trip time for the flow.
  • Throughput: The throughput used currently by the sender.
  • Delta_Throughput: The value by which the sender would like to change its throughput. This field is updated by the routers along the path. If one of the router sets a negative value, it means that the sender must slow down.
  • Reverse_Feedback: This is the value, which is copied from the Delta_Throughput field and returned by the receiver to the sender, it contains a maximum feedback allocation from the network.

A router that implements XCP maintains two control algorithms, executed periodically. The first of them, implemented in the congestion controller, is responsible for specifying maximum use of the outbound port. The fairness controller is responsible for fair distribution of throughput to flows sharing the link.

Benefits of XCP

  • it achieves fairness and rapidly converges to fairness,
  • it achieves maximum link utilization (better bottleneck link utilization),
  • it learns about the bottleneck much more rapidly,
  • it is potentially applicable to any transport protocol,
  • is more stable at long-distance connections with larger RTT.

Deployment issues and issues with XCP

  • has to be tested in real network environments,
  • it should describe solutions for tunneling protocols,
  • has to be supported by routers along the path,
  • the work load for the routers is higher.

References

-- WojtekSronek - 08 Jul 2005
-- SimonLeinen - 01 Apr 2006 - 18 Nov 2006

Source Quench

Source Quench is an ICMP based mechanism used by network devices to inform data sender that the packets can not be forwarded due to buffers overload. When the message is received by a TCP sender, that sender should decrease its send window to the respective destination in order to limit outgoing traffic. The ICMP Source Quench usage is specified in RFC 896 - Congestion Control in IP/TCP Internetworks. The currently used standard specified in RFC 1812 - Requirements for IP Version 4 Routers says that routers should not originate this message, and therefore Source Quench should not be used any more.

Problems with Source Quench

There are several reasons why Source Quench has fallen out of favor as a congestion control mechanism over the years:

  1. Source Quench messages can be lost on their way to the sender (due to congestion on the return path, filtering/return routing issues etc.). A congestion control mechanism should be robust against these sorts of problems.
  2. A Source Quench message carries very little information per packet, namely only that some amount of congestion was sensed at the gateway (router) that sent it.
  3. Source Quench messages, like all ICMP messages, are expensive for a router to generate. This is bad because the congestion control mechanism could contribute additional congestion, if router processing resources become a bottleneck.
  4. Source Quench messages could be abused by a malevolent third party to slow down connections, causing denial of service.

In effect, ICMP Source Quench messages are almost never generated on the Internet today, and would be ignored almost everywhere if they still existed.

References

  • Congestion Control in IP/TCP Internetworks, RFC 896, J. Nagle, January 1984
  • Something a Host Could Do with Source Quench: The Source Quench Introduced Delay (SQuID), RFC 1016, W. Prue and J. Postel, July 1987
  • Requirements for IP Version 4 Routers, RFC 1812, F. Baker (Ed.), June 1995
  • IETF-discuss message with notes on the history of Source Quench, F. Baker, Feb. 2007 - in archive

-- WojtekSronek - 05 Jul 2005

Explicit Congestion Notification (ECN)

The Internet's end-to-end rate control schemes (notably the TransmissionControlProtocol (TCP)) traditionally have to rely on packet loss as the prime indicator of congestion (queueing delay is also used for congestion feedback, although mostly implicitly, except in newer mechanisms such as TCP FAST).

Both loss and delay are implicit signals of congestion. The alternative is to send explicit congestion signals.

ICMP Source Quench

Such a signal was indeed part of the original IP design, in the form of the "Source Quench" ICMP (Internet Control Message Protocol) message. A router experiencing congestion could send ICMP Source Quench messages towards a source, in order to tell that source to send more slowly. This method was deprecated by RFC1995 in June 1812. Oops, I meant by RFC1812 ("Requirements for IP Version 4 Routers") in June 1995. The biggest problem with ICMP Source Quench are:

  • that the mechanism causes more traffic to be generated in a situation of congestion (although in the other direction)
  • that, when ICMP Source Quench messages are lost, it fails to slow down the sender.

The ECN Proposal

The new ECN mechanism consists of two components:

  • Two new "ECN" bits in the former TOS field of the IP header:
    • The "ECN-Capable Transport" (ECT) bit must only be set for packets controlled by ECN-aware transports
    • The "Congestion Experienced" (CE) bit can be set by a router if
      • the router has detected congestion on the outgoing link
      • and the ECT bit is set.

  • Transport-specific protocol extensions which communicate the ECN signal back from the receiver to the sender. For TransmissionControlProtocol, this takes the form of two new flags in the TCP header, ECN-Echo (ECE) and Congestion Window Reduced (CWR). A similar mechanism has been included in SCTP.

ECN works as follows. When a transport supports ECN, it sends IP packets with ECT (ECN-Capable Transport) set. Then, when there is congestion, a router will set the CE (Congestion Experienced) bit in some of these packets. The receiver notices this, and sends a signal back to the sender (by setting the ECE flag). The sender then reduces its sending rate, as if it had detected the loss of a single packet, and sets the CWR flag so as to inform the receiver of this action.

(Note that the two-bit ECN field in the IP header has been redefined in the current ECN RFC (RFC3168), so that "ECT" and "CE" are no longer actual bits. But the old definition is somewhat easier to understand. If you want to know how these "conceptual" bits are encoded, please check out RFC 3168.)

Benefits of ECN

ECN provides two significant benefits:

  • ECN-aware transports can properly adapt their rates to congestion without requiring packet loss
  • Congestion feedback can be quicker with ECN, because detecting a dropped packet requires a timeout.

Deployment Issues with ECN

ECN requires AQM, which isn't widely deployed

ECN requires routers to use an Active Queue Management (AQM) mechanism such as Random Early Detection (RED). In addition, routers have to be able to mark eligible (ECT) packets with the CE bit when the AQM mechanism notices congestion. RED is widely implemented on routers today, although it is rarely activated in actual networks.

ECN must be added to routers' forwarding paths

The capability to ECN-mark packets can be added to CPU- or Network-Processor-based routing platforms relatively easily - Cisco's CPU-based routers such as the 7200/7500 routers support this with newer software, for example, but if queueing/forwarding is performed by specialized hardware (ASICs), this function has to be designed into the hardware from the start. Therefore, most of today's high-speed routers cannot easily support ECN to my knowledge.

ECN "Blackholing"

Another issue is that attempts to use ECN can cause issues with certain "middlebox" devices such as firewalls or load balancers, which break connectivity when unexpected TCP flags (or, more rarely, unexpected IP TOS values) are encountered. The original ECN RFC (RFC 2481) didn't handle this gracefully, so activating ECN on hosts that implement this version caused much frustration because of "hanging" connections. RFC 3168 proposes a mechanism to deal with ECN-unfriendly networks, but that hasn't been widely implemented yet. In particular, the Linux ECN implementation doesn't seem to implement it as of November 2007 (Linux 2.6.23).

See Floyd's ECN Problems page for more.

References

The Addition of Explicit Congestion Notification (ECN) to IP
RFC 3168, K. Ramakrishnan, S. Floyd, D. Black, September 2001, ftp://ftp.ietf.org/rfc/rfc3168.txt

Robust Explicit Congestion Notification (ECN) Signaling with Nonces
RFC 3540, N. Spring, D. Wetherall, D. Ely, June 2003, ftp://ftp.ietf.org/rfc/rfc3540.txt

ECN (Explicit Congestion Notification) in TCP/IP
Web page by Sally Floyd, http://www.icir.org/floyd/ecn.html

-- SimonLeinen - 07 Jan 2005 - 04 Nov 2007

Rate Control Protocol (RCP)

Developed by Nandita Dukkipati and Nick McKeown in Stanford University, RCP aims to emulate processor sharing(PS) over a broad range of operating conditions. TCP's congestion control algorithm, and most of the other proposed alternatives such as ExplicitControlProtocol, try to emulate processor sharing by giving each competing flow an equal share of a bottleneck link. They emulate PS well in a static scenario when all flows are long-lived, but in scenarios where flows are short-lived, arrive randomly and have a finite amount of data to send, as is the case in today's Internet, they do not perform as well.

In RCP a router assigns a single rate to all flows that pass through it. The router does not keep flow-state nor does it do per-packet calculations. The flow rate is picked by routers based on the current queue occupancy and the aggregate input traffic rate.

The Algorithm

The basic RCP algorithm is as follows:

  1. Every router maintains a single fair-share rate, R(t), that it offers to all flows. It updates R(t) approximately once every RTT.
  2. Every packet header carries a rate field, Rp. When transmitted by the source, Rp = infinity. When a router receives a packet, if R(t) at the router is smaller than Rp, then Rp <- R(t); otherwise it's unchanged. The destination copies Rp into the acknowledgement packets, so as to notify the source. The packet header also carries an RTT field, RTTp; where RTTp is the source's current estimate of the RTT for the flow. When a router receives a packet it uses RTTp to update its moving average of the RTT of the flows passing through it, d0.
  3. The source transmits at a rate Rp, which correspondsto the smallest offered rate along the path

Papers

Processor Sharing Flows in the Internet. N. Dukkipati and N. McKeown Stanford University High Performance Networking Group Technical Report TR04-HPNG-061604, June 2004

-- OrlaMcGann - 11 Oct 2005

UDP (User Datagram Protocol)

The User Datagram Protocol (UDP) is a very simple layer over the host-to-host Internet Protocol (IP). It only adds 16-bit source and destination port numbers for multiplexing between different applications on the pair of hosts, and 16-bit length and checksum fields. It has been defined in RFC 768.

UDP is used directly for protocols such as the Domain Name System (DNS) or the Network Time Protocol (NTP), which consists on isolated request-response transactions between hosts, where the negotiation and maintenance of TCP connections would be prohibitive.

There are other protocols layered on top of UDP, for example the Real-time Transport Protocol (RTP) used in real-time media applications. UDT, VFER, RBUDP, Tsunami, and Hurricane are examples of UDP-based bulk transport protocols.

References

  • RFC 768, User Datagram Protocol, J. Postel, August 1980

-- SimonLeinen - 31 Oct 2004 - 02 Apr 2006

Real-Time Transport Protocol (RTP)

RTP (RFC 3550) is a generic transport protocol for real-time media streams such as audio or video. RTP is typically run over UDP. RTP's services include timestamps and identification of media types. RTP includes RTCP (RTP Control Protocol), whose primary use is to provide feedback about the quality of transmission in the form of Receiver Reports.

RTP is used as a framing protocol for many real-time audio/video applications, including those based on the ITU-T H.323 protocol and the Session Initiation Protocol (SIP).

References

  • RFC 3550, RTP: A Transport Protocol for Real-Time Applications, H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. July 2003

-- SimonLeinen - 28 Oct 2004 - 24 Nov 2007

pathping

Seems to be similar to mtr, but for Windows systems. This is included in at least some modern versions of Windows.

References

  • "pathping", Windows XP Professional Product Documentation,
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/pathping.mspx

-- SimonLeinen - 26 Feb 2006

PingPlotter

PingPlotter by Nessoft LLC combines traceroute, ping and whois to collect data for the Windows platform.

Example

A nice example of the tool can be found in the "Screenshot" section of its Web page. This contains not only the screenshot (included below by reference) but also a description of what you can read from it.

nessoftgraph.gif

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 26 Feb 2006

Traceroute Mesh Server

Traceroute Mesh combines traceroutes from many sources at once and builds a large map of the interconnections from the Internet to the specific IP. Very cool!

References

Example

Partial view of traceroute map created:

partial.png

-- HankNussbacher - 10 Jul 2005

tracepath

tracepath and tracepath6 trace the path to a network host, discovering MTU and asymmetry along this path. As described below, their applicability for path asymmetry measurements is quite limited, but the tools can still measure MTU rather reliably.

Methodology and Caveats

A path is considered asymmetric if the number of hops to a router is different from how much TTL was decremented while the ICMP error message was forwarded back from the router. The latter depends on knowing what was the original TTL the router used to send the ICMP error. The tools guess TTL values 64, 128 and 255. Obviously, a path might be asymmetric even if the forward and return paths were equally long, so the tool just catches one case of path asymmetry.

A major operational issue with this approach is that at least Juniper's M/T-series routers decrement TTL for ICMP errors they originate (e.g., the first hop router returns ICMP error with TTL=254 instead of TTL=255) as if they were forwarding the packet. This shows as path asymmetry.

Path MTU is measured by sending UDP packets with DF bit set. The packet size is the MTU of the host's outgoing link, which may be cached Path MTU Discovery for a given destination address. If a link MTU is lower than the tried path, the ICMP error tells the new path MTU which is used in subsequent probes.

As explained in tracepath(8), if MTU changes along the path, then the route will probably erroneously be declared as asymmetric.

Examples

IPv4 Example

This example shows a path from a host with 9000-byte "jumbo" MTU support to a host on a traditional 1500-byte Ethernet.

: leinen@cemp1[leinen]; tracepath diotima
 1:  cemp1.switch.ch (130.59.35.130)                        0.203ms pmtu 9000
 1:  swiCE2-G5-2.switch.ch (130.59.35.129)                  1.024ms
 2:  swiLS2-10GE-1-3.switch.ch (130.59.37.2)                1.959ms
 3:  swiEZ2-10GE-1-1.switch.ch (130.59.36.206)              5.287ms
 4:  swiCS3-P1.switch.ch (130.59.36.221)                    5.456ms
 5:  swiCS3-P1.switch.ch (130.59.36.221)                  asymm  4   5.467ms pmtu 1500
 6:  swiLM1-V610.switch.ch (130.59.15.230)                  4.864ms
 7:  swiLM1-V610.switch.ch (130.59.15.230)                asymm  6   5.209ms !H
     Resume: pmtu 1500

The router (interface) swiCS3-P1.switch.ch occurs twice; on the first line (hop 4), it returns an ICMP TTL Exceeded error, on the next (hop 5) it returns an ICMP "fragmentation needed and DF bit set" error. Unfortunately this causes tracepath to miss the "real" hop 5, and also to erroneously assume that the route is asymmetric at that point. One could consider this a bug, as tracepath could distinguish these different ICMP errors, and refrain from incrementing TTL when it reduces MTU (in response to the "fragmentation needed..." error).

When one retries the tracepath, the discovered Path MTU for the destination has been cached by the host, and you get a different result:

: leinen@cemp1[leinen]; tracepath diotima
 1:  cemp1.switch.ch (130.59.35.130)                        0.211ms pmtu 1500
 1:  swiCE2-G5-2.switch.ch (130.59.35.129)                  0.384ms
 2:  swiLS2-10GE-1-3.switch.ch (130.59.37.2)                1.214ms
 3:  swiEZ2-10GE-1-1.switch.ch (130.59.36.206)              4.620ms
 4:  swiCS3-P1.switch.ch (130.59.36.221)                    4.623ms
 5:  swiNM1-G1-0-25.switch.ch (130.59.15.237)               5.861ms
 6:  swiLM1-V610.switch.ch (130.59.15.230)                  4.845ms
 7:  swiLM1-V610.switch.ch (130.59.15.230)                asymm  6   5.226ms !H
     Resume: pmtu 1500

Note that hop 5 now shows up correctly and without an "asymm" warning. There is still an "asymm" warning at the end of the path, because a filter on the last-hop router swiLM1-V610.switch.ch prevents the UDP probes from reaching the final destination.

IPv6 Example

Here is the same path for IPv6, using tracepath6. Because of more relaxed UDP filters, the final destination is actually reached in this case:

: leinen@cemp1[leinen]; tracepath6 diotima
 1?: [LOCALHOST]                      pmtu 9000
 1:  swiCE2-G5-2.switch.ch                      1.654ms
 2:  swiLS2-10GE-1-3.switch.ch                  2.235ms
 3:  swiEZ2-10GE-1-1.switch.ch                  5.616ms
 4:  swiCS3-P1.switch.ch                        5.793ms
 5:  swiCS3-P1.switch.ch                      asymm  4   5.872ms pmtu 1500
 5:  swiNM1-G1-0-25.switch.ch                   5. 47ms
 6:  swiLM1-V610.switch.ch                      5. 79ms
 7:  diotima.switch.ch                          4.766ms reached
     Resume: pmtu 1500 hops 7 back 7

Again, once the Path MTU has been cached, tracepath6 starts out with that MTU, and will discover the correct path:

: leinen@cemp1[leinen]; tracepath6 diotima
 1?: [LOCALHOST]                      pmtu 1500
 1:  swiCE2-G5-2.switch.ch                      0.703ms
 2:  swiLS2-10GE-1-3.switch.ch                  8.786ms
 3:  swiEZ2-10GE-1-1.switch.ch                  4.904ms
 4:  swiCS3-P1.switch.ch                        4.979ms
 5:  swiNM1-G1-0-25.switch.ch                   4.989ms
 6:  swiLM1-V610.switch.ch                      6.578ms
 7:  diotima.switch.ch                          5.191ms reached
     Resume: pmtu 1500 hops 7 back 7

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 26 Feb 2006
-- PekkaSavola - 31 Aug 2006

Bandwidth Measurement Tools

(Click on blue headings to link through to detailed tool descriptions and examples)

Iperf

Iperf is a tool to measure maximum TCP bandwidth, allowing the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss. http://sourceforge.net/projects/iperf/

BWCTL

"BWCTL is a command line client application and a scheduling and policy daemon that wraps Iperf"

Home page: http://e2epi.internet2.edu/bwctl/

Exemple
http://e2epi.internet2.edu/pipes/pmp/pmp-switch.htm

nuttcp

Measurement tool very similar to iperf. It can be found here http://www.nuttcp.net/nuttcp/Welcome%20Page.html

NDT

Web100-based TCP tester that can be used from a Java applet.

Pchar

Hop-by-hop capacity measurements along a network path.

Netperf

A client/server network performance benchmark.

Home page: http://www.netperf.org/netperf/NetperfPage.html

Thrulay

A tool that performs TCP throughput tests and RTT measurements at the same time.

RUDE/CRUDE

RUDE is a package of applications to generate and measure the UDP traffic between two points. The rude generates traffic to the network, which can be received and logged on the other side of the network with the crude. Traffic pattern can be defined by user. Tool is available under http://rude.sourceforge.net/

NEPIM

nepim stands for network pipemeter, a tool for measuring available bandwidth between hosts. nepim is also useful to generate network traffic for testing purposes.

nepim operates in client/server mode, is able to handle multiple parallel traffic streams, reports periodic partial statistics along the testing, and supports IPv6.

Tool is available under http://www.nongnu.org/nepim/

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm

DSL Reports doesn't require any install. It checks the speed of your connection via a Java applet. One can choose from over 300 sites in the world from which to run the tests http://www.dslreports.com/stest

* Sample report produced by DSL Reports:
dslreport.png

Other online bandwidth measurement sits:

* http://myspeed.visualware.com/

* http://www.toast.net/performance/

* http://www.beelinebandwidthtest.com/

-- FrancoisXavierAndreu, SimonMuyal & SimonLeinen - 06-30 Jun 2005

-- HankNussbacher - 07 Jul 2005 & 15 Oct 2005 (DSL and online Reports section)

pchar

Characterize the bandwidth, latency and loss on network links. (See below the example for information on how pchar works)

(debian package : pchar)

Example:

pchar to 193.51.180.221 (193.51.180.221) using UDP/IPv4
Using raw socket input
Packet size increments from 32 to 1500 by 32
46 test(s) per repetition
32 repetition(s) per hop
 0: 193.51.183.185 (netflow-nri-a.cssi.renater.fr)
    Partial loss:      0 / 1472 (0%)
    Partial char:      rtt = 0.124246 ms, (b = 0.000206 ms/B), r2 = 0.997632
                       stddev rtt = 0.001224, stddev b = 0.000002
    Partial queueing:  avg = 0.000158 ms (765 bytes)
    Hop char:          rtt = 0.124246 ms, bw = 38783.892367 Kbps
    Hop queueing:      avg = 0.000158 ms (765 bytes)
 1: 193.51.183.186 (nri-a-g13-1-50.cssi.renater.fr)
    Partial loss:      0 / 1472 (0%)
    Partial char:      rtt = 1.087330 ms, (b = 0.000423 ms/B), r2 = 0.991169
                       stddev rtt = 0.004864, stddev b = 0.000006
    Partial queueing:  avg = 0.005093 ms (23535 bytes)
    Hop char:          rtt = 0.963084 ms, bw = 36913.554996 Kbps
    Hop queueing:      avg = 0.004935 ms (22770 bytes)
 2: 193.51.179.122 (nri-n3-a2-0-110.cssi.renater.fr)
    Partial loss:      5 / 1472 (0%)
    Partial char:      rtt = 697.145142 ms, (b = 0.032136 ms/B), r2 = 0.999991
                       stddev rtt = 0.011554, stddev b = 0.000014
    Partial queueing:  avg = 0.009681 ms (23679 bytes)
    Hop char:          rtt = 696.057813 ms, bw = 252.261443 Kbps
    Hop queueing:      avg = 0.004589 ms (144 bytes)
 3: 193.51.180.221 (caledonie-S1-0.cssi.renater.fr)
    Path length:       3 hops
    Path char:         rtt = 697.145142 ms r2 = 0.999991
    Path bottleneck:   252.261443 Kbps
    Path pipe:         21982 bytes
    Path queueing:     average = 0.009681 ms (23679 bytes)
    Start time:        Mon Jun  6 11:38:54 2005
    End time:          Mon Jun  6 12:15:28 2005
If you do not have access to a Unix system, you can run pchar remotely via: http://noc.greatplains.net/measurement/pchar.php

The README text below was written by Bruce A Mah and is taken from http://www.kitchenlab.org/www/bmah/Software/pchar/README, where there is information on how to obtain and install pchar.

PCHAR:  A TOOL FOR MEASURING NETWORK PATH CHARACTERISTICS
Bruce A. Mah
<bmah@kitchenlab.org>
$Id: PcharTool.txt,v 1.3 2005/08/09 16:59:27 TobyRodwell Exp www-data $
---------------------------------------------------------

INTRODUCTION
------------

pchar is a reimplementation of the pathchar utility, written by Van
Jacobson.  Both programs attempt to characterize the bandwidth,
latency, and loss of links along an end-to-end path through the
Internet.  pchar works in both IPv4 and IPv6 networks.

As of pchar-1.5, this program is no longer under active development,
and no further releases are planned.

...

A FEW NOTES ON PCHAR'S OPERATION
--------------------------------

pchar sends probe packets into the network of varying sizes and
analyzes ICMP messages produced by intermediate routers, or by the
target host.  By measuring the response time for packets of different
sizes, pchar can estimate the bandwidth and fixed round-trip delay
along the path.  pchar varies the TTL of the outgoing packets to get
responses from different intermediate routers.  It can use UDP or ICMP
packets as probes; either or both might be useful in different
situations.

At each hop, pchar sends a number of packets (controlled by the -R flag)
of varying sizes (controlled by the -i and -m flags).  pchar determines
the minimum response times for each packet size, in an attempt to
isolate jitter caused by network queueing.  It performs a simple
linear regression fit to the resulting minimum response times.  This
fit yields the partial path bandwidth and round-trip time estimates.

To yield the per-hop estimates, pchar computes the differences in the
linear regression parameter estimates for two adjacent partial-path
datasets.  (Earlier versions of pchar differenced the minima for the
datasets, then computed a linear regressions.)  The -a flag selects
between one of (currently) two different algorithms for performing the
linear regression, either a least squares fit or a nonparametric
method based on Kendall's test statistic.

Using the -b option causes pchar to send small packet bursts,
consisting of a string of back-to-back ICMP ECHO_REPLY packets
followed by the actual probe.  This can be useful in probing switched
networks.

CAVEATS
-------

Router implementations may very well forward a packet faster than they
can return an ICMP error message in response to a packet.  Because of
this fact, it's possible to see faster response times from longer
partial paths; the result is a seemingly non-sensical, negative
estimate of per-hop round-trip time.

Transient fluctuations in the network may also cause some odd results.

If all else fails, writing statistics to a file will give all of the
raw data that pchar used for its analysis.

Some types of networks are intrinsically difficult for pchar to
measure.  Two notable examples are switched networks (with multiple
queues at Layer 2) or striped networks.  We are currently
investigating methods for trying to measure these networks.

pchar needs superuser access due to its use of raw sockets.

...

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

-- HankNussbacher - 10 Jul 2005 (Great Plains server)

-- TobyRodwell - 09 Aug 2005 (added Brudce A Mah's Readme text)

Iperf

Iperf is a tool to measure TCP throughput and available bandwidth, allowing the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay variation, and datagram loss.

The popular Iperf 2 releases were developed by NLANR/DAST (http://dast.nlanr.net/Projects/Iperf/) and maintained at http://sourceforge.net/projects/iperf. As of April 2014, the last released version was 2.0.5, from July 2010.

A script that automates the starting and then stopping iperf servers is here . This can be invoked from a remote machine (say a NOC workstation) to simplify starting, and more importantly stopping, an iperf server.

Iperf 3

ESnet and the Lawrence Berkeley National Laboratory have developed a from-scratch reimplementation of Iperf called Iperf 3. It has a Github repository. It is not compatible with iperf 2, and has additional interesting features such as a zero-copy TCP mode (-Z flag), JSON output (-J), and reporting of TCP retransmission counts and CPU utilization (-V). It also supports SCTP in addition to UDP and TCP. Since December 2013, various public releases were made on http://stats.es.net/software/.

Usage Examples

TCP Throughput Test

The following shows a TCP throughput test, which is iperf's default action. The following options are given:

  • -s - server mode. In iperf, the server will receive the test data stream.
  • -c server - client mode. The name (or IP address) of the server should be given. The client will transmit the test stream.
  • -i interval - display interval. Without this option, iperf will run the test silently, and only write a summary after the test has finished. With -i, the program will report intermediate results at given intervals (in seconds).
  • -w windowsize - select a non-default TCP window size. To achieve high rates over paths with a large bandwidth-delay product, it is often necessary to select a larger TCP window size than the (operating system) default.
  • -l buffer length - specify the length of send or receive buffer. In UDP, this sets the packet size. In TCP, this sets the send/receive buffer length (possibly using system defaults). Using this may be important especially if the operating system default send buffer is too small (e.g. in Windows XP).

NOTE -c and -s arguments must be given first. Otherwise some configuration options are ignored.

The -i 1 option was given to obtain intermediate reports every second, in addition to the final report at the end of the ten-second test. The TCP buffer size was set to 2 Megabytes (4 Megabytes effective, see below) in order to permit close to line-rate transfers. The systems haven't been fully tuned, otherwise up to 7 Gb/s of TCP throughput should be possible. Normal background traffic on the 10 Gb/s backbone is on the order of 30-100 Mb/s. Note that in iperf, by default it is the client that transmits to the server.

Server Side:

welti@ezmp3:~$ iperf -s -w 2M -i 1
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)
------------------------------------------------------------
[  4] local 130.59.35.106 port 5001 connected with 130.59.35.82 port 41143
[  4]  0.0- 1.0 sec    405 MBytes  3.40 Gbits/sec
[  4]  1.0- 2.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  2.0- 3.0 sec    425 MBytes  3.56 Gbits/sec
[  4]  3.0- 4.0 sec    422 MBytes  3.54 Gbits/sec
[  4]  4.0- 5.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  5.0- 6.0 sec    422 MBytes  3.54 Gbits/sec
[  4]  6.0- 7.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  7.0- 8.0 sec    423 MBytes  3.55 Gbits/sec
[  4]  8.0- 9.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  9.0-10.0 sec    413 MBytes  3.47 Gbits/sec
[  4]  0.0-10.0 sec  4.11 GBytes  3.53 Gbits/sec

Client Side:

welti@mamp1:~$ iperf -c ezmp3 -w 2M -i 1
------------------------------------------------------------
Client connecting to ezmp3, TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)
------------------------------------------------------------
[  3] local 130.59.35.82 port 41143 connected with 130.59.35.106 port 5001
[  3]  0.0- 1.0 sec    405 MBytes  3.40 Gbits/sec
[  3]  1.0- 2.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  2.0- 3.0 sec    425 MBytes  3.56 Gbits/sec
[  3]  3.0- 4.0 sec    422 MBytes  3.54 Gbits/sec
[  3]  4.0- 5.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  5.0- 6.0 sec    422 MBytes  3.54 Gbits/sec
[  3]  6.0- 7.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  7.0- 8.0 sec    423 MBytes  3.55 Gbits/sec
[  3]  8.0- 9.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  0.0-10.0 sec  4.11 GBytes  3.53 Gbits/sec

UDP Test

In the following example, we send a 300 Mb/s UDP test stream. No packets were lost along the path, although one arrived out-of-order. Another interesting result is jitter, which is displayed as 27 or 28 microseconds (apparently there is some rounding error or other impreciseness that prevents the client and server from agreeing on the value). According to the documentation, "Jitter is the smoothed mean of differences between consecutive transit times."

Server Side

: leinen@mamp1[leinen]; iperf -s -u
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 130.59.35.82 port 5001 connected with 130.59.35.106 port 38750
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec  0.028 ms    0/256410 (0%)
[  3]  0.0-10.0 sec  1 datagrams received out-of-order

Client Side

: leinen@ezmp3[leinen]; iperf -c mamp1-eth0 -u -b 300M
------------------------------------------------------------
Client connecting to mamp1-eth0, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 130.59.35.106 port 38750 connected with 130.59.35.82 port 5001
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec
[  3] Sent 256411 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec  0.027 ms    0/256410 (0%)
[  3]  0.0-10.0 sec  1 datagrams received out-of-order

As you would expect, during a UDP test traffic is only sent from the client to the server see here for an example with tcpdump.

Problem isolation procedures using iperf

TCP Throughput measurements

Typically end users are reporting throughput problems as they see on with their applications, like unexpected slow file transfer times. Some users may already report TCP throughput results as measured with iperf. In any case, network administrators should validate the throughput problem. It is recommended this be done using iperf end-to-end measurements in TCP mode between the end systems' memory. The window size of the TCP measurement should follow the bandwidth*delay product rule, and should therefore be set to at least the measured round trip time multiplied by the path's bottle-neck speed. If the actual bottleneck is not known (because of lack on knowledge of the end-to-end path) then it should be assumed the bottleneck is the slowest of the two end-systems' network interface cards.

For instance if one system is connected with Gigabit Ethernet, but the other one with Fast Ethernet and the measured round trip time is 150ms, then the window size should be set to 100 Mbit/s * 0.150s / 8 = 1875000 bytes, so setting the TCP window to a value of 2 MBytes would be a good choice.

In theory the TCP throughput could reach, but not exceed, the available bandwidth on an end-to-end path. The knowledge of that network metric is therefore important for distinguishing between issues with the end system's TCP stacks, or network related problems.

Available bandwidth measurements

Iperf could be used in UDP mode for measuring the available bandwidth. Only short duration measurements in the range of 10 seconds should be done so as not to disturb other production flows. The goal of UDP measurements is to find the maximum UDP sending rate that results in almost no packet loss on the end-to-end path, in good practice the packet loss threshold is 1%. UDP data transfers that results in higher packet losses are likely to disturb TCP production flows and therefore should be avoided. A practicable procedure to find the available bandwidth value is to start with UDP data transfers with a 10s duration and with interim result reports at one second intervals. The data rate to start with should be slightly below the reported TCP throughput. If the measured packet loss values are below the threshold then a new measurement with slightly increased data rate could be started. This procedure of small UDP data transfers with increasing data rate should be repeated until the packet loss threshold is exceeded. Depending on the required result's accuracy further tests can be started beginning with the maximum data rate causing packet losses below the threshold and with smaller data rate increasing intervals. At the end the maximum data rate that caused packet losses below the threshold could be seen as a good measurement of the available bandwidth on the end to end path.

By comparing the reported applications throughput with the measured TCP throughput and the measured available bandwidth, it is possible to distinguish between applications problems, TCP stack problems, or network issues. Note however that differing nature of UDP and TCP flows means that it their measurements should not be directly compared. Iperf sends UDP datagrams are a constant steady rate, whereas TPC tends to send packet trains. This means that TCP is likely to suffer from congestion effects at a lower data rate than UDP.

In case of unexpected low available bandwidth measurements on the end-to-end path, network administrators are interested on the bandwidth bottleneck. The best way to get this value is to retrieve it from passively measured link utilisations and provided capacities on all links along the path. However, if the path is crossing multiple administrative domains this is often not possible because of restrictions in getting those values from other domains. Therefore, it is common practice to use measurement workstations along the end-to-end path, and thus separate the end-to-end path in segments on which available bandwidth measurements are done. This way it is possible to identify the segment on which the bottleneck occurs and to concentrate on that during further troubleshooting procedures.

Other iperf use cases

Besides the capability of measuring TCP throughput and available bandwidth, in UDP mode iperf can report on packet reordering and delay jitter.

Other use cases for measurements using iperf are IPv6 bandwidth measurements and IP multicast performance measurements. More information of the iperf features, its source and binary code for different UNIXes and Microsoft Windows operating systems can be retrieved from the Iperf Web site.

Caveats and Known Issues

Impact on other traffic

As Iperf sends real full data streams it can reduce the available bandwidth on a given path. In TCP mode, the effect to the co-existing production flows should be negligible, assuming the number of production flows is much greater than the number of test data flows, which is normally a valid assumption on paths through a WAN. However, in UDP mode iperf has the potential to disturb production traffic, and in particular TCP streams, if the sender's data rate exceeds the available bandwidth on a path. Therefore, one should take particular care whenever running iperf tests in UDP mode.

TCP buffer allocation

On Linux systems, if you request a specific TCP buffer size with the "-w" option, the kernel will always try to allocate double as much bytes as you specified.

Example: when you request 2MB window size you'll receive 4MB:

welti@mamp1:~$ iperf -c ezmp3 -w 2M -i 1
------------------------------------------------------------
Client connecting to ezmp3, TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)    <<<<<<
------------------------------------------------------------

Counter overflow

Some versions seem to suffer from a 32-bit integer overflow which will lead to wrong results.

e.g.:

[ 14]  0.0-10.0 sec  953315416 Bytes  762652333 bits/sec
[ 14] 10.0-20.0 sec  1173758936 Bytes  939007149 bits/sec
[ 14] 20.0-30.0 sec  1173783552 Bytes  939026842 bits/sec
[ 14] 30.0-40.0 sec  1173769072 Bytes  939015258 bits/sec
[ 14] 40.0-50.0 sec  1173783552 Bytes  939026842 bits/sec
[ 14] 50.0-60.0 sec  1173751696 Bytes  939001357 bits/sec
[ 14]  0.0-60.0 sec  2531115008 Bytes  337294201 bits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

As you can see the summary 0-60 seconds doesn't match the average that one would expect. This is due to the fact that the total number of Bytes is not correct as a result of a counter wrap.

If you're experiencing this kind of effects, upgrade to the latest version of iperf, which should have this bug fixed.

UDP buffer sizing

The UDP buffer sizing (the -w parameter) is also required with high-speed UDP transmissions. Otherwise (typically) the UDP receive buffer will overflow and this will look like packet loss (but looking at tcpdumps or counters reveals you got all the data). This will show in UDP statistics (for example, on Linux with 'netstat -s' under Udp: ... receive packet errors). See more information at: http://www.29west.com/docs/THPM/udp-buffer-sizing.html

Control of measurements

There are two typical deployment scenarios which differ in the kind of access the operator has to the sender and receiver instances. A measurement between well-located measurement workstations within an administrative domain e.g. a campus network allow network administrators full control on the server and client configurations (including test schedules), and allows them to retrieve full measurement results. Measurements on paths that extend beyond the administrative domain borders require access or collaboration with administrators of the far-end systems. Iperf has two features implemented that simplify its use in this scenario, such that the operator does not to need of have an interactive login account on the far-end system:

  • The server instance may run as a daemon (option -D) listening on a configurable transport protocol port, and
  • It is possible to bi-directional tests, either one after the other (option -r) , or simultaneously (option -d).

Screen

Another method of running iperf on a *NIX device is to use 'screen'. Screen is a utility that lets you keep a session running even once you have logged out. It is described more fully here in its man pages, but a simple sequence applicable to iperf would be as follows:

[user@host]$screen -d -m iperf -s -p 5002 
This starts iperf -s -p 5002 as a 'detached' session

[user@host]$screen -ls
There is a screen on:
        25278..mp1      (Detached)
1 Socket in /tmp/uscreens/S-admin.
'screen -ls' shows open sessions.

'screen -r' reconnects to a running session . when in that session keying 'CNTL+a', then 'd' detaches the screen. You can if you wish log out, log back in again, and re-attach. To end the iperf session (and a screen) just hit 'CNTL+c' whilst attached.

Note that BWCTL offers additional control and resource limitation features that make it more suitable for use over administrative domains.

Related Work

Public iperf servers

There used to be some public iperf servers available, but at present none is known anymore. Similar services are provided by BWCTL (see below) and by public NDT servers.

BWCTL

BWCTL (BandWidth test ConTroLler) is a wrapper around iperf that provides scheduling and remote control of measurements.

Instrumented iperf (iperf100)

The iperf code provided by NLANR/DAST was instrumented in order to provide more information to the user. Iperf100 displays various web100 variables at the end of a transfer.

Patches are available at http://www.csm.ornl.gov/~dunigan/netperf/download/

The Instrumented iperf requires machine running a kernel.org linux-2.X.XX kernel with the latest web100 patches applied (http://www.web100.org)

Jperf

Jperf is a Java-based graphical front-end to iperf. It is now included as part of the iperf project on SourceForge. This was once available as a separate project called xjperf, but that seems to have been given up in favor of iperf/SourceForge integration.

iPerf for Android

An Android version of iperf appeared on Google Play (formerly the Android Market) in 2010.

nuttcp

Similar to iperf, but with an additional control connection that makes it somewhat more versatile.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- HankNussbacher - 10 Jul 2005 (Great Plains server)
-- AnnHarding & OrlaMcGann - Aug 2005 (DS3.3.2 content)
-- SimonLeinen - 08 Feb 2006 (OpenSS7 variant, BWCTL pointer)
-- BartoszBelter - 28 Mar 2006 (iperf100)
-- ChrisWelti - 11 Apr 2006 (examples, 32-bit overflows, buffer allocation)
-- SimonLeinen - 01 Jun 2006 (integrated DS3.3.2 text from Ann and Orla)
-- SimonLeinen - 17 Sep 2006 (tracked iperf100 pointer)
-- PekkaSavola - 26 Mar 2008 (added warning about -c/s having to be first, a common gotcha)
-- PekkaSavola - 05 Jun 2008 (added discussion of '-l' parameter and its significance
-- PekkaSavola - 30 Apr 2009 (added discussion of UDP (receive) buffer sizing significance
-- SimonLeinen - 23 Feb 2012 (removed installation instructions, obsolete public iperf servers)
-- SimonLeinen - 22 May 2012 (added notes about Iperf 3, Jperf and iPerf for Android)
-- SimonLeinen - 01 Feb 2013 (added pointer to Android app; cross-reference to Nuttcp)
-- SimonLeinen - 06 April 2014 (updated Iperf 3 section: now on Github and with releases)
-- SimonLeinen - 05 May 2014 (updated Iperf 3 section: documented more new features)

BWCTL

BWCTL is a command line client application and a scheduling and policy daemon that wraps throughput measurement tools such as iperf, thrulay, and nuttcp (versions prior to 1.3 only support iperf).

More Information: For configuration, common problems, examples etc
Description of bwctld.limits parameters

A typical BWCTL result looks like one of the two print outs below. The first print out shows a test run from the local host to 193.136.3.155 ('-c' stands for 'collector'). The second print out shows a test run to the localhost from 193.136.3.155 ('-s' stands for 'sender').

[user@ws4 user]# bwctl -c 193.136.3.155
bwctl: 17 seconds until test results available
RECEIVER START
3339497760.433479: iperf -B 193.136.3.155 -P 1 -s -f b -m -p 5001 -t 10
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 193.136.3.155
TCP window size: 65536 Byte (default)
------------------------------------------------------------
[ 14] local 193.136.3.155 port 5001 connected with 62.40.108.82 port 35360
[ 14]  0.0-10.0 sec  16965632 Bytes  13547803 bits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
RECEIVER END


[user@ws4 user]# bwctl -s 193.136.3.155
bwctl: 17 seconds until test results available
RECEIVER START
3339497801.610690: iperf -B 62.40.108.82 -P 1 -s -f b -m -p 5004 -t 10
------------------------------------------------------------
Server listening on TCP port 5004
Binding to local address 62.40.108.82
TCP window size: 87380 Byte (default)
------------------------------------------------------------
[ 16] local 62.40.108.82 port 5004 connected with 193.136.3.155 port 5004
[ ID] Interval       Transfer     Bandwidth
[ 16]  0.0-10.0 sec  8298496 Bytes  6622281 bits/sec
[ 16] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
[ 16] Read lengths occurring in more than 5% of reads:
[ 16]  1448 bytes read  5708 times (99.8%)
RECEIVER END

The read lengths are shown when the test is for received traffic. In his e-mail to TobyRodwell on 28 March 2005 Stanislav Shalunov of Internet2 writes:

"The values are produced by Iperf and reproduced by BWCTL.

An Iperf server gives you several most frequent values that the read() system call returned during a TCP test, along with the percentages of all read() calls for which the given read() lengths accounted.

This can let you judge the efficiency of interrupt coalescence implementations, kernel scheduling strategies and other such esoteric applesauce. Basically, reads shorter than 8kB can put quite a bit of load on the CPU (each read() call involves several microseconds of context switching). If the target rate is beyond a few hundred megabits per second, one would need to pay attention to read() lengths."

References (Pointers)

-- TobyRodwell - 28 Oct 2005
-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 22 Apr 2008 - 07 Dec 2008

Netperf

A client/server network performance benchmark (debian package: netperf).

Example

coming soon

netperf-wrapper

For his measurement work on fq_codel, Toke Hiland-Jrgensen developed Python wrappers for Netperf and added some tests, notably the RRUL (Realtime Response Under Load) test, which runs bulk TCP transfers in parallel with UDP and ICMP response-time measurements. This can be used to expose Bufferbloat on a path.

References

-- FrancoisXavierAndreu & SimonMuyal - 2005-06-06
-- SimonLeinen - 2013-03-30 - 2014-10-05

RUDE/CRUDE

RUDE is a package of applications to generate and measure UDP traffic between two endpoints (hosts). The rude tool generates traffic to the network, which can be received and logged on the other side of the network with crude. The traffic pattern can be defined by the user. This tool is available under http://rude.sourceforge.net/

Example

The following rude script will cause a constant-rate packet stream to be sent to a destination for 4 seconds, starting one second (1000 milliseconds) from the time the script was read, stopping five seconds (5000 milliseconds) after the script was read. The script can be called using rude -s example.rude.

START NOW
1000 0030 ON 3002 10.1.1.1:10001 CONSTANT 200 250
5000 0030 OFF

Explanation of the values: 1000 and 5000 are relative timestamps (in milliseconds) for the actions. 0030 is the identification of a traffic flow. At 1000, a CONSTANT-rate flow is started (ON) with source port 3002, destination address 10.1.1.1 and destination port 10001. CONSTANT 200 250 specifies a fixed rate of 200 packets per second (pps) of 250-byte UDP datagrams.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 28 Sep 2008

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm/

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

Active Measurement Tools

Active measurement injects traffic into a network to measure properties about that network.

  • ping - a simple RTT and loss measurement tool using ICMP ECHO messages
  • fping - ping variant supporting concurrent measurement to multiple destinations
  • SmokePing - nice graphical representation of RTT distribution and loss rate from periodic "pings"
  • OWAMP - a protocol and tool for one-way measurements

There are several permanent infrastructures performing active measurements:

  • HADES DFN-developed HADES Active Delay Evaluation System, deployed in GEANT/GEANT2 and DFN
  • RIPE TTM
  • QoSMetricsBoxes used in RENATER's Métrologie project

-- SimonLeinen - 29 Mar 2006

SmokePing

Smokeping is a software-based measurement framework that uses various software modules (called probes) to measure round trip times, packet losses and availability at layer 3 (IPv4 and IPv6) and even applications? latencies. Layer 3 measurements are based on the ping tool and for analysis of applications there are such probes as for measuring DNS lookup and RADIUS authentication latencies. The measurements are centrally controlled on a single host from which the software probes are started, which in turn emit active measurement flows. The load impact to the network by these streams is usually negligible.

As with MRTG the results are stored in RRD databases in the original polling time intervals for 24 hours and then aggregated over time, and the peak values and mean values on a 24h interval are stored for more than one year. Like MRTG, the results are usually displayed in a web browser, using html embedded graphs in daily, monthly, weekly and yearly timeframes. A particular strength of smoke ping is the graphical manner in which it displays the statistical distribution of latency values over time.

The tool has also an alarm feature that, based on flexible threshold rules, either sends out emails or runs external scripts.

The following picture shows an example output of a weekly graph. The background colour indicates the link availability, and the foreground lines display the mean round trip times. The shadows around the lines indicate graphically about the statistical distribution of the measured round-trip times.

  • Example SmokePing graph:
    Example SmokePing graph

The 2.4* versions of Smokeping included SmokeTrace, an AJAX-based traceroute tool similar to mtr, but browser/server based. This was removed in 2.5.0 in favor of a separate, more general, tool called remOcular.

In August 2013, Tobi Oetiker announced that he received funding for the development of SmokePing 3.0. This will use Extopus as a front-end. This new version will be developed in public on Github. Another plan is to move to an event-based design that will make Smokeping more efficient and allow it to scale to a large number of probes.

Related Tools

The OpenNMS network management platform includes a tool called StrafePing which is heavily inspired by SmokePing.

References

-- SimonLeinen - 2006-04-07 - 2013-11-03

OWAMP (One-Way Active Measurement Tool)

OWAMP is a command line client application and a policy daemon used to determine one way latencies between hosts. It is an implementation of the OWAMP protocol (see references) that is currently going through the standardization process within the IPPM WG in the IETF.

With roundtrip-based measurements, it is hard to isolate the direction in which congestion is experienced. One-way measurements solve this problem and make the direction of congestion immediately apparent. Since traffic can be asymmetric at many sites that are primarily producers or consumers of data, this allows for more informative measurements. One-way measurements allow the user to better isolate the effects of specific parts of a network on the treatment of traffic.

The current OWAMP implementation (V3.1) supports IPv4 and IPv6.

Example using owping (OWAMP v3.1) on a linux host:

welti@atitlan:~$ owping ezmp3
Approximately 13.0 seconds until results available

--- owping statistics from [2001:620:0:114:21b:78ff:fe30:2974]:52887 to [ezmp3]:41530 ---
SID:    feae18e9cd32e3ee5806d0d490df41bd
first:  2009-02-03T16:40:31.678
last:   2009-02-03T16:40:40.894
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 2.4/2.5/3.39 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering


--- owping statistics from [ezmp3]:41531 to [2001:620:0:114:21b:78ff:fe30:2974]:42315 ---
SID:    fe302974cd32e3ee7ea1a5004fddc6ff
first:  2009-02-03T16:40:31.479
last:   2009-02-03T16:40:40.685
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 1.99/2.1/2.06 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering

References

-- ChrisWelti - 03 Apr 2006 - 03 Feb 2009 -- SimonLeinen - 29 Mar 2006 - 01 Dec 2008

Active Measurement Boxes

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

HADES

Hades Active Delay Evaluation System (HADES) devices (previously called IPPM devices) were developed by the WiN-Labor at RRZE (Regional Computing Centre Erlangen), to provide QoS measurements in DFN's G-WiN infrastructure based on the metrics developed by the IETF IPPM WG. In addition to DFN's backbone, HADES devices have been deployed in numerous GEANT/GEANT2 Points of Prescence (PoPs).

Example Output

The examples that follow were generated using the new (experimental) HADES front-end in early June 2006.

One-Way Delay Plots

A typical one-way delay plot looks like this.

  • IPPM Example: FCCN-GARR one-way delays:
    IPPM Example: FCCN-GARR one-way delays

The default view selects the scale on the y (delay) axis so that all values fit in. In this case we see that there were two cases (around 12:30 and around 18:10) where there were "outliers" with delays up to 190ms, much higher than the average of about 30ms.

In the presence of such outliers, the auto-scaling feature makes it hard to discern variations in the non-outlying values. Fortunately, the new interface allows us to select the y axis scale by hand, so that we can zoom in to the delay range that we are interested in:

  • IPPM Example: FCCN-GARR one-way delay, narrower delay scale:
    IPPM Example: FCCN-GARR one-way delay, narrower delay scale

It now becomes evident that most of the samples lie in a very narrow band between 29.25 and 29.5 milliseconds. Interestingly, there are a few outliers towards lower delays. These could be artifacts of clock inaccuracies, or they could point to some kind of routing anomaly, although the latter seems less probable, because routing anomalies normally don't lead to shorter delays.

Concluding Remarks

Note that the IPPM system is being developed for the NREN community, hence its feature developments focus on that specific community needs; for example, it implements measurement of out of order packets as well as metric measurements for different IP Precedence values.

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 01 Jun 2006

RIPE TTM Boxes

Summary

The Test Traffic Measurements service has been offered by the RIPE NCC as a service since the year 2000. The system continuously records one-way delay and packet-loss measurements as well as router-level paths ("traceroutes") between a large set of probe machines ("test boxes"). The test boxes are hosted by many different organisations, including NRENs, commercial ISPs, universities and others, and usually maintained by RIPE NCC as part of the service. While the vast majority of the test boxes is in Europe, there are a couple of machines in other continents, including outside the RIPE NCC's service area. Every test box includes a precise time source - typically a GPS receiver - so that accurate one-way delay measurements can be provided. Measurement data are entered in a central database at RIPE NCC's premises every night. The database is based on CERN's ROOT system. Measurement results can be retrieved over the Web using various presentations, both pre-generated and "on-demand".

Applicability

RIPE TTM data is often useful to find out "historical" quality information about a network path of interest, provided that TTM test boxes are deployed near (in the topological sense, not just geographically) the ends of the path. For research network paths throughout Europe, the coverage of the RIPE TTM infrastructure is not complete, but quite comprehensive. When one suspects that there is (non-temporary) congestion on a path covered by the RIPE TTM measurement infrastructure, the TTM graphs can be used to easily verify that, because such congestion will show up as delay and/or loss if present.

The RIPE TTM system can also be used by operators to precisely measure - but only after the fact - the impact of changes in the network. These changes can include disturbances such as network failures or misconfigurations, but also things such as link upgrades or routing improvements.

Notes

The TTM project has provided extremely high-quality and stable delay and loss measurements for a large set of network paths (IPv4 and recently also IPv6) throughout Europe. These paths cover an interesting mix of research and commercial networks. The Web interface to the collected measurements supports in-depth exploration quite well, such as looking at the delay/loss evolution of a specific path over both a wide range of intervals from the very short to the very long. On the other hand, it is hard to get at useful "overview" pictures. The RIPE NCC's DNS Server Monitoring (DNSMON) service provides such overviews for specific sets of locations.

The raw data collected by the RIPE TTM infrastructure is not generally available, in part because of restrictions on data disclosure. This is understandable in that RIPE NCC's main member/customer base consists of commercial ISPs, who presumably wouldn't allow full disclosure for competitive reasons. But it also means that in practice, only the RIPE NCC can develop new tools that make use of these data, and their resources for doing so are limited, in part due to lack of support from their ISP membership. On a positive note, the RIPE NCC has, on several occasions, given scientists access to the measurement infrastructure for research. The TTM team also offers to extract raw data (in ROOT format) from the TTM database manually on demand.

Another drawback of RIPE TTM is that the central TTM database is only updated once a day. This makes the system impossible to use for near-real-time diagnostic purposes. (For test box hosts, it is possible to access the data collected by one's hosted test boxes in near realy time, but this provides only very limited functionality, because only information about "inbound" packets is available locally, and therefore one can neither correlate delays with path changes, nor can one compute loss figures without access to the sending host's data.) In contrast, RIPE NCC's Routing Information Service (RIS) infrastructure provides several ways to get at the collected data almost as it is collected. And because RIS' raw data is openly available in a well-defined and relatively easy-to-use format, it is possible for third parties to develop innovative and useful ways to look at that data.

Examples

The following screenshot shows an on-demand graph of the delays over a 24-hour period from one test box hosted by SWITCH (tt85 at ETH Zurich) to to other one (tt86 at the University of Geneva). The delay range has been narrowed so that the fine-grained distribution can be seen. Five packets are listed as "overflow" under "STATISTICS - Delay & Hops" because their one-way delays exceeded the range of the graph. This is an example of an almost completely uncongested path, showing no packet loss and the vast majority of delays very close to the "baseline" value.

ripe-ttm-sample-delay-on-demand.png

The following plot shows the impact of a routing change on a path to a test box in New Zealand (tt47). The hop count decreases by one, but the base one-way delay increased by about 20ms. The old and new routes can be retrieved from the Web interface too. Note also that the delay values are spread out over a fairly large range, which indicates that there is some congestion (and thus queueing) on the path. The loss rate over the entire day is 1.1%. It is somewhat interesting how much of this is due to the routing change and how much (if any) is due to "normal" congestion. The lower right graph shows "arrived/lost" packets over time and indicates that most packets were in fact lost during the path change - so although some congestion is visible in the delays, it is not severe enough to cause visible loss. On the other hand, the total number of probe packets sent (about 2850 per day) means that congestion loss would probably go unnoticed until it reaches a level of 0.1% or so.

day.tt85.tt47.gif

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

QoSMetrics Boxes

This measurement infrastructure was built as part of RENATER's Métrologie efforts. The infrastructure performs continuous delay measurement between a mesh of measurement points on the RENATER backbone. The measurements are sent to a central server every minute, and are used to produce a delay matrix and individual delay histories (IPv4 and IPv6 measures, soon for each CoS)).

Please note that all results (tables and graphs) presented on this page are genereted by RENATER's scripts from QoSMetrics MySQL Database.

Screendump from RENATER site (http://pasillo.renater.fr/metrologie/get_qosmetrics_results.php)

new_Renater_IPPM_results_screendump.jpg

Example of asymmetric path:

(after that the problem was resolve, only one path has his hop number decreased)

Graphs:

Paris_Toulouse_delay Toulouse_Paris_delay
Paris_Toulouse_jitter Toulouse_Paris_jitter
Paris_Toulouse_hop_number Toulouse_Paris_hop_number
Paris_Toulouse_pktsLoss Toulouse_Paris_pktsLoss

Traceroute results:

traceroute to 193.51.182.xx (193.51.182.xx), 30 hops max, 38 byte packets
1 209.renater.fr (195.98.238.209) 0.491 ms 0.440 ms 0.492 ms
2 gw1-renater.renater.fr (193.49.159.249) 0.223 ms 0.219 ms 0.252 ms
3 nri-c-g3-0-50.cssi.renater.fr (193.51.182.6) 0.726 ms 0.850 ms 0.988 ms
4 nri-b-g14-0-0-101.cssi.renater.fr (193.51.187.18) 11.478 ms 11.336 ms 11.102 ms
5 orleans-pos2-0.cssi.renater.fr (193.51.179.66) 50.084 ms 196.498 ms 92.930 ms
6 poitiers-pos4-0.cssi.renater.fr (193.51.180.29) 11.471 ms 11.459 ms 11.354 ms
7 bordeaux-pos1-0.cssi.renater.fr (193.51.179.254) 11.729 ms 11.590 ms 11.482 ms
8 toulouse-pos1-0.cssi.renater.fr (193.51.180.14) 17.471 ms 17.463 ms 17.101 ms
9 xx.renater.fr (193.51.182.xx) 17.598 ms 17.555 ms 17.600 ms

[root@CICT root]# traceroute 195.98.238.xx
traceroute to 195.98.238.xx (195.98.238.xx), 30 hops max, 38 byte packets
1 toulouse-g3-0-20.cssi.renater.fr (193.51.182.202) 0.200 ms 0.189 ms 0.111 ms
2 bordeaux-pos2-0.cssi.renater.fr (193.51.180.13) 16.850 ms 16.836 ms 16.850 ms
3 poitiers-pos1-0.cssi.renater.fr (193.51.179.253) 16.728 ms 16.710 ms 16.725 ms
4 nri-a-pos5-0.cssi.renater.fr (193.51.179.17) 22.969 ms 22.956 ms 22.972 ms
5 nri-c-g3-0-0-101.cssi.renater.fr (193.51.187.21) 22.972 ms 22.961 ms 22.844 ms
6 gip-nri-c.cssi.renater.fr (193.51.182.5) 17.603 ms 17.582 ms 17.616 ms
7 250.renater.fr (193.49.159.250) 17.836 ms 17.718 ms 17.719 ms
8 xx.renater.fr (195.98.238.xx) 17.606 ms 17.707 ms 17.226 ms

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 - 07 Apr 2006

Passive Measurement Tools

Network Usage:

  • SNMP-based tools retrieve network information such as state and utilization of links, router CPU loads, etc.: MRTG, Cricket
  • Netflow-based tools use flow-based accounting information from routers for traffic analysis, detection of routing and security problems, denial-of-service attacks etc.

Traffic Capture and Analysis Tools

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006

SNMP-based tools

The Simple Network Management Protocol (SNMP) is widely used for device monitoring over IP, especially for monitoring infrastructure compoments of those networks, such as routers, switches etc. There are many tools that use SNMP to retrieve network information such as the state and utilization of links, routers' CPU load, etc., and generate various types of visualizations, other reports, and alarms.

  • MRTG (Multi-Router Traffic Grapher) is a widely-used open-source tool that polls devices every five minutes using SNMP, and plots the results over day, week, month, and year timescales.
  • Cricket serves the same purpose as MRTG, but is configured in a different way - targets are organized in a tree, which allows inheritance of target specifications. Cricket has been using RRDtool from the start, although MRTG has picked that up (as an option) as well.
  • RRDtool is a round-robin database used to store periodic measurements such as SNMP readings. Recent values are stored at finer time granularity than older values, and RRDtool incrementally "consolidates" values as they age, and uses an efficient constant-size representation. RRDtool is agnostic to SNMP, but it is used by Cricket, MRTG (optionally), and many other SNMP tools.
  • Synagon is a Python-based tool that performs SNMP measurements (Juniper firewall counters) at much shorter timescales (seconds) and graphs them in real time. This can be used to look at momentary link utilizations, within the limits of how often devices actually update their SNMP values.

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006 - 16 Jul 2007

Synagon SNMP Graphing Tool

Overview

Synagon is Python-based program that collects and graphs firewall values from Juniper routers. The tool first requires the user to select a router from a list. It will retrieve all the available firewall filters/counters from that router and the user can select any of these to graph. It was developed by DANTE for internal use, but could be made available on request (on a case by case basis), in an entirely unsupported manner. Please contact operations@dante.org.uk for more details.

[The information below is taken from DANTE's internal documentation.]

Installation and Setup procedures

Synagon comes as a single compressed archive, that can be copied in a directory and then invoked as any other standard python script.

It requires the following python libraries to operate:

1) The python SNMP framework (available at http://pysnmp.sourceforge.net)

The SNMP framework is required to SNMP to the routers. For guidance how to install the packages please see the section below

You also need to create a network.py file that describes the network topology where the tool will act upon. The configuration files should be located in the same directory as the tool itself. Please look at the section below for more information on the format of the file.

  • Example Sysngaon screen:
    Example Sysngaon screen

How to use Synagon

The tool can be invoked by double clicking on the file name (under windows) or by typing the following on the command shell:

python synagon.py

Prior to that, a network.py configuration file must have been created and located at the directory where the tool has been invoked. It describes the location of the routers and the snmp community to use when contacting them.

Immediately after, a window with a single menu entry will appear. By clicking on the menu, a list of all routers described in the network.py will be given as submenus.

Under each router, a submenu shows the list of firewall filters for that router (depending whether you have previously collected the list) together with a collect submenu. The collect submenu initiates the collection of the firewall filters from the router and then updates the list of firewall filters in the submenu where the collect button is located. The list of firewall filters of each router is also stored on the local disk and next time the tool is invoked it wont be necessary to interrogate the router for receiving the firewall list, but they will be taken instead from the file. When the firewall filter list is retrieved, a submenu with all counter and policers for each filter will be displayed. A policer is marked with [p] and a counter with a [c] just after their name.

When a counter or policer of a firewall filter of a router is selected, the SNMP poling for that entity begins. After a few seconds, a window will appear that plots the requested values.

The graph shows the rate of bytes dropped since monitoring had started for a particular counter or the rate of packets dropped since monitoring had started for a particular policer. The save graph button can create a file for the graph in encapsulated postscript format (.eps).

Because values are frequently retrieved, there are memory consumption and performance concerns. To avoid this a configurable limit is placed on the number of samples plotted on graph. The handle at the top of the window limits the number of samples plotted and has an impact on performance if is positioned at a high scale (of thousands).

Developers guide

This section provides information about how the tool internally works.

The tool mainly consists of the felektis.py file, but it also relies on the external csnmp.py and graph.py library. The felektis.py defines two threads of control; one initiates the application and is responsible for collecting the results, the other is responsible for the visual part of the application and the interaction with the user.

The GUI thread updates a shared structure (a dictionary) with contact information about all the firewalls the user has chosen by that time to monitor.

active_firewall_list = {}

The structure holds the following information:

router_name->filter_name->counter_name(collect, instance, type, bytes, epoch)

There is an entry per counter and that entry has information about whether the counter should be retrieved (collect), the SNMP instance (instance) for that counter, whether it is a policer or a counter (type), the last byte counter value (bytes), the time the last value was received (epoch).

It operates as follows:

do forever:
   if the user has shutdown the GUI thread:
      terminate

   for each router in active_firewall_list:
      for each filter of each router:
         for each counter/policer of each filter:
            retrieve counter value
            calculate change rate 
            pass rate of that counter to the GUI thread

It uses SNMP to request the values from the router.

The gui thread (MainWindow class) first creates the user interfaces (based on the Tkinter Tcl/TK module) and is passed the geant network topology found in the network.py file. It then creates a menu list with all routers found in the topology. Each router entry has a submenu with an item of collect. This item is bound to a function and if it is selected, the router is interrogated via SNMP on the Juniper Firewall MIB, and the collection of filter name/counter names for that router is retrieved. Because the interrogation may take tens of seconds, the firewall filter list is serialised onto the local directory using the cPickle builtin module. The file is given the name router_name.filter (e.g.: uk1.filters for router uk1) and it is stored in the local directory where the tool was invoked from. Next time the tool is invoked, it will check if there is a file list for that router, and if so, it will deserialise the list of firewall filters from the file and populate the routers submenu. It could of course be the case the serialised firewall list can be out of date, though the collect submenu entry for each router is always there and can be invoked to replace the current list (both on memory and disk) with the latest details. If the user selects a firewall or policer to monitor from the submenu list, it will populate the active_firewall_list with that counter/policer, and the main thread will take it up and begin retrieving data for that counter/policer.

The main thread retrieves and passes the calculated rate to the MainWindow thread onto a queue that the main thread populates and the gui thread gets its data from. Because the updates can come from a variety of different firewall filters the queue is indexed with a text string which is a unique identifier based upon router name, firewall filter name, counter/policer name. The gui thread periodically (i.e.: every 250ms) check if there are data in the above mentioned queue, If the identifier hasnt been seen before, a new window is created where the values are plotted. If the window already exists, then it is updated with that value and its contents are redrawn. A window is an instance of the class PlotList.

When a window is closed by the user, the gui modifies the active_firewall_list for that policer by setting the collect attribute to None, and the main thread next time will reach that counter/policer, it will ignore it.

Creating a single synagon package

The tool consists of several files:

  • felegktis.py [main logic]
  • graph.py [graph library]
  • csnmp.py [ SNMP library]

It is possible by using the squeeze tool (available at http://www.pythonware.com/products/python/squeeze/index.htm) to make this file into a compressed python executable archive.

The build_synagon.bat uses the squeeze tool to archive all this file into one.

build_synagon.bat:
REM build the sinagon distribution
python squeezeTool.py -1 -o synagon -b felegktis felegktis.py graph.py csnmp.py

The command above builds synagon.py from files felegktis.py, graph.py and csnmp.py.

Appendices

A. network.py

newtork.py descibes the nework's topology - its format can be deduced by inspection. Currently the network topology name (graph name) should be set to geant - the script could be adjusted to change this behaviour.

Compulsory router properties:

  • type: the type of the router [juniper, cisco, etc] etc (the tool only operates on juniper routers
  • address: The address where the router can be contacted at
  • community: the SNMP community to authenticate on the router
Example:
# Geant network
geant = graph.Graph()

# Defaults for router
class Router(graph.Vertex):
    """ Common options for geant routers """

    def __init__(self):
        # Initialise base class
        graph.Vertex.__init__(self)
        # Set the default properties for these routers
      self.property['type'] = 'juniper'
        self.property['community'] = 'commpassword'

uk1 = Router()
gr1 = Router()


geant.vertex['uk1'] = uk1
geant.vertex['gr1'] = gr1

uk1.property['address'] = 'uk1.uk.geant.net'
gr1.property['address'] = 'gr1.gr.geant.net'

B. Python Package Installation

Python package installation

Python has a built in support for installing packages.

Packages usually come in a source format and they compiled at the time of the installation. Packages can be complete python source code or have extension in other languages (c, c++) . The packages can also come in binary form.

Many python packages which have extension written in some oher language, need to be compiled into a binary package before distributing to platform like windows, because not all windows machine have c or c++ compilers that a package may require.

The Python SNMP framework library

The source version can be downloaded from http://mysnmp.sourceforge.net

When the file is uncompressed, the user should invoke the setup.py:

setup.py install There is also a binary version for Windows that exists in the NEPs wiki. It has a simple GUI. Just keep pushing the next button until the package is installed.

-- TobyRodwell - 23 Jan 2006

Netflow-based Tools

Netflow: Analysis tools to characterize the traffic in the network, detect routing problems, security problems (Denials of Service), etc.

There are different versions of NDE (NetFlow Data Export): v1 (first), v5, v7, v8 and the last version (v9) which is based on templates. This last version makes it possible to analyze IPv6, MPLS and Multicast flows.

  • NDE v9 description:
    NDE v9 description

References

-- FrancoisXavierAndreu - 06 Jun 2005 - 07 Apr 2006 -- SimonLeinen - 05 Jan 2006

libtrace

libtrace is a library for trace processing. It supports multiple input methods, including device capture, raw and gz-compressed trace, and sockets; and mulitple input formats, including pcap and DAG.

References

-- SimonLeinen - 04 Mar 2006

Netdude

Netdude (Network Dump data Displayer and Editor) is a framework for inspection, analysis and manipulation of tcpdump (libpcap) trace files.

References

-- SimonLeinen - 04 Mar 2006

jnettop

Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use. The result is a nice listing of communication on network grouped by stream, which shows transported bytes and consumed bandwidth.

References

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 04 Mar 2006

Host and Application Measurement Tools

Network performance is only one part of a distrbuted system's perfomance. An equally important part is the performance od the actual application and hardware it runs on.

  • Web100 - fine-grained instrumentation of TCP implementation, currently available for the Linux kernel.
  • SIFTR - TCP instrumentation similar to Web100, available for *BSD
  • DTrace - Dynamic tracing facility available in Solaris and *BSD
  • NetLogger - a framework for instrumentation and log collection from various components of networked applications

-- TobyRodwell - 22 Mar 2006
-- SimonLeinen - 28 Mar 2006

NetLogger

NetLogger (copyright Lawrence Berkeley National Laboratory) is both a set of tools and a methodology for analysing the performance of a distributed system. The methodolgoy (below) can be implemented separately from the LBL developed tools. For full information on NetLogger see its website, http://dsd.lbl.gov/NetLogger/

Methodology

From the NetLogger website

The NetLogger methodology is really quite simple. It consists of the following:

  1. All components must be instrumented to produce monitoring These components include application software, middleware, operating system, and networks. The more components that are instrumented the better.
  2. All monitoring events must use a common format and common set of attributes. Monitoring events most also all contain a precision timestamp which is in a single timezone (GMT) and globally synchronized via a clock synchronization method such as NTP.
  3. Log all of the following events: Entering and exiting any program or software component, and begin/end of all IO (disk and network).
  4. Collect all log data in a central location
  5. Use event correlation and visualization tools to analyze the monitoring event logs.

Toolkit

From the NetLogger website

The NetLogger Toolkit includes a number of separate components which are designed to help you do distributed debugging and performance analysis. You can use any or all of these components, depending on your needs.

These include:

  • NetLogger message format and data model: A simple, common message format for all monitoring events which includes high-precision timestamps
  • NetLogger client API library: C/C++, Java, PERL, and Python calls that you add to your existing source code to generate monitoring events. The destination and logging level of NetLogger messages are all easily controlled using an environment variable. These libraries are designed to be as lightweight as possible, and to never block or adversely affect application performance.
  • NetLogger visualization tool (nlv): a powerful, customizable X-Windows tool for viewing and analysis of event logs based on time correlated and/or object correlated events.
  • NetLogger host/network monitoring tools: a collection of NetLogger-instrumented host monitoring tools, including tools to interoperate with Ganglia and MonALisa.
  • NetLogger storage and retrieval tools, including a daemon that collects NetLogger events from several places at a single, central host; a forwarding daemon to forward all NetLogger files in a specified directory to a given location; and a NetLogger event archive based on mySQL.

References

-- TobyRodwell - 22 Mar 2006

NREN Tools and Statistics

A list of all NRENs and other networks who offer statistics/info for their networks and/or have tools available for PERT staff to use.

Outside Europe:

For information on other NRENs' monitoring tools please see the traffic monitoring URLs section of the TERENA Compendium

A list of network weathermaps is maintained in the JRA1 Wiki.

-- TobyRodwell - 14 Apr 2005 -- SimonLeinen - 20 Sep 2005 - 05 Aug 2011 -- FrancoisXavierAndreu - 05 Apr 2006 -- LuchesarIliev - 27 Apr 2006

GÉANT Tools

PERT Staff can access a range of tools for checking the status of the GÉANT network by logging into https://tools.geant.net/portal/ (if a member of the PERT does not have an account for this site they should contact DANTE operations operations@dante.org.uk).

The following tools are available:

GÉANT Usage Map: For each country currently connected to GÉANT, this map shows the level of usage of the access link connecting the country's national research network to GÉANT.

Network Weathermap: Map showing the GÉANT routers and the circuits interconnecting them. The circuits are coloured according to their utilisation. LOGIN REQUIRED

GÉANT Looking Glass: this allows users to execute basic operational commands on GÉANT routers, including ping, traceroute and show route

IPv4 and IPv6 Beacon: Beacon nodes are located in the majority of GÉANT PoPs. They send and receive a specific mulitcast group, and from this are able to infer the performance of all multicast traffic throughout GÉANT.

HADES Measure Points: These network performance Measurement Points (MPs) run DFN's HADES IPPM software (OWD, OWDV, packet loss) and also iperf/BWCTL. PERT engineers can log in to the devices and run AB tests (they should use common sense so as not to overload low capacity or already heavily loaded paths).

BWCTL Measurement Points: NRENs can run BWCTL/iperf tests to, from and between GEANT2's BWCTL servers. See GeantToolsDanteBwctl for more information.

-- TobyRodwell - 05 Apr 2005 -- SimonLeinen - 20 Sep 2005 - 05 Aug 2011

Performance-Related Tools at SWITCH

  • IPv4 Looking Glass -- allows running IPv4 ping, traceroute, and several BGP monitoring commands on external border routers
  • IPv6 Looking Glass -- like the above, but for IPv6
  • Network Diagnostic Tester (NDT) -- run TCP throughput tests to a Web100-instrumented server, and detect configuration and cabling issues on your client. Java 1.4 or newer required
  • BWCTL & OWAMP testboxes -- run bwctl (iperf) and one-way delay tests (owamp) to our PMPs
  • SmokePing -- permanent RTT ("ping") measurements to select places on the Internet, with good visualization
  • RIPE TTM -- tt85 (at SWITCH's PoP at ETH Zurich) and tt86 (at the University of Geneva) show one-way delays, delay variations and loss from/to SWITCH from many other networks.
  • IEPM-BW -- regular throughput measurements from SLAC to many hosts, including one at SWITCH
  • Multicast beacons -- IPv4/IPv6 multicast reachability matrices

-- AlexGall - 15 Apr 2005

PSNC - Network Monitoring Tools

Public Tools

Private Tools

Contact PSNC NOC <noc@man.poznan.pl> for information from any of the following tools

  • Smokeping Latency Graphs
  • MRTG traffic statistics

-- BartoszBelter - 18 Aug 2005

Public Tools

  • MRTG - Traffic statistics for core and access circuits
  • Map - Weathermap based on MRTG
  • Looking Glass - IPv4 and IPv6

Private Tools

Contact HEAnet NOC for information from any of the following tools

  • Smokeping - latency graphing
  • Netsaint - reachability and latency alarms
  • Netflow
  • RIPE TTM
  • Multicast Beacon

-- AnnRHarding - 17 May 2005

ISTF Network Monitoring Tools

Network Monitoring Tools Portal -- includes links to:

  • Looking Glass (IPv4 and IPv6 ping, trace, bgp)
  • Cacti Network Graphs
  • Smokeping Latency Graphs
  • Rancid CVS repository

-- VedrinJeliazkov - 30 May 2005

Public Tools

Private Tools

Contact FCCN NOC for information from any of the following tools:

  • Netsaint - reachability and latency alarms
  • Multicast Beacon IPv4 - IPv4 multicast reachability matrix
  • RRDTool
  • Iperf (IPv4/ IPv6)

-- MonicaDomingues - 31 May 2005

Public Tools

Private Tools

Contact IUCC NOC (nocplus@noc.ilan.net.il) for information from any of the following tools:

IUCC is willing for PERT CMs to have access to non-public IUCC network information. If as a part of an investigation a PERT CM needs more info about the IUCC network they should contact Hank (or the IUCC NOC) for more assistance.

-- HankNussbacher - 02 Jun 2005 -- RafiSadowsky - 25 Mar 2006

Public Tools

Private Tools

  • IPv6 cricket -- IPv6 cricket
  • Cricket -- Traffic statistics for core and access circuits
  • Weathermap -- Hungarnet Weathermap
  • Router configuration database IPv4/IPv6 -- CVS database of router configurations
  • IPv6 looking glass -- allows ping tracerotue and bgp commands
  • Nagios -- latency and service availability alarms (both IPv6/IPv4)

-- MihalyMeszaros - 16 Mar 2007

RENATER - Network Monitoring Tools

Public Tools

-- FrancoisXavierAndreu - 05 Apr 2006

Network Emulation Tools

In general, it is difficult to assess performance of distributed applications and protocols before they are deployed on the real network, because the interaction with network impairments such as delay is hard to predict. Therefore, researchers and practitioners often use emulation to mimic deployment (typically wide-area) networks for use in laboratory testbeds.

The emulators listed below implement logical interfaces with configurable parameters such as delay, bandwidth (capacity) and packet loss rates.

-- TobyRodwell - 06 Apr 2005 -- SimonLeinen - 15 Dec 2005

Linux netem (introduced in recent 2.6 kernels)

NISTnet works only on 2.4.x Linux kernels, so for having the best support for recent GE cards, I would recommend kernel 2.6.10 or .11 and the new netem (Network emulator). A lot of similar functionality is built on this special QoS queue. It is buried into menuconfig:

 Networking -->
   Networking Options -->
     QoS and/or fair queuing -->
        Network emulator

You will need a recent iproute2 package that supports netem. The tc command is used to configure netem characteristics such as loss rates and distribution, delay and delay variation, and packet corruption rate (packet corruption was added in Linux version 2.6.16).

References

-- TobyRodwell - 14 Apr 2005
-- SimonLeinen - 21 Mar 2006 - 07 Jan 2007 -- based on information from David Martinez Moreno (RedIRIS). Larry Dunn from Cisco has set up a portable network emulator based on NISTnet. He wrote this on it:

My systems started as 2x100BT, but not for the (cost) reasons you suspect. I have 3 cheap Linux systems in a mobile rack (3x1RU, <$5k USD total). They are attached in a "row", 2 with web100, 1 with NISTNet like this: web100 - NISTNet - web100. The NISTNet node acts as a variable delay-line (emulates different fiber propagation delays), and as a place to inject loss, if desired. DummyNet/BSD could work as well.

I was looking for a very "shallow" (front-to-back) 1RU system, so I could fit it into a normal-sized Anvil case (for easy shipping to Networkers, etc). Penguin Computing was one of 2 companies I found that was then selling such "shallow" rack-mount systems. The shallow units were "low-end" servers, so they only had 2x100BT ports.

Their current low-end systems have two 1xGE ports, and are (still) around $1100 USD each. So a set of 3 systems (inc. rack), Linux preloaded, 3-yr warranty, is still easily under $5k (Actaully, closer to $4k, including a $700 Anvil 2'x2'x2' rolling shock-mount case, and a cheap 8-port GE switch). Maybe $5k if you add more memory & disk...

Here's a reference to their current low-end stuff:

http://penguincomputing.com/products/servers/relion1XT.php?PHPSESSID=f90f88b4d54aad4eedc2bdae33796032

I have no affiliation with Penguin, just happened to buy their stuff, and it's working OK (though the fans are pretty loud). I suppose there are comparable vendors in Europe.

-- TobyRodwell - 14 Apr 2005

Large Send Offload (LSO)

This feature is also known as "Segmentation Offload", "TCP Segmentation Offload (TSO)", "[TCP] Multidata Transmit (MDT), or "TCP Large Send".

From Microsoft's document, Windows Network Task Offload:

With Segmentation Offload, or TCP Large Send, TCP can pass a buffer to be transmitted that is bigger than the maximum transmission unit (MTU) supported by the medium. Intelligent adapters implement large sends by using the prototype TCP and IP headers of the incoming send buffer to carve out segments of required size. Copying the prototype header and options, then calculating the sequence number and checksum fields creates TCP segment headers. All other information, such as options and flag values, are preserved except where noted.

Large Send Offload can be seen as doing for output what interrupt coalescence combined with large-receive offload does for input, namely reduce the number of (bus/interrupt) transactions between CPUs and network adapters by bundling multiple packets to larger transactions (scatter/gather).

Hardware (network adapter) support for LSO is a refinement of transmit chaining, where multiple transmitted frames can be sent from the host to the adapter in a single transaction.

Issues with Large Send Offload

Timing

Like Interrupt Coalescence, LSO can affect packet timing and increase burstiness. An illustration of this effect is this patch that modified LSO (TSO as it is called in Linux) to bound the time that outgoing segments can be held while trying to accumulate a larger transfer unit. The accompanying message to the netdev mailing list includes some graphs that show the impact of (pre-patch) TSO on RTTs over a low-speed link.

(Transport) Protocol Fossilization

The way it is defined by most of the industry, LSO needs to be aware of the transport protocols. In particular, it must be able to split over-large transport segments into suitable sub-segments, and generate transport (e.g. TCP) headers for these sub-segments. This function is typically implemented in the adapter's firmware, for some popular transport protocol such as TCP. This makes it hard to implement additional functions such as IPSec, or the TCP MD5 Authentication option, or even other transport protocols such as SCTP.

There is a weakened form of LSO that requires the host operating system to prepare the segmentation and construct headers. This allows for "dumber" network adapters, and in particular it doesn't require them to be transport protocol-aware. It still provides significant performance improvement because multiple segments can be transferred between host and adapter in a single transaction, which reduces bus occupation and other overhead. Sun's Solaris operating system supports this variant of LSO under the name of "MDT" (Multidata Transmit), and the Linux kernel added something similar as part of "GSO" in 2.6.18 (September 2006) for IPv4 and in 2.6.35 (August 2010) for IPv6.

Configuration

Under Linux, LSO/TSO can be controlled using the -K option to the ethtool command, which can also be used to control other offloading features. It is typically enabled by default if kernel/driver and adapter support it.

References

-- TobyRodwell & SimonLeinen - 28 Feb 2005 - 11 Aug 2010

Interrupt Coalescence (also called Interrupt Moderation, Interrupt Blanking, or Interrupt Throttling)

A common bottleneck for high-speed data transfers is the high rate of interrupts that the receiving system has to process - traditionally, a network adapter generates an interrupt for each frame that it receives. These interrupts consume signaling resources on the system's bus(es), and introduce significant CPU overhead as the system transitions back and forth between "productive" work and interrupt handling many thousand times a second.

To alleviate this load, some high-speed network adapters support interrupt coalescence. When multiple frames are received in a short timeframe ("back-to-back"), these adapters buffer those frames locally and only interrupt the system once.

Interrupt coalescence together with large-receive offload can roughly be seen as doing on the "receive" side what transmit chaining and large-send offload (LSO) do for the "transmit" side.

Issues with interrupt coalescence

While this scheme lowers interrupt-related system load significantly, it can have adverse effects on timing, and make TCP traffic more bursty or "clumpy". Therefore it would make sense to combine interrupt coalescence with on-board timestamping functionality. Unfortunately that doesn't seem to be implemented in commodity hardware/driver combinations yet.

The way that interrupt coalescence works, a network adapter that has received a frame doesn't send an interrupt to the system right away, but waits for a little while in case more packets arrive. This can have a negative impact on latency.

In general, interrupt coalescence is configured such that the additional delay is bounded. On some implementations, these delay bounds are specified in units of milliseconds, on other systems in units of microseconds. It requires some thought to find a good trade-off between latency and load reduction. One should be careful to set the coalescence threshold low enough that the additional latency doesn't cause problems. Setting a low threshold will prevent interrupt coalescence from occurring when successive packets are spaced too far apart. But in that case, the interrupt rate will probably be low enough so that this is not a problem.

Configuration

Configuration of interrupt coalescence is highly system dependent, although there are some parameters that are more or less common over implementations.

Linux

On Linux systems with additional driver support, the ethtool -C command can be used to modify the interrupt coalescence settings of network devices on the fly.

Some Ethernet drivers in Linux have parameters to control Interrupt Coalescence (Interrupt Moderation, as it is called in Linux). For example, the e1000 driver for the large family of Intel Gigabit Ethernet adapters has the following parameters according to the kernel documentation:

InterruptThrottleRate
limits the number of interrupts per second generated by the card. Values >= 100 are interpreted as the maximum number of interrupts per second. The default value used to be 8'000 up to and including kernel release 2.6.19. A value of zero (0) disabled interrupt moderation completely. Above 2.6.19, some values between 1 and 99 can be used to select adaptive interrupt rate control. The first adaptive modes are "dynamic conservative" (1) and dynamic with reduced latency (3). In conservative mode (1), the rate changes between 4'000 interrupts per second when only bulk traffic ("normal-size packets") is seen, and 20'000 when small packets are present that might benefit from lower latency. The more aggressive mode (3), "low-latency" traffic may drive the interrupt rate up to 70'000 per second. This mode is supposed to be useful for cluster communication in grid applications.
RxIntDelay
specifies, in multiples of 1'024 microseconds, the time after reception of a frame to wait for another frame to arrive before sending an interrupt.
RxAbsIntDelay
bounds the delay between reception of a frame and generation of an interrupt. It is specified in units of 1'024 microseconds. Note that InterruptThrottleRate overrides RxAbsIntDelay, so even when a very short RxAbsIntDelay is specified, the interrupt rate should never exceed the rate specified (either directly or by the dynamic algorithm) by InterruptThrottleRate
RxDescriptors
specifies the number of descriptors to store incoming frames on the adapter. The default value is 256, which is also the maximum for some types of E1000-based adapters. Others can allocate up to 4'096 of these descriptors. The size of the receive buffer associated with each descriptor varies with the MTU configured on the adapter. It is always a power-of-two number of bytes. The number of descriptors available will also depend on the per-buffer size. When all buffers have been filled by incoming frames, an interrupt will have to be signaled in any case.

Solaris

As an example, see the Platform Notes: Sun GigaSwift Ethernet Device Driver. It lists the following parameters for that particular type of adapter:

rx_intr_pkts
Interrupt after this number of packets have arrived since the last packet was serviced. A value of zero indicates no packet blanking. (Range: 0 to 511, default=3)
rx_intr_time
Interrupt after 4.5 microsecond ticks have elapsed since the last packet was serviced. A value of zero indicates no time blanking. (Range: 0 to 524287, default=1250)

References

-- SimonLeinen - 04 Jul 2005 - 02 Jul 2011

Checksum Offload

A large part of the processing costs related to TCP is the generation and verification of the TCP checksum. Many Gigabit Ethernet chipsets include "on-board" hardware that can verify and/or generate these checksums. This significantly reduces the amount of work that has to be done by the system kernel on a CPU, especially when combined with other adapter/driver enhancements such as Large-Send Offload. Checksum Offload is also part of TCP Offload Engines (TOEs), which move the entire TCP processing from the CPU(s) to the adapter. Checksum Offload requires special driver support and a kernel infrastructure that supports such drivers.

TCP Checksum Offload is the most common form of checksum offload, but of course it is possible to offload other checksums such as the UDP or SCTP checksums.

Some people (at HP?) abbreviate Checksum Offload as "CKO".

A possible issue with offloading checksums to the controller is that the integrity protection is less "end-to-end": If there are errors in the internal (bus) transmission of data from the host processor/memory to the adapter, the adapter will happily compute checksums on the corrupted data, which means that the corruption will go undetected at the receiver.

-- SimonLeinen -18 May 2008

TCP Offload Engines (TOEs)

The idea of a TOE is to put the TCP implementation onto the network adapter itself. This relieves the computer's CPUs of handling TCP segmentation, reassembly, checksum calculation and verification, and so on. Large-Send Offload (LSO) and Checksum Offload are typically assumed to be subsets of TOE functionality.

The drawbacks of TOEs are that they require driver support in the operating system, as well as additional kernel/driver interfaces for TCP-relevant operations (as opposed to the frame-based operations of traditional adapters). Also, when the operating system implements improvements to TCP over time, those normally have to be implemented on the TOE as well. And additional instrumentation such as the Web100 kernel instrumentation set would also need to be implemented separately.

For these and other reasons, TOEs (which are a relatively old idea) have never become a "mainstream" technology. In contrast, some more generic performance enhancements such as LSO, interrupt coalescence, or checksum offload, are now part of many "commodity" network adapter chip-sets, and enjoy increasing support in operating systems.

Mogul (2003) nicely presents most of the issues with TOEs, and argues that they aren't overly useful for general TCP use. However TOE - or, more generally, transport offload - may find good use as part of network adapters that provide Remote DMA (Remote Direct Memory Access) functionality for use in networked storage or clustering applications.

TOE on Windows: TCP Chimney

Microsoft Windows has its own architecture for TCP offload-capable network adapters called TCP Chimney. The name "chimney" stands for the channel that is established between the operating system and the adapter for each TCP connection that is offloaded. TCP Chimney was first introduced (in 2006?) as part of an addition to Windows Server 2003 called the Scalable Networking pack. In March 2008, it found its way into the first Service Pack for the Vista operating system (Vista SP1).

References

-- SimonLeinen - 04 Jul 2005 - 26 Mar 2008

Performance Hints for Application Developers

Caveat

Premature optimization is the root of all evil in programming

E. Dijkstra... or was it C.A.R. Hoare... or was it D. Knuth?

You should always get your program correct first, and think about optimizations once it works right. That said, it is always good to keep performance in mind while programming. The most important decisions are related to the choice of suitable algorithms, of course.

Regarding networked applications in particular, here are a few performance-related things to think about:

-- SimonLeinen - 05 Jul 2005

"Chatty" Protocols

A common problem with naively designed application protocols is that they are too "chatty", i.e. they imply too many "round-trip" cycles where one party has to wait for a response from the other. It is an easy mistake to make, because when testing such a protocol locally, these round-trips usually don't have much of an impact on overall performance. But when used over network paths with large RTTs, chattiness can dramatically impact perceived performance.

Example: SMTP (Simple Mail Transfer Protocol)

The Simple Mail Transfer Protocol (SMTP) is used to transport most e-mail messages over the Internet. In its original design (RFC 821, now superseded by RFC 2821), the protocol consisted of a strict sequence of request/response transactions, some of them very small. Taking an example from RFC 2920, a typical SMTP conversation between a client "C" that wants to send a message, and a server "S" that receives it, would look like this:

   S: <wait for open connection>
   C: <open connection to server>
   S: 220 Innosoft.com SMTP service ready
   C: HELO dbc.mtview.ca.us
   S: 250 Innosoft.com
   C: MAIL FROM:<mrose@dbc.mtview.ca.us>
   S: 250 sender <mrose@dbc.mtview.ca.us> OK
   C: RCPT TO:<ned@innosoft.com>
   S: 250 recipient <ned@innosoft.com> OK
   C: RCPT TO:<dan@innosoft.com>
   S: 250 recipient <dan@innosoft.com> OK
   C: RCPT TO:<kvc@innosoft.com>
   S: 250 recipient <kvc@innosoft.com> OK
   C: DATA
   S: 354 enter mail, end with line containing only "."
    ...
   C: .
   S: 250 message sent
   C: QUIT
   S: 221 goodbye

This simple conversation contains nine places where the client waits for a response from the server.

In order to improve this, the PIPELINING extension (RFC 2920) was later defined. When the server supports it - as signaled through the ESMTP extension mechanism in the response to an EHLO request - the client is allowed to send multiple requests in a row, and collect the responses later. The previous conversation becomes the following one with PIPELINING:

   S: <wait for open connection>
   C: <open connection to server>
   S: 220 innosoft.com SMTP service ready
   C: EHLO dbc.mtview.ca.us
   S: 250-innosoft.com
   S: 250 PIPELINING
   C: MAIL FROM:<mrose@dbc.mtview.ca.us>
   C: RCPT TO:<ned@innosoft.com>
   C: RCPT TO:<dan@innosoft.com>
   C: RCPT TO:<kvc@innosoft.com>
   C: DATA
   S: 250 sender <mrose@dbc.mtview.ca.us> OK
   S: 250 recipient <ned@innosoft.com> OK
   S: 250 recipient <dan@innosoft.com> OK
   S: 250 recipient <kvc@innosoft.com> OK
   S: 354 enter mail, end with line containing only "."
    ...
   C: .
   C: QUIT
   S: 250 message sent
   S: 221 goodbye

There are still a couple of places where the client has to wait for responses, notably during initial negotiation; but the number of these situations has been reduced to those where the response has an impact on further processing. The PIPELINING extension reduces the number of "turn-arounds" from nine to four. This speeds up the overall mail submission process when the RTT is high, reduces the number of packets that have to be sent (because several requests, or several responses, can be sent as a single TCP segment), and significantly decreases the risk of timeouts (and consequent loss of connection) when the connectivity between client and server is really bad.

The X Window System protocol (X11) is an example of a protocol that has been designed from the start to reduce turn-arounds.

References

-- SimonLeinen - 05 Jul 2005

Performance-friendly I/O interfaces

read()/write() vs. mmap()/write() vs. sendfile()

For applications with high input/output performance requirements (including network I/O), it is worthwhile to look at operating system support for efficient I/O routines.

As an example, here is simple pseudo-code that reads the contents of an open file in and writes them to an open socket out - this code could be part of a file server. A straightforward way of coding this uses the read()/write() system calls to copy the bytes through a memory buffer:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char buffer[BUFSIZE];
  int result; long written = 0;
  while (result = read (in, buffer, BUFSIZE) > 0) {
    if (write (out, buffer, result) != result)
      return -1;
    written += result;
  }
  return (result == 0 ? written : result);
}

Unfortunately, this common programming paradigm results in high memory traffic and inefficient use of a system's caches. Also, if a small buffer is used, the number of system operations and, in particular, of user/kernel context switches will be quite high.

On systems that support memory mapping of files using mmap(), the following is more efficient if the source is an actual file:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char *b;
  struct stat st;
  if (fstat (in, &st) == -1) return -1;
  if ((b = mmap (0, st.st_size, PROT_READ, 0)) == -1)
    return -1;
  madvise (b, st.st_size, MADV_SEQUENTIAL);
  return write (out, b, st.st_size);
}

An even more efficient - and also more concise - variant is the sendfile() call, which directly copies the bits from the file to the network.

long send_file (int in, int out) {
  struct stat st;
  if (fstat (in, &st) == -1) return -1;
  off_t offset = 0;
  return sendfile (out, in, &offset, st.st_size);
}

Note that an operating system could optimize this internally up to the point where data blocks are copied directly from the disk controller to the network controller without any involvement of the CPU.

For more complex situations, the sendfilev() interface can be used to send data from multiple files and memory buffers to construct complex protocol units with a single call.

-- SimonLeinen - 26 Jun 2005

One thing to note, the above usage of write and sendfile is simplified, these system-calls can stop in the middle and return the number of bytes written, for real-world usage you should have a loop around them to continue sending the rest of the file and handle signal errors.

The first loop should be written as:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char buffer[BUFSIZE];
  int result; long written = 0;
  while (result = read (in, buffer, BUFSIZE) > 0) {
    ssize_t tosend = result;
    ssize_t offset = 0;
    while (tosend > 0) {
      result = write (out, buffer + offset, tosend);
      if (result == -1) {
        if (errno == EINTR || errno == EAGAIN)
          continue;
        else
          return -1;
      }
      written += result;
      offset += result;
      tosend -= result;
    }
  }
  return (result == 0 ? written : result);
}

References

  • Project Volo Design Document, OpenSolaris, 2008. This document contains, in section 4.5 ("zero-copy interface") a description of how sendfilev is implemented in current Solaris, as well as suggestions on generalizing the internal kernel interfaces that support it.

-- BaruchEven - 05 Jan 2006
-- SimonLeinen - 12 Sep 2008

Network Tuning

This section describes a few ways that can be used to improve performance of a network.

-- SimonLeinen - 01 Nov 2004 - 30 Mar 2006
-- PekkaSavola - 13 Jun 2008 (addition of WAN accelerators)
-- Alessandra Scicchitano - 11 Jun 2012 (Buffer Bloat)

Router Architectures

Basic Functions

Basic functions of an IP router can be divided into two main groups:

  • The forwarding plane is responsible for packet forwarding (or packet switching), which is the act of receiving packets on the router's interfaces and sending them out on (usually) other interfaces.
  • The control plane gathers and maintains network topology information, and passes it to the forwarding plane so that it knows where to forward the received packets.

Architectures

CPU-based vs. hardware forwarding support

Small routers usually have similar architectures to general purpose computers. They have a CPU, operational memory, some persistent storage device (to store configuration settings and the operating system software), and network interfaces. Both forwarding and control functions are carried out by the CPU, and being different processes is the only separation between the forwarding and the control plane. Network interfaces in these routers are treated just like NICs in general purpose computers, or as any other peripheral device.

High performance routers however, in order to achieve multi-Gbps throughput, use specialized hardware to forward packets. (The control plane is still very similar to a general-purpose computer architecture.) This way the forwarding and the control plane are far more separated. They do not contend for shared resources like processing power, because they both have their own.

Centralized vs. distributed forwarding

Another difference between low-end and high-end routers is that in low-end routers, packets from all interfaces are forwarded using a central forwarding engine, while some high-end routers decentralize forwarding across line cards, so that packets received on an interface are handled by a forwarding engine local to that line card. Distributed forwarding is especially attractive where one wants to scale routers to large numbers of line cards (and thus interfaces). Some routers allow line cards with and without forwarding engines to be mixed in the same chassis - packets arriving on engine-less line cards are simply handled by the central engine.

On distributed architectures, the line cards typically have their own copy of the forwarding table, and run very few "control-plane" functions - usually just what's necessary to communicate with the (central) control-plane engine. So even when forwarding is CPU-based, a router with distributed forwarding behaves much like a hardware-based router, in that forwarding performance is decoupled from control-plane performance.

There are examples for all kinds of combinations of CPU-/hardware-based and centralized/distributed forwarding, even within the products of a single vendor, for example Cisco:

  CPU-based hardware forw.
centralized 7200 7600 OSR
distributed 7500 w/VIP 7600 w/DFCs

Effects on performance analysis

Because of the separation of the forwarding and the control plane in high performance routers, the performance characteristics of traffic passing through the router and traffic destined to the router may be significantly different. The reason behind the difference is that transit traffic may be handled completely by the separated forwarding plane (for which this is the function it is optimized for), while traffic destined to the router is passed to the control plane and handled by the control plane CPU, which may have more important tasks at the moment (e.g. running routing protocols, calculating routes), and even if it's free, it cannot process traffic as efficiently as the forwarding plane.

This means that performance metrics of intermediate hops obtained from ping or traceroute-like tools may be misleading, as they may show significantly worse metrics than those of the analyzed "normal" transit traffic. In other words, it may happen easily that a router is forwarding transit traffic correctly, fast, without any packet loss, etc., while a traceroute through the router or pinging the router shows packet loss, high round-trip time, high delay variation, or some other bad things.

However, the control plane CPU of a high performance router normally is not always so busy, and in quiet periods it can deal well with the probe traffic directed to it. So intermediate hop probe traffic measurement results should be usually interpreted by dropping the occasional bad values, based on the assumption that the intermediate router's forwarding plane CPU had more important things to do than dealing with the probe traffic.

Beware the slow path

Routers with hardware support for forwarding often restrict this support to a common subset of traffic and configurations. "Unusual" packets may be handled in a "slow path" using the general-purpose CPU which is otherwise mostly dedicated to control-plane tasks. This often includes packets with IP options.

Some routers also revert from hardware-forwarding to CPU-based when certain complex functions are configured on the device - for example when the router has to do NAT or other payload-inspecting features. Another reason for reverting to CPU-based forwarding is when some resources in the forwarding hardware become exhausted, such as the forwarding table ("hardware" equivalent of the routing table) or access control tables.

To effectively use routers with hardware-based forwarding, it is therefore essential to know the restrictions of the hardware, and to ensure that the majority of traffic can indeed be hardware-switched. Where this is supported, it may be a good idea to limit the amount of "slow-path" traffic using rate-limits such as Cisco's "Control Plane Policing" (CoPP).

-- AndrasJako - 06 Mar 2006

Active Queue Management (AQM)

Packet-switching nodes such as routers usually need to accomodate queues to buffer packets, when incoming traffic exceeds the available outbound capacity. Traditionally, these buffers have been organised as tail-drop queues, where packets are queued until the buffer is full, and when it is full, newly arriving packets are dropped until the queue empties out. With bursty traffic (as is typical with TCP/IP), this can lead to entire bursts of arriving packets to be dropped because of a full queue. The effects of this are synchronisation of flows and a decrease of aggregate throughput. Another effect of the tail-drop queueing strategy is that, when congestion is long-lived, the queue will grow to fill the buffer, and will remain large until congestion eventually subsides. With large router buffers, this leads to increased one-way delay and round-trip times, which impacts network performance in various ways - see BufferBloat.

This had lead to the idea of "active queue management", where network nodes send congestion signals once they sense the onset of congestion, to avoid buffers filling up completely.

Active Queue Management is a precondition for Explicit Congestion Notification (ECN), which helps performance by reducing or eliminating packet loss during times of (light) congestion.

The earliest and best-known form of AQM on the Internet is Random Early Detection (RED). This is now supported by various routers and switches, although it is not typically activated by default. One possible reason is that RED, as originally specified, must be "tuned" depending on the given traffic mix (and optimisation goals) to be maximally effective. Various alternative methods have been proposed as improvements to RED, but none of them have enjoyed widespread use.

CoDel has recently (May 2012) been proposed as an promising practical alternative to RED. PIE was then proposed as an alternative to CoDel, claiming to be easier to implement efficiently, in particular on "hardware" implementations.

AQM in the IETF

RFC 2309, Recommendations on Queue Management and Congestion Avoidance in the Internet, (1998) recommended "testing, standardization, and widespread deployment" of AQM, and specifically RED, on the Internet. The testing part was certainly followed, in the sense that a huge number of academic papers was published about RED, its perceived shortcomings, proposed alternative AQMs, and so on. There was no standardization, and very little actual deployment. While RED is implemented in most routers today, it is generally not enabled by default, and very few operators explicitly enable it. There are many reasons for this, but an important one is that optimal configuration parameters for RED depend on traffic load and tradeoffs between various optimization goals (e.g. throughput and delay). RFC 3819, Advice for Internet Subnetwork Designers, also discusses questions of AQM, in particular RED and its configuration parameters.

In March 2013, a new AQM mailing list (archive) was announced to discuss a possible replacement for RFC 2309. Fred Baker issued draft-baker-aqm-recommendation as a starting point.

References

  • RFC 2309, Recommendations on Queue Management and Congestion Avoidance in the Internet, B. Braden, D. Clark, J. Crowcroft, B. Davie, S. Deering, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K. Ramakrishnan, S. Shenker, J. Wroclawski, L. Zhang. April 1998
  • RFC 3819, Advice for Internet Subnetwork Designers, P. Karn, Ed., C. Bormann, G. Fairhurst, D. Grossman, R. Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood. July 2004
  • Advice on network buffering, G. Fairhurst, B. Briscoe, slides presented to ICCRG at IETF-86, March 2013
  • draft-baker-aqm-recommendation, IETF Recommendations Regarding Active Queue Management, Fred Baker, March 2013

-- SimonLeinen - 2005-01-05 - 2013-04-03

Random Early Detection (RED)

RED is the first and most widely used Active Queue Management (AQM) mechanism. The basic idea is this: a route controlled by RED samples its occupation (queue size) over time, and computes a characteristic (often a decaying average such as an Exponentially Weighted Moving Average (EWMA), although several researchers propose that using the instantaneous queue length works better). This sampled queue length is then used to compute the loss probability for an incoming packet, according to a pre-defined profile. The decision whether to drop an incoming packet is performed using a random number and the drop probability. In this way, a point of (light) congestion in the network can run with a short queue, which is beneficial for applications because it keeps one-way delay and round-trip time low. When the congested link is shared by only a few TCP streams, RED also prevents synchronization effects that cause degradation of throughput.

Although RED is widely implemented in networking equipment, it is usually "off by default", and not often activated by operators. One reason is that it has various parameters (queue thresholds, EWMA parameters etc.) that need to be configured, and finding good values for these parameters is considered as something of a black art. Also, a huge number of research papers have been published that point out shortcomings of RED in particular scenarios, which discourages network operators from using it. Unfortunately, none of the many alternatives to RED that have been proposed in these publications, such as EXPRED, REM, or BLUE, has gained any more traction.

The BufferBloat activity has rekindled interest in queue-management strategies, and Kathleen Nichols and Van Jacobson, who have a long history of doing research on practical AQM algorithms, have published CoDel in 2012, which claims to improve upon RED in several key points using a fresh approach.

References

-- SimonLeinen - 2005-01-07 - 2012-07-12

Differentiated Services (DiffServ)

From the IETF's Differentiated Services (DiffServ) Work Group's site:

The differentiated services approach to providing quality of service in networks employs a small, well-defined set of building blocks from which a variety of aggregate behaviors may be built. A small bit-pattern in each packet, in the IPv4 TOS octet or the IPv6 Traffic Class octet, is used to mark a packet to receive a particular forwarding treatment, or per-hop behavior, at each network node.

Examples of DiffServ-based services include the Premium IP and "Less-than Best Effort" (LBE) services available on GEANT/GN2 and some NRENs.

References

-- TobyRodwell - 28 Feb 2005 -- SimonLeinen - 15 Jul 2005

Premium IP

Premium IP is a new service available on GEANT/GN2 and some of the NRENs connected to it. It uses DiffServ, and in particular the EF (Expedited Forwarding) PHB (per-hop behaviour) to protect a given aggregate of IP traffic against disturbances by other traffic, so that OneWayDelay, DelayVariation, and PacketLoss are assured to be low. The aggregate is specified by source and destination IP address ranges, or by ingress/egress AS (Autonomous System) numbers when in the core domain. In addition, a Premium IP aggregate must conform to strict rate limits. Unlike earlier proposals of similar "Premium" services, GEANT/GN2's Premium IP service downgrades excess traffic (over the contractual rate) to "best effort" (DSCP=0), rather than dropping it.

References

-- SimonLeinen - 21 Apr 2005

LBE (Less than Best Effort) Service

Less Than Best Effort (LBE) is a new service available on GEANT/GN2 and some of the NRENs connected to it. It uses DiffServ with the Class Selector 1 (CS=1) DSCP to mark traffic that can get by with less than the default "best effort" from the network. It can be used by high-volume, low-priority applications in order to limit their impact on other traffic.

References

-- SimonLeinen - 06 May 2005

Integrated Services (IntServ)

IntServ was an attempt by the IETF to add Quality of Service differentiation to the Internet architecture. Components of the Integrated Services architecture include

  • A set of predefined service classes with different parameters, in particular Controlled Load and Guaranteed Service
  • A ReSerVation Protocol (RSVP) for setting up specific service parameters for a flow
  • Mappings to different lower layers such as ATM, Ethernet, or low-speed links.

Concerns with IntServ include its scaling properties when many flow reservations are active in the "core" parts of the network, the difficulties of implementing the necessary signaling and packet treatment functions in high-speed routers, and the lack of policy control and accounting/billing infrastructure to make this worthwhile for operators. While IntServ never became widely implemented beyond intra-enterprise environments, RSVP has found new uses as a signaling protocol for Multi-Protocol Label Switching (MPLS).

As an alternative to IntServ, the IETF later developed the Differentiated Services architecture, which provides simple building blocks that can be composed to similarly granular services.

References

Integrated Services Architecture

  • RFC 1633, Integrated Services in the Internet Architecture: an Overview, R. Braden, D. Clark, S. Shenker, 1994

RSVP

  • RFC 2205, Resource <ReSerVation Protocol (RSVP) -- Version 1 Functional Specification, R. Braden, Ed., L. Zhang, S. Berson, S. Herzog, S. Jamin, September 1997
  • RFC 2208, Resource ReSerVation Protocol (RSVP) -- Version 1 Applicability Statement Some Guidelines on Deployment, A. Mankin, Ed., F. Baker, B. Braden, S. Bradner, M. O'Dell, A. Romanow, A. Weinrib, L. Zhang, September 1997
  • RFC 2209, Resource ReSerVation Protocol (RSVP) -- Version 1 Message Processing Rules, R. Braden, L. Zhang, September 1997
  • RFC 2210, The Use of RSVP with IETF Integrated Services, J. Wroclawski, September 1997
  • RFC 2211, Specification of the Controlled-Load Network Element Service, J. Wroclawski, September 1997
  • RFC 2212, Specification of Guaranteed Quality of Service, S. Shenker, C. Partridge, R. Guerin, September 1997

Lower-Layer Mappings

  • RFC 2382, A Framework for Integrated Services and RSVP over ATM, E. Crawley, Ed., L. Berger, S. Berson, F. Baker, M. Borden, J. Krawczyk, August 1998
  • RFC 2689, Providing Integrated Services over Low-bitrate Links, C. Bormann, September 1999
  • RFC 2816, A Framework for Integrated Services Over Shared and Switched IEEE 802 LAN Technologies, A. Ghanwani, J. Pace, V. Srinivasan, A. Smith, M. Seaman, May 2000

-- SimonLeinen - 28 Feb 2006

Sizing of Network Buffers

Where temporary congestion cannot be avoided, some buffering in network nodes is required (in routers and other packet-forwarding devices such as Ethernet or MPLS switches) to queue incoming packets until they can be transmitted. The appropriate sizing of these buffers has been a subject of discussion for a long time.

Traditional wisdom recommends that a network node should be able to buffer an end-to-end round-trip time's worth of line-rate traffic, in order to be able to accomodate bursts of TCP traffic. This recommendation is often followed in "core" IP networks. For example, FPC (Flexible PIC Concentrators) on Juniper's M- and T-Series routers contain buffer memory for 200ms (M-series) or 100ms (T-series) at the supported interface bandwidth (cf. Juniper M-Series Datasheet and a posting from 8 May, 2005 by Hannes Gredler to the juniper-nsp mailing list.) These ideas also influenced RFC 3819, Advice for Internet Subnetwork Designers.

Recent research results suggest that much smaller buffers are sufficient when there is a high degree of multiplexing of TCP streams.

This work is highly relevant, because overly large buffers not only require more (expensive high-speed) memory, but bring about a risk of high delays that affect perceived quality of service; see BufferBloat.

References

ACM SIGCOMM Computer Communications Review

The October 2006 edition has a short summary paper on router buffer sizing. If you read one article, read this!

The July 2005 edition (Volume 35 , Issue 3) has a special feature about sizing router buffers, containing of these articles:
Making router buffers much smaller
Nick McKeown, Damon Wischik
Part I: buffer sizes for core routers
Nick McKeown, Damon Wischik
Part II: control theory for buffer sizing
Gaurav Raina, Don Towsley, Damon Wischik
Part III: routers with very small buffers
Mihaela Enachescu, Yashar Ganjali, Ashish Goel, Nick McKeown, Tim Roughgarden

Sizing Router Buffers (copy)
Guido Appenzeller Isaac Keslassy Nick McKeown, SIGCOMM'04, in: ACM Computer Communications Review 34(4), pp. 281--292

The Effect of Router Buffer Size on HighSpeed TCP Performance
Dhiman Barman, Georgios Smaragdakis and Ibrahim Matta. In Proceedings of IEEE Globecom 2004. (PowerPoint presentation)

Link Buffer Sizing: a New Look at the Old Problem
Sergey Gorinsky, A. Kantawala, and J. Turner, ISCC-05, June 2005
Another version was published as Technical Report WUCSE-2004-82, Department of Computer Science and Engineering, Washington University in St. Louis, December 2004.

Effect of Large Buffers on TCP Queueing Behavior
Jinsheng Sun, Moshe Zukerman, King-Tim Ko, Guanrong Chen and Sammy Chan, IEEE INFOCOM 2004

High Performance TCP in ANSNET
C. Villamizar and C. Song., in: ACM Computer Communications Review, 24(5), pp.45--60, 1994

RFC 3819, Advice for Internet Subnetwork Designers
P. Karn, Ed., C. Bormann, G. Fairhurst, D. Grossman, R. Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood. July 2004

-- SimonLeinen - 2005-01-07 - 2013-04-03

OS-specific Tuning: Cisco IOS

  • Cisco IOS specific Tuning
  • TCP MSS Adjustment The TCP MSS Adjustment feature enables the configuration of the maximum segment size (MSS) for transient packets that traverse a router, specifically TCP segments in the SYN bit set, when Point to Point Protocol over Ethernet (PPPoE) is being used in the network. PPPoE truncates the Ethernet maximum transmission unit (MTU) 1492, and if the effective MTU on the hosts (PCs) is not changed, the router in between the host and the server can terminate the TCP sessions. The ip tcp adjust-mss command specifies the MSS value on the intermediate router of the SYN packets to avoid truncation.

-- HankNussbacher - 06 Oct 2005

Support for Large Frames/Packets ("Jumbo MTU")

On current research networks (and most other parts of the Internet), end nodes (hosts) are restricted to a 1500-byte IP Maximum Transmission Unit (MTU). Because a larger MTU would save effort on the part of both hosts and network elements - fewer packets per volume of data means less work - many people have been advocating efforts to increase this limit, in particular on research network initiatives such as Internet2 and GANT.

Impact of Large Packets

Improved Host Performance for Bulk Transfers

It has been argued that TCP is most efficient when the payload portions of TCP segments match integer multiples of the underlying virtual memory system's page size - this permits "page flipping" techniques in zero-copy TCP implementations. A 9000 byte MTU would "naturally" lead to 8960-byte segments, which doesn't correspond to any common page size. However, a TCP implementation should be smart enough to adapt segment size to page sizes where this actually matters, for example by using 8192-byte segments even though 8960-byte segments are permitted. In addition, Large Send Offload (LSO), which is becoming increasingly common with high-speed network adapters, removes the direct correspondance between TCP segments and driver transfer units, making this issue moot.

On the other hand, Large Send Offload (LSO) and Interrupt Coalescence remove most of the host-performance motivations for large MTUs: The semgentation and reassembly function between (large) application data units and (smaller) network packets is mostly moved to the network interface controller.

Lower Packet Rate in the Backbone

Another benefit of large frames is that their use reduces the number of packets that have to be processed by routers, switches, and other devices in the network. However, most high-speed networks are not limited by per-packet processing costs, and packet processing capability is often dimensioned for the worst case, i.e. the network should continue to work even when confronted with a flood of small packets. On the other hand, per-packet processing overhead may be an issue for devices such as firewalls, which have to do significant processing of the headers (but possibly not the contents) of each packet.

Reduced Framing Overhead

Another advantage of large packets is that they reduce the overhead for framing (headers) in relation to payload capacity. A typical TCP segment over IPv4 carries 40 bytes of IP and TCP header, plus link-dependent framing (e.g. 14 bytes over Ethernet). This represents about 3% of overhead with the customary 1500-byte MTU, whereas an MTU of 9000 bytes reduces this overhead to 0.5%.

Network Support for Large MTUs

Research Networks

The GANT/GANT2 and Abilene backbones now both support a 9000-byte IP MTU. The current access MTUs of Abilene connectors can be found on the Abilene Connector Technology Support page. The access MTUs of the NRENs to GANT are listed in the GANT Monthly Report (not publicly available).

Commercial Internet Service Providers

Many commercial ISPs support larger-than-1500-byte MTUs in their backbones (4470 bytes is a typical value) and on certain types of access interfaces, i.e. Packet-over-SONET (POS). But when Ethernet is used as an access interface, as is more and more frequently the case, the access MTU is usually set to the 1500 bytes corresponding to the Ethernet standard frame size limit. In addition, many inter-provider connections are over shared Ethernet networks at public exchange points, which also imply a 1500-byte limit, with very few exceptions - MAN LAN is a rare example of an Ethernet-based access point that explicitly supports the use of larger frames.

Possible Issues

Path MTU Discovery issues

Moving to larger MTUs may exhibit problems with the traditional Path MTU Discovery mechanism - see that topic for more information about this.

Inconsistent MTUs within a subnet

There are other deployment considerations that make the introduction of large MTUs tricky, in particular the requirement that all hosts on a logical IP subnet must use identical MTUs; this makes gradual introduction hard for large bridged (switched) campus or data center networks, and also for most Internet Exchange Points.

When different MTUs are used on a link (logical subnet), this can often go unnoticed for a long time. Packets smaller than the minimum MTU will always pass through the link, and the end with the smaller MTU configured will always fragment larger packets towards the other side. The only packets that will be affected are packets from the larger-MTU side to the smaller-MTU side that are larger than the smaller MTU. Those will typically be sent unfragmented, and dropped at the receiving end (the one with the smaller MTU). However, it may happen that such packets rarely occur in normal situations, and the misconfiguration isn't detected immediately.

Some routing protocols such as OSPF and IS-IS detect MTU mismatches and will refuse to build adjacencies in this case. This helps diagnose configurations. Other protocols, in particular BGP, will appear to work as long as only small amounts of data are sent (e.g. during initial handshake and option negotiation), but get stuck when larger amounts of data (i.e. the initial route advertisements) must be sent in the bigger-to-smaller-MTU direction.

Problems with Large MTUs on end systems network interface cards (NICs)

On some NICs it's possible to configure jumbo frames (for example MTU=9000) and the NIC is working fine when checking it's functionality with pings (jumbo sized ICMP packets), altough the NIC vendor states there is no jumbo frame support.
In that cases there are packet losses on the NIC if it receives jumbo frames with typical production data rates.

Therefore the NIC vendor information should be consulted before activating Large MTUs on the host interface. Then high data rate tests should be done with large MTUs. The hosts interface statistics should be checked on input packet drops.
Typical commands on unix systems: 'ifconfig <ifname>' or 'ethtool -S <ifname>'


References

Vendor-specific

-- SimonLeinen - 16 Mar 2005 - 02 Sep 2009
-- HankNussbacher - 18 Jul 2005 (added Phil Dykstra paper) An evil middlebox is a transparent device that sits inbetween an end-to-end connection that disturbs the normal end-to-end traffic in some way. As you can not see these devices which usually work on layer 2, it is difficult to debug issues that involve them. Examples are HTTP proxy, Gateway proxy (all protocols). Normally, these devices are installed for security reasons to filter out "bad" traffic. Bad traffic may be viri, trojans, evil javascript, or anything that is not known to the device. Sometimes also so called rate shapers are installed as middleboxes; while these do not change the contents of the traffic, they do drop packets according to rules only known by themselves. Bugs in such middleboxes can have fatal consequences for "legitimate" Internet traffic which may lead to performance or even worse connection issues.

Middleboxes come in all shapes and flavors. The most popular are firewalls:

Examples of experienced performance issues

Two examples in the beginning of 2005 in SWITCH:

  • HttpProxy: very slow response from a webserver only for a specific circle of people

  • GatewayProxy: tcp transfers get stalled as soon as a packet is lost on the local segment from the middlebox to the end host.

A Cisco IOS Firewall in August 2006 in Funet:

  • WindowScalingProblems: when window scaling was enabled, TCP performance was bad (10-20 KBytes/sec). Some older versions of PIX could also be affected by window scaling issues.

DNS Based global load balancing problems

Juniper SRX3600 mistreats fragmented IPv6 packets

This firewall (up to at least version 11.4R3.7) performs fragment reassembly in order to apply certain checks to the entire datagram, for example in "DNS ALG" mode. It then tries to forward the reassembled packet instead of the initial fragments, which triggers ICMP "packet too big" messages if the full datagram is larger than the MTU of the next link. This will lead to a permanent failure on this path, because the (correct) fragmentation at the sender is annihilated by the erroneous reassembly at the firewall.

The same issue has also been found with some models of the Fortigate firewall.

-- ChrisWelti - 01 Mar 2005
-- PekkaSavola - 10 Oct 2006

-- PekkaSavola - 07 Nov 2006

-- AlexGall - 2012-10-31 Symptom: Accessing a specific web-site which contains javascript, is very slow (around 30 seconds for one page)

Analysis Summary: HTTP traffic is split between the webserver and a transparent HTTP proxy on the customer site and the HTTP proxy server and the end-hosts. The transparent HTTP proxy fakes the end-points; to the HTTP web server it pretends to be the customer accessing it and to the customer the HTTP proxy appears to be the web server (faked IP addresses). Accordingly there are 2 TCP connections to be considered here. The proxy receives a HTTP request from the customer to the webserver. It then forwards this request to the webserver and WAITS until it has received the whole reply (this is essential, as it needs to analyze the whole reply to decide if it is bad or not). If the content of that HTTP reply is dynamic, the length is not known. With HTTP1.1 a TCP session is not built for every object but remains intact untill a timeout has occured.This means the proxy has to wait until the TCP session gets torn down, to be sure there is not more content coming. When it has received the whole reply it will forward that reply to the customer who asked for it. Of course the customer will suffer from a major delay.

-- ChrisWelti - 03 Mar 2005

Warning: Can't find topic PERTKB.SshProxy

Specific Network Technologies

This section of the knowledge base treats some (sub)network technologies with their specific performance implications.

-- SimonLeinen - 06 Dec 2005 -- BlazejPietrzak - 05 Sep 2007

Ethernet

Ethernet is now widely prevalent as a link-layer technology for local area/campus network, and is making inroads in other market segments as well, for example as a framing technique for wide-area connections (replacing ATM or SDH/SONET in some applications), or as fabric for storage networks (replacing Fibre Channel etc.) or clustered HPC systems (replacing special-purpose networks such as Myrinet etc.).

From the original media access protocol based on CSMA/CD used on shared coaxial cabling at speeds of 10 Mb/s (originally 3 Mb/s), Ethernet has involved to much higher speeds, from Fast Ethernet (100 Mb/s) through Gigabit Ethernet (1 Gb/s) to 10 Gb/s. Shared media access and bus topologies have been replaced by star-shaped topologies connected by switches. Additional extensions include speed, duplex-mode and cable-crossing auto-negotiation, virtual LANs (VLANs), flow-control and other quality-of-service enhancements, port-based access control and many more.

The topics treated here are mostly relevant for the "traditional" use of Ethernet in local (campus) networks. Some of these topics are relevant for non-Ethernet networks, but are mentioned here nevertheless because Ethernet is so widely used that they have become associated with it.

-- SimonLeinen - 15 Dec 2005

Duplex Modes and Auto-Negotiation

A point-to-point Ethernet segment (typically between a switch and an end-node, or between two directly connected end-nodes) can operate in one of two duplex modes: half duplex means that only one station can send at a time, and full duplex means that both stations can send at the same time. Of course full-duplex mode is preferable for performance reasons if both stations support it.

Duplex Mismatch

Duplex mismatch describes the situation where one station on a point-to-point Ethernet link (typically between a switch and a host or router) uses full-duplex mode, and the other uses half-duplex mode. A link with duplex mismatch will seem to work fine as long as there is little traffic. But when there is traffic in both directions, it will experience packet loss and severely decreased performance. Note that the performance in the duplex mismatch case will be much worse than when both stations operate in half-duplex mode.

Work in the Internet2 "End-to-End Performance Initiative" suggests that duplex mismatch is one of the most common causes of bad bulk throughput. Rich Carlson's NDT (Network Diagnostic Tester) uses heuristics to try to determine whether the path to a remote host suffers from duplex mismatch.

Duplex Auto-Negotiation

In early versions of Ethernet, only half-duplex mode existed, mostly because point-to-point Ethernet segments weren't all that common - typically an Ethernet would be shared by many stations, with the CSMA/CD (Collision Sense Multiple Access/Collision Detection) protocol used to arbitrate the sending channel.

When "Fast Ethernet" (100 Mb/s Ethernet) over twisted pair cable (100BaseT) was introduced, an auto-negotiation procedure was added to allow two stations and the ends of an Ethernet cable to agree on the duplex mode (and also to detect whether the stations support 100 Mb/s at all - otherwise communication would fall back to traditional 10 Mb/s Ethernet). Gigabit Ethernet over twisted pair (1000BaseTX) had speed, duplex, and even "crossed-cable" (MDX) autonegotiation from the start.

Why people turn off auto-negotiation

Unfortunately, some early products supporting Fast Ethernet didn't include the auto-negotiation mechanism, and those that did sometimes failed to interoperate with each other. So many knowledgeable people recommended to avoid the use of duplex-autonegotiation, because it introduced more problems than it solved. The common recommendation was thus to manually configure the desired duplex mode - typically full duplex by hand.

Problems with turning auto-negotiation off

There are two main problems with turning off auto-negotiation

  1. You have to remember to configure both ends consistently. Even when the initial configuration is consistent on both ends, it often turns into an inconsistent one as devices and connectinos are moved around.
  2. Hardcoding one side to full duplex when the other does autoconfiguration causes duplex mismatch. In situations where one side must use auto-negotiation (maybe because it is a non-manageable switch), it is never right to manually configure full-duplex mode on the other. This is because the auto-negotiation mechanism requires that, when the other side doesn't perform auto-negotiation, the local side must set itself to half-duplex mode.

Both situations result in duplex mismatches, with the associated performance issues.

Recommendation: Leave auto-negotiation on

In the light of these problems with hard-coded duplex modes, it is generally preferable to rely on auto-negotiation of duplex mode. Recent equipment handles auto-negotiation in a reliable and interoperable way, with very few exceptions.

References

-- SimonLeinen - 12 Jun 2005 - 4 Sep 2006

LAN Collisions

In some legacy networks, workstations or other devices may still be connected as into a LAN segment using hubs. All incoming and outgoing traffic is propagated throughout the entire hub, often resulting in a collision when two or more devices attempt to send data at the same time. For each collision, the original information will need to be resent, reducing performance.

Operationally, this can lead to up to 100% of 5000-byte packets being lost when sending traffic off network and 31%-60% packet loss within a single subnet. It should be noted that common applications (e.g. email, FTP, WWW) on LANs produce packets close to 1500 bytes in size, and that packet loss rates >1% render applications such as video conferencing unusable.

To prevent collisions from traveling to every workstation in the entire network, bridges or switches should be installed. These devices will not forward collisions, but will permit broadcasts to all users and multicasts to specific groups to pass through.

When only a single system is connected to a single switch port, each collision domain is made up of only one system, and full-duplex communication also becomes possible.

-- AnnRHarding - 18 Jul 2005

LAN Broadcast Domains

While switches help network performance by reducing collision domains, they will permit broadcasts to all users and multicasts to specific groups to pass through. In a switched network with a lot of broadcast traffic, network congestion can occur despite high speed backbones. As universities and colleges were often early adopters of Internet technologies, they may have large address allocations, perhaps even deployed as big flat networks which generate a lot of broadcast traffic. In some cases, these networks can be as large as a /16, and having the potential to put up to sixty five thousand hosts on a single network segment could be disastrous for performance.

The main purpose of subnetting is to help relieve network congestion caused by broadcast traffic. A successful subnetting plan is one where most of the network traffic will be isolated to the subnet in which it originated and broadcast domains are of a manageable size. This may be possible based on physical location, or it may be better to use VLANs. VLANs allow you to segment a LAN into different broadcast domains regardless of physical location. Users and devices on different floors or buildings have the ability to belong to the same LAN, since the segmentation is handled virtually and not via the physical layout.

References and Further Reading

RIPE Subnet Mask Information http://www.ripe.net/rs/ipv4/subnets.html

Priscilla Oppenheimer, Top-Down Network Design http://www.topdownbook.com/

-- AnnRHarding - 18 Jul 2005

Wireless LAN

Wireless Local Area Networks according to IEEE 802.11 standards has become extremely widespread in recent years, in campus networks, for home networking, for convenient network access at conferences, and to a certain point for commercial Internet access provision in hotels, public places, and even planes.

While wireless LANs are usually built more for convenience (or profit) than for performance, there are some interesting performance issues specific to WLANs. As an example, it is still a big challenge to build WLANs using multiple access points so that they can scale to large numbers of simultaneous users, e.g. for large events.

Common problems with 802.11 wireless LANs

Interference

In the 2.4 GHz band, the number of usable channels (frequency) is low. Adjacent channels use overlapping frequencies, so there are typically only three truly non-overlapping channels in this band - channels 1, 6, and 11 are frequently used. In campus networks requiring many access points, care must be taken to avoid interference between same-channel access points. The problem is even more severe in areas where access points are deployed without coordination (such as in residential areas). Some modern access points can sense the radio environment during initialization, and try to use a channel that doesn't suffer from much interference. The 2.4 GHz is also used by other technologies such as Bluetooth or microwave ovens.

Capacity loss due to backwards compatibility or slow stations

The radio link in 802.11 can work at many different data rates below the nominal rate. For example, the 802.11g (54 Mb/s) access point to which I am presently connected supports operation at 1, 2, 5.5, 6, 9, 11, 12, 48, 18, 24, 36, or 54 Mb/s. Using the lower speeds can be useful in terms of adverse radio transmission conditions. In addition, it allows backwards compatibility - for example, 802.11g equipment interoperates with older 802.11b equipment, albeit at most at the lower 11 Mb/s supported by 802.11b.

When lower-rate and higher-rate stations coexist on the same access point, it should be noticed that the lower-rate station will occupy disproportionally more of the medium's capacity, because of increased serialization times at lower rates. So a single station operating at 1 Mb/s and transferring data at 500 kb/s will consume an equal part of the access point's capacity as 54 stations also transferring 500 kb/s each, but at a 54 Mb/s wireless rate.

Multicast/broadcast

Wireless is a "natural" broadcast medium, so broadcast and multicast should be relatively efficient. But the access point normally sends multicast and broadcast frames at a low rate, to increase the probability that all stations can actually receive them. Thus, multicast traffic streams can quickly consumea large fraction of an access point's capacity as per the considerations in the preceding section.

This is a reason why wireless networks often aren't multicast-enabled even in environments that typically have multicast connectivity (such as campus networks). Note that broadcast and multicast cannot easily be disabled completely, because they are required for lower-layer protocols such as ARP (broadcast) or IPv6 Neighbor Discovery (multicast) for work.

802.11n Performance Features

IEEE 802.11n is an recent addition to the standards for wireless LAN offering higher performance in terms of both capacity ("bandwidth") and reach. The standard supports both bands, although "consumer" 802.11n products often work with 2.4 GHz only unless marked "dual-band". Within each band, 802.11n equipment is normally backwards compatible with the respective prior standards, i.e. 802.11b/g for 2.4 GHz and 802.11a for 5 GHz. 802.11n achieves performance increases by

  • physical diversity using multiple antennas in a "MIMO" multiple in/multiple out scheme
  • the option to use wider channels (spaced 40 MHz rather than 20 MHz)
  • frame aggregation options at the MAC levels, allowing larger packets or bundling of multiple frames to a single radio frame, in order to better amortize the (relatively high) link access overhead.

References

-- SimonLeinen - 15 Dec 2005 - 21 Oct 2008

Performance Case Studies

Detailed Case Studies

This section describes some case studies in which alterations to end systems, applications or network elements improved overall performance.

Short Case Studies

These are brief descriptions of problems seen, and the reasons behind the problems

-- AnnRHarding - 20 Jul 2005
-- SimonLeinen - 21 Jul 2006

Scaling Apache 2.x beyond 20,000 concurrent downloads

ftp.heanet.ie is HEAnet's National Mirror Server for Ireland. Currently mirroring over 50,000 projects, it is a popular source of content on the internet. It serves mostly static content via HTTP, FTP and RSYNC, all available via IPv4 and IPv6. It regularly sustains over 20,000 concurrent connections on a single Apache instance and has served as many as 27,000 with about 3.5 Terabytes of content per day. The front-end system is a Dell 2650, with 2 2.4 Ghz Xeon processors, 12Gb of memory and the usual 2 system disks and 15k RPM SCSI disks, running Debian GNU/Linux and Apache 2.x.

Considerable system and application tuning enabled this system to achieve these performance rates. Apachebench was used for web server benchmarking, bonnie++ and iozone for file system benchmarking and an in-house script to measure buffering, virtual memory management and scheduling. The steps taken are highlighted below:

Apache

MPM Tuning

Apache 2.x has a choice of multi-processing modules. For this system, the prefork MPM was chosen, tuned to have 10 spare servers, the number of spare servers calculated such that there are enough child processes available to handle new requests when the rate of new connections exceeds the rate at which Apache can manage to create new processes.

Module Compilation

Apache modules can be compiled in directly to one binary, or as dynamically-loaded share dobjects which are then loaded by a smaller binary. For our load, a small performance gain (measurable as about 0.2%) was found by compiling the modules in directly.

htaccess

As the highperformance.conf sample provided with Apache suggests, turning off the use of .htaccess files, if appropriate can give significant performance improvements.

sendfile

Sendfile is a system call that enables programs to hand off the job of sending files out of network sockets to the kernel, improving performance and efficiency. It is enabled by default at compile-time if Apache detects that the system supports the call. However, the Linux implementation of sendfile corrupted IPv6 sessions so this was not implementable on ftp.heanet.ie for policy reasons.

Mmap

Mmap (memory map) support allows Apache to treat a file as if it were a contiguous region of memory, greatly speeding up the I/O by dispensing with unnecessary read operations. This allowed serving of files roughly 3 times quicker.

mod_cache

mod_disk_cache is an experimental feature in Apache 2.x that caches files in a defined area as they are being served for the first time. Repeated requests are served from this cache, avoiding the slower file systems. mod The default was further tuned to increase the CacheDirLevel4 to 5 to faciliate more files in the cache.

Configure Options

The following configure options were used:

CFLAGS="-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE"; export CFLAGS
"./configure" \
"--with-mpm=prefork" \
"--prefix=/usr/local/web" \
"--enable-cgid" \
"--enable-rewrite" \
"--enable-expires" \
"--enable-cache" \
"--enable-disk-cache" \
"--without-sendfile"

As little as possible was compiled into the httpd binary to reduce the amount of memory used. The CFLAGS exported enabled serving of files over 2Gb in size.

File System

Chosing a fast and efficient filesystem is very important for a web server. Tests at the time showed XFS gave better performance than ext2 and ext3, up to a margin of 20%. As a caveat, as the number of used inodes in the filesystem grows, XFS becomes very slow at directory traversal. This resulted in a migration to ext3, despite the reduced perfomance.

noatime

One significant mount option, noatime, was set. This is the single easiest way to dramatically increase filesystem performance for read operations. Normally, when a file is read Unix-like systems update the inode for the file with this access time so that the time of last access is known. This operation means that read operations also involve writing to the filesystem - a severe performance bottleneck in most cases. If knowing this access time is not critical, and it certainly is not for a busy mirror server, it can be turned off by mounting the filesystem with the noatime option.

logbufs

For XFS, the logbufs mount option allows the administrator to specify how many in-memory log buffers are used.While it was not clear what these log buffers do, increasing this number to its maximum increased performance. This performance increase comes at the expense of memory which was acceptable for the overall design.

dir_index

For ext3, dir_index option is an option whereby ext3 uses hashed binary-trees to speed up lookup in directories. This has proved much faster for directory traversal.

Kernel

The system originally ran the SGI Linux 2.4 kernel which gave about 12,000 sessions as a maximum. However after simply upgrading to the 2.6 kernel the server hit the compiled-in 20,000 limit of Apache without any additional effort, so the scheduler in the 2.6 kernel appears to have markedly improved.

File Descriptors

One of the most important tunables for a large-scale webserver is the maximum number of file descriptors the system is allowed to have open at once. The default is not suffcient when serving thousands of clients. It is important to remember that regular files, sockets, pipes and the standard streams for every running process are all classed as filedescriptors and that it is easy to run out.

This figure was set to 5049800.

Virtual Memory Manager

In Linux, the VM manages the memory allocated to processes and the kernel and also manages the in-memory cache of files. By far the easiest way to “tune” the VM in this regard is to increase the amount of memory available to it. This is probably the most reliable and easy way of speeding up a webserver - add as much memory as you can afford.

The VM takes a similar approach to mod_disk_cache for freeing up space - it assigns programs memory as they request it and then periodically prunes back what can be made free. If a lot of files are being read very quickly, the rate of increase of memory usage will be very high. If this rate is so high that memory is exhausted before the VM had had a chance to free any there will be severe system instability. To correct for this 5 sysctl's were set:

vm/min_free_kbytes = 204800
vm/lower_zone_protection = 1024
vm/page-cluster = 20
vm/swappiness = 200
vm/vm_vfs_scan_ratio = 2

* The first sysctl sets the VM to aim for at least 200 Megabytes of memory to be free. * The second sysctl sets the amount of “lower zone” memory directly addressable by the CPU that should be kept free. * The third sysctl, “vm/page-cluster” tells Linux how many pages to free at a time when freeing pages. * The fourth sysctl, “swappiness,” is a very vague sysctl which seems to boil down to how much Linux “prefers” swap, or how “swappy” it should be. * The final sysctl, the “vm vfs scan ratio,” sets what proportion of the filesystem-data caches should be scanned when freeing memory. By setting this to 20 we mean that 1/20th of them should be scanned - this means that some cached data is kept longer than it otherwise would, leading to increased opportunity for re-use.

Network Stack

Six sysctl's were set relating to the network stack:

net/ipv4/tcp_rfc1337=1
net/ipv4/tcp_syncookies=1
net/ipv4/tcp_keepalive_time = 300
net/ipv4/tcp_max_orphans=1000
sys/net/core/rmem_default=262144
sys/net/core/rmem_max=262144

* TCP syncookies and the RFC1337 options were enabled for security reasons. * The default tcp keepalive time was set to 5 minutes to avoid the situation where httpd children handling connections which have not been responsive for 5 minutes are not needlessly waiting in the queue. This has the minor impact that if the client does try to continue with the TCP session at a later time it will disconnect. * The max orphans option ensures that even despite the 5 minute timeout there are never more than 1,000 processes held in such a state, and will instead start closing the sockets of the longest waiting processes. This prevents process starvation due to many broken connections. * The final two options increase the amount of memory generally available to the networking stack for queueing packets.

Hyperthreading

Hyperthreading is a technology which makes one processor show up as two with the aim of improving resource usage efficiency within the processor. The webserver was benchmarked with hyperthreading enabled and hyperthreading disabled. Hyperthreading enabled resulted in a 37% performance increase (from 721 requests per second to 989 requests per second, with the same test). It was therefore enabled.

References

Colm MacCrthaigh, Scaling Apache 2.x beyond 20,000 concurrent downloads, http://www.stdlib.net/~colmmacc/Apachecon-EU2005/scaling-apache-handout.pdf, ApacheCon 2005

-- AnnRHarding - 20 Jul 2005

EXPERIMENTAL 10Gbps TECHNOLOGY RESULTS AT FORSCHUNGSZENTRUM KARLSRUHE

A world wide, distributed Grid Computing environment is currently under development for the upcoming Large Hadron Collider (LHC) at CERN, organized in several so called Tier-centres. CERN will be the source of a very high data load, originating from particle detectors at the LHC accelerator ring, with an estimated 2000 MByte/sec. Part of the data processing is done in a Tier-0 centre, located at CERN. It is responsible mainly for initial processing of data and its archiving to tape. One tenth of the data is distributed to each of the 10 Tier-1 centres for further processing and backup-purposes. The Gridka cluster (www.gridka.de), located at Forschungszentrum Karlsruhe/Germany (www.fzk.de), is one of these Tier-1 centres. Further details about the LHC computing model can be found here: http://les.home.cern.ch/les/talks/lcg%20overview%20sep05.ppt.

A 10 Gbit link is active between the CERN Tier-0 centre and GridKa in order to handle this high load of date.

Often network administrators, after having set up a Gigabit connection (or even one with 10Gb/s), sadly realize that what they get out of it is not what they had expected. The performance achieved with vanilla systems is often far away from what it should be. The current problems of Gigabit- and 10 GB/s-Ethernet were not yet apparent years ago, when the Fast Ethernet technology was released. Currently it is really difficult to get close to line speed and it is almost impossible to fill a full-duplex communication when the latter technology is used. While at the height of the Fast Ethernet technology the networks were the bottlenecks, nowadays the problems have moved towards the end-systems, and the networks are not likely to be the bottlenecks anymore.

The requirements of a full duplex 10Gbps communication are just too demanding for the capabilities of current end-systems. Most of the applications in use today are based on the TCP/IP protocol. The inherent unreliability of the IP protocol is complemented by several methods in the TCP layer of the OSI model to ensure a correct and reliable end-to-end comunication. This reliability does of course not come for free, as each TCP-flow within a system passes the Front Side Bus (FSB) up to four times and the memory is accessed up to five times.

Currently vendors try to decrease the number of transferences back and forth through the FSB and the interruption rate that a 10Gbps system has to deal with. This is done by introducing offload engines (see http://www.redbooks.ibm.com/redbooks/SG245287/) and Interrupt Coalescence to current 10Gbps network devices, with the aim of getting the best out of the current technology found in end-systems. The first approach implements some software TCP procedures of last year in hardware, thereby eliminating one cycle, which reduces the number of accesses to two Front Side Bus transferences and three memory accesses. The second approach simply places each new package into a queue during a set time rather than sending it as soon it is ready. It is hoped that when the time frame is finished, there will be more packets ready to be sent. This allows to send them all in one interruption rather than generating an interruption for every single packet.

The various 10Gbps tests run at the Forschungszentrum Karlsruhe can be divided into two big groups:

  • a local test inside the Forschungszentrum testbed

  • tests involving experiments in a Wide Area Network environment (WAN), between Forschungszentrum Karlsruhe and CERN. Such tests use the Deutsche Forschungsnetz and Geant through a 20 ms RTT path over a shared Least Best Effort ATM MPLS Tunnel. This allows DFN and Geant to stay in control, as their 10Gbps backbone could be easily filled up with these tests only. This would effectively cut off the communication of thousands of scientist across Europe ...

The local environment at Forschungszentrum Karlsruhe consisted of two IBM xSeries 345 Intel Xeon based systems, both equipped with a 10Gbps LR card, kindly provided by Intel. With these unmodified systems the throughput went up slightly above 1Gbps. After modifying Intel's device driver's default interruption coalescence configuration, using Ethernet extended non-standard jumbo frames, and setting to its maximum the MMRBC register of the PCI-X command register set, an unidirectional single stream of slightly over 5.3Gbps could be sent in a back to back transference, this way improving it by more than 400%. As both the IBM system's load rose up to 99%, no higher throughput could be achieved with these machines. The bottleneck in this case was the memory subsystem of the Xeon systems.

In the WAN environment, a single Intel Itanium node at CERN plus one of the already tuned IBM systems were configured to take part in the wide area tests. Both were configured in the same way. The first tests were really disappointing, as they did not go beyond a few MegaBytes. Once the TCP SACK (selective acknowledgements, RFC 2018) and TCP Timestamps (RFC 1323) were enabled, and the TCP windows were enlarged by means of the sysctl parameters in order to match the bandwidth delay product (BDP), the throughput drastically increased up to 5.4Gbps. In the latter case, the BDP is roughly 20Mbit for this 20ms RTT across Germany, France and Switzerland. In this situation the bottleneck was again the xServers memory subsystem. This did not come unexpected, as two different architectures were brought face to face; Xeon versus Itanium.

Here is the modification of the TCP stack, as done using the Linux kernels sysctl parameters:

net.ipv4.tcp_timestamps =1

net.ipv4.tcp_sack = 1

net.ipv4.tcp_rmem = 10000000 25165824 50331648# sets min/default/max TCP

read buffer, default 4096 87380 174760

net.ipv4.tcp_wmem = 10000000 25165824 50331648# sets min/pressure/max TCP

write buffer, default 4096 16384 131072

net.ipv4.tcp_mem = 32768 65536 131072 # sets min/pressure/max TCP buffer

space, default 31744 32256 32768

Related links:

Forschungszentrum Karlsruhe: http://www.fzk.de

GridKa: http://www.gridka.de

CERN: http://www.cern.ch

LHC GridComputing: http://les.home.cern.ch/les/talks/lcg%20overview%20sep05.ppt

IBM Redbook: http://www.redbooks.ibm.com/redbooks/SG245287

Autors

Marc Garca Mart

Bruno Hoeft

-- MonicaDomingues - 24 Oct 2005
-- SimonLeinen - 14 Oct 2006 (added cross-references)

  • figure1.bmp: Logical topology of the Forschungszentrum /CERN 10Gbps testbed

Network Performance People

There are a many individuals who contributed to the field of network performance. A few of them are listed here - if you think of someone else, just add them!

-- SimonLeinen - 06 May 2005

Van Jacobson

Traceroute, TCP, and DiffServ work

Van Jacobson made vast contributions to networking, in particular related to performance. He wrote the original traceroute tool based on a idea by Steve Deering, introduced Congestion Avoidance to TCP, proposed (with Sally Floyd) Explicit Congestion Notification, implemented the first zero-copy TCP in BSD Unix, and did some of the early work on what later became the Differentiated Services Architecture and Premium IP.

Channel-based Networking Driver Architecture

Recent work includes a rearchitecture (based on a new "channel" concept) of the device driver and buffer management architecture that is common to networking stack implementations of practically all current operating systems. This work is described in a talk at Linux Conference Australia (LCA2006) (slides from the talk; blog article by DaveM).

-- SimonLeinen - 06 May 2005-04 Feb 2006

Sally Floyd

Sally co-invented RandomEarlyDetection with VanJacobson, proposed ExplicitCongestionNotification, and works on the HS-TCP (High-Speed TCP) variant of the TransmissionControlProtocol.

References

Sally's home page, including papers, talks, and informal notes, as well as a number of valuable pointer collections and open research questions.

-- SimonLeinen - 06 May 2005

Related Efforts

There are many other places where useful information about network performance can be found. Here is a small selection - feel free to add more references as you encounter them.

Network Monitoring Tools

Standardization

  • IETF (Internet Engineering Task Force), in particular the working groups on IPPM, ...

-- SimonLeinen - 22 Jan 2006

-- SimonLeinen - 09 Apr 2006

Topic revision: r4 - 2007-08-17 - TobyRodwell
 
GANT
Copyright 2004-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.