Unrolled PERTKB Contents

Welcome to the eduPERT Knowledge Base!!!!

_If you'd like to know more about us and what the eduPERT community is up to, visit our portal.

This is a wiki that collects useful information on how to get good performance out of networks, in particular research networks.

This KB is open and public and anybody can benefit and contribute to its content. In the past years many topics have been collected with regards to performance and general knowledge of the network. To make the navigation easier they have been grouped in 5 main categories:

network.pngNETWORK: Network protocols, tuning and more...

endhost.pngEND HOST: Application protocols, End-Host tuning and more...

new_tools.pngTOOLS: Active and passive measurement tools, traceroute and more...

knowledge.pngGENERAL KNOWLEDGE: Wizard gap, performance people, evil middlebox and more...

performance.pngPERFORMANCE CASE STUDIES: History of the PERT cases (solved or still open).

If you're looking for topics to work on, check out the ToDo list! You can also add suggestions for improvement there.

Much of the material written here was published in August 2006 as GN2-06-135v2 (DS3.3.3): PERT Performance Guides.

Note that you can subscribe to update notifications to this Knowledge Base through an RSS Feed



How to report a performance problem

Site Tools of the PERTKB Web

Find Topics in KB:
Result:

Latest News

Windows 7 / Vista / 2008 TCP/IP tweaks

On Speedguide.net there is a great article about the TCP/IP Stack tuning possibilities in Windows 7/Vista/2008.

NDT Protocol Documented

A description of the protocol used by the NDT tool has recently been written up. Check this out if you want to implement a new client (or server), or if you simply want to understand the mechanisms.

-- SimonLeinen - 02 Aug 2011

What does the "+1" mean?

I would like to experiment with social web techniques, so I have added a (Google) "+1" button to the template for viewing PERT KB topics. It allows people to publicly share interesting Web pages. Please share what you think about this via the SocialWebExperiment topic!

-- SimonLeinen - 02 Jul 2011

News from Web10G

The Web10G project has released its first code in May 2011. If you are interested in this followup project of the incredibly useful web100, check out their site. I have added a short summary on the project to the WebTenG topic here.

-- SimonLeinen - 30 Jun 2011

FCC Open Internet Apps Challenge - Vote for Measurement Tools by 15 July

The U.S. Federal Communications Commission (FCC) is currently holding a challenge for applications that can help assess the "openness" of Internet access offerings. There's a page on challenge.gov where people can vote for their favorite tool. A couple of candidate applications are familiar to our community, for example NDT and NPAD. But there are some new ones that look interesting. Some applications are specific for mobile data networks ("3G" etc.). There are eight submissions - be sure to check them out. Voting is possible until 15 July 2011.

-- SimonLeinen - 29 Jun 2011

Web10G

Catching up with the discussion@web100.org mailing list, I noticed an announcement by Matt Mathis (formerly PSC, now at Google) from last September (yes, I'm a bit behind with my mail reading :-): A follow-on project called Web10G has received funding from the NSF. I'm all excited about this because Web100 produced so many influential results. So congratulations to the proposers, good luck for the new project, and let's see what Web10G will bring!

-- SimonLeinen - 25 Mar 2011

eduPERT Training Workshop in Zurich, 18/19 November 2010

We held another successful training event in November, with 17 attendees from various NRENs including some countries that have joined GÉANT recently, or are in the process of doing so. The program was based on the one used for the workshops held in 2008 and 2007. But this time we added some case studies, where the instructors walked through some actual cases they had encountered in their PERT work, and at many points asked the audience for their interpretation of the symptoms seen so far, and for suggestions on what to do next. Thanks to the active participation of several attendees, this interactive part turned out to be both enjoyable and instructive, and was well received by the audience. Program and slides are available on the PERT Training Workshop page at TERENA.

-- SimonLeinen - 18 Dec 2010

Efforts to increase TCP's initial congestion window (updated)

At the 78th IETF meeting in Maastricht this July, one of the many topics discussed there was a proposal to further increase TCP's initial congestion window, possibly to something as big as ten maximum-size segments (10 MSS). This caused me to notice that the PERT KB was missing a topic on TCP's initial congestion window, so I created one. A presentation by Matt Mathis on the topic is scheduled for Friday's ICCRG meeting. The slides, as well as some other materials and minutes, can be found on the IETF Proceedings page. Edited to add: Around IETF 79 in November 2010, the discussion on the initial window picked up steam, and there are now three different proposals in the form of Internet-Drafts. I have updated the TcpInitialWindow topic to briefly describe these.

-- SimonLeinen - 28 Jul 2010 - 09 Dec 2010

PERT meeting at TNC 2010

We held a PERT meeting on Sunday 30 May (1400-1730), just before the official start of the TNC 2010 conference in Vilnius. The attendees discussed technical and organizational issues facing PERTs in National Research and Education Networks (NRENs), large campus networks, and supporting network-intensive projects such as the multi-national research efforts in (Radio-) Astronomy, Particle Physics, or Grid/Cloud research. Slides are available on TNC2010 website here.

-- SimonLeinen - 15 Jun 2010

Web100 Project Re-Opening Soon?

In a mail to the ndt-users list, Matt Mathis has issued a call for input from people who use Web100, as part of NDT or otherwise. If you use this, please send Matt an e-mail. There's a possibility that the project will be re-opened "for additional development and some bug fixes".

-- SimonLeinen - 20 Feb 2010

ICSI releases Netalyzr Internet connectivity tester

The International Computer Science Institute (ICSI) at Berkeley has released an awesome new java applet called Netalyzr which will let you do a whole bunch of connectivity tests. A very nice tool to get information about your Internet connectivity, to see if your provider blocks ports, your DNS works properly, and much more. Read about it on our page here or go directly to their homepage.

-- ChrisWelti - 18 Jan 2010

News Archive:

Notes About Performance

This part of the PERT Knowledgebase contains general observations and hints about the performance of networked applications.

The mapping between network performance metrics and user-perceived performance is complex and often non-intuitive.

-- SimonLeinen - 13 Dec 2004 - 07 Apr 2006

User-Perceived Performance

User-perceived performance of network applications is made up of a number of mainly qualitative metrics, some of which are in conflict with each other. In the case of some applications, a single metric will outweigh the others, such as responsiveness from video services or throughput for bulk transfer applications. More commonly, a combination of factors usually determines the experience of network performance for the end-user.

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006
-- AlessandraScicchitano - 10 Feb 2012

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

-- SimonLeinen - 07 Apr 2006 - 23 Aug 2007

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

-- SimonLeinen - 07 Apr 2006

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006

The "Wizard Gap"

The Wizard Gap is an expression coined by Matt Mathis (then PSC) in 1999. It designates the difference between the performance that is "theoretically" possible on today's high-speed networks (in particular, research networks), and the performance that most users actually perceive. The idea is that today, the "theoretical" performance can only be (approximately) obtained by "wizards" with superior knowledge and skills concerning system tuning. Good examples for "Wizard" communities are the participants in Internet2 Land-Speed Record or SC Bandwidth Challenge competitions.

The Internet2 end-to-end performance initiative strives to reduce the Wizard Gap by user education as well as improved instrumentation (see e.g. Web100) of networking stacks. In the GÉANT community, PERTs focus on assistance to users (case management), as well as user education through resources such as this knowledge base. Commercial players are also contributing to closing the wizard gap, by improving "out-of-the-box" performance of hardware and software, so that their customers can benefit from faster networking.

References

  • M. Mathis, Pushing up Performance for Everyone, Presentation at the NLANR/Internet2 Joint Techs Workshop, 1999 (PPT)

-- SimonLeinen - 2006-02-28 - 2016-04-27

Why Latency Is Important

Traditionally, the metric of focus in networking has been bandwidth. As more and more parts of the Internet have their capacity upgraded, bandwidth is often not the main problem anymore. However, network-induced latency as measured in One-Way Delay (OWD) or Round-Trip Time often has noticeable impact on performance.

It's not just for gamers...

The one group of Internet users today that is most aware of the importance of latency are online gamers. It is intuitively obvious than in real-time multiplayer games over the network, players don't want to be put at a disadvantage because their actions take longer to reach the game server than their opponents'.

However, latency impacts the other users of the Internet as well, probably much more than they are aware of. At a given bottleneck bandwidth, a connection with lower latency will reach its achievable rate faster than one with higher latency. The effect of this should not be underestimated, since most connections on the Internet are short - in particular connections associated with Web browsing.

But even for long connections, latency often has a big impact on performance (throughput), because many of those long connections have their throughput limited by the window size that is available for TCP. And when the window size is the bottleneck, throughput is inversely proportional to round-trip time. Furthermore, for a given (small) loss rate, RTT places an upper limit on the achieveable throughput of a TCP connection, as shown by the Mathis Equation

But I thought latency was only important for multimedia!?

Common wisdom is that latency (and also jitter) is important for audio/video ("multimedia") applications. This is only partly true: Many applications of audio/video involve "on-demand" unidirectional transmission. In those applications, the real-time concerns can often be mitigated by clever buffering or transmission ahead of time. For conversational audio/video, such as "Voice over IP" or videoconferencing, the latency issue is very real. The principal sources of latency in these applications are not backbone-related, but related to compression/sampling rates (see packetization delay) and to transcoding devices such as H.323 MCUs (Multi-Channel Units).

References

-- SimonLeinen - 13 Dec 2004

Network Performance Metrics

There are many metrics that are commonly used to characterize the performance of networks and parts of networks. We present the most important of these, explain what influences them, how they can be measured, how they influence end-to-end performance, and what can be done to improve them.

A framework for network performance metrics has been defined by the IETF's IP Performance Metrics (IPPM) Working Group in RFC 2330. The group also developed definitions for several specific performance metrics; those are referenced from the respective sub-topics.

References

-- SimonLeinen - 31 Oct 2004

One-way Delay (OWD)

One-way delay is the time it takes for a packet to reach its destination. It is considered a property of network links or paths. RFC 2679 contains the definition of the one-way delay metric of the IETF's IPPM (IP Performance Metrics) working group.

Decomposition

One-way delay along a network path can be decomposed into per-hop one-way delays, and these in turn into per-link and per-node delay components.

Per-link delay components: propagation delay and serialization delay

The link-specific component of one-way delay consists of two sub-components:

Propagation Delay is the time it takes for signals to move from the sending to the receiving end of the link. On simple links, this is the product of the link's physical length and the characteristical propagation speed. The velocity of propagation (VoP) for copper and fibre optic are similar (the VoP of copper is slightly faster), being approximately 2/3 the speed of light in a vacuum.

Serialization delay is the time it takes for a packet to be serialized into link transmission units (typically bits). It is the packet size (in bits) divided by the link's capacity (in bits per second).

In addition to the propagation and serialization delays, some types of links may introduce additional delays, for example to avoid collisions on a shared link, or when the link-layer transparently retransmits packets that were damaged during transmission.

Per-node delay components: forwarding delay, queueing delay

Within a network node such as a router, a packet experiences different kinds of delay between its arrival on one link and its departure on another (or the same) link:

Forwarding delay is the time it takes for the node to read forwarding-relevant information (typically the destination address and other headers) from the packet, compute the "forwarding decision" based on routing tables and other information, and to actually forward the packet towards the destination, which involves copying the packet to a different interface inside the node, rewriting parts of it (such as the IP TTL and any media-specific headers) and possibly other processing such as fragmentation, accounting, or checking access control lists.

Depending on router architecture, forwarding can compete for resources with other activities of the router. In this case, packets can be held up until the router's forwarding resource is available, which can take many milliseconds on a router with CPU-based forwarding. Routers with dedicated hardware for forwarding don't have this problem, although there may be delays when a packet arrives as the hardware forwarding table is being reprogrammed due to a routing change.

Queueing delay is the time a packet has to wait inside the node waiting for availability of the output link. Queueing delay depends on the amount of competing traffic towards the output link, and of the priorities of the packet itself and the competing traffic. The amount of queueing that a given packet may encounter is hard to predict, but it is bounded by the buffer size available for queueing.

There can be causes for queueing other than contention on the outgoing link, for example contention on the node's backplane interconnect.

Impact on end-to-end performance

When studying end-to-end performance, it is usually more interesting to look at the following metrics that are derived from one-way delay:

  • Round-trip time (RTT) is the time from node A to B and back to A. It is the sum of the one-way delays from A to B and from B to A, plus the response time in B.
  • Delay Variation represents the variation in one-way delay. It is important for real-time applications. People often call this "jitter".

Measurement

One-way delays from a node A to a node B can be measured by sending timestamped packets from A, and recording the reception times at B. The difficulty is that A and B need clocks that are synchronized to each other. This is typically achieved by having clocks synchronized to a standard reference time such as UTC (Universal Time Coordinated) using techniques such as Global Positioning System (GPS)-derived time signals or the Network Time Protocol (NTP).

There are several infrastructures that continuously measure one-way delays and packet loss: The HADES Boxes in DFN and GÉANT2; RIPE TTM Boxes between various research and commercial networks, and RENATER's QoSMetrics boxes.

OWAMP and RUDE/CRUDE are examples of tools that can be used to measure one-way delay.

Improving delay

Shortest-(Physical)-Path Routing and Proper Provisioning

On high-speed wide-area network paths, delay is usually dominated by propagation times. Therefore, physical routing of network links plays an important role, as well as the topology of the network and the selection of routing metrics. Ensuring minimal delays is simply a matter of

  • using an internal routing protocol with "shortest-path routing" (such as OSPF or IS-IS) and a metric proportional to per-link delay
  • provisioning the network so that these shortest paths aren't congested.

This could be paraphrased as "use proper provisioning instead of traffic engineering".

The node contributions to delay can be addressed by:

  • using nodes with fast forwarding
  • avoiding queueing by provisioning links to accomodate typical traffic bursts
  • reducing the number of forwarding nodes

References

-- SimonLeinen - 31 Oct 2004 - 25 Jan 2007

Propagation Delay

The propagation delay is the time it takes for a signal to propagate. It depends on the distance traveled, and the specific propagation speed of the medium. For instance, information transmitted via radio or through copper cables will travel at a speed close to c (speed of light in vacuum, ~300000 km/s). The prevalent medium for long-distance digital transmission is now light in optical fibers, where the propagation speed is about 2/3 c, i.e. 200000 km/s.

Propagation delay, along with serialization delay and processing delays in nodes such as routers, is a component of overall delay/RTT. For uncongested long-distance network paths, it is usually the dominant component.

Examples

Here are a few examples for propagation delay components of one-way and round-trip delays over selected distances in fiber.

Fibre length One-way delay Round-trip time
1m 5 ns 10 ns
1km 5 µs 10 µs
10km 50 µs 100 µs
100km 500 µs 1 ms
1000km 5 ms 10 ms
10000km 50 ms 100 ms

-- SimonLeinen - 28 Feb 2006

Serialization Delay (or Transmission Delay)

Serialization delay is the time it takes for a unit of data, such as a packet, to be serialized for transmission on a narrow (e.g. serial) channel such as a cable. Serialization delay is dependent on size, which means that longer packets experience longer delays over a given network path. Serialization delay is also dependent on channel capacity ("bandwidth"), which means that for equal-size packets, the faster the link, the lower the serialization delay.

Serialization delays are incurred at processing nodes, when packets are stored-and-copied between links and (router/switch) buffers. This includes the copying over internal links in processing nodes, such as router backplanes/switching fabrics.

In the core of the Internet, serialization delay has largely become a non-issue, because link speeds have increased much faster over the past years than packets sizes. Therefore, the "hopcount" as shown by e.g. traceroute is a bad predictor for delay today.

Example Serialization Delays

To illustrate the effects of link rates and packet sizes on serialization delay, here is a table of some representative values. Note that the maximum packet size for most computers is 1500 bytes today, but 9000-byte "jumbo frames" are already supported by many research networks.

Link Rate 64 kb/s 1 Mb/s 10 Mb/s 100 Mb/s 1 Gb/s 10 Gb/s
9000 bytes 1125 ms 72 ms 7.2 ms 720 µs 72 µs 7.2 µs
Packet Size            
64 bytes 8 ms 0.512 ms 51.2 µs 5.12 µs 0.512 µs 51.2 ns
512 bytes 64 ms 4.096 ms 409.6 µs 40.96 µs 4.096 µs 409.6 ns
1500 bytes 187.5 ms 12 ms 1.2 ms 120 µs 12 µs 1.2 µs

-- SimonLeinen - 28 Oct 2004 - 17 Jun 2010

Round-Trip Time (RTT)

Round-trip time (RTT) is the total time for a packet sent by a node A to reach is destination B, and for a response sent back by B to reach A.

Decomposition

The round-trip time is the sum of the one-way delays from A to B and from B to A, and of the time it takes B to send the response.

Impact on end-to-end performance

For window-based transport protocols such as TCP, the round-trip time influences the achievable throughput at a given window size, because there can only be a window's worth of unacknowledged data in the network, and the RTT is the lower bound for a packet to be acknowledged.

For interactive applications such as conversational audio/video, instrument control, or interactive games, the RTT represents a lower bound on response time, and thus impacts responsiveness directly.

Measurement

The round-trip time is often measured with tools such as ping (Packet InterNet Groper) or one of its cousins such as fping, which send ICMP Echo requests to a destination, and measure the time until the corresponding ICMP Echo Response messages arrive.

However, please note that while round-trip time reported by PING is relatively precise measurement, some network devices may prioritize ICMP handling routines, so the measured values do not correspond to the real values.

-- MichalPrzybylski - 19 Jul 2005

Improving the round-trip time

The network components of RTT are the one-way delays in both directions (which can use different paths), so see the OneWayDelay topic on how those can be improved. The speed of response generation can be improved through upgrades of the responding host's processing power, or by optimizing the responding program.

-- SimonLeinen - 31 Oct 2004

Bandwidth-Delay Product (BDP)

The BDP of a path is the product of the (bottleneck) bandwidth and the delay of the path. Its dimension is "information", because bandwidth (here) expresses information per time, and delay is a time (duration). Typically, one uses bytes as a unit, and it is often useful to think of BDP as the "memory capacity" of a path, i.e. the amount of data that fits entirely into the path (between two end-systems).

BDP is an important parameter for the performance of window-based protocols such as TCP. Network paths with a large bandwidth-delay product are called Long Fat Networks or "LFNs".

References

-- SimonLeinen - 14 Apr 2005

"Long Fat Networks" (LFNs)

Long Fat Networks (LFNs, pronounced like "elephants") are networks with a high bandwidth-delay product.

One of the issues with this type of network is that it can be challenging to achieve high throughput for individual data transfers with window-based transport protocols such as TCP. LFNs are thus a main focus of research on high-speed improvements for TCP.

-- SimonLeinen - 27 Oct 2004 - 17 Jun 2005

Delay Variation ("Jitter")

Delay variation or "jitter" is a metric that describes the level of disturbance of packet arrival times with respect to an "ideal" pattern, typically the pattern in which the packets were sent. Such disturbances can be caused by competing traffic (i.e. queueing), or by contention on processing resources in the network.

RFC 3393 defines an IP Delay Variation Metric (IPDV). This particular metric only compares the delays experienced by packets of equal size, on the grounds that delay is naturally dependent on packet size, because of serialization delay.

Delay variation is an issue for real-time applications such as audio/video conferencing systems. They usually employ a jitter buffer to eliminate the effects of delay variation.

Delay variation is related to packet reordering. But note that the RFC 3393 IPDV of a network can be arbitrarily low, even zero, even though that network reorders packets, because the IPDV metric only compares delays of equal-sized packets.

Decomposition

Jitter is usually introduced in network nodes (routers), as an effect of queueing or contention for forwarding resources, especially on CPU-based router architectures. Some types of links can also introduce jitter, for example through collision avoidance (shared Ethernet) or link-level retransmission (802.11 wireless LANs).

Measurement

Contrary to one-way delay, one-way delay variation can be measured without requiring precisely synchronized clocks at the measurement endpoints. Many tools that measure one-way delay also provide delay variation measurements.

References

The IETF IPPM (IP Performance Metrics) Working Group has formalized metrics for IPDV, and more recently started work on an "applicability statement" that explains how IPDV can be used in practice and what issues have to be considered.

  • RFC 3393, IP Packet Delay Variation Metric for IP Performance Metrics (IPPM), C. Demichelis, P. Chimento. November 2002.
  • draft-morton-ippm-delay-var-as-03.txt, Packet Delay Variation Applicability Statement, A. Morton, B. Claise, July 2007.

-- SimonLeinen - 28 Oct 2004 - 24 Jul 2007

Packet Loss

Packet loss is the probability of a packet to be lost in transit from a source to a destination.

A One-way Packet Loss Metric for IPPM is defined in RFC 2680. RFC 3357 contains One-way Loss Pattern Sample Metrics.

Decomposition

There are two main reasons for packet loss:

Congestion

When the offered load exceeds the capacity of a part of the network, packets are buffered in queues. Since these buffers are also of limited capacity, severe congestion can lead to queue overflows, which lead to packet drops. In this context, severe congestion could mean that a moderate overload condition holds for an extended amount of time, but could also consist of the sudden arrival of a very large amount of traffic (burst).

Errors

Another reason for loss of packets is corruption, where parts of the packet are modified in-transit. When such corruptions happen on a link (due to noisy lines etc.), this is usually detected by a link-layer checksum at the receiving end, which then discards the packet.

Impact on end-to-end performance

Bulk data transfers usually require reliable transmission, so lost packets must be retransmitted. In addition, congestion-sensitive protocols such as TCP must assume that packet loss is due to congestion, and reduce their transmission rate accordingly (although recently there have been some proposals to allow TCP to identify non-congestion related losses and treat those differently).

For real-time applications such as conversational audio/video, it usually doesn't make much sense to retransmit lost packets, because the retransmitted copy would arrive too late (see delay variation). The result of packet loss is usually a degradation in sound or image quality. Some modern audio/video codecs provide a level of robustness to loss, so that the effect of occasional lost packets are benign. On the other hand, some of the most effective image compression methods are very sensitive to loss, in particular those that use relatively rare "anchor frames", and that represent the intermediate frames by compressed differences to these anchor frames - when such an anchor frame is lost, many other frames won't be able to be reconstructed.

Measurement

Packet loss can be actively measured by sending a set of packets from a source to a destination and comparing the number of received packets against the number of packets sent.

Network elements such as routers also contain counters for events such as checksum errors or queue drops, which can be retrieved through protocols such as SNMP. When this kind of access is available, this can point to the location and cause of packet losses.

Reducing packet loss

Congestion-induced packet loss can be avoided by proper provisioning of link capacities. Depending on the probability of bursts (which is somewhat difficult to estimate, taking into account both link capacities in the network, and the traffic rates and patterns of a possibly large number of hosts at the edge of the network), buffers in network elements such as routers must also be sufficient. Note that large buffers can be detrimental to other aspects of network performance, in particular one-way delay (and thus round-trip time) and delay variation.

Quality-of-Service mechanisms such as DiffServ or IntServ can be used to protect some subset of traffic against loss, but this necessarily increases packet loss for the remaining traffic.

Lastly, Active Queue Management (AQM) and Excplicit Congestion Notification (ECN) can be used to mitigate both packet loss and queueing delay (and thus one-way delay, round-trip time and delay variation).

References

  • A One-way Packet Loss Metric for IPPM, G. Almes, S. Kalidindi, M. Zekauskas, September 1999, RFC 2680
  • Improving Accuracy in Endtoend Packet Loss Measurement, J. Sommers, P. Barford, N. Duffield, A. Ron, August 2005, SIGCOMM'05 (PDF)
-- SimonLeinen - 01 Nov 2004

Packet Reordering

The Internet Protocol (IP) does not guarantee that packets are delivered in the order in which they were sent. This was a deliberate design choice that distinguishes IP from protocols such as, for instance, ATM and IEEE 802.3 Ethernet.

Decomposition

Reasons why a network may reorder packets: Usually because of some kind of parallelism, either because of a choice of alternative routes (Equal Cost Multipath, ECMP), or because of internal parallelism inside switching elements such as routers. One particular kind of packet reordering concerns packets of different sizes. Because a larger packet takes longer to transfer over a serial link (or a limited-width backplane inside a router), larger packets may be "overtaken" by smaller packets that were sent subsequently. This is usually not a concern for high-speed bulk transfers - where the segments tend to be equal-sized (hopefully Path MTU-sized), but may pose problems for naive implementations of multi-media (Audio/Video) transport.

Impact on end-to-end performance

In principle, applications that use a transport protocol such as TCP or SCTP don't have to worry about packet reordering, because the transport protocol is responsible for reassembling the byte stream (message stream(s) in the case of SCTP) into the original ordering. However, reordering can have a severe performance impact on some implementations of TCP. Recent TCP implementations, in particular those that support Selective Acknowledgements (SACK), can exhibit robust performance even in the face of reordering in the network.

Real-time media applications such as audio/video conferencing tools often experience problems when run over networks that reorder packets. This is somewhat remarkable in that all of these applications have jitter buffers to eliminate the effects of delay variation on the real-time media streams. Obviously, the code that manages these jitter buffers is often not written in a way to accomodate reordered packets sensibly, although this could be done with moderate effort.

Measurement

Packet reordering is measured by sending a numbered sequence of packets, and comparing the received sequence number sequence with the original. There are many possible ways of quantifying reordering, some of which are described in the IPPM Working Group's documents (see the references below).

The measurements here can be done differently, depending on the measurement purpose:

a) measuring reordering for particular application can be done by capturing the application traffic (e.g. using the Wireshark/Ethereal tool), injecting the same traffic pattern via traffic generator and calculating the reordering.

b) measuring maximal reordering introduced by the network can be done by injecting relatively small amount of traffic, shaped as a short bursts of long packets immediately followed by short burst of short packets, with line rate. After capture and calculation on the other end of the path, the results will reflect the worst possible packet reordering situation which may occur on particular path.

For more information please refer to the measurement page, available at packet reordering - measurement site. Please notice, that although the tool is based on older versions of reordering drafts, the metrics used are compatible with the definitions from new ones.

In particular Reorder Density and Reorder Buffer-Occupancy Density can be obtained from tcpdump by subjecting the dump files to RATtool, available at Reordering Analysis Tool Website

Improving reordering

The probability of reordering can be reduced by avoiding parallelism in the network, or by using network nodes that take care to use parallel paths in such a way that packets belonging to the same end-to-end flow are kept on a single path. This can be ensured by using a hash on the destination address, or the (source address, destination address) pair to select from the set of available paths. For certain types of parallelism this is hard to achieve, for example when a simple striping mechanism is used to distribute packets from a very high-speed interface to multiple forwarding paths.

References

-- SimonLeinen - 2004-10-28 - 2014-12-29
-- MichalPrzybylski - 2005-07-19
-- BartoszBelter - 2006-03-28

Maximum Transmission Unit (MTU)

The MTU (or to be exact, 'protocol MTU') of a link (logical IP subnet) describes the maximum size of an IP packet that can be transferred over the link without fragmentation. Common MTUs include

  • 1500 bytes (Ethernet, 802.11 WLAN)
  • 4470 bytes (FDDI, common default for POS and serial links)
  • 9000 bytes (Internet2 and GÉANT convention, limit of some Gigabit Ethernet adapters)
  • 9180 bytes (ATM, SMDS)

For entire network paths, see PathMTU. For specific information on configuring large (Jumbo) MTUs, see JumboMTU.

The term 'media MTU' is used to refer to the maximum sized Layer 2 PDU that a given interface can support. Media MTU bust equal to or greater the sum of protocol MTU and the Layer 2 header.

References

-- SimonLeinen - 27 Oct 2004
-- TobyRodwell - 24 Jan 2005
-- MarcoMarletta - 20 Jan 2006

Path MTU

The Path MTU is the Maximum Transmission Unit (MTU) supported by a network path. It is the minimum of the MTUs of the links (segments) that make up the path. Larger Path MTUs generally allow for more efficient data transfers, because source and destination hosts, as well as the switching devices (routers) along the network path have to process fewer packets. However, it should be noted that modern high-speed network adapters have mechanisms such as LSO (Large Send Offload) and Interrupt Coalescence that diminish the influence of MTUs on performance. Furthermore, routers are typically dimensioned to sustain very high packet loads (so that they can resist denial-of-service attacks) so the packet rates caused by high-speed transfers is not normally an issue for today's high-speed networks.

The prevalent Path MTU on the Internet is now 1500 bytes, the Ethernet MTU. There are some initiatives to support larger MTUs (JumboMTU) in networks, in particular on research networks. But their usability is hampered by last-mile issues and lack of robustness of RFC 1191 Path MTU Discovery.

Path MTU Discovery Mechanisms

Traditional (RFC1191) Path MTU Discovery

RFC 1191 describes a method for a sender to detect the Path MTU to a given receiver. (RFC 1981 describes the equivalent for IPv6.) The method works as follows:

  • The sending host will send packets to off-subnet destinations with the "Don't Fragment" bit set in the IP header.
  • The sending host keeps a cache containing Path MTU estimates per destination host address. This cache is often implemented as an extension of the routing table.
  • The Path MTU estimate for a new destination address is initialized to the MTU of the outgoing interface over which the destination is reached according to the local routing table.
  • When the sending host receives an ICMP "Too Big" (or "Fragmentation Needed and Don't Fragment Bit Set") destination-unreachable message, it learns that the Path MTU to that destination is smaller than previously assumed, and updates the estimate accordingly.
  • Normally, an ICMP "Too Big" message contains the next-hop MTU, and the sending host will use that as the new Path MTU estimate. The estimate can still be wrong because a subsequent link on the path may have an even smaller MTU.
  • For destination addresses with a Path MTU estimate lower than the outgoing interface MTU, the sending host will occasionally attempt to raise the estimate, in case the path has changed to support a larger MTU.
  • When trying ("probing") a larger Path MTU, the sending host can use a list of "common" MTUs, such as the MTUs associated with popular link layers, perhaps combined with popular tunneling overheads. This list can also be used to guess a smaller MTU in case an ICMP "Too Big" message is received that doesn't include any information about the next-hop MTU (maybe from a very very old router).

This method is widely implemented, but is not robust in today's Internet because it relies on ICMP packets sent by routers along the path. Such packets are often suppressed either at the router that should generate them (to protect its resources) or on the way back to the source, because of firewalls and other packet filters or rate limitations. These problems are described in RFC2923. When packets are lost due to MTU issues without any ICMP "Too Big" message, this is sometimes called a (MTU) black hole. Some operating systems have added heuristics to detect such black holes and work around them. Workarounds can include lowering the MTU estimate or disabling PMTUD for certain destinations.

Packetization Layer Path MTU Discovery (PLPMTUD, RFC 4821)

An IETF Working Group (pmtud) was chartered to define a new mechanism for Path MTU Discovery to solve these issues. This process resulted in RFC 4821, Packetization Layer Path MTU Discovery ("PLPMTUD"), which was published in March 2007. This scheme requires cooperation from a network layer above IP, namely the layer that performs "packetization". This could be TCP, but could also be a layer above UDP, let's say an RPC or file transfer protocol. PLPMTUD does not require ICMP messages. The sending packetization layer starts with small packets, and probes progressively larger sizes. When there's an indication that a larger packet was successfully transmitted to the destination (presumably because some sort of ACK was received), the Path MTU estimate is raised accordingly.

When a large packet was lost, this might have been due to an MTU limitation, but it might also be due to other causes, such as congestion or a transmission error - or maybe it's just the ACK that was lost! PLPMTUD recommends that the first time this happens, the sending packetization layer should assume an MTU issue, and try smaller packets. An isolated incident need not be interpreted as an indication of congestion.

An implementation of the new scheme for the Linux kernel was integrated into version 2.6.17. It is controlled by a "sysctl" value that can be observed and set through /proc/sys/net/ipv4/tcp_mtu_probing. Possible values are:

  • 0: Don't perform PLPMTUD
  • 1: Perform PLPMTUD only after detecting a "blackhole" in old-style PMTUD
  • 2: Always perform PLPMTUD, and use the value of tcp_base_mss as the initial MSS.

A user-space implementation over UDP is included in the VFER bulk transfer tool.

References

Implementations

  • for Linux 2.6 - integrated in mainstream kernel as of 2.6.17. However, it is disabled by default (see net.ipv4.tcp_mtu_probing sysctl)
  • for NetBSD

-- Hank Nussbacher - 2005-07-03 -- Simon Leinen - 2006-07-19 - 2017-10-30

Network Protocols

This section contains information about a few common Internet protocols, with a focus on performance questions.

For information about higher-level protocols, see under application protocols.

-- SimonLeinen - 2004-10-31 - 2015-04-26

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-07, Transmission Control Protocol Specification, Wesley M. Eddy, November 2017
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2017-11-13 -- Ulrich Schmid - 2005-05-31

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

TCP Acknowledgements

In TCP's sliding-window scheme, the receiver acknowledges the data it receives, so that the sender can advance the window and send new data. As originally specified, TCP's acknowledgements ("ACKs") are cumulative: the receiver tells the sender how much consecutive data it has received. More recently, selective acknowledgements were introduced to allow more fine-grained acknowledgements of received data.

Delayed Acknowledgements and "ack every other"

RFC 831 first suggested a delayed acknowledgement (Delayed ACK) strategy, where a receiver doesn't always immediately acknowledge segments as it receives them. This recommendation was carried forth and specified in more detail in RFC 1122 and RFC 5681 (formerly known as RFC 2581). RFC 5681 mandates that an acknowledgement be sent for at least every other full-size segment, and that no more than 500ms expire before any segment is acknowledged.

The resulting behavior is that, for longer transfers, acknowledgements are only sent for every two segments received ("ack every other"). This is in order to reduce the amount of reverse flow traffic (and in particular the number of small packets). For transactional (request/response) traffic, the delayed acknowledgement strategy often makes it possible to "piggy-back" acknowledgements on response data segments.

A TCP segment which is only an acknowledgement i.e. has no payload, is termed a pure ack.

Delayed Acknowledgments should be taken into account when doing RTT estimation. As an illustration, see this change note for the Linux kernel from Gavin McCullagh.

Critique on Delayed ACKs

John Nagle nicely explains problems with current implementations of delayed ACK in a comment to a thread on Hacker News:

Here's how to think about that. A delayed ACK is a bet. You're betting that there will be a reply, upon with an ACK can be piggybacked, before the fixed timer runs out. If the fixed timer runs out, and you have to send an ACK as a separate message, you lost the bet. Current TCP implementations will happily lose that bet forever without turning off the ACK delay. That's just wrong.

The right answer is to track wins and losses on delayed and non-delayed ACKs. Don't turn on ACK delay unless you're sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent.

Duplicate Acknowledgements

A duplicate acknowledgement (DUPACK) is one with the same acknowledgement number as its predecessor - it signifies that the TCP receiver has received a segment newer than the one it was expecting i.e. it has missed a segment. The missed segment might not be lost, it might just be re-ordered. For this reason the TCP sender will not assume data loss on the first DUPACK but (by default) on the third DUPACK, when it will, as per RFC 5681, perform a "Fast Retransmit", by sending the segment again without waiting for a timeout. DUPACKs are never delayed; they are sent immediately the TCP receiver detects an out-of-order segment.

"Quickack mode" in Linux

The TCP implementation in Linux has a special receive-side feature that temporarily disables delayed acknowledgements/ack-every-other when the receiver thinks that the sender is in slow-start mode. This speeds up the slow-start phase when RTT is high. This "quickack mode" can also be explicitly enabled or disabled with setsockopt() using the TCP_QUICKACK option. The Linux modification has been criticized because it makes Linux' slow-start more aggressive than other TCPs' (that follow the SHOULDs in RFC 5681), without sufficient validation of its effects.

"Stretch ACKs"

Techniques such as LRO and GRO (and to some level, Delayed ACKs as well as ACKs that are lost or delayed on the path) can cause ACKs to cover many more than the two segments suggested by the historic TCP standards. Although ACKs in TCP have always been defined as "cumulative" (with the exception of SACKs), some congestion control algorithms have trouble with ACKs that are "stretched" in this way. In January 2015, a patch set was submitted to the Linux kernel network development with the goal to improve the behavior of the Reno and CUBIC congestion control algorithms when faced with stretch ACKs.

References

  • TCP Congestion Control, RFC 5681, M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 813, Window and Acknowledgement Strategy in TCP, D. Clark, July 1982
  • RFC 1122, Requirements for Internet Hosts -- Communication Layers, R. Braden (Ed.), October 1989

-- TobyRodwell - 2005-04-05
-- SimonLeinen - 2007-01-07 - 2015-01-28

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

High-Speed TCP Variants

There have been numerous ideas for improving TCP over the years. Some of those ideas have been adopted by mainstream operations (after thorough review). Recently there has been an uptake in work towards improving TCP's behavior with LongFatNetworks. It has been proven that the current congestion control algorithms limit the efficiency in network resource utilization. The various types of new TCP implementations introduce changes in defining the size of congestion window. Some of them require extra feedback from the network. Generally, all such congestion control protocols are divided as follows.

An orthogonal technique for improving TCP's performance is automatic buffer tuning.

Comparative Studies

All papers about individual TCP-improvement proposals contain comparisons against older TCP to quantify the improvement. There are several studies that compare the performance of the various new TCP variants, including

Other References

  • Gigabit TCP, G. Huston, The Internet Protocol Journal, Vol. 9, No. 2, June 2006. Contains useful descriptions of many modern TCP enhancements.
  • Faster, G. Huston, June 2005. This article from ISP Column looks at various approaches to TCP transfers at very high rates.
  • Congestion Control in the RFC Series, M. Welzl, W. Eddy, July 2006, Internet-Draft (work in progress)
  • ICCRG Wiki, IRTF (Internet Research Task Force) ICCRG (Internet Congestion Control Research Group), includes bibliography on congestion control.
  • RFC 6077: Open Research Issues in Internet Congestion Control, D. Papadimitriou (Ed.), M. Welzl, M. Scharf, B. Briscoe, February 2011

Related Work

  • PFLDnet (International Workshop on Protocols for Fast Long-Distance Networks): 2003, 2004, 2005, 2006, 2007, 2009, 2010. (The 2008 pages seem to have been removed from the Net.)

-- Simon Leinen - 2005-05-28–2016-09-18 -- Orla Mc Gann - 2005-10-10 -- Chris Welti - 2008-02-25

Explicit Control Protocol (XCP)

The eXplicit Control Protocol (XCP) is a new congestion control protocol developed by Dina Katabi from MIT Computer Science & Artificial Intelligence Lab. It extracts information about congestion from routers along the path between endpoints. It is more complicated to implement than other proposed Internet congestion control protocols.

XCP-capable routers make a fair per-flow bandwidth allocation without carrying per-flow congestion state in packets. In order to request for the desired throughput, the sender sends a congestion header (XCP packet) located between the IP and transport headers. It enables the sender to learn about the bottleneck on the path from the sender to the receiver in a single round trip.

In order to increase congestion window size for TCP connection, the sender require feedback from the network, informing about the maximum, available throughput along the path for injecting data into the network. The routers update such information in the congestion header as it moves from the sender to the receiver. The main task for the receiver is to copy the network feedback into outgoing packets belonging to the same bidirectional flow.

The congestion header consists of four fields as follows:

  • Round-Trip Time (RTT): The current round-trip time for the flow.
  • Throughput: The throughput used currently by the sender.
  • Delta_Throughput: The value by which the sender would like to change its throughput. This field is updated by the routers along the path. If one of the router sets a negative value, it means that the sender must slow down.
  • Reverse_Feedback: This is the value, which is copied from the Delta_Throughput field and returned by the receiver to the sender, it contains a maximum feedback allocation from the network.

A router that implements XCP maintains two control algorithms, executed periodically. The first of them, implemented in the congestion controller, is responsible for specifying maximum use of the outbound port. The fairness controller is responsible for fair distribution of throughput to flows sharing the link.

Benefits of XCP

  • it achieves fairness and rapidly converges to fairness,
  • it achieves maximum link utilization (better bottleneck link utilization),
  • it learns about the bottleneck much more rapidly,
  • it is potentially applicable to any transport protocol,
  • is more stable at long-distance connections with larger RTT.

Deployment issues and issues with XCP

  • has to be tested in real network environments,
  • it should describe solutions for tunneling protocols,
  • has to be supported by routers along the path,
  • the work load for the routers is higher.

References

-- Wojtek Sronek -2005-07-08 -- Simon Leinen - 2006-04-01–2016-05-30

Source Quench

Source Quench is an ICMP based mechanism used by network devices to inform data sender that the packets can not be forwarded due to buffers overload. When the message is received by a TCP sender, that sender should decrease its send window to the respective destination in order to limit outgoing traffic. The ICMP Source Quench usage is specified in RFC 896 - Congestion Control in IP/TCP Internetworks. The currently used standard specified in RFC 1812 - Requirements for IP Version 4 Routers says that routers should not originate this message, and therefore Source Quench should not be used any more.

Problems with Source Quench

There are several reasons why Source Quench has fallen out of favor as a congestion control mechanism over the years:

  1. Source Quench messages can be lost on their way to the sender (due to congestion on the return path, filtering/return routing issues etc.). A congestion control mechanism should be robust against these sorts of problems.
  2. A Source Quench message carries very little information per packet, namely only that some amount of congestion was sensed at the gateway (router) that sent it.
  3. Source Quench messages, like all ICMP messages, are expensive for a router to generate. This is bad because the congestion control mechanism could contribute additional congestion, if router processing resources become a bottleneck.
  4. Source Quench messages could be abused by a malevolent third party to slow down connections, causing denial of service.

In effect, ICMP Source Quench messages are almost never generated on the Internet today, and would be ignored almost everywhere if they still existed.

References

  • Congestion Control in IP/TCP Internetworks, RFC 896, J. Nagle, January 1984
  • Something a Host Could Do with Source Quench: The Source Quench Introduced Delay (SQuID), RFC 1016, W. Prue and J. Postel, July 1987
  • Requirements for IP Version 4 Routers, RFC 1812, F. Baker (Ed.), June 1995
  • IETF-discuss message with notes on the history of Source Quench, F. Baker, Feb. 2007 - in archive

-- WojtekSronek - 05 Jul 2005

Explicit Congestion Notification (ECN)

The Internet's end-to-end rate control schemes (notably the TransmissionControlProtocol (TCP)) traditionally have to rely on packet loss as the prime indicator of congestion (queueing delay is also used for congestion feedback, although mostly implicitly, except in newer mechanisms such as TCP FAST).

Both loss and delay are implicit signals of congestion. The alternative is to send explicit congestion signals.

ICMP Source Quench

Such a signal was indeed part of the original IP design, in the form of the "Source Quench" ICMP (Internet Control Message Protocol) message. A router experiencing congestion could send ICMP Source Quench messages towards a source, in order to tell that source to send more slowly. This method was deprecated by RFC1995 in June 1812. Oops, I meant by RFC1812 ("Requirements for IP Version 4 Routers") in June 1995. The biggest problem with ICMP Source Quench are:

  • that the mechanism causes more traffic to be generated in a situation of congestion (although in the other direction)
  • that, when ICMP Source Quench messages are lost, it fails to slow down the sender.

The ECN Proposal

The new ECN mechanism consists of two components:

  • Two new "ECN" bits in the former TOS field of the IP header:
    • The "ECN-Capable Transport" (ECT) bit must only be set for packets controlled by ECN-aware transports
    • The "Congestion Experienced" (CE) bit can be set by a router if
      • the router has detected congestion on the outgoing link
      • and the ECT bit is set.

  • Transport-specific protocol extensions which communicate the ECN signal back from the receiver to the sender. For TransmissionControlProtocol, this takes the form of two new flags in the TCP header, ECN-Echo (ECE) and Congestion Window Reduced (CWR). A similar mechanism has been included in SCTP.

ECN works as follows. When a transport supports ECN, it sends IP packets with ECT (ECN-Capable Transport) set. Then, when there is congestion, a router will set the CE (Congestion Experienced) bit in some of these packets. The receiver notices this, and sends a signal back to the sender (by setting the ECE flag). The sender then reduces its sending rate, as if it had detected the loss of a single packet, and sets the CWR flag so as to inform the receiver of this action.

(Note that the two-bit ECN field in the IP header has been redefined in the current ECN RFC (RFC3168), so that "ECT" and "CE" are no longer actual bits. But the old definition is somewhat easier to understand. If you want to know how these "conceptual" bits are encoded, please check out RFC 3168.)

Benefits of ECN

ECN provides two significant benefits:

  • ECN-aware transports can properly adapt their rates to congestion without requiring packet loss
  • Congestion feedback can be quicker with ECN, because detecting a dropped packet requires a timeout.

For more detailed information, see the Internet-Draft The Benefits of using Explicit Congestion Notification (ECN) mentioned in the References below.

Deployment Issues with ECN

ECN requires AQM, which isn't widely deployed

ECN requires routers to use an Active Queue Management (AQM) mechanism such as Random Early Detection (RED). In addition, routers have to be able to mark eligible (ECT) packets with the CE bit when the AQM mechanism notices congestion. RED is widely implemented on routers today, although it is rarely activated in actual networks.

ECN must be added to routers' forwarding paths

The capability to ECN-mark packets can be added to CPU- or Network-Processor-based routing platforms relatively easily - Cisco's CPU-based routers such as the 7200/7500 routers support this with newer software, for example, but if queueing/forwarding is performed by specialized hardware (ASICs), this function has to be designed into the hardware from the start. Therefore, most of today's high-speed routers cannot easily support ECN to my knowledge.

ECN "Blackholing"

Another issue is that attempts to use ECN can cause issues with certain "middlebox" devices such as firewalls or load balancers, which break connectivity when unexpected TCP flags (or, more rarely, unexpected IP TOS values) are encountered. The original ECN RFC (RFC 2481) didn't handle this gracefully, so activating ECN on hosts that implement this version caused much frustration because of "hanging" connections. RFC 3168 proposes a mechanism to deal with ECN-unfriendly networks, but that hasn't been widely implemented yet. In particular, the Linux ECN implementation doesn't seem to implement it as of November 2007 (Linux 2.6.23).

See Floyd's ECN Problems page for more.

How to Activate ECN

On Linux hosts

ECN for both IPv4 and IPv6 is controlled by a sysctl parameter net.ipv4.tcp_ecn. According to ip-sysctl.txt, it may have one of three values: 0 means "never do ECN", 1 means "actively try to negotiate ECN", 2 means "do ECN when asked for". To activate on a running system:

echo 1 > /proc/sys/net/ipv4/tcp_ecn

To make persistent, add the following line to /etc/sysctl.conf:

net.ipv4.tcp_ecn=1

On Mac OS X

In a message from 23 March 2015 to the iccrg mailing list, Stuart Cheshire explains how to activate ECN on Mac OS X. The following commands activate it on the running system:

sysctl -w net.inet.tcp.ecn_negotiate_in=1
sysctl -w net.inet.tcp.ecn_initiate_out=1

To make it persistent, add the following to /etc/sysctl.conf:

net.inet.tcp.ecn_initiate_out=1
net.inet.tcp.ecn_negotiate_in=1

Suggested Improvements

For situations where ECN is not explicit enough, RFC 7514 (Really Explicit Congestion Notification (RECN) suggests a protocol where senders can be told to back off in increasingly explicit terms.

References

-- Simon Leinen - 2005-01-07 - 2017-11-17

Rate Control Protocol (RCP)

Developed by Nandita Dukkipati and Nick McKeown in Stanford University, RCP aims to emulate processor sharing(PS) over a broad range of operating conditions. TCP's congestion control algorithm, and most of the other proposed alternatives such as ExplicitControlProtocol, try to emulate processor sharing by giving each competing flow an equal share of a bottleneck link. They emulate PS well in a static scenario when all flows are long-lived, but in scenarios where flows are short-lived, arrive randomly and have a finite amount of data to send, as is the case in today's Internet, they do not perform as well.

In RCP a router assigns a single rate to all flows that pass through it. The router does not keep flow-state nor does it do per-packet calculations. The flow rate is picked by routers based on the current queue occupancy and the aggregate input traffic rate.

The Algorithm

The basic RCP algorithm is as follows:

  1. Every router maintains a single fair-share rate, R(t), that it offers to all flows. It updates R(t) approximately once every RTT.
  2. Every packet header carries a rate field, Rp. When transmitted by the source, Rp = infinity. When a router receives a packet, if R(t) at the router is smaller than Rp, then Rp <- R(t); otherwise it's unchanged. The destination copies Rp into the acknowledgement packets, so as to notify the source. The packet header also carries an RTT field, RTTp; where RTTp is the source's current estimate of the RTT for the flow. When a router receives a packet it uses RTTp to update its moving average of the RTT of the flows passing through it, d0.
  3. The source transmits at a rate Rp, which correspondsto the smallest offered rate along the path

Papers

Processor Sharing Flows in the Internet. N. Dukkipati and N. McKeown Stanford University High Performance Networking Group Technical Report TR04-HPNG-061604, June 2004

-- OrlaMcGann - 11 Oct 2005

HighSpeed TCP

HighSpeed TCP is a modification to TCP congestion control mechanism, proposed by Sally Floyd from ICIR (The ICSI Center for Internet Research). The main problem of TCP connection is that it takes a long time to make a full recovery from packet loss for high-bandwidth long-distance connections. Like the others, new TCP implementations proposes a modification of congestion control mechanism for use with TCP connections with large congestion windows. For the current standard TCP with a steady-state packet loss rate p, an average congestion window is 1.2/sqrt(p) segments. It places a serious constraint in realistic environments.

Achieving an average TCP throughput of B bps requires a loss event at most every BR/(12D) round-trip times and an average congestion window W is BR/(8D). For round-trip time R = 0.1 seconds and 1500-bytes packets (D), throughput of 10 Gbps would require an average congestion window of 83000 segments, and a packet drop rate (equal 0,0000000002) of at most one congestion event every 5,000,000,000 packets or equivalently at most one congestion event every 1 2/3 hours. This is an unrealistic constraint and has a huge impact on TCP connections with larger congestion windows. So what we need is to achieve high per-connection throughput without requiring unrealistically low packet loss rates.

HighSpeed TCP proposes changing TCP response function w = 1.2/sqrt(p) to achieve high throughput with more realistic requirements for the steady-state packet drop rate. A modified HighSpeed TCP function uses three parameters: Low_Window, High_Window, and High_P. The HighSpeed response function uses the same response function as Standard TCP when the current congestion window is at most Low_Window, and uses the HighSpeed response function when the current congestion window is greater than Low_Window. The value of the average congestion window w greater than Low_Window is

HSTCP_w.png

where,

  • Low_P is the packet drop rate corresponding to Low_Window, using Standard TCP response function,
  • S is a constant as follows

HSTCP_S.png

The window needed to sustain 10Gbps throughput using HighSpeed TCP connection is 83000 segments, so the High_Window could be set to this value. The packet drop rate needed in the HighSpeed reponse function to achieve an average congestion window of 83000 segments is 10^-7. For informations on how to set up the remaining parameters see document of Sally Floyd HighSpeed TCP for Large Congestion Windows. Calibrating appropriate values, we can use bigger congestion window sizes with very lower RTTs between congestion events and achieve a greater throughput.

HighSpeed TCP will have to modify the increase parameter a(w) and the decrease parameter b(w) of AIMD (Additive Increase and Multiplicative Decrease). For Standard TCP, a(w) = 1 and b(w) = 1/2, regardless of the value of w. HighSpeed TCP uses the same values of a(w) and b(w) for w lower than Low_Window. Parameters a(w) and b(w) for HighSpeed TCP are specified below.

HSTCP_aw.png

HSTCP_bw.png

High_Decrease = 0,1 means decrease of 10% of congestion window when congestion event occurre.

For example, for w = 83000, parameters a(w) and b(w) for Standard TCP and HSTCP are specified in table below.

TCPHSTCP
a(w)172
b(w)0.50.1

Summary

Based on the results of HighSpeed TCP and Standard TCP connections that shared bandwidth together we obtain following conclusions. HSTCPs share fairly available bandwidth between themselves. HSTCP may take longer time to converge than Standard TCP. The great disadvantage is that HighSpeed TCP flow starves a TCP flow even in relatively low/moderate bandwidth. The author of the HSTCP specification justifies that there are not a lot of TCP connections effectively operating in this regime today, with large congestion windows, and that therefore the benefits of the HighSpeed response function would outweigh the unfairness that would be experienced by Standard TCP in this regime. Another benefit of applying HSCTP is that it is easier to deploy and no router support needed.

Implementations

HS-TCP is included in Linux as a selectable option in the modular TCP congestion control framework. A few problems with the implementation in Linux 2.6.16 were found by D.X. Wei (see references). These problems were fixed as of Linux 2.6.19.2.

A proposed OpenSolaris project foresees the implementation of HS-TCP and several other congestion control algorithms for OpenSolaris.

References

-- WojtekSronek - 13 Jul 2005
-- HankNussbacher - 06 Oct 2005
-- SimonLeinen - 03 Dec 2006 - 17 Nov 2009

Hamilton TCP

Hamilton TCP (H-TCP), developed by Hamilton Institute, is an enhancement to existing TCP congestion control protocol, with good mathematic and empirical basis. The authors makes an assumption that their improvement should behave as regular TCP in standard LAN and WAN networks where RTT and link capacity are not extremely high. But in the long distant fast networks, H-TCP is far more aggressive and more flexible while adjusting to available bandwidth and avoiding packet losses. Therefore two modes are used for congestion control and mode selection depends on the time since the last experienced congestion. In regular TCP window control algorithm, the window is increased by adding a constant value α and decreased by multiplying by β. The corresponding values are by default equals to 1 and 0.5. In H-TCP case the window estimation is a bit more complex, because both factors are not constant values, but are calculated during transmission.

The parameter values are estimated as follows. For each acknowledgment set:

HTCP_alpha.png

and then

HTCP_alpha2.png

On each congestion event set:

HTCP_beta.png

Where,

  • Δ_i is the time that elapsed since the last congestion experienced by i'th source,
  • Δ_L is the threshold for switching between fast and slow modes,
  • B_i(k) is the throughput of flow i immediately before the k'th congestion event,
  • RTT_max,i and RTT_min,i are maximum and minimum round trip time values for the i'th flow.

The α value depends on the time between experienced congestion events, while β depends on RTT and achieved bandwidth values. The constant values, as well as the Δ_L value, can be modified, in order to adjust the algorithm behavior to user expectations, but the default values are estimated empirically and seem to be the most suitable in most cases. The figures below present the behavior of congestion window, throughput and RTT values during tests performed by the H-TCP authors. In the first case β value was set to 0.5, in the second case the adaptive approach (explained with the equations above) was used. The throughput in the second case is far more stable and is closing to its maximal value.

HTCP_figure1.jpg

HTCP_figure2.jpg

H-TCP congestion window and throughput behavior (source: �H-TCP: TCP for high-speed and long-distance networks�, D.Leith, R.Shorten).

Summary

Another advantage of H-TCP is fairness and friendliness, which means that the algorithm is not "greedy", and will not consume the whole link capacity, but is able to fairly share the bandwidth with either another H-TCP like transmissions or any other TCP-like transmissions. H-TCP congestion window control improves the dynamic of sending ratio and therefore provides better overall throughput value than regular TCP implementation. On the Hamilton Institute web page (http://www.hamilton.ie/net/htcp/) more information is available, including papers, test results and algorithm implementations for Linux kernels 2.4 and 2.6.

References

-- WojtekSronek - 13 Jul 2005 -- SimonLeinen - 30 Jan 2006 - 14 Apr 2008

TCP Westwood

The current Standard TCP implementation rely on packet loss as an indicator of network congestion. The problem in TCP is that it does not possess the capability to distinguish congestion loss from loss invoked by noisy links. As a consequence, Standard TCP reacts with a drastic reduction of the congestion window. In wireless connections overlapping radio channels, signal attenuation, additional noises have a huge impact on such losses.

TCP Westwood (TCPW) is a small modification of Standard TCP congestion control algorithm. When the sender perceive that congestion has appeared, the sender uses the estimated available bandwidth to set the congestion window and the slow start threshold sizes. TCP Westwood avoids huge reduction of these values and ensure both faster recovery and effective congestion avoidance. It does not require any support from lower and higher layers and does not need any explicit congestion feedback from the network.

Measurement of bandwidth in TCP Westwood lean on simple relationship between amount of data sent by the sender to the time of receiving an acknowledgment. The estimation bandwidth process is a bit similar to the one used for RTT estimation in Standard TCP with additional improvements. When there is an absence of ACKs because no packets were sent, the estimated value goes to zero. The formulas of estimation bandwidth process is shown in many documents about TCP Westwood. Such documents are listed on the UCLA Computer Science Department website.

There are some issues that can potentially disrupt the bandwidth estimation process such a delayed or cumulative ACKs indicating wrong order of received segments. So the source must keep track of the number of duplicated ACKs and should be able to detect delayed ACKs and act accordingly.

As it was said before the general idea is to use the bandwidth estimate value to set the congestion window and the slow start threshold after a congestion episode. Below there is an algorithm of setting these values.

if (n duplicate ACKs are received) {
   ssthresh = (BWE * RTTmin)/seg_size;
   if (cwin > ssthresh) cwin = ssthresh; /* congestion avoid. */
}

where

  • BWE is a estimated value of bandwidth,
  • RTTmin is the smallest RTT value observed during the connection time,
  • seg_size is a length of TCP’s segment payload.

Additional benefits

Based on the results of testing TCP Westwood we can see that its fairness is at least as good, than in widely used Standard TCP even if two flows have different round trip times. The main advantage of new implementation of TCP is that it does not starve the others variants of TCP. So, we can claim that TCP Westwood is also friendly.

References

-- WojtekSronek - 27 Jul 2005

TCP Selective Acknowledgements (SACK)

Selective Acknowledgements are a refinement of TCP's traditional "cumulative" acknowledgements.

SACKs allow a receiver to acknowledge non-consecutive data, so that the sender can retransmit only what is missing at the receiver�s end. This is particularly helpful on paths with a large bandwidth-delay product (BDP).

TCP may experience poor performance when multiple packets are lost from one window of data. With the limited information available from cumulative acknowledgments, a TCP sender can only learn about a single lost packet per round trip time. An aggressive sender could choose to retransmit packets early, but such retransmitted segments may have already been successfully received.

A Selective Acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, can help to overcome these limitations. The receiving TCP sends back SACK packets to the sender informing the sender of data that has been received. The sender can then retransmit only the missing data segments.

Multiple packet losses from a window of data can have a catastrophic effect on TCP throughput. TCP uses a cumulative acknowledgment scheme in which received segments that are not at the left edge of the receive window are not acknowledged. This forces the sender to either wait a roundtrip time to find out about each lost packet, or to unnecessarily retransmit segments which have been correctly received. With the cumulative acknowledgment scheme, multiple dropped segments generally cause TCP to lose its ACK-based clock, reducing overall throughput. Selective Acknowledgment (SACK) is a strategy which corrects this behavior in the face of multiple dropped segments. With selective acknowledgments, the data receiver can inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost.

The selective acknowledgment extension uses two TCP options. The first is an enabling option, "SACK-permitted", which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection once permission has been given by SACK-permitted.

Blackholing Issues

Enabling SACK globally used to be somewhat risky, because in some parts of the Internet, TCP SYN packets offering/requesting the SACK capability were filtered, causing connection attempts to fail. By now, it seems that the increased deployment of SACK has caused most of these filters to disappear.

Performance Issues

Sometimes it is not recommended to enable SACK feature, e.g. for the Linux 2.4.x TCP SACK implementation suffers from significant performance degradation in case of a burst of packet loss. People from CERN observed that a burst of packet loss considerably affects TCP connections with large bandwidth delay product (several MBytes), because the TCP connection doesn't recover as it should. After a burst of loss, the throughput measured in their testbed is close to zero during 70 seconds. This behavior is not compliant with TCP's RFC. A timeout should occur after a few RTTs because one of the loss couldn't be repaired quickly enough and the sender should go back into slow start.

For more information take a look at http://sravot.home.cern.ch/sravot/Networking/TCP_performance/tcp_bug.htm

Additionally, work done at Hamilton Institute found that SACK processing in the Linux kernel is inefficient even for later 2.6 kernels, where at 1Gbps networks performance for a long file transfer can lose about 100Mbps of potential. Most of these issues should have been fixed in Linux kernel 2.6.16.

Detailed Explanation

The following is closely based on a mail that Baruch Even sent to the pert-discuss mailing list on 25 Jan '07:

The Linux TCP SACK handling code has in the past been extremely inefficient - there were multiple passes on a linked list which holds all the sent packets. This linked list on a large BDP link can span 20,000 packets. This meant multiple traversals of this list take longer than it takes for another packet to come. Pretty quickly after a loss with SACK the sender's incoming queue fills up and ACKs start getting dropped. There used to be an anti-DoS mechanism that would drop all packets until the queue had emptied - with a default of 1000 packets that took a long time and could easily drop all the outstanding ACKs resulting in a great degradation of performance. This value is set with proc/sys/net/core/netdev_max_backlog

This situation has been slightly improved in that though there is still a 1000 packet limit for the network queue, it acts as a normal buffer i.e. packets are accepted as soon as the queue dips below 1000 again.

Kernel 2.6.19 should be the preferred kernel now for high speed networks. It is believed it has the fixes for all former major issues (at least those fixes that were accepted to the kernel), but it should be noted that some other (more minor) bugs have appeared, and will need to be fixed in future releases.

Historical Note

Selective acknowledgement schemes were known long before they were added to TCP. Noel Chiappa mentioned that Xerox' PUP protocols had them in a posting to the tcp-ip mailing list in August 1986. Vint Cerf responds with a few notes about the thinking that lead to the cumulative form of the original TCP acknowledgements.

References

-- UlrichSchmid & SimonLeinen - 02 Jun 2005 - 14 Jan 2007
-- BartoszBelter - 16 Dec 2005
-- BaruchEven - 05 Jan 2006

Automatic Tuning of TCP Buffers

Note: This mechanism is sometimes referred to as "Dynamic Right-Sizing" (DRS).

The issues mentioned under "Large TCP Windows" are arguments in favor of "buffer auto-tuning", a promising but relatively new approach to better TCP performance in operating systems. See the TCP auto-tuning zoo reference for a description of some approaches.

Microsoft introduced (receive-side) buffer auto-tuning in Windows Vista. This implementation is explained in a TechNet Magazine "Cable Guy" article.

FreeBSD introduced buffer auto-tuning as part of its 7.0 release.

Mac OS X introduced buffer auto-tuning in release 10.5.

Linux auto-tuning details

Some automatic buffer tuning is implemented in Linux 2.4 (sender-side), and Linux 2.6 implements it for both the send and receive directions.

In a post to the web100-discuss mailing list, John Heffner describes the Linux 2.6.16 (March 2006) Linux implementation as follows:

For the sender, we explored separating the send buffer and retransmit queue, but this has been put on the back burner. This is a cleaner approach, but is not necessary to achieve good performance. What is currently implemented in Linux is essentially what is described in Semke '98, but without the max-min fair sharing. When memory runs out, Linux implements something more like congestion control for reducing memory. It's not clear that this is well-behaved, and I'm not aware of any literature on this. However, it's rarely used in practice.

For the receiver, we took an approach similar to DRS, but not quite the same. RTT is measured with timestamps (when available), rather than using a bounding function. This allows it to track a rise in RTT (for example, due to path change or queuing). Also, a subtle but important difference is that receiving rate is measured by the amount of data consumed by the application, not data received by TCP.

Matt Mathis reports on the end2end-interest mailing list (26/07/06):

Linux 2.6.17 now has sender and receiver side autotuning and a 4 MB DEFAULT maximum window size. Yes, by default it negotiates a TCP window scale of 7.
4 MB is sufficient to support about 100 Mb/s on a 300 ms path or 1 Gb/s on a 30 ms path, assuming you have enough data and an extremely clean (loss-less) network.

References

  • TCP auto-tuning zoo, Web page by Tom Dunegan of Oak Ridge National Laboratory, PDF
  • A Comparison of TCP Automatic Tuning Techniques for Distributed Computing, E. Weigle, W. Feng, 2002, PDF
  • Automatic TCP Buffer Tuning, J. Semke, J. Mahdavi, M. Mathis, SIGCOMM 1998, PS
  • Dynamic Right-Sizing in TCP, M. Fisk, W. Feng, Proc. of the Los Alamos Computer Science Institute Symposium, October 2001, PDF
  • Dynamic Right-Sizing in FTP (drsFTP): Enhancing Grid Performance in User-Space, M.K. Gardner, Wu-chun Feng, M. Fisk, July 2002, PDF. This paper describes an implementation of buffer-tuning at application level for FTP, i.e. outside of the kernel.
  • Socket Buffer Auto-Sizing for High-Performance Data Transfers, R. Prasad, M. Jain, C. Dovrolis, PDF
  • The Cable Guy: TCP Receive Window Auto-Tuning, J. Davies, January 2007, Microsoft TechNet Magazine
  • What's New in FreeBSD 7.0, F. Biancuzzi, A. Oppermann et al., February 2008, ONLamp
  • How to disable the TCP autotuning diagnostic tool, Microsoft Support, Article ID: 967475, February 2009

-- SimonLeinen - 04 Apr 2006 - 21 Mar 2011
-- ChrisWelti - 03 Aug 2006

UDP (User Datagram Protocol)

The User Datagram Protocol (UDP) is a very simple layer over the host-to-host Internet Protocol (IP). It only adds 16-bit source and destination port numbers for multiplexing between different applications on the pair of hosts, and 16-bit length and checksum fields. It has been defined in RFC 768.

UDP is used directly for protocols such as the Domain Name System (DNS) or the Network Time Protocol (NTP), which consists on isolated request-response transactions between hosts, where the negotiation and maintenance of TCP connections would be prohibitive.

There are other protocols layered on top of UDP, for example the Real-time Transport Protocol (RTP) used in real-time media applications. UDT, VFER, RBUDP, Tsunami, and Hurricane are examples of UDP-based bulk transport protocols.

References

  • RFC 768, User Datagram Protocol, J. Postel, August 1980

-- SimonLeinen - 31 Oct 2004 - 02 Apr 2006

Real-Time Transport Protocol (RTP)

RTP (RFC 3550) is a generic transport protocol for real-time media streams such as audio or video. RTP is typically run over UDP. RTP's services include timestamps and identification of media types. RTP includes RTCP (RTP Control Protocol), whose primary use is to provide feedback about the quality of transmission in the form of Receiver Reports.

RTP is used as a framing protocol for many real-time audio/video applications, including those based on the ITU-T H.323 protocol and the Session Initiation Protocol (SIP).

References

  • RFC 3550, RTP: A Transport Protocol for Real-Time Applications, H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. July 2003

-- SimonLeinen - 28 Oct 2004 - 24 Nov 2007

Application Protocols

-- TobyRodwell - 28 Feb 2005

File Transfer

A common problem for many scientific applications is the replication of - often large - data sets (files) from one system to another. (For the generalized problem of transferring data sets from a source to multiple destinations, see DataDissemination.) Typically this requires reliable transfer (protection against transmission errors) such as provided by TCP, typically access control based on some sort of authentication, and sometimes confidentiality against eavesdroppers, which can be provided by encryption. There are many protocols that can be used for file transfer, some of which are outlined here.

  • FTP, the File Transfer Protocol, was one of the earliest protocols used on the ARPAnet and the Internet, and predates both TCP and IP. It supports simple file operations over a variety of operating systems and file abstractions, and has both a text and a binary mode. FTP uses separate TCP connections for control and data transfer.
  • HTTP, the Hypertext Transfer Protocol, is the basic protocol used by the World Wide Web. It is quite efficient for transferring files, but is typically used to transfer from a server to a client only.
  • RCP, the Berkeley Remote Copy Protocol, is a convenient protocol for transferring files between Unix systems, but lacks real security beyond address-based authentication and clear-text passwords. Therefore it has mostly fallen out of use.
  • SCP is a file-transfer application of the SSH protocol. It provides various modern methods of authentication and encryption, but its current implementations have some performance limitations over "long fat networks" that are addressed under the SSH topic.
  • BitTorrent is an example of a peer-to-peer file-sharing protocol. It employs local control mechanisms to optimize the global problem of replicating a large file to many recipients, by allowing peers to share partial copies as they receive them.
  • VFER is a tool for high-performance data transfer developed at Internet2. It is layered on UDP and implements its own delay-based rate control algorithm in user-space, which is designed to be "TCP friendly". Its security is based on SSH.
  • UDT is another UDP-based bulk transfer protocol, optimized for high-capacity (1 Gb/s and above) wide-area network paths. It has been used in the winning entry at the Supercomputing'06 Bandwidth Challenge.

Several high-performance file transfer protocols are used in the Grid community. The "comparative evaluation..." paper in the references compares FOBS, RBUDP, UDT, and bbFTP. Other protocols include GridFTP, Tsunami and FDT. The eVLBI community uses file transfer tools from the Mark5 software suite: File2Net and Net2File. The ESnet "Fasterdata" knowledge base has a very nice section on Data Transfer Tools, providing both general background information and information about several specific tools. Another useful document is Harry Mangalam's How to transfer large amounts of data via network—a nicely written general introduction to the problem of moving data with many usage examples of specialized tools, including performance numbers and tuning hints.

Network File Systems

Another possibility of exchanging files over the network involves networked file systems, which make remote files transparently accessible in a local system's normal file namespace. Examples for such file systems are:

  • NFS, the Network File System, was initially developed by Sun and is widely utilized on Unix-like systems. Very recently, NFS 4.1 added support for pNFS (parallel NFS), where data access can be striped over multiple data servers.
  • AFS, the Andrew File System from CMU, evolved into DFS (Distributed File System)
  • SMB (Server Message Block) or CIFS (Common Internet File System) is the standard protocol for connecting to "network shares" (remote file systems) in the Windows world.
  • GPFS (General Parallel File System) is a high-performance scalable network file system by IBM.
  • Lustre is an open-source file systems for high-performance clusters, distributed by Sun.

References

-- TobyRodwell - 2005-02-28
-- SimonLeinen - 2005-06-26 - 2015-04-22

-- TobyRodwell - 28 Feb 2005

Secure Shell (SSH)

SSH is a widely used protocol for remote terminal access with secure authentication and data encryption. It is also used for file transfers, using tools such as scp (Secure Copy), sftp (Secure FTP), or rsync-over-ssh.

Performance Issues With SSH

Application Layer Window Limitation

When users use SSH to transfer large files, they often think that performance is limited by the processing power required for encryption and decryption. While this can indeed be an issue in a LAN context, the bottleneck over "long fat networks" (LFNs) is most likely a window limitation. Even when TCP parameters have been tuned to allow sufficiently large TCP Windows, the most common SSH implementation (OpenSSH) has a hardwired window size at the application level. Until OpenSSH 4.7, the limit was ~64K but since then, the limit was increased 16-fold (see below) and window increase logic was made more aggressive.

This limitation is replaced with a more advanced logic in a modification of the OpenSSH software provided by the Pittsburgh Supercomputing Center (see below).

The performance difference is substantial especially when RTT grows. In a test setup, with 45 ms RTT, two Linux systems with 8 MB read/write buffers could achive 1.5 MB/s performance with regular OpenSSH (3.9p1 + 4.3p1). Switching to OpenSSH 5.1p1 + HPN-SSH patches on both ends allow up to 55-70 MB/s (no encryption) or 35/50 MB/s (aes128-cbc/ctr encryption) , with the stable rate somewhat lower, the bottleneck being CPU on one end. By just upgrading the receiver (client) side, transfer could still reach 50 MB/s (with encryption).

Crypto overhead

When the window-size limitation is removed, encryption/decryption performance may become the bottleneck again. So it is useful to choose a "cipher" (encryption/decryption method) that performs well, while still being regarded as sufficiently secure to protect the data in question. Here is a table that displays the performance of several ciphers supported by OpenSSH in a reference setting:

cipher throughput
3des-cbc 2.8MB/s
arcfour 24.4MB/s
aes192-cbc 13.3MB/s
aes256-cbc 11.7MB/s
aes128-ctr 12.7MB/s
aes192-ctr 11.7MB/s
aes256-ctr 11.3MB/s
blowfish-cbc 16.3MB/s
cast128-cbc 7.9MB/s
rijndael-cbc@lysator.liu.se 12.2MB/s

The High Performance Enabled SSH/SCP (HPN-SSH) version also supports an option to the scp program that supports use of the "none" cipher, when confidentiality protection of the transferred data is not required. The program also supports a cipher-switch option where password authentication can be encrypted but the transferred data not.

References

Basics

SSH Performance

-- ChrisWelti - 03 Apr 2006
-- SimonLeinen - 12 Feb 2005 - 26 Apr 2010
-- PekkaSavola - 01 Oct 2008

BitTorrent

BitTorrent is an example of a peer-to-peer file-sharing protocol. It employs local control mechanisms to optimize the global problem of replicating a large file to many recipients, by allowing peers to share partial copies as they receive them. It was developed by Bram Cohen, and is widely used to distribute large files over the Internet. Because much of this copying is for audio and video data and without the accordance of the rights owners of that material, BitTorrent has become a focus of attention of media interest groups such as the Motion Picture Artists of America (MPAA). But BitTorrent is also used to distribute large software archives under "Free Software" or similar legal-redistribution regimes.

References

-- SimonLeinen - 26 Jun 2005

Troubleshooting Procedures

PERT Troubleshooting Procedures, as used in the earlier centralised PERT in GN2 (2004-2008), are laid down in GN2 Deliverable DS3.5.2 (see references section), which also contains the User Guide for using the (now-defunct) PERT Ticket System (PTS).

Standard tests

  • Ping - simple RTT/loss measurement using ICMP ECHO requests
  • Fping - ping variant with concurrent measurement of multiple destinations

End-user tests

tweak tools http://www.dslreports.com/tweaks is able to run a simple end-user test, checking such parameters as TCP options, receive window size, data transfer rates. It is all done through the user's web-browser making it a simple test for them to do.

Linux Tools and Checks

ip route show cache

Linux is able to apply specific conditions (MSS, ssthresh) to specific routes. Some parameters (such as MSS, ssthresh) can be set manually (with ip route add|replace ...) whilst others are changed automatically by Linux in response to what it learns from TCP (such parameters include estimated rtt, cwnd and re-ordering). The learned info is stored in the route cache and thus can be shown with ip route show cache. Note, this learning behaviour can actually limit TCP performance - if the last transfer was poor, then the starting TCP parameters will be be pessimistic. For this reason some tools, e.g. bwctl, always flush the route cache before starting a test.

References

  • PERT Troubleshooting Procedures - 2nd Edition, B. Belter, A. Harding, T. Rodwell, GN2 deliverable D3.5.2, August 2006 (PDF)

-- SimonLeinen - 18 Sep 2005 - 17 Aug 2010
-- BartoszBelter - 28 Mar 2006

Measurement Tools

Traceroute-like Tools: traceroute, MTR, PingPlotter, lft, tracepath, traceproto

There is a large and growing number of path-measurement tools derived from the well-known traceroute tool. Those tools all attempt to find the route a packet will take from the source (typically where the tool is running) to a given destination, and to find out some hop-wise performance parameters along the way.

Bandwidth Measurement Tools: pchar, Iperf, bwctl, nuttcp, Netperf, RUDE/CRUDE, ttcp, NDT, DSL Reports

Use: Evaluate the Bandwidth between two points in the network.

Active Measurement Boxes

Use: information about Delay, Jitter and packets loss between two probes.

Passive Measurement Tools

These tools perform their measurement by looking at existing traffic. They can be further classified into

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- HankNussbacher - 01 Jun 2005
-- SimonLeinen - 2005-07-15 - 2014-12-27
-- AlessandraScicchitano - 2013-01-30

traceroute-like Tools

Traceroute is used to determine the route a packet takes through the Internet to reach its destination; i.e. the sequence of gateways or "hops" it passes through. Since its inception, traceroute has been widely used for network diagnostics as well as for research in the widest sense.

The basic idea is to send out "probe" packets with artificially small TTL (time-to-live) values, eliciting ICMP "time exceeded" messages from routers in the network, and reconstructing the path from these messages. This is described in more detail under the VanJacobsonTraceroute topic. This original traceroute implementation was followed by many attempts at improving on the idea in various directions: More useful output (often graphically enhanced, sometimes trying to map the route geographically), more detailed measurements along the path, faster operation for many targets ("topology discovery"), and more robustness in the face of packet filters, using different types of suitable probe packets.

Traceroute Variants

Close cousins

Variants with more filter-friendly probe packets

Extensions

GUI (Graphical User Interface) traceroute variants

  • 3d Traceroute plots per-hop RTTs over time in 3D. Works on Windows, free demo available
  • LoriotPro graphical traceroute plugin. Several graphical representations.
  • Path Analyzer Pro, graphical traceroute tool for Windows and MacOS X. Include "synopsis" feature.
  • PingPlotter combines traceroute, ping and whois to collect data for Windows platform.
  • VisualRoute performs traceroutes and provides various graphical representations and analysis. Runs on Windows, Web demo available.

More detailed measurements along the path

Other

  • Multicast Traceroute (mtrace) uses IGMP protocol extensions to allow "traceroute" functionality for multicast
  • Paris Traceroute claims to produce better results in the presence of "multipath" routing.
  • Scamper performs bulk measurements to large numbers of destinations, under configurable traffic limits.
  • tracefilter tries to detect where along a path packets are filtered.
  • Tracebox tries to detect middleboxes along the path
  • traceiface tries to find "both ends" of each link traversed using expanding-ring search.
  • Net::Traceroute::PurePerl is an implementation of traceroute in Perl that lends itself to experimentation.
  • traceflow is a proposed protocol to trace the path for traffic of a specific 3/5-tuple and collect diagnostic information

Another list of traceroute variants, with a focus on (geo)graphical display capabilities, can be found on the JRA1 Wiki.

Traceroute Servers

There are many TracerouteServers on the Internet that allow running traceroute from other parts of the network.

Traceroute Data Collection and Representation

Researchers from the network measurement community have created large collections of traceroute results to help understand the Internet's topology, i.e. the structure of its connectivity. Some of these collections are available for other researchers. Scamper is an example of a tool that can be used to efficiently obtain traceroute results towards a large number of destinations.

The IETF IPPM WG is standardizing an XML-based format to store traceroute results.

-- FrancoisXavierAndreu - SimonMuyal - 06 Jun 2005
-- SimonLeinen - 2005-05-06 - 2015-09-11

Original Van Jacobson/Unix/LBL Traceroute

The well known traceroute program has been written by Van Jacobson in December 1988. In the header comment of the source, Van explains that he just implemented an idea he got from Steve Deering (and that could have come from Guy Almes or Matt Mathis). Traceroute sends "probe" packets with TTL (Time to Live) values incrementing from one, and uses ICMP "Time Exceeded" messages to detect router "hops" on the way to the specified destination. It also records "response" times for each hop, and displays losses and other types of failures in a compact way.

This traceroute is also known as "Unix traceroute" (although some Unix variants have departed from the code base, see e.g. Solaris traceroute) or "LBL traceroute" after Van Jacobson's employer at the time he wrote this.

Basic function

UDP packets are sent as probes to a high ephemeral port (usually in the range 33434--33525) with the Time-To-Live (TTL) field in the IP header increasing by one for each probe sent until the end host is reached. The originating host listens for ICMP Time Exceeded responses from each of the routers/hosts en-route. It knows that the packet's destination has been reached when it receives an ICMP Port Unreachable message; we expect a port unreachable message as no service should be listening for connections in this port range. If there is no response to the probe within a certain time period (typically 5 seconds), then a * is displayed.

Output

The output of the traceroute program shows each host that the packet passes through on its way to its destination and the RTT to each gateway en-route. Occasionally, the maximum number of hops (specified by the TTL field, which defaults to 64 hops in *NIX implementations) is exceeded before the port unreachable is received. When this happens an ! will be printed beside the RTT in the output.

Other error messages that may appear after the RTT in the output of a traceroute are:

!H
Host unreachable
!N
Network unreachable
!P
Protocol unreachable
!S
Source-route failed (that is to say, the router was not able to honour the source-route option set in an IP packet)
!F[pmtu]
Fragmentation needed. [pmtu] displays the Path MTU Discovery value, typically the "next-hop MTU" contained in the ICMP response packet.
!X
Administratively prohibited. The gateway prohibits these packets, but sends an ICMP message back to the source of the traceroute to inform them of this.
!V
Host precedence violation
!C
Precedence cut-off in effect
![num]
displays the ICMP unreachable code, as defined in a variety of RFC, shown at http://www.iana.org/assignments/icmp-parameters

UDP Probes

The original traceroute sends UDP packets as probes to a high ephemeral port (usually in the range 33434--33525) with the Time-To-Live (TTL) field in the IP header increasing by one for each probe sent until the end host is reached. The originating host listens for ICMP Time Exceeded responses from each of the routers/hosts en-route. It knows that the packet's destination has been reached when it receives an ICMP Port Unreachable message; we expect a port unreachable message as no service should be listening for connections in this port range. If there is no response to the probe within a certain time period (typically 5 seconds), then a * is displayed.

Why UDP?

It would seem natural to use ICMP ECHO requests as probe packets, but Van Jacobson chose UDP packets to presumably unused ports instead. It is believed that this is because at that time, some gateways (as routers were called then) refused to send ICMP (TTL exceeded) messages in response to ICMP messages, as specified in the introduction of RFC 792, "Internet Control Message Protocol". Therefore the UDP variant was more robust.

These days, all gateways (routers) send ICMP TTL Exceeded messages about ICMP ECHO request packets (as specified in RFC1122, "Requirements for Internet Hosts -- Communication Layers", so more recent traceroute versions (such as Windows tracert) do indeed use ICMP probes, and newer Unix traceroute versions allow ICMP probes to be selected with the -I option.

Why increment the UDP port number?

Traceroute varies (increments) the UDP destination port number for each probe sent out, in order to reliably match ICMP TTL Exceeded messages to individual probes. Because the UDP ports occur right after the IP header, they can be relied on to be included in the "original packet" portion of the ICMP TTL Exceeded messages, even though the ICMP standards only mandate that the first eight octets following the IP header of the original packet be included in ICMP messages (it is allowed to send more though).

When ICMP ECHO requests are used, the probes can be disambiguated by using the sequence number field, which also happens to be located just before that 8-octet boundary.

Filters

Note that either or both of ICMP and UDP may be blocked by firewalls, so this must be taken into account when troubleshooting. As an alternative, one can often use traceroute variants such as lft or tcptraceroute to work around these filters, provided that the destination host has at least one TCP port open towards the source.

References

  • ftp://ftp.ee.lbl.gov/ - the original distribution site. The latest traceroute version found as of February 2006 is traceroute-1.4a12 from December 2000.
  • original announcement from the archives of the end2end-interest mailing list

-- SimonLeinen - 26 Feb 2006

Traceroute6

Traceroute6 uses the Hop-Limit field of the IPv6 protocol to elicit an ICMPv6 Time Exceeded ICMPv6 message from each gateway ("hop") along the path to some host. Just as with traceroute, it prints the route to the given destination and the RTT to each gateway/router.
The following are a list of possible errors that may appear after the RTT for a gateway (especially for OSes that use the KAME IPv6 network stack, such as the BSDs):

!N
No route to host
!P
Administratively prohibited (i.e. Blocked by a firewall, but the firewall issues an ICMPv6 message to the originating host to inform them of this)
!S
Not a Neighbour
!A
Address unreachable
!
The hop-limit is <= 1 on a Port Unreachable ICMPv6 message. This means that the packet got to its destination, but that the reply only had a hop-limit large enough that was just large enough to allow it to get back to the source of the traceroute6. This option was more interesting in IPv4, where bugs in some implementations of the IP stack could be identified by this behaviour.

Traceroute6 can also be specified to use ICMPv6 Echo messages to send the probe packets, instead of the default UDP probes, by specifying the -I flag when running the program. This may be useful in situations where UDP packets are blocked by a packet filter or firewall, while ICMP ECHO requests are permitted.

Note that on some systems, notably Sun's Solaris (see SolarisTraceroute), IPv6 functionality is integrated in the normal traceroute program.

-- TobyRodwell - 06 Apr 2005
-- SimonLeinen - 26 Feb 2006

Solaris traceroute (includes IPv6 functionality)

On Sun's Solaris, IPv6 traceroute functionality is included in the normal traceroute program, so a separate "traceroute6" isn't necessary. Sun's traceroute program has additional options to select between IPv4 and IPv6:

-A [inet|inet6]
Resolve hostnames to an IPv4 (inet) or IPv6 (inet6) address only

-a
Perform traceroutes to all addresses in all address families found for the given destination name.

Examples

Here are a few examples of the address-family selection switches for Solaris' traceroute. By default, IPv6 is preferred:

: leinen@diotima[leinen]; traceroute cemp1
traceroute: Warning: cemp1 has multiple addresses; using 2001:620:0:114:20b:cdff:fe1b:3d1a
traceroute: Warning: Multiple interfaces found; using 2001:620:0:4:203:baff:fe4c:d751 @ bge0:1
traceroute to cemp1 (2001:620:0:114:20b:cdff:fe1b:3d1a), 30 hops max, 60 byte packets
 1  swiLM1-V4.switch.ch (2001:620:0:4::1)  0.683 ms  0.757 ms  0.449 ms
 2  swiNM1-V610.switch.ch (2001:620:0:c047::1)  0.576 ms  0.576 ms  0.463 ms
 3  swiCS3-G3-3.switch.ch (2001:620:0:c046::1)  0.461 ms  0.334 ms  0.340 ms
 4  swiEZ2-P1.switch.ch (2001:620:0:c03f::2)  0.467 ms  0.332 ms  0.348 ms
 5  swiLS2-10GE-1-1.switch.ch (2001:620:0:c03c::1)  3.976 ms  3.825 ms  3.729 ms
 6  swiCE2-10GE-1-3.switch.ch (2001:620:0:c006::1)  4.817 ms  4.703 ms  4.740 ms
 7  cemp1-eth1.switch.ch (2001:620:0:114:20b:cdff:fe1b:3d1a)  4.583 ms  4.566 ms  4.590 ms

If IPv4 is desired, this can be selected using -A inet:

: leinen@diotima[leinen]; traceroute -A inet cemp1
traceroute to cemp1 (130.59.35.130), 30 hops max, 40 byte packets
 1  swiLM1-V4.switch.ch (130.59.4.1)  0.643 ms  0.539 ms  0.465 ms
 2  swiNM1-V610.switch.ch (130.59.15.229)  0.453 ms  0.553 ms  0.470 ms
 3  swiCS3-G3-3.switch.ch (130.59.15.238)  0.590 ms  0.426 ms  0.476 ms
 4  swiEZ2-P1.switch.ch (130.59.36.222)  0.463 ms  0.307 ms  0.352 ms
 5  swiLS2-10GE-1-1.switch.ch (130.59.36.205)  3.723 ms  3.755 ms  3.743 ms
 6  swiCE2-10GE-1-3.switch.ch (130.59.37.1)  4.677 ms  4.678 ms  4.690 ms
 7  cemp1-eth1.switch.ch (130.59.35.130)  5.028 ms  4.555 ms  4.567 ms

The -a switch can be used to trace to all addresses:

: leinen@diotima[leinen]; traceroute -a cemp1
traceroute: Warning: Multiple interfaces found; using 2001:620:0:4:203:baff:fe4c:d751 @ bge0:1
traceroute to cemp1 (2001:620:0:114:20b:cdff:fe1b:3d1a), 30 hops max, 60 byte packets
 1  swiLM1-V4.switch.ch (2001:620:0:4::1)  0.684 ms  0.515 ms  0.457 ms
 2  swiNM1-V610.switch.ch (2001:620:0:c047::1)  0.580 ms  0.848 ms  0.561 ms
 3  swiCS3-G3-3.switch.ch (2001:620:0:c046::1)  0.304 ms  0.428 ms  0.315 ms
 4  swiEZ2-P1.switch.ch (2001:620:0:c03f::2)  0.455 ms  0.516 ms  0.397 ms
 5  swiLS2-10GE-1-1.switch.ch (2001:620:0:c03c::1)  3.853 ms  3.826 ms  3.874 ms
 6  swiCE2-10GE-1-3.switch.ch (2001:620:0:c006::1)  5.071 ms  4.654 ms  4.702 ms
 7  cemp1-eth1.switch.ch (2001:620:0:114:20b:cdff:fe1b:3d1a)  4.581 ms  4.564 ms  4.585 ms

traceroute to cemp1 (130.59.35.130), 30 hops max, 40 byte packets
 1  swiLM1-V4.switch.ch (130.59.4.1)  0.677 ms  0.523 ms  0.716 ms
 2  swiNM1-V610.switch.ch (130.59.15.229)  0.462 ms  0.558 ms  0.470 ms
 3  swiCS3-G3-3.switch.ch (130.59.15.238)  0.340 ms  0.309 ms  0.352 ms
 4  swiEZ2-P1.switch.ch (130.59.36.222)  0.341 ms  0.307 ms  0.351 ms
 5  swiLS2-10GE-1-1.switch.ch (130.59.36.205)  3.722 ms  3.684 ms  3.719 ms
 6  swiCE2-10GE-1-3.switch.ch (130.59.37.1)  4.794 ms  4.695 ms  4.658 ms
 7  cemp1-eth1.switch.ch (130.59.35.130)  4.645 ms  4.653 ms  4.587 ms

-- SimonLeinen - 26 Feb 2006

NANOG Traceroute

Ehud Gavron maintains this version of traceroute. It is derived from the original traceroute program, but adds a few features such as AS (Autonomous System) number lookup, and detection of TOS (Type-of-Service) changes along the path.

TOS Change Detection

This feature can be used to check whether a network path is DSCP-transparent. Use the -t option to select the TOS byte, and watch out for TOS byte changes indicated with TOS=x!.

As shown in the following example, GEANT2 accepts TOS=32/DSCP=8/CLS=1 - this corresponds to "LBE"

: root@diotima[nanog]; ./traceroute -t 32 www.dfn.de.
traceroute to sirius.dfn.de (192.76.176.5), 30 hops max, 40 byte packets
1 swiLM1-V4.switch.ch (130.59.4.1) 1 ms 1 ms 0 ms
2 swiNM1-V610.switch.ch (130.59.15.229) 0 ms 1 ms 0 ms
3 swiCS3-G3-3.switch.ch (130.59.15.238) 0 ms 0 ms 0 ms
4 swiEZ2-P1.switch.ch (130.59.36.222) 0 ms 0 ms 0 ms
5 swiLS2-10GE-1-1.switch.ch (130.59.36.205) 4 ms 4 ms 4 ms
6 swiCE2-10GE-1-3.switch.ch (130.59.37.1) 5 ms 5 ms 5 ms
7 switch.rt1.gen.ch.geant2.net (62.40.124.21) 5 ms 5 ms 5 ms
8 so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) 13 ms 13 ms 13 ms
9 dfn-gw.rt1.fra.de.geant2.net (62.40.124.34) 13 ms 13 ms 13 ms
10 cr-berlin1-po1-0.x-win.dfn.de (188.1.18.53) 25 ms 25 ms 25 ms
11 ar-berlin1-ge6-1.x-win.dfn.de (188.1.20.34) 25 ms 25 ms 25 ms
12 zpl-gw.dfn.de (192.76.176.253) 25 ms 25 ms 25 ms
13 * ^C

On the other hand, TOS=64/DSCP=16/CLS=2 is rewritten to TOS=0 at hop 7 (the rewrite only shows up at hop 8):

: root@diotima[nanog]; ./traceroute -t 64 www.dfn.de.
traceroute to sirius.dfn.de (192.76.176.5), 30 hops max, 40 byte packets
1 swiLM1-V4.switch.ch (130.59.4.1) 2 ms 1 ms 0 ms
2 swiNM1-V610.switch.ch (130.59.15.229) 0 ms 0 ms 0 ms
3 swiCS3-G3-3.switch.ch (130.59.15.238) 0 ms 0 ms 0 ms
4 swiEZ2-P1.switch.ch (130.59.36.222) 0 ms 0 ms 0 ms
5 swiLS2-10GE-1-1.switch.ch (130.59.36.205) 4 ms 4 ms 4 ms
6 swiCE2-10GE-1-3.switch.ch (130.59.37.1) 5 ms 5 ms 5 ms
7 switch.rt1.gen.ch.geant2.net (62.40.124.21) 5 ms 5 ms 5 ms
8 so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) 13 ms (TOS=0!) 13 ms 13 ms
9 dfn-gw.rt1.fra.de.geant2.net (62.40.124.34) 13 ms 13 ms 13 ms
10 cr-berlin1-po1-0.x-win.dfn.de (188.1.18.53) 25 ms 25 ms 25 ms
11 ar-berlin1-ge6-1.x-win.dfn.de (188.1.20.34) 25 ms 25 ms 25 ms
12 zpl-gw.dfn.de (192.76.176.253) 25 ms 25 ms 25 ms
13 * ^C

References

-- SimonLeinen - 26 Feb 2006

Windows tracert

All Windows versions that come with TCP/IP support include a tracert command, which is a simple traceroute client that uses ICMP probes. The (few) options are slightly different from the Unix variants of traceroute. In short, -d is used to suppress address-to-name resolution (-n in Unix traceroute), -h specifies the maximum hopcount (-m in the Unix version), -j is used to specify loose source routes (-g in Unix), and although the same -w option is used to specify the timeout, tracert interprets it in milliseconds, while in Unix it is specified in seconds.

References

-- SimonLeinen - 26 Feb 2006

TCP Traceroute

TCPTraceroute is a traceroute implementation that uses TCP packets instead of UDP or ICMP packets to send its probes. TCPtraceroute can be used in situations where a firewall blocks ICMP and UDP traffic. It is based on the "half-open scanning" technique that is used by NMAP, sending a TCP with the SYN flag set and waiting for a SYN/ACK (which indicates that something is listening on this port for connections). When it receives a response, the tcptraceroute program sends a packet with a RST flag to close the connection.

References

-- TobyRodwell - 06 Apr 2005

LFT (Layer Four Traceroute)

lft is a variant of traceroute that uses per default TCP and port 80 to go through packet-filter based firewalls.

Example

142:/home/andreu# lft -d 80 -m 1 -M 3 -a 5 -c 20 -t 1000 -H 30 -s 53 www.cisco.com     
Tracing _________________________________.
TTL  LFT trace to www.cisco.com (198.133.219.25):80/tcp
1   129.renater.fr (193.49.159.129) 0.5ms
2   gw1-renater.renater.fr (193.49.159.249) 0.4ms
3   nri-a-g13-0-50.cssi.renater.fr (193.51.182.6) 1.0ms
4   193.51.185.1 0.6ms
5   PO11-0.pascr1.Paris.opentransit.net (193.251.241.97) 7.0ms
6   level3-1.GW.opentransit.net (193.251.240.214) 0.8ms
7   ae-0-17.mp1.Paris1.Level3.net (212.73.240.97) 1.1ms
8   so-1-0-0.bbr2.London2.Level3.net (212.187.128.42) 10.6ms
9   as-0-0.bbr1.NewYork1.Level3.net (4.68.128.106) 72.1ms
10   as-0-0.bbr1.SanJose1.Level3.net (64.159.1.133) 158.7ms
11   ge-7-0.ipcolo1.SanJose1.Level3.net (4.68.123.9) 159.2ms
12   p1-0.cisco.bbnplanet.net (4.0.26.14) 159.4ms
13   sjck-dmzbb-gw1.cisco.com (128.107.239.9) 159.0ms
14   sjck-dmzdc-gw2.cisco.com (128.107.224.77) 159.1ms
15   [target] www.cisco.com (198.133.219.25):80 159.2ms

References

  • Home page: http://pwhois.org/lft/
  • Debian package: lft - display the route packets take to a network host/socket

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 21 May 2006

traceproto

Another traceroute variant which allows different protocols and port to be used. "It currently supports tcp, udp, and icmp traces with the possibility of others in the future." It comes with a wrapper script called HopWatcher, which can be used to quickly detect when a path has changed -

Example

142:/home/andreu# traceproto www.cisco.com
traceproto: trace to www.cisco.com (198.133.219.25), port 80
ttl  1:  ICMP Time Exceeded from 129.renater.fr (193.49.159.129)
        6.7040 ms       0.28100 ms      0.28600 ms
ttl  2:  ICMP Time Exceeded from gw1-renater.renater.fr (193.49.159.249)
        0.16900 ms      6.0140 ms       0.25500 ms
ttl  3:  ICMP Time Exceeded from nri-a-g13-0-50.cssi.renater.fr (193.51.182.6)
        6.8280 ms       0.58200 ms      0.52100 ms
ttl  4:  ICMP Time Exceeded from 193.51.185.1 (193.51.185.1)
        6.6400 ms       7.4230 ms       6.7690 ms
ttl  5:  ICMP Time Exceeded from PO11-0.pascr1.Paris.opentransit.net (193.251.241.97)
        0.58100 ms      0.64100 ms      0.54700 ms
ttl  6:  ICMP Time Exceeded from level3-1.GW.opentransit.net (193.251.240.214)
        6.9390 ms       0.62200 ms      6.8990 ms
ttl  7:  ICMP Time Exceeded from ae-0-17.mp1.Paris1.Level3.net (212.73.240.97)
        7.0790 ms       7.0250 ms       0.79400 ms
ttl  8:  ICMP Time Exceeded from so-1-0-0.bbr2.London2.Level3.net (212.187.128.42)
        10.362 ms       10.100 ms       16.384 ms
ttl  9:  ICMP Time Exceeded from as-0-0.bbr1.NewYork1.Level3.net (4.68.128.106)
        109.93 ms       78.367 ms       80.352 ms
ttl  10:  ICMP Time Exceeded from as-0-0.bbr1.SanJose1.Level3.net (64.159.1.133)
        156.61 ms       179.35 ms
          ICMP Time Exceeded from ae-0-0.bbr2.SanJose1.Level3.net (64.159.1.130)
        148.04 ms
ttl  11:  ICMP Time Exceeded from ge-7-0.ipcolo1.SanJose1.Level3.net (4.68.123.9)
        153.59 ms
         ICMP Time Exceeded from ge-11-0.ipcolo1.SanJose1.Level3.net (4.68.123.41)
        142.50 ms
         ICMP Time Exceeded from ge-7-1.ipcolo1.SanJose1.Level3.net (4.68.123.73)
        133.66 ms
ttl  12:  ICMP Time Exceeded from p1-0.cisco.bbnplanet.net (4.0.26.14)
        150.13 ms       191.24 ms       156.89 ms
ttl  13:  ICMP Time Exceeded from sjck-dmzbb-gw1.cisco.com (128.107.239.9)
        141.47 ms       147.98 ms       158.12 ms
ttl  14:  ICMP Time Exceeded from sjck-dmzdc-gw2.cisco.com (128.107.224.77)
        188.85 ms       148.17 ms       152.99 ms
ttl  15:no response     no response     

hop :  min   /  ave   /  max   :  # packets  :  # lost
      -------------------------------------------------------
  1 : 0.28100 / 2.4237 / 6.7040 :   3 packets :   0 lost
  2 : 0.16900 / 2.1460 / 6.0140 :   3 packets :   0 lost
  3 : 0.52100 / 2.6437 / 6.8280 :   3 packets :   0 lost
  4 : 6.6400 / 6.9440 / 7.4230 :   3 packets :   0 lost
  5 : 0.54700 / 0.58967 / 0.64100 :   3 packets :   0 lost
  6 : 0.62200 / 4.8200 / 6.9390 :   3 packets :   0 lost
  7 : 0.79400 / 4.9660 / 7.0790 :   3 packets :   0 lost
  8 : 10.100 / 12.282 / 16.384 :   3 packets :   0 lost
  9 : 78.367 / 89.550 / 109.93 :   3 packets :   0 lost
 10 : 148.04 / 161.33 / 179.35 :   3 packets :   0 lost
 11 : 133.66 / 143.25 / 153.59 :   3 packets :   0 lost
 12 : 150.13 / 166.09 / 191.24 :   3 packets :   0 lost
 13 : 141.47 / 149.19 / 158.12 :   3 packets :   0 lost
 14 : 148.17 / 163.34 / 188.85 :   3 packets :   0 lost
 15 : 0.0000 / 0.0000 / 0.0000 :   0 packets :   2 lost
     ------------------------Total--------------------------
total 0.0000 / 60.540 / 191.24 :  42 packets :   2 lost

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

MTR (Matt's TraceRoute)

mtr (Matt's TraceRoute) combines the functionality of the traceroute and ping programs in a single network diagnostic tool.

Example

mtr.png

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

-- SimonLeinen - 06 May 2005 - 26 Feb 2006

pathping

Seems to be similar to mtr, but for Windows systems. This is included in at least some modern versions of Windows.

References

  • "pathping", Windows XP Professional Product Documentation,
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/pathping.mspx

-- SimonLeinen - 26 Feb 2006

PingPlotter

PingPlotter by Nessoft LLC combines traceroute, ping and whois to collect data for the Windows platform.

Example

A nice example of the tool can be found in the "Screenshot" section of its Web page. This contains not only the screenshot (included below by reference) but also a description of what you can read from it.

nessoftgraph.gif

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 26 Feb 2006

Traceroute Mesh Server

Traceroute Mesh combines traceroutes from many sources at once and builds a large map of the interconnections from the Internet to the specific IP. Very cool!

References

Example

Partial view of traceroute map created:

partial.png

-- HankNussbacher - 10 Jul 2005

tracepath

tracepath and tracepath6 trace the path to a network host, discovering MTU and asymmetry along this path. As described below, their applicability for path asymmetry measurements is quite limited, but the tools can still measure MTU rather reliably.

Methodology and Caveats

A path is considered asymmetric if the number of hops to a router is different from how much TTL was decremented while the ICMP error message was forwarded back from the router. The latter depends on knowing what was the original TTL the router used to send the ICMP error. The tools guess TTL values 64, 128 and 255. Obviously, a path might be asymmetric even if the forward and return paths were equally long, so the tool just catches one case of path asymmetry.

A major operational issue with this approach is that at least Juniper's M/T-series routers decrement TTL for ICMP errors they originate (e.g., the first hop router returns ICMP error with TTL=254 instead of TTL=255) as if they were forwarding the packet. This shows as path asymmetry.

Path MTU is measured by sending UDP packets with DF bit set. The packet size is the MTU of the host's outgoing link, which may be cached Path MTU Discovery for a given destination address. If a link MTU is lower than the tried path, the ICMP error tells the new path MTU which is used in subsequent probes.

As explained in tracepath(8), if MTU changes along the path, then the route will probably erroneously be declared as asymmetric.

Examples

IPv4 Example

This example shows a path from a host with 9000-byte "jumbo" MTU support to a host on a traditional 1500-byte Ethernet.

: leinen@cemp1[leinen]; tracepath diotima
 1:  cemp1.switch.ch (130.59.35.130)                        0.203ms pmtu 9000
 1:  swiCE2-G5-2.switch.ch (130.59.35.129)                  1.024ms
 2:  swiLS2-10GE-1-3.switch.ch (130.59.37.2)                1.959ms
 3:  swiEZ2-10GE-1-1.switch.ch (130.59.36.206)              5.287ms
 4:  swiCS3-P1.switch.ch (130.59.36.221)                    5.456ms
 5:  swiCS3-P1.switch.ch (130.59.36.221)                  asymm  4   5.467ms pmtu 1500
 6:  swiLM1-V610.switch.ch (130.59.15.230)                  4.864ms
 7:  swiLM1-V610.switch.ch (130.59.15.230)                asymm  6   5.209ms !H
     Resume: pmtu 1500

The router (interface) swiCS3-P1.switch.ch occurs twice; on the first line (hop 4), it returns an ICMP TTL Exceeded error, on the next (hop 5) it returns an ICMP "fragmentation needed and DF bit set" error. Unfortunately this causes tracepath to miss the "real" hop 5, and also to erroneously assume that the route is asymmetric at that point. One could consider this a bug, as tracepath could distinguish these different ICMP errors, and refrain from incrementing TTL when it reduces MTU (in response to the "fragmentation needed..." error).

When one retries the tracepath, the discovered Path MTU for the destination has been cached by the host, and you get a different result:

: leinen@cemp1[leinen]; tracepath diotima
 1:  cemp1.switch.ch (130.59.35.130)                        0.211ms pmtu 1500
 1:  swiCE2-G5-2.switch.ch (130.59.35.129)                  0.384ms
 2:  swiLS2-10GE-1-3.switch.ch (130.59.37.2)                1.214ms
 3:  swiEZ2-10GE-1-1.switch.ch (130.59.36.206)              4.620ms
 4:  swiCS3-P1.switch.ch (130.59.36.221)                    4.623ms
 5:  swiNM1-G1-0-25.switch.ch (130.59.15.237)               5.861ms
 6:  swiLM1-V610.switch.ch (130.59.15.230)                  4.845ms
 7:  swiLM1-V610.switch.ch (130.59.15.230)                asymm  6   5.226ms !H
     Resume: pmtu 1500

Note that hop 5 now shows up correctly and without an "asymm" warning. There is still an "asymm" warning at the end of the path, because a filter on the last-hop router swiLM1-V610.switch.ch prevents the UDP probes from reaching the final destination.

IPv6 Example

Here is the same path for IPv6, using tracepath6. Because of more relaxed UDP filters, the final destination is actually reached in this case:

: leinen@cemp1[leinen]; tracepath6 diotima
 1?: [LOCALHOST]                      pmtu 9000
 1:  swiCE2-G5-2.switch.ch                      1.654ms
 2:  swiLS2-10GE-1-3.switch.ch                  2.235ms
 3:  swiEZ2-10GE-1-1.switch.ch                  5.616ms
 4:  swiCS3-P1.switch.ch                        5.793ms
 5:  swiCS3-P1.switch.ch                      asymm  4   5.872ms pmtu 1500
 5:  swiNM1-G1-0-25.switch.ch                   5. 47ms
 6:  swiLM1-V610.switch.ch                      5. 79ms
 7:  diotima.switch.ch                          4.766ms reached
     Resume: pmtu 1500 hops 7 back 7

Again, once the Path MTU has been cached, tracepath6 starts out with that MTU, and will discover the correct path:

: leinen@cemp1[leinen]; tracepath6 diotima
 1?: [LOCALHOST]                      pmtu 1500
 1:  swiCE2-G5-2.switch.ch                      0.703ms
 2:  swiLS2-10GE-1-3.switch.ch                  8.786ms
 3:  swiEZ2-10GE-1-1.switch.ch                  4.904ms
 4:  swiCS3-P1.switch.ch                        4.979ms
 5:  swiNM1-G1-0-25.switch.ch                   4.989ms
 6:  swiLM1-V610.switch.ch                      6.578ms
 7:  diotima.switch.ch                          5.191ms reached
     Resume: pmtu 1500 hops 7 back 7

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 26 Feb 2006
-- PekkaSavola - 31 Aug 2006

Bandwidth Measurement Tools

(Click on blue headings to link through to detailed tool descriptions and examples)

Iperf

Iperf is a tool to measure maximum TCP bandwidth, allowing the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss. http://sourceforge.net/projects/iperf/

BWCTL

"BWCTL is a command line client application and a scheduling and policy daemon that wraps Iperf"

Home page: http://e2epi.internet2.edu/bwctl/

Exemple
http://e2epi.internet2.edu/pipes/pmp/pmp-switch.htm

nuttcp

Measurement tool very similar to iperf. It can be found here http://www.nuttcp.net/nuttcp/Welcome%20Page.html

NDT

Web100-based TCP tester that can be used from a Java applet.

Pchar

Hop-by-hop capacity measurements along a network path.

Netperf

A client/server network performance benchmark.

Home page: http://www.netperf.org/netperf/NetperfPage.html

Thrulay

A tool that performs TCP throughput tests and RTT measurements at the same time.

RUDE/CRUDE

RUDE is a package of applications to generate and measure the UDP traffic between two points. The rude generates traffic to the network, which can be received and logged on the other side of the network with the crude. Traffic pattern can be defined by user. Tool is available under http://rude.sourceforge.net/

NEPIM

nepim stands for network pipemeter, a tool for measuring available bandwidth between hosts. nepim is also useful to generate network traffic for testing purposes.

nepim operates in client/server mode, is able to handle multiple parallel traffic streams, reports periodic partial statistics along the testing, and supports IPv6.

Tool is available under http://www.nongnu.org/nepim/

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm

DSL Reports doesn't require any install. It checks the speed of your connection via a Java applet. One can choose from over 300 sites in the world from which to run the tests http://www.dslreports.com/stest

* Sample report produced by DSL Reports:
dslreport.png

Other online bandwidth measurement sits:

* http://myspeed.visualware.com/

* http://www.toast.net/performance/

* http://www.beelinebandwidthtest.com/

-- FrancoisXavierAndreu, SimonMuyal & SimonLeinen - 06-30 Jun 2005

-- HankNussbacher - 07 Jul 2005 & 15 Oct 2005 (DSL and online Reports section)

pchar

Characterize the bandwidth, latency and loss on network links. (See below the example for information on how pchar works)

(debian package : pchar)

Example:

pchar to 193.51.180.221 (193.51.180.221) using UDP/IPv4
Using raw socket input
Packet size increments from 32 to 1500 by 32
46 test(s) per repetition
32 repetition(s) per hop
 0: 193.51.183.185 (netflow-nri-a.cssi.renater.fr)
    Partial loss:      0 / 1472 (0%)
    Partial char:      rtt = 0.124246 ms, (b = 0.000206 ms/B), r2 = 0.997632
                       stddev rtt = 0.001224, stddev b = 0.000002
    Partial queueing:  avg = 0.000158 ms (765 bytes)
    Hop char:          rtt = 0.124246 ms, bw = 38783.892367 Kbps
    Hop queueing:      avg = 0.000158 ms (765 bytes)
 1: 193.51.183.186 (nri-a-g13-1-50.cssi.renater.fr)
    Partial loss:      0 / 1472 (0%)
    Partial char:      rtt = 1.087330 ms, (b = 0.000423 ms/B), r2 = 0.991169
                       stddev rtt = 0.004864, stddev b = 0.000006
    Partial queueing:  avg = 0.005093 ms (23535 bytes)
    Hop char:          rtt = 0.963084 ms, bw = 36913.554996 Kbps
    Hop queueing:      avg = 0.004935 ms (22770 bytes)
 2: 193.51.179.122 (nri-n3-a2-0-110.cssi.renater.fr)
    Partial loss:      5 / 1472 (0%)
    Partial char:      rtt = 697.145142 ms, (b = 0.032136 ms/B), r2 = 0.999991
                       stddev rtt = 0.011554, stddev b = 0.000014
    Partial queueing:  avg = 0.009681 ms (23679 bytes)
    Hop char:          rtt = 696.057813 ms, bw = 252.261443 Kbps
    Hop queueing:      avg = 0.004589 ms (144 bytes)
 3: 193.51.180.221 (caledonie-S1-0.cssi.renater.fr)
    Path length:       3 hops
    Path char:         rtt = 697.145142 ms r2 = 0.999991
    Path bottleneck:   252.261443 Kbps
    Path pipe:         21982 bytes
    Path queueing:     average = 0.009681 ms (23679 bytes)
    Start time:        Mon Jun  6 11:38:54 2005
    End time:          Mon Jun  6 12:15:28 2005
If you do not have access to a Unix system, you can run pchar remotely via: http://noc.greatplains.net/measurement/pchar.php

The README text below was written by Bruce A Mah and is taken from http://www.kitchenlab.org/www/bmah/Software/pchar/README, where there is information on how to obtain and install pchar.

PCHAR:  A TOOL FOR MEASURING NETWORK PATH CHARACTERISTICS
Bruce A. Mah
<bmah@kitchenlab.org>
$Id: PcharTool.txt,v 1.3 2005/08/09 16:59:27 TobyRodwell Exp www-data $
---------------------------------------------------------

INTRODUCTION
------------

pchar is a reimplementation of the pathchar utility, written by Van
Jacobson.  Both programs attempt to characterize the bandwidth,
latency, and loss of links along an end-to-end path through the
Internet.  pchar works in both IPv4 and IPv6 networks.

As of pchar-1.5, this program is no longer under active development,
and no further releases are planned.

...

A FEW NOTES ON PCHAR'S OPERATION
--------------------------------

pchar sends probe packets into the network of varying sizes and
analyzes ICMP messages produced by intermediate routers, or by the
target host.  By measuring the response time for packets of different
sizes, pchar can estimate the bandwidth and fixed round-trip delay
along the path.  pchar varies the TTL of the outgoing packets to get
responses from different intermediate routers.  It can use UDP or ICMP
packets as probes; either or both might be useful in different
situations.

At each hop, pchar sends a number of packets (controlled by the -R flag)
of varying sizes (controlled by the -i and -m flags).  pchar determines
the minimum response times for each packet size, in an attempt to
isolate jitter caused by network queueing.  It performs a simple
linear regression fit to the resulting minimum response times.  This
fit yields the partial path bandwidth and round-trip time estimates.

To yield the per-hop estimates, pchar computes the differences in the
linear regression parameter estimates for two adjacent partial-path
datasets.  (Earlier versions of pchar differenced the minima for the
datasets, then computed a linear regressions.)  The -a flag selects
between one of (currently) two different algorithms for performing the
linear regression, either a least squares fit or a nonparametric
method based on Kendall's test statistic.

Using the -b option causes pchar to send small packet bursts,
consisting of a string of back-to-back ICMP ECHO_REPLY packets
followed by the actual probe.  This can be useful in probing switched
networks.

CAVEATS
-------

Router implementations may very well forward a packet faster than they
can return an ICMP error message in response to a packet.  Because of
this fact, it's possible to see faster response times from longer
partial paths; the result is a seemingly non-sensical, negative
estimate of per-hop round-trip time.

Transient fluctuations in the network may also cause some odd results.

If all else fails, writing statistics to a file will give all of the
raw data that pchar used for its analysis.

Some types of networks are intrinsically difficult for pchar to
measure.  Two notable examples are switched networks (with multiple
queues at Layer 2) or striped networks.  We are currently
investigating methods for trying to measure these networks.

pchar needs superuser access due to its use of raw sockets.

...

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

-- HankNussbacher - 10 Jul 2005 (Great Plains server)

-- TobyRodwell - 09 Aug 2005 (added Brudce A Mah's Readme text)

Iperf

Iperf is a tool to measure TCP throughput and available bandwidth, allowing the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay variation, and datagram loss.

The popular Iperf 2 releases were developed by NLANR/DAST (http://dast.nlanr.net/Projects/Iperf/) and maintained at http://sourceforge.net/projects/iperf. As of April 2014, the last released version was 2.0.5, from July 2010.

A script that automates the starting and then stopping iperf servers is here . This can be invoked from a remote machine (say a NOC workstation) to simplify starting, and more importantly stopping, an iperf server.

Iperf 3

ESnet and the Lawrence Berkeley National Laboratory have developed a from-scratch reimplementation of Iperf called Iperf 3. It has a Github repository. It is not compatible with iperf 2, and has additional interesting features such as a zero-copy TCP mode (-Z flag), JSON output (-J), and reporting of TCP retransmission counts and CPU utilization (-V). It also supports SCTP in addition to UDP and TCP. Since December 2013, various public releases were made on http://stats.es.net/software/.

Usage Examples

TCP Throughput Test

The following shows a TCP throughput test, which is iperf's default action. The following options are given:

  • -s - server mode. In iperf, the server will receive the test data stream.
  • -c server - client mode. The name (or IP address) of the server should be given. The client will transmit the test stream.
  • -i interval - display interval. Without this option, iperf will run the test silently, and only write a summary after the test has finished. With -i, the program will report intermediate results at given intervals (in seconds).
  • -w windowsize - select a non-default TCP window size. To achieve high rates over paths with a large bandwidth-delay product, it is often necessary to select a larger TCP window size than the (operating system) default.
  • -l buffer length - specify the length of send or receive buffer. In UDP, this sets the packet size. In TCP, this sets the send/receive buffer length (possibly using system defaults). Using this may be important especially if the operating system default send buffer is too small (e.g. in Windows XP).

NOTE -c and -s arguments must be given first. Otherwise some configuration options are ignored.

The -i 1 option was given to obtain intermediate reports every second, in addition to the final report at the end of the ten-second test. The TCP buffer size was set to 2 Megabytes (4 Megabytes effective, see below) in order to permit close to line-rate transfers. The systems haven't been fully tuned, otherwise up to 7 Gb/s of TCP throughput should be possible. Normal background traffic on the 10 Gb/s backbone is on the order of 30-100 Mb/s. Note that in iperf, by default it is the client that transmits to the server.

Server Side:

welti@ezmp3:~$ iperf -s -w 2M -i 1
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)
------------------------------------------------------------
[  4] local 130.59.35.106 port 5001 connected with 130.59.35.82 port 41143
[  4]  0.0- 1.0 sec    405 MBytes  3.40 Gbits/sec
[  4]  1.0- 2.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  2.0- 3.0 sec    425 MBytes  3.56 Gbits/sec
[  4]  3.0- 4.0 sec    422 MBytes  3.54 Gbits/sec
[  4]  4.0- 5.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  5.0- 6.0 sec    422 MBytes  3.54 Gbits/sec
[  4]  6.0- 7.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  7.0- 8.0 sec    423 MBytes  3.55 Gbits/sec
[  4]  8.0- 9.0 sec    424 MBytes  3.56 Gbits/sec
[  4]  9.0-10.0 sec    413 MBytes  3.47 Gbits/sec
[  4]  0.0-10.0 sec  4.11 GBytes  3.53 Gbits/sec

Client Side:

welti@mamp1:~$ iperf -c ezmp3 -w 2M -i 1
------------------------------------------------------------
Client connecting to ezmp3, TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)
------------------------------------------------------------
[  3] local 130.59.35.82 port 41143 connected with 130.59.35.106 port 5001
[  3]  0.0- 1.0 sec    405 MBytes  3.40 Gbits/sec
[  3]  1.0- 2.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  2.0- 3.0 sec    425 MBytes  3.56 Gbits/sec
[  3]  3.0- 4.0 sec    422 MBytes  3.54 Gbits/sec
[  3]  4.0- 5.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  5.0- 6.0 sec    422 MBytes  3.54 Gbits/sec
[  3]  6.0- 7.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  7.0- 8.0 sec    423 MBytes  3.55 Gbits/sec
[  3]  8.0- 9.0 sec    424 MBytes  3.56 Gbits/sec
[  3]  0.0-10.0 sec  4.11 GBytes  3.53 Gbits/sec

UDP Test

In the following example, we send a 300 Mb/s UDP test stream. No packets were lost along the path, although one arrived out-of-order. Another interesting result is jitter, which is displayed as 27 or 28 microseconds (apparently there is some rounding error or other impreciseness that prevents the client and server from agreeing on the value). According to the documentation, "Jitter is the smoothed mean of differences between consecutive transit times."

Server Side

: leinen@mamp1[leinen]; iperf -s -u
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 130.59.35.82 port 5001 connected with 130.59.35.106 port 38750
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec  0.028 ms    0/256410 (0%)
[  3]  0.0-10.0 sec  1 datagrams received out-of-order

Client Side

: leinen@ezmp3[leinen]; iperf -c mamp1-eth0 -u -b 300M
------------------------------------------------------------
Client connecting to mamp1-eth0, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 130.59.35.106 port 38750 connected with 130.59.35.82 port 5001
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec
[  3] Sent 256411 datagrams
[  3] Server Report:
[  3]  0.0-10.0 sec    359 MBytes    302 Mbits/sec  0.027 ms    0/256410 (0%)
[  3]  0.0-10.0 sec  1 datagrams received out-of-order

As you would expect, during a UDP test traffic is only sent from the client to the server see here for an example with tcpdump.

Problem isolation procedures using iperf

TCP Throughput measurements

Typically end users are reporting throughput problems as they see on with their applications, like unexpected slow file transfer times. Some users may already report TCP throughput results as measured with iperf. In any case, network administrators should validate the throughput problem. It is recommended this be done using iperf end-to-end measurements in TCP mode between the end systems' memory. The window size of the TCP measurement should follow the bandwidth*delay product rule, and should therefore be set to at least the measured round trip time multiplied by the path's bottle-neck speed. If the actual bottleneck is not known (because of lack on knowledge of the end-to-end path) then it should be assumed the bottleneck is the slowest of the two end-systems' network interface cards.

For instance if one system is connected with Gigabit Ethernet, but the other one with Fast Ethernet and the measured round trip time is 150ms, then the window size should be set to 100 Mbit/s * 0.150s / 8 = 1875000 bytes, so setting the TCP window to a value of 2 MBytes would be a good choice.

In theory the TCP throughput could reach, but not exceed, the available bandwidth on an end-to-end path. The knowledge of that network metric is therefore important for distinguishing between issues with the end system's TCP stacks, or network related problems.

Available bandwidth measurements

Iperf could be used in UDP mode for measuring the available bandwidth. Only short duration measurements in the range of 10 seconds should be done so as not to disturb other production flows. The goal of UDP measurements is to find the maximum UDP sending rate that results in almost no packet loss on the end-to-end path, in good practice the packet loss threshold is 1%. UDP data transfers that results in higher packet losses are likely to disturb TCP production flows and therefore should be avoided. A practicable procedure to find the available bandwidth value is to start with UDP data transfers with a 10s duration and with interim result reports at one second intervals. The data rate to start with should be slightly below the reported TCP throughput. If the measured packet loss values are below the threshold then a new measurement with slightly increased data rate could be started. This procedure of small UDP data transfers with increasing data rate should be repeated until the packet loss threshold is exceeded. Depending on the required result's accuracy further tests can be started beginning with the maximum data rate causing packet losses below the threshold and with smaller data rate increasing intervals. At the end the maximum data rate that caused packet losses below the threshold could be seen as a good measurement of the available bandwidth on the end to end path.

By comparing the reported applications throughput with the measured TCP throughput and the measured available bandwidth, it is possible to distinguish between applications problems, TCP stack problems, or network issues. Note however that differing nature of UDP and TCP flows means that it their measurements should not be directly compared. Iperf sends UDP datagrams are a constant steady rate, whereas TPC tends to send packet trains. This means that TCP is likely to suffer from congestion effects at a lower data rate than UDP.

In case of unexpected low available bandwidth measurements on the end-to-end path, network administrators are interested on the bandwidth bottleneck. The best way to get this value is to retrieve it from passively measured link utilisations and provided capacities on all links along the path. However, if the path is crossing multiple administrative domains this is often not possible because of restrictions in getting those values from other domains. Therefore, it is common practice to use measurement workstations along the end-to-end path, and thus separate the end-to-end path in segments on which available bandwidth measurements are done. This way it is possible to identify the segment on which the bottleneck occurs and to concentrate on that during further troubleshooting procedures.

Other iperf use cases

Besides the capability of measuring TCP throughput and available bandwidth, in UDP mode iperf can report on packet reordering and delay jitter.

Other use cases for measurements using iperf are IPv6 bandwidth measurements and IP multicast performance measurements. More information of the iperf features, its source and binary code for different UNIXes and Microsoft Windows operating systems can be retrieved from the Iperf Web site.

Caveats and Known Issues

Impact on other traffic

As Iperf sends real full data streams it can reduce the available bandwidth on a given path. In TCP mode, the effect to the co-existing production flows should be negligible, assuming the number of production flows is much greater than the number of test data flows, which is normally a valid assumption on paths through a WAN. However, in UDP mode iperf has the potential to disturb production traffic, and in particular TCP streams, if the sender's data rate exceeds the available bandwidth on a path. Therefore, one should take particular care whenever running iperf tests in UDP mode.

TCP buffer allocation

On Linux systems, if you request a specific TCP buffer size with the "-w" option, the kernel will always try to allocate double as much bytes as you specified.

Example: when you request 2MB window size you'll receive 4MB:

welti@mamp1:~$ iperf -c ezmp3 -w 2M -i 1
------------------------------------------------------------
Client connecting to ezmp3, TCP port 5001
TCP window size: 4.00 MByte (WARNING: requested 2.00 MByte)    <<<<<<
------------------------------------------------------------

Counter overflow

Some versions seem to suffer from a 32-bit integer overflow which will lead to wrong results.

e.g.:

[ 14]  0.0-10.0 sec  953315416 Bytes  762652333 bits/sec
[ 14] 10.0-20.0 sec  1173758936 Bytes  939007149 bits/sec
[ 14] 20.0-30.0 sec  1173783552 Bytes  939026842 bits/sec
[ 14] 30.0-40.0 sec  1173769072 Bytes  939015258 bits/sec
[ 14] 40.0-50.0 sec  1173783552 Bytes  939026842 bits/sec
[ 14] 50.0-60.0 sec  1173751696 Bytes  939001357 bits/sec
[ 14]  0.0-60.0 sec  2531115008 Bytes  337294201 bits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

As you can see the summary 0-60 seconds doesn't match the average that one would expect. This is due to the fact that the total number of Bytes is not correct as a result of a counter wrap.

If you're experiencing this kind of effects, upgrade to the latest version of iperf, which should have this bug fixed.

UDP buffer sizing

The UDP buffer sizing (the -w parameter) is also required with high-speed UDP transmissions. Otherwise (typically) the UDP receive buffer will overflow and this will look like packet loss (but looking at tcpdumps or counters reveals you got all the data). This will show in UDP statistics (for example, on Linux with 'netstat -s' under Udp: ... receive packet errors). See more information at: http://www.29west.com/docs/THPM/udp-buffer-sizing.html

Control of measurements

There are two typical deployment scenarios which differ in the kind of access the operator has to the sender and receiver instances. A measurement between well-located measurement workstations within an administrative domain e.g. a campus network allow network administrators full control on the server and client configurations (including test schedules), and allows them to retrieve full measurement results. Measurements on paths that extend beyond the administrative domain borders require access or collaboration with administrators of the far-end systems. Iperf has two features implemented that simplify its use in this scenario, such that the operator does not to need of have an interactive login account on the far-end system:

  • The server instance may run as a daemon (option -D) listening on a configurable transport protocol port, and
  • It is possible to bi-directional tests, either one after the other (option -r) , or simultaneously (option -d).

Screen

Another method of running iperf on a *NIX device is to use 'screen'. Screen is a utility that lets you keep a session running even once you have logged out. It is described more fully here in its man pages, but a simple sequence applicable to iperf would be as follows:

[user@host]$screen -d -m iperf -s -p 5002 
This starts iperf -s -p 5002 as a 'detached' session

[user@host]$screen -ls
There is a screen on:
        25278..mp1      (Detached)
1 Socket in /tmp/uscreens/S-admin.
'screen -ls' shows open sessions.

'screen -r' reconnects to a running session . when in that session keying 'CNTL+a', then 'd' detaches the screen. You can if you wish log out, log back in again, and re-attach. To end the iperf session (and a screen) just hit 'CNTL+c' whilst attached.

Note that BWCTL offers additional control and resource limitation features that make it more suitable for use over administrative domains.

Related Work

Public iperf servers

There used to be some public iperf servers available, but at present none is known anymore. Similar services are provided by BWCTL (see below) and by public NDT servers.

BWCTL

BWCTL (BandWidth test ConTroLler) is a wrapper around iperf that provides scheduling and remote control of measurements.

Instrumented iperf (iperf100)

The iperf code provided by NLANR/DAST was instrumented in order to provide more information to the user. Iperf100 displays various web100 variables at the end of a transfer.

Patches are available at http://www.csm.ornl.gov/~dunigan/netperf/download/

The Instrumented iperf requires machine running a kernel.org linux-2.X.XX kernel with the latest web100 patches applied (http://www.web100.org)

Jperf

Jperf is a Java-based graphical front-end to iperf. It is now included as part of the iperf project on SourceForge. This was once available as a separate project called xjperf, but that seems to have been given up in favor of iperf/SourceForge integration.

iPerf for Android

An Android version of iperf appeared on Google Play (formerly the Android Market) in 2010.

nuttcp

Similar to iperf, but with an additional control connection that makes it somewhat more versatile.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- HankNussbacher - 10 Jul 2005 (Great Plains server)
-- AnnHarding & OrlaMcGann - Aug 2005 (DS3.3.2 content)
-- SimonLeinen - 08 Feb 2006 (OpenSS7 variant, BWCTL pointer)
-- BartoszBelter - 28 Mar 2006 (iperf100)
-- ChrisWelti - 11 Apr 2006 (examples, 32-bit overflows, buffer allocation)
-- SimonLeinen - 01 Jun 2006 (integrated DS3.3.2 text from Ann and Orla)
-- SimonLeinen - 17 Sep 2006 (tracked iperf100 pointer)
-- PekkaSavola - 26 Mar 2008 (added warning about -c/s having to be first, a common gotcha)
-- PekkaSavola - 05 Jun 2008 (added discussion of '-l' parameter and its significance
-- PekkaSavola - 30 Apr 2009 (added discussion of UDP (receive) buffer sizing significance
-- SimonLeinen - 23 Feb 2012 (removed installation instructions, obsolete public iperf servers)
-- SimonLeinen - 22 May 2012 (added notes about Iperf 3, Jperf and iPerf for Android)
-- SimonLeinen - 01 Feb 2013 (added pointer to Android app; cross-reference to Nuttcp)
-- SimonLeinen - 06 April 2014 (updated Iperf 3 section: now on Github and with releases)
-- SimonLeinen - 05 May 2014 (updated Iperf 3 section: documented more new features)

BWCTL

BWCTL is a command line client application and a scheduling and policy daemon that wraps throughput measurement tools such as iperf, thrulay, and nuttcp (versions prior to 1.3 only support iperf).

More Information: For configuration, common problems, examples etc
Description of bwctld.limits parameters

A typical BWCTL result looks like one of the two print outs below. The first print out shows a test run from the local host to 193.136.3.155 ('-c' stands for 'collector'). The second print out shows a test run to the localhost from 193.136.3.155 ('-s' stands for 'sender').

[user@ws4 user]# bwctl -c 193.136.3.155
bwctl: 17 seconds until test results available
RECEIVER START
3339497760.433479: iperf -B 193.136.3.155 -P 1 -s -f b -m -p 5001 -t 10
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address 193.136.3.155
TCP window size: 65536 Byte (default)
------------------------------------------------------------
[ 14] local 193.136.3.155 port 5001 connected with 62.40.108.82 port 35360
[ 14]  0.0-10.0 sec  16965632 Bytes  13547803 bits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
RECEIVER END


[user@ws4 user]# bwctl -s 193.136.3.155
bwctl: 17 seconds until test results available
RECEIVER START
3339497801.610690: iperf -B 62.40.108.82 -P 1 -s -f b -m -p 5004 -t 10
------------------------------------------------------------
Server listening on TCP port 5004
Binding to local address 62.40.108.82
TCP window size: 87380 Byte (default)
------------------------------------------------------------
[ 16] local 62.40.108.82 port 5004 connected with 193.136.3.155 port 5004
[ ID] Interval       Transfer     Bandwidth
[ 16]  0.0-10.0 sec  8298496 Bytes  6622281 bits/sec
[ 16] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
[ 16] Read lengths occurring in more than 5% of reads:
[ 16]  1448 bytes read  5708 times (99.8%)
RECEIVER END

The read lengths are shown when the test is for received traffic. In his e-mail to TobyRodwell on 28 March 2005 Stanislav Shalunov of Internet2 writes:

"The values are produced by Iperf and reproduced by BWCTL.

An Iperf server gives you several most frequent values that the read() system call returned during a TCP test, along with the percentages of all read() calls for which the given read() lengths accounted.

This can let you judge the efficiency of interrupt coalescence implementations, kernel scheduling strategies and other such esoteric applesauce. Basically, reads shorter than 8kB can put quite a bit of load on the CPU (each read() call involves several microseconds of context switching). If the target rate is beyond a few hundred megabits per second, one would need to pay attention to read() lengths."

References (Pointers)

-- TobyRodwell - 28 Oct 2005
-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 22 Apr 2008 - 07 Dec 2008

Netperf

A client/server network performance benchmark (debian package: netperf).

Example

coming soon

Flent (previously netperf-wrapper)

For his measurement work on fq_codel, Toke Høiland-Jørgensen developed Python wrappers for Netperf and added some tests, notably the RRUL (Realtime Response Under Load) test, which runs bulk TCP transfers in parallel with UDP and ICMP response-time measurements. This can be used to expose Bufferbloat on a path.

These tools used to be known as "netperf-wrappers", but were renamed to Flent (the Flexible Network Tester) in early 2015.

References

-- FrancoisXavierAndreu & SimonMuyal - 2005-06-06
-- SimonLeinen - 2013-03-30 - 2015-06-01

RUDE/CRUDE

RUDE is a package of applications to generate and measure UDP traffic between two endpoints (hosts). The rude tool generates traffic to the network, which can be received and logged on the other side of the network with crude. The traffic pattern can be defined by the user. This tool is available under http://rude.sourceforge.net/

Example

The following rude script will cause a constant-rate packet stream to be sent to a destination for 4 seconds, starting one second (1000 milliseconds) from the time the script was read, stopping five seconds (5000 milliseconds) after the script was read. The script can be called using rude -s example.rude.

START NOW
1000 0030 ON 3002 10.1.1.1:10001 CONSTANT 200 250
5000 0030 OFF

Explanation of the values: 1000 and 5000 are relative timestamps (in milliseconds) for the actions. 0030 is the identification of a traffic flow. At 1000, a CONSTANT-rate flow is started (ON) with source port 3002, destination address 10.1.1.1 and destination port 10001. CONSTANT 200 250 specifies a fixed rate of 200 packets per second (pps) of 250-byte UDP datagrams.

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 28 Sep 2008

TTCP

TTCP (Test TCP) is a utility for benchmarking UDP and TCP performance. Utility for Unix and Windows is available at http://www.pcausa.com/Utilities/pcattcp.htm/

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

NDT (Network Diagnostic Tool)

NDT can be used to check the TCP configuration of any host that can run Java applets. The client connects to a Web page containing a special applet. The Web server that serves the applet must run a kernel with the Web100 extensions for TCP measurement instrumentation. The applet performs TCP memory-to-memory tests between the client and the Web100-instrumented server, and then uses the measurements from the Web100 instrumentation to find out about the TCP configuration of the client. NDT also detects common configuration errors such as duplex mismatch.

Besides the applet, there is also a command-line client called web100clt that can be used without a browser. The client works on Linux without any need for Web100 extensions, and can be compiled on other Unix systems as well. It doesn't require a Web server on the server side, just the NDT server web100srv - which requires a Web100-enhanced kernel.

Applicability

Since NDT performs memory-to-memory tests, it avoids end-system bottlenecks such as file system or disk limitations. Therefore it is useful for estimating "pure" TCP throughput limitations for a system, better than, for instance, measuring the throughput of a large file retrieval from a public FTP or WWW server. In addition, NDT servers are supposed to be well-connected and lightly loaded.

When trying to tune a host for maximum throughput, it is a good idea to start testing it against an NDT server that is known to be relatively close to the host in question. Once the throughput with this server is satisfactory, one can try to further tune the host's TCP against a more distant NDT server. Several European NRENs as well as many sites in the U.S. operate NDT servers.

Example

Applet (tcp100bw.html)

This example shows the results (overview) of an NDT measurement between a client in Switzerland (Gigabit Ethernet connection) and an IPv6 NDT server in Ljubljana, Slovenia (Gigabit Ethernet connection).

Screen_shot_2011-07-25_at_18.11.44_.png

Command-line client

The following is an example run of the web100clt command-line client application. In this case, the client runs on a small Linux-based home router connected to a commercial broadband-over-cable Internet connection. The connection is marketed as "5000 kb/s downstream, 500 kb/s upstream", and it can be seen that this nominal bandwidth can in fact be achieved for an individual TCP connection.

$ ./web100clt -n ndt.switch.ch
Testing network path for configuration and performance problems  --  Using IPv6 address
Checking for Middleboxes . . . . . . . . . . . . . . . . . .  Done
checking for firewalls . . . . . . . . . . . . . . . . . . .  Done
running 10s outbound test (client to server) . . . . .  509.00 kb/s
running 10s inbound test (server to client) . . . . . . 5.07 Mb/s
Your host is connected to a Cable/DSL modem
Information [C2S]: Excessive packet queuing detected: 10.18% (local buffers)
Information [S2C]: Excessive packet queuing detected: 73.20% (local buffers)
Server 'ndt.switch.ch' is not behind a firewall. [Connection to the ephemeral port was successful]
Client is probably behind a firewall. [Connection to the ephemeral port failed]
Information: Network Middlebox is modifying MSS variable (changed to 1440)
Server IP addresses are preserved End-to-End
Client IP addresses are preserved End-to-End

Note the remark "-- Using IPv6 address". The example used the new version 3.3.12 of the NDT software, which includes support for both IPv4 and IPv6. Because both the client and the server support IPv6, the test is run over IPv6. To run an IPv4 test in this situation, one could have called the client with the -4 option, i.e. web100clt -4 -n ndt.switch.ch.

Public Servers

There are many public NDT server installations available on the Web. You should choose between the servers according to distance (in terms of RTT) and the throughput range you are interested in. The limits of your local connectivity can best be tested by connecting to a server with a fast(er) connection that is close by. If you want to tune your TCP parameters for "LFNs", use a server that is far away from you, but still reachable over a path without bottlenecks.

Development

Development of NDT is now hosted on Google Code, and as of July 2011 there is some activity. Notably, there is now documentation about the architecture as well as the detailed protocol or NDT. One of the stated goals is to allow outside parties to write compatible NDT clients without consulting the NDT source code.

References

-- SimonLeinen - 30 Jun 2005 - 22 Nov 2011

Active Measurement Tools

Active measurement injects traffic into a network to measure properties about that network.

  • ping - a simple RTT and loss measurement tool using ICMP ECHO messages
  • fping - ping variant supporting concurrent measurement to multiple destinations
  • SmokePing - nice graphical representation of RTT distribution and loss rate from periodic "pings"
  • OWAMP - a protocol and tool for one-way measurements

There are several permanent infrastructures performing active measurements:

  • HADES DFN-developed HADES Active Delay Evaluation System, deployed in GEANT/GEANT2 and DFN
  • RIPE TTM
  • QoSMetricsBoxes used in RENATER's Métrologie project

-- SimonLeinen - 29 Mar 2006

ping

Ping sends ICMP echo request messages to a remote host and waits for ICMP echo replies to determine the latency between those hosts. The output shows the Round Trip Time (RTT) between the host machine and the remote target. Ping is often used to determine whether a remote host is reachable. Unfortunately, it is quite common these days for ICMP traffic to be blocked by packet filters / firewalls, so a ping timing out does not necessarily mean that a host is unreachable.

(Another method for checking remote host availability is to telnet to a port that you know to be accessible; such as port 80 (HTTP) or 25 (SMTP). If the connection is still timing out, then the host is probably not reachable - often because of a filter/firewall silently blocking the traffic. When there is no service listening on that port, this usually results in a "Connection refused" message. If a connection is made to the remote host, then it can be ended by typing the escape character (usually 'CTRL ^ ]') then quit.)

The -f flag may be specified to send ping packets as fast as they come back, or 100 times per second - whichever is more frequent. As this option can be very hard on the network only a super-user (the root account on *NIX machines) is allowed to specified this flag in a ping command.

The -c flag specifies the number of Echo request messages sent to the remote host by ping. If this flag isn't used, then the ping continues to send echo request messages until the user types CTRL C. If the ping is cancelled after only a few messages have been sent, the RTT summary statistics that are displayed at the end of the ping output may not have finished being calculated and won't be completely accurate. To gain an accurate representation of the RTT, it is recommended to set a count of 100 pings. The MS Windows implementation of ping just sends 4 echo request messages by default.

Ping6 is the IPv6 implementation of the ping program. It works in the same way, but sends ICMPv6 Echo Request packets and waits for ICMPv6 Echo Reply packets to determine the RTT between two hosts. There are no discernable differences between the Unix implementations and the MS Windows implementation.

As with traceroute, Solaris integrates IPv4 and IPv6 functionality in a single ping binary, and provides options to select between them (-A <AF>).

See fping for a ping variant that supports concurrent measurements to multiple destinations and is easier to use in scripts.

Implementation-specific notes

-- TobyRodwell - 06 Apr 2005
-- MarcoMarletta - 15 Nov 2006 (Added the difference in size between Juniper and Cisco)

fping

fping is a ping like program which uses the Internet Control Message Protocol (ICMP) echo request to determine if a host is up. fping is different from ping in that you can specify any number of hosts on the command line, or specify a file containing the lists of hosts to ping. Instead of trying one host until it timeouts or replies, fping will send out a ping packet and move on to the next host in a round-robin fashion. If a host replies, it is noted and removed from the list of hosts to check. If a host does not respond within a certain time limit and/or retry limit it will be considered unreachable.

Unlike ping, fping is meant to be used in scripts and its output is easy to parse. It is often used as a probe for packet loss and round-trip time in SmokePing.

fping version 3

The original fping was written by Roland Schemers in 1992, and stopped being updated in 2002. In December 2011, David Schweikert decided to take up maintainership of fping, and increased the major release number to 3, mainly to reflect the change of maintainer. Changes from earlier versions include:

  • integration of all Debian patches, including IPv6 support (2.4b2-to-ipv6-16)
  • an optimized main loop that is claimed to bring performance close to the theoretical maximum
  • patches to improve SmokePing compatibility

The Web site location has changed to fping.org, and the source is now maintained on GitHub.

Since July 2013, there is also a Mailing List, which will be used to announce new releases.

Examples

Simple example of usage:

# fping -c 3 -s www.man.poznan.pl www.google.pl
www.google.pl     : [0], 96 bytes, 8.81 ms (8.81 avg, 0% loss)
www.man.poznan.pl : [0], 96 bytes, 37.7 ms (37.7 avg, 0% loss)
www.google.pl     : [1], 96 bytes, 8.80 ms (8.80 avg, 0% loss)
www.man.poznan.pl : [1], 96 bytes, 37.5 ms (37.6 avg, 0% loss)
www.google.pl     : [2], 96 bytes, 8.76 ms (8.79 avg, 0% loss)
www.man.poznan.pl : [2], 96 bytes, 37.5 ms (37.6 avg, 0% loss)

www.man.poznan.pl : xmt/rcv/%loss = 3/3/0%, min/avg/max = 37.5/37.6/37.7
www.google.pl     : xmt/rcv/%loss = 3/3/0%, min/avg/max = 8.76/8.79/8.81

       2 targets
       2 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
       6 ICMP Echos sent
       6 ICMP Echo Replies received
       0 other ICMP received

 8.76 ms (min round trip time)
 23.1 ms (avg round trip time)
 37.7 ms (max round trip time)
        2.039 sec (elapsed real time)

IPv6 Support

Jeroen Massar has added IPv6 support to fping. This has been implemented as a compile-time variant, so that there are separate fping (for IPv4) and fping6 (for IPv6) binaries. The IPv6 patch has been partially integrated into the fping version on www.fping.com as of release "2.4b2_to-ipv6" (thus also integrated in fping 3.0). Unfortunately his modifications to the build routine seem to have been lost in the integration, so that the fping.com version only installs the IPv6 version as fping. Jeroen's original version doesn't have this problem, and can be downloaded from his IPv6 Web page.

ICMP Sequence Number handling

Older versions of fping used the Sequence Number field in ICMP ECHO requests in a peculiar way: it used a different sequence number for each destination host, but used the same sequence number for all requests to a specific host. There have been reports of specific systems that suppress (or rate-limit) ICMP ECHO requests with repeated sequence numbers, which causes high loss rates reported from tools that use fping, such as SmokePing. Another issue is that fping could not distinguish a perfect link from one that drops every other packet and that duplicates every other.

Newer fping versions such as 3.0 or 2.4.2b2_to (on Debian GNU/Linux) include a change to sequence number handling attributed to Stephan Fuhrmann. These versions increment sequence numbers for every probe sent, which should solve both of these problems.

References

-- BartoszBelter - 2005-07-14 - 2005-07-26
-- SimonLeinen - 2008-05-19 - 2013-07-26

SmokePing

Smokeping is a software-based measurement framework that uses various software modules (called probes) to measure round trip times, packet losses and availability at layer 3 (IPv4 and IPv6) and even applications? latencies. Layer 3 measurements are based on the ping tool and for analysis of applications there are such probes as for measuring DNS lookup and RADIUS authentication latencies. The measurements are centrally controlled on a single host from which the software probes are started, which in turn emit active measurement flows. The load impact to the network by these streams is usually negligible.

As with MRTG the results are stored in RRD databases in the original polling time intervals for 24 hours and then aggregated over time, and the peak values and mean values on a 24h interval are stored for more than one year. Like MRTG, the results are usually displayed in a web browser, using html embedded graphs in daily, monthly, weekly and yearly timeframes. A particular strength of smoke ping is the graphical manner in which it displays the statistical distribution of latency values over time.

The tool has also an alarm feature that, based on flexible threshold rules, either sends out emails or runs external scripts.

The following picture shows an example output of a weekly graph. The background colour indicates the link availability, and the foreground lines display the mean round trip times. The shadows around the lines indicate graphically about the statistical distribution of the measured round-trip times.

  • Example SmokePing graph:
    Example SmokePing graph

The 2.4* versions of Smokeping included SmokeTrace, an AJAX-based traceroute tool similar to mtr, but browser/server based. This was removed in 2.5.0 in favor of a separate, more general, tool called remOcular.

In August 2013, Tobi Oetiker announced that he received funding for the development of SmokePing 3.0. This will use Extopus as a front-end. This new version will be developed in public on Github. Another plan is to move to an event-based design that will make Smokeping more efficient and allow it to scale to a large number of probes.

Related Tools

The OpenNMS network management platform includes a tool called StrafePing which is heavily inspired by SmokePing.

References

-- SimonLeinen - 2006-04-07 - 2013-11-03

OWAMP (One-Way Active Measurement Tool)

OWAMP is a command line client application and a policy daemon used to determine one way latencies between hosts. It is an implementation of the OWAMP protocol (see references) that is currently going through the standardization process within the IPPM WG in the IETF.

With roundtrip-based measurements, it is hard to isolate the direction in which congestion is experienced. One-way measurements solve this problem and make the direction of congestion immediately apparent. Since traffic can be asymmetric at many sites that are primarily producers or consumers of data, this allows for more informative measurements. One-way measurements allow the user to better isolate the effects of specific parts of a network on the treatment of traffic.

The current OWAMP implementation (V3.1) supports IPv4 and IPv6.

Example using owping (OWAMP v3.1) on a linux host:

welti@atitlan:~$ owping ezmp3
Approximately 13.0 seconds until results available

--- owping statistics from [2001:620:0:114:21b:78ff:fe30:2974]:52887 to [ezmp3]:41530 ---
SID:    feae18e9cd32e3ee5806d0d490df41bd
first:  2009-02-03T16:40:31.678
last:   2009-02-03T16:40:40.894
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 2.4/2.5/3.39 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering


--- owping statistics from [ezmp3]:41531 to [2001:620:0:114:21b:78ff:fe30:2974]:42315 ---
SID:    fe302974cd32e3ee7ea1a5004fddc6ff
first:  2009-02-03T16:40:31.479
last:   2009-02-03T16:40:40.685
100 sent, 0 lost (0.000%), 0 duplicates
one-way delay min/median/max = 1.99/2.1/2.06 ms, (err=0.43 ms)
one-way jitter = 0 ms (P95-P50)
TTL not reported
no reordering

References

-- ChrisWelti - 03 Apr 2006 - 03 Feb 2009 -- SimonLeinen - 29 Mar 2006 - 01 Dec 2008

Active Measurement Boxes

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

HADES

Hades Active Delay Evaluation System (HADES) devices (previously called IPPM devices) were developed by the WiN-Labor at RRZE (Regional Computing Centre Erlangen), to provide QoS measurements in DFN's G-WiN infrastructure based on the metrics developed by the IETF IPPM WG. In addition to DFN's backbone, HADES devices have been deployed in numerous GEANT/GEANT2 Points of Prescence (PoPs).

Example Output

The examples that follow were generated using the new (experimental) HADES front-end in early June 2006.

One-Way Delay Plots

A typical one-way delay plot looks like this.

  • IPPM Example: FCCN-GARR one-way delays:
    IPPM Example: FCCN-GARR one-way delays

The default view selects the scale on the y (delay) axis so that all values fit in. In this case we see that there were two cases (around 12:30 and around 18:10) where there were "outliers" with delays up to 190ms, much higher than the average of about 30ms.

In the presence of such outliers, the auto-scaling feature makes it hard to discern variations in the non-outlying values. Fortunately, the new interface allows us to select the y axis scale by hand, so that we can zoom in to the delay range that we are interested in:

  • IPPM Example: FCCN-GARR one-way delay, narrower delay scale:
    IPPM Example: FCCN-GARR one-way delay, narrower delay scale

It now becomes evident that most of the samples lie in a very narrow band between 29.25 and 29.5 milliseconds. Interestingly, there are a few outliers towards lower delays. These could be artifacts of clock inaccuracies, or they could point to some kind of routing anomaly, although the latter seems less probable, because routing anomalies normally don't lead to shorter delays.

Concluding Remarks

Note that the IPPM system is being developed for the NREN community, hence its feature developments focus on that specific community needs; for example, it implements measurement of out of order packets as well as metric measurements for different IP Precedence values.

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 01 Jun 2006

RIPE TTM Boxes

Summary

The Test Traffic Measurements service has been offered by the RIPE NCC as a service since the year 2000. The system continuously records one-way delay and packet-loss measurements as well as router-level paths ("traceroutes") between a large set of probe machines ("test boxes"). The test boxes are hosted by many different organisations, including NRENs, commercial ISPs, universities and others, and usually maintained by RIPE NCC as part of the service. While the vast majority of the test boxes is in Europe, there are a couple of machines in other continents, including outside the RIPE NCC's service area. Every test box includes a precise time source - typically a GPS receiver - so that accurate one-way delay measurements can be provided. Measurement data are entered in a central database at RIPE NCC's premises every night. The database is based on CERN's ROOT system. Measurement results can be retrieved over the Web using various presentations, both pre-generated and "on-demand".

Applicability

RIPE TTM data is often useful to find out "historical" quality information about a network path of interest, provided that TTM test boxes are deployed near (in the topological sense, not just geographically) the ends of the path. For research network paths throughout Europe, the coverage of the RIPE TTM infrastructure is not complete, but quite comprehensive. When one suspects that there is (non-temporary) congestion on a path covered by the RIPE TTM measurement infrastructure, the TTM graphs can be used to easily verify that, because such congestion will show up as delay and/or loss if present.

The RIPE TTM system can also be used by operators to precisely measure - but only after the fact - the impact of changes in the network. These changes can include disturbances such as network failures or misconfigurations, but also things such as link upgrades or routing improvements.

Notes

The TTM project has provided extremely high-quality and stable delay and loss measurements for a large set of network paths (IPv4 and recently also IPv6) throughout Europe. These paths cover an interesting mix of research and commercial networks. The Web interface to the collected measurements supports in-depth exploration quite well, such as looking at the delay/loss evolution of a specific path over both a wide range of intervals from the very short to the very long. On the other hand, it is hard to get at useful "overview" pictures. The RIPE NCC's DNS Server Monitoring (DNSMON) service provides such overviews for specific sets of locations.

The raw data collected by the RIPE TTM infrastructure is not generally available, in part because of restrictions on data disclosure. This is understandable in that RIPE NCC's main member/customer base consists of commercial ISPs, who presumably wouldn't allow full disclosure for competitive reasons. But it also means that in practice, only the RIPE NCC can develop new tools that make use of these data, and their resources for doing so are limited, in part due to lack of support from their ISP membership. On a positive note, the RIPE NCC has, on several occasions, given scientists access to the measurement infrastructure for research. The TTM team also offers to extract raw data (in ROOT format) from the TTM database manually on demand.

Another drawback of RIPE TTM is that the central TTM database is only updated once a day. This makes the system impossible to use for near-real-time diagnostic purposes. (For test box hosts, it is possible to access the data collected by one's hosted test boxes in near realy time, but this provides only very limited functionality, because only information about "inbound" packets is available locally, and therefore one can neither correlate delays with path changes, nor can one compute loss figures without access to the sending host's data.) In contrast, RIPE NCC's Routing Information Service (RIS) infrastructure provides several ways to get at the collected data almost as it is collected. And because RIS' raw data is openly available in a well-defined and relatively easy-to-use format, it is possible for third parties to develop innovative and useful ways to look at that data.

Examples

The following screenshot shows an on-demand graph of the delays over a 24-hour period from one test box hosted by SWITCH (tt85 at ETH Zurich) to to other one (tt86 at the University of Geneva). The delay range has been narrowed so that the fine-grained distribution can be seen. Five packets are listed as "overflow" under "STATISTICS - Delay & Hops" because their one-way delays exceeded the range of the graph. This is an example of an almost completely uncongested path, showing no packet loss and the vast majority of delays very close to the "baseline" value.

ripe-ttm-sample-delay-on-demand.png

The following plot shows the impact of a routing change on a path to a test box in New Zealand (tt47). The hop count decreases by one, but the base one-way delay increased by about 20ms. The old and new routes can be retrieved from the Web interface too. Note also that the delay values are spread out over a fairly large range, which indicates that there is some congestion (and thus queueing) on the path. The loss rate over the entire day is 1.1%. It is somewhat interesting how much of this is due to the routing change and how much (if any) is due to "normal" congestion. The lower right graph shows "arrived/lost" packets over time and indicates that most packets were in fact lost during the path change - so although some congestion is visible in the delays, it is not severe enough to cause visible loss. On the other hand, the total number of probe packets sent (about 2850 per day) means that congestion loss would probably go unnoticed until it reaches a level of 0.1% or so.

day.tt85.tt47.gif

References

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005

QoSMetrics Boxes

This measurement infrastructure was built as part of RENATER's Métrologie efforts. The infrastructure performs continuous delay measurement between a mesh of measurement points on the RENATER backbone. The measurements are sent to a central server every minute, and are used to produce a delay matrix and individual delay histories (IPv4 and IPv6 measures, soon for each CoS)).

Please note that all results (tables and graphs) presented on this page are genereted by RENATER's scripts from QoSMetrics MySQL Database.

Screendump from RENATER site (http://pasillo.renater.fr/metrologie/get_qosmetrics_results.php)

new_Renater_IPPM_results_screendump.jpg

Example of asymmetric path:

(after that the problem was resolve, only one path has his hop number decreased)

Graphs:

Paris_Toulouse_delay Toulouse_Paris_delay
Paris_Toulouse_jitter Toulouse_Paris_jitter
Paris_Toulouse_hop_number Toulouse_Paris_hop_number
Paris_Toulouse_pktsLoss Toulouse_Paris_pktsLoss

Traceroute results:

traceroute to 193.51.182.xx (193.51.182.xx), 30 hops max, 38 byte packets
1 209.renater.fr (195.98.238.209) 0.491 ms 0.440 ms 0.492 ms
2 gw1-renater.renater.fr (193.49.159.249) 0.223 ms 0.219 ms 0.252 ms
3 nri-c-g3-0-50.cssi.renater.fr (193.51.182.6) 0.726 ms 0.850 ms 0.988 ms
4 nri-b-g14-0-0-101.cssi.renater.fr (193.51.187.18) 11.478 ms 11.336 ms 11.102 ms
5 orleans-pos2-0.cssi.renater.fr (193.51.179.66) 50.084 ms 196.498 ms 92.930 ms
6 poitiers-pos4-0.cssi.renater.fr (193.51.180.29) 11.471 ms 11.459 ms 11.354 ms
7 bordeaux-pos1-0.cssi.renater.fr (193.51.179.254) 11.729 ms 11.590 ms 11.482 ms
8 toulouse-pos1-0.cssi.renater.fr (193.51.180.14) 17.471 ms 17.463 ms 17.101 ms
9 xx.renater.fr (193.51.182.xx) 17.598 ms 17.555 ms 17.600 ms

[root@CICT root]# traceroute 195.98.238.xx
traceroute to 195.98.238.xx (195.98.238.xx), 30 hops max, 38 byte packets
1 toulouse-g3-0-20.cssi.renater.fr (193.51.182.202) 0.200 ms 0.189 ms 0.111 ms
2 bordeaux-pos2-0.cssi.renater.fr (193.51.180.13) 16.850 ms 16.836 ms 16.850 ms
3 poitiers-pos1-0.cssi.renater.fr (193.51.179.253) 16.728 ms 16.710 ms 16.725 ms
4 nri-a-pos5-0.cssi.renater.fr (193.51.179.17) 22.969 ms 22.956 ms 22.972 ms
5 nri-c-g3-0-0-101.cssi.renater.fr (193.51.187.21) 22.972 ms 22.961 ms 22.844 ms
6 gip-nri-c.cssi.renater.fr (193.51.182.5) 17.603 ms 17.582 ms 17.616 ms
7 250.renater.fr (193.49.159.250) 17.836 ms 17.718 ms 17.719 ms
8 xx.renater.fr (195.98.238.xx) 17.606 ms 17.707 ms 17.226 ms

-- FrancoisXavierAndreu & SimonLeinen - 06 Jun 2005 - 06 Jul 2005 - 07 Apr 2006

Passive Measurement Tools

Network Usage:

  • SNMP-based tools retrieve network information such as state and utilization of links, router CPU loads, etc.: MRTG, Cricket
  • Netflow-based tools use flow-based accounting information from routers for traffic analysis, detection of routing and security problems, denial-of-service attacks etc.

Traffic Capture and Analysis Tools

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006

SNMP-based tools

The Simple Network Management Protocol (SNMP) is widely used for device monitoring over IP, especially for monitoring infrastructure compoments of those networks, such as routers, switches etc. There are many tools that use SNMP to retrieve network information such as the state and utilization of links, routers' CPU load, etc., and generate various types of visualizations, other reports, and alarms.

  • MRTG (Multi-Router Traffic Grapher) is a widely-used open-source tool that polls devices every five minutes using SNMP, and plots the results over day, week, month, and year timescales.
  • Cricket serves the same purpose as MRTG, but is configured in a different way - targets are organized in a tree, which allows inheritance of target specifications. Cricket has been using RRDtool from the start, although MRTG has picked that up (as an option) as well.
  • RRDtool is a round-robin database used to store periodic measurements such as SNMP readings. Recent values are stored at finer time granularity than older values, and RRDtool incrementally "consolidates" values as they age, and uses an efficient constant-size representation. RRDtool is agnostic to SNMP, but it is used by Cricket, MRTG (optionally), and many other SNMP tools.
  • Synagon is a Python-based tool that performs SNMP measurements (Juniper firewall counters) at much shorter timescales (seconds) and graphs them in real time. This can be used to look at momentary link utilizations, within the limits of how often devices actually update their SNMP values.

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 05 Jan 2006 - 16 Jul 2007

Synagon SNMP Graphing Tool

Overview

Synagon is Python-based program that collects and graphs firewall values from Juniper routers. The tool first requires the user to select a router from a list. It will retrieve all the available firewall filters/counters from that router and the user can select any of these to graph. It was developed by DANTE for internal use, but could be made available on request (on a case by case basis), in an entirely unsupported manner. Please contact operations@dante.org.uk for more details.

[The information below is taken from DANTE's internal documentation.]

Installation and Setup procedures

Synagon comes as a single compressed archive, that can be copied in a directory and then invoked as any other standard python script.

It requires the following python libraries to operate:

1) The python SNMP framework (available at http://pysnmp.sourceforge.net)

The SNMP framework is required to SNMP to the routers. For guidance how to install the packages please see the section below

You also need to create a network.py file that describes the network topology where the tool will act upon. The configuration files should be located in the same directory as the tool itself. Please look at the section below for more information on the format of the file.

  • Example Sysngaon screen:
    Example Sysngaon screen

How to use Synagon

The tool can be invoked by double clicking on the file name (under windows) or by typing the following on the command shell:

python synagon.py

Prior to that, a network.py configuration file must have been created and located at the directory where the tool has been invoked. It describes the location of the routers and the snmp community to use when contacting them.

Immediately after, a window with a single menu entry will appear. By clicking on the menu, a list of all routers described in the network.py will be given as submenus.

Under each router, a submenu shows the list of firewall filters for that router (depending whether you have previously collected the list) together with a ‘collect’ submenu. The collect submenu initiates the collection of the firewall filters from the router and then updates the list of firewall filters in the submenu where the collect button is located. The list of firewall filters of each router is also stored on the local disk and next time the tool is invoked it won’t be necessary to interrogate the router for receiving the firewall list, but they will be taken instead from the file. When the firewall filter list is retrieved, a submenu with all counter and policers for each filter will be displayed. A policer is marked with [p] and a counter with a [c] just after their name.

When a counter or policer of a firewall filter of a router is selected, the SNMP poling for that entity begins. After a few seconds, a window will appear that plots the requested values.

The graph shows the rate of bytes dropped since monitoring had started for a particular counter or the rate of packets dropped since monitoring had started for a particular policer. The save graph button can create a file for the graph in encapsulated postscript format (.eps).

Because values are frequently retrieved, there are memory consumption and performance concerns. To avoid this a configurable limit is placed on the number of samples plotted on graph. The handle at the top of the window limits the number of samples plotted and has an impact on performance if is positioned at a high scale (of thousands).

Developer’s guide

This section provides information about how the tool internally works.

The tool mainly consists of the felektis.py file, but it also relies on the external csnmp.py and graph.py library. The felektis.py defines two threads of control; one initiates the application and is responsible for collecting the results, the other is responsible for the visual part of the application and the interaction with the user.

The GUI thread updates a shared structure (a dictionary) with contact information about all the firewalls the user has chosen by that time to monitor.

active_firewall_list = {}

The structure holds the following information:

router_name->filter_name->counter_name(collect, instance, type, bytes, epoch)

There is an entry per counter and that entry has information about whether the counter should be retrieved (collect), the SNMP instance (instance) for that counter, whether it is a policer or a counter (type), the last byte counter value (bytes), the time the last value was received (epoch).

It operates as follows:

do forever:
   if the user has shutdown the GUI thread:
      terminate

   for each router in active_firewall_list:
      for each filter of each router:
         for each counter/policer of each filter:
            retrieve counter value
            calculate change rate 
            pass rate of that counter to the GUI thread

It uses SNMP to request the values from the router.

The gui thread (MainWindow class) first creates the user interfaces (based on the Tkinter Tcl/TK module) and is passed the geant network topology found in the network.py file. It then creates a menu list with all routers found in the topology. Each router entry has a submenu with an item of ‘collect’. This item is bound to a function and if it is selected, the router is interrogated via SNMP on the Juniper Firewall MIB, and the collection of filter name/counter names for that router is retrieved. Because the interrogation may take tens of seconds, the firewall filter list is serialised onto the local directory using the cPickle builtin module. The file is given the name router_name.filter (e.g.: uk1.filters for router uk1) and it is stored in the local directory where the tool was invoked from. Next time the tool is invoked, it will check if there is a file list for that router, and if so, it will deserialise the list of firewall filters from the file and populate the routers’ submenu. It could of course be the case the serialised firewall list can be out of date, though the ‘collect’ submenu entry for each router is always there and can be invoked to replace the current list (both on memory and disk) with the latest details. If the user selects a firewall or policer to monitor from the submenu list, it will populate the active_firewall_list with that counter/policer, and the main thread will take it up and begin retrieving data for that counter/policer.

The main thread retrieves and passes the calculated rate to the MainWindow thread onto a queue that the main thread populates and the gui thread gets its data from. Because the updates can come from a variety of different firewall filters the queue is indexed with a text string which is a unique identifier based upon router name, firewall filter name, counter/policer name. The gui thread periodically (i.e.: every 250ms) check if there are data in the above mentioned queue, If the identifier hasn’t been seen before, a new window is created where the values are plotted. If the window already exists, then it is updated with that value and its contents are redrawn. A window is an instance of the class PlotList.

When a window is closed by the user, the gui modifies the active_firewall_list for that policer by setting the ‘collect’ attribute to None, and the main thread next time will reach that counter/policer, it will ignore it.

Creating a single synagon package

The tool consists of several files:

  • felegktis.py [main logic]
  • graph.py [graph library]
  • csnmp.py [ SNMP library]

It is possible by using the squeeze tool (available at http://www.pythonware.com/products/python/squeeze/index.htm) to make this file into a compressed python executable archive.

The build_synagon.bat uses the squeeze tool to archive all this file into one.

build_synagon.bat:
REM build the sinagon distribution
python squeezeTool.py -1 -o synagon -b felegktis felegktis.py graph.py csnmp.py

The command above builds synagon.py from files felegktis.py, graph.py and csnmp.py.

Appendices

A. network.py

newtork.py descibes the nework's topology - its format can be deduced by inspection. Currently the network topology name (graph name) should be set to ‘geant’ - the script could be adjusted to change this behaviour.

Compulsory router properties:

  • type: the type of the router [juniper, cisco, etc] etc (the tool only operates on ‘juniper’ routers
  • address: The address where the router can be contacted at
  • community: the SNMP community to authenticate on the router
Example:
# Geant network
geant = graph.Graph()

# Defaults for router
class Router(graph.Vertex):
    """ Common options for geant routers """

    def __init__(self):
        # Initialise base class
        graph.Vertex.__init__(self)
        # Set the default properties for these routers
      self.property['type'] = 'juniper'
        self.property['community'] = 'commpassword'

uk1 = Router()
gr1 = Router()


geant.vertex['uk1'] = uk1
geant.vertex['gr1'] = gr1

uk1.property['address'] = 'uk1.uk.geant.net'
gr1.property['address'] = 'gr1.gr.geant.net'

B. Python Package Installation

Python package installation

Python has a built in support for installing packages.

Packages usually come in a source format and they compiled at the time of the installation. Packages can be complete python source code or have extension in other languages (c, c++) . The packages can also come in binary form.

Many python packages which have extension written in some oher language, need to be compiled into a binary package before distributing to platform like windows, because not all windows machine have c or c++ compilers that a package may require.

The Python SNMP framework library

The source version can be downloaded from http://mysnmp.sourceforge.net

When the file is uncompressed, the user should invoke the setup.py:

setup.py install There is also a binary version for Windows that exists in the NEP’s wiki. It has a simple GUI. Just keep pushing the ‘next’ button until the package is installed.

-- TobyRodwell - 23 Jan 2006

Netflow-based Tools

Netflow: Analysis tools to characterize the traffic in the network, detect routing problems, security problems (Denials of Service), etc.

There are different versions of NDE (NetFlow Data Export): v1 (first), v5, v7, v8 and the last version (v9) which is based on templates. This last version makes it possible to analyze IPv6, MPLS and Multicast flows.

  • NDE v9 description:
    NDE v9 description

References

-- FrancoisXavierAndreu - 2005-06-06 - 2006-04-07
-- SimonLeinen - 2006-01-05 - 2015-07-18

Packet Capture and Analysis Tools:

Detect protocol problems via the analysis of packets, trouble shooting

  • tcpdump: Packet capture tool based on the libpcap library. http://www.tcpdump.org/
  • snoop: tcpdump equivalent in Sun's Solaris operating system
  • Wireshark (formerly called Ethereal): Extended tcpdump with a user-friendly GUI. Its plugin architecture allowed many protocol-specific "dissectors" to be written. Also includes some analysis tools useful for performance debugging.
  • libtrace: A library for trace processing; works with libpcap (tcpdump) and other formats.
  • Netdude: The Network Dump data Displayer and Editor is a framework for inspection, analysis and manipulation of tcpdump trace files.
  • jnettop: Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use.
  • tcptrace: a statistics/analysis tool for TCP sessions using tcpdump capture files
  • capturing packets with Endace DAG cards (dagsnap, dagconvert, ...)

General Hints for Taking Packet Traces

Capture enough data - you can always throw away stuff later. Note that tcpdump's default capture length is small (96 bytes), so use -s 0 (or something like -s 1540) if you are interested in payloads. Wireshark and snoop capture entire packets by default. Seemingly unrelated traffic can impact performance. For example, Web pages from foo.example.com may load slowly because of the images from adserver.example.net. In situations of high background traffic, it may however be necessary to filter out unrelated traffic.

It can be extremely useful to collect packet traces from multiple points in the network

  • On the endpoints of the communication
  • Near �suspicious� intermediate points such as firewalls

Synchronized clocks (e.g. by NTP) are very useful for matching traces.

Address-to-name resolution can slow down display and causes additional traffic which confuses the trace. With tcpdump, consider using -n or tracing to file (-w file).

Request remote packet traces

When a packet trace from a remote site is required, this often means having to ask someone at that site to provide it. When requesting such a trace, consider making this as easy as possible for the person having to do it. Try to use a packet tracing tool that is already available - tcpdump for most BSD or Linux systems, snoop for Solaris machines. Windows doesn't seem to come bundled with a packet capturing program, but you can direct the user to Wireshark, which is reasonably easy to install and use under Windows. Try to give clean indications on how to call the packet capture program. It is usually best to ask the user to capture to a file, and then send you the capture file as an e-mail attachment or so.

References

  • Presentation slides from a short talk about packet capturing techniques given at the network performance section of the January 2006 GEANT2 technical workshop

-- FrancoisXavierAndreu - 06 Jun 2005
-- SimonLeinen - 05 Jan 2006-09 Apr 2006
-- PekkaSavola - 26 Oct 2006

tcpdump

One of the early diagnostic tools for TCP/IP that was written by Van Jacobson, Craig Leres, and Steven McCanne. Tcpdump can be used to capture and decode packets in real time, or to capture packets to files (in "libpcap" format, see below), and analyze (decode) them later.

There are now more elaborate and, in some ways, user-friendly packet capturing programs, such as Wireshark (formerly called Ethereal), but tcpdump is widely available, widely used, so it is very useful to know how to use it.

Tcpdump/libpcap is still actively being maintained, although not by its original authors.

libpcap

Libpcap is a library that is used by tcpdump, and also names a file format for packet traces. This file format - usually used in files with the extension .pcap - is widely supported by packet capture and analysis tools.

Selected options

Some useful options to tcpdump include:

  • -s snaplen capture snaplen bytes of each frame. By default, tcpdump captures only the first 68 bytes, which is sufficient to capture IP/UDP/TCP/ICMP headers, but usually not payload or higher-level protocols. If you are interested in more than just headers, use -s 0 to capture packets without truncation.
  • -r filename read from an previously created capture file (see -w)
  • -w filename dump to a file instead of analyzing on-the-fly
  • -i interface capture on an interface other than the default (first "up" non-loopback interface). Under Linux, -i any can be used to capture on all interfaces, albeit with some restrictions.
  • -c count stop the capture after count packets
  • -n don't resolve addresses, port numbers, etc. to symbolic names - this avoids additional DNS traffic when analyzing a live capture.
  • -v verbose output
  • -vv more verbose output
  • -vvv even more verbose output

Also, a pflang expression can be appended to the command so as to filter the captured packets. An expression is made up of one or more of "type", "direction" and "protocol".

  • type Can be host, net or port. host is presumed unless otherwise speciified
  • dir Can be src, dst, src or dst or src and dst
  • proto (for protocol) Common types are ether, ip, tcp, udp, arp ... If none is specifiied then all protocols for which the value is valid are considered.
Example expressions:
  • dst host <address>
  • src host <address>
  • udp dst port <number>
  • host <host> and not port ftp and not port ftp-data

Usage examples

Capture a single (-c 1) udp packet to file test.pcap:

: root@diotima[tmp]; tcpdump -c 1 -w test.pcap udp
tcpdump: listening on bge0, link-type EN10MB (Ethernet), capture size 96 bytes
1 packets captured
3 packets received by filter
0 packets dropped by kernel

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.pcap
-rw-r--r-- 1 root root 114 2006-04-09 18:57 test.pcap
: root@diotima[tmp]; file test.pcap
test.pcap: tcpdump capture file (big-endian) - version 2.4 (Ethernet, capture length 96)

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; tcpdump -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: UDP, length: 12

Display the same capture file in verbose mode:

: root@diotima[tmp]; tcpdump -v -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: [udp sum ok] UDP, length: 12 (len 20, hlim 118)

More examples with some advanced tcpdump use cases.

References

-- SimonLeinen - 2006-03-04 - 2016-03-17

Wireshark®

Wireshark is a packet capture/analysis tool, similar to tcpdump but much more elaborate. It has a graphical user interface (GUI) which allows "drilling down" into the header structure of captured packets. In addition, it has a "plugin" architecture that allows decoders ("dissectors" in Wireshark terminology) to be written with relative ease. This and the general user-friendliness of the tool has resulted in Wireshark supporting an abundance of network protocols - presumably writing Wireshark dissectors is often part of the work of developing/implementing new protocols. Lastly, Wireshark includes some nice graphical analysis/statictics tools such as much of the functionality of tcptrace and xplot.

One of the main attractions of Wireshark is that it works nicely under Microsoft Windows, although it requires a third-party library to implement the equivalent of the libpcap packet capture library.

Wireshark used to be called Ethereal™, but was renamed in June 2006, when its principal maintainer changed employers. Version 1.8 adds support for capturing on multiple interfaces in parallel, simplified management of decryption keys (for 802.11 WLANs and IPsec/ISAKMP), and geolocation for IPv6 addresses. Some TCP events (fast retransmits and TCP Window updates) are no longer flagged as warnings/errors. The default format for saving capture files is changing from pcap to pcap-ng. Wireshark 1.10, announced in June 2013, adds many features including Windows 8 support and response-time analysis for HTTP requests.

Usage examples

The following screenshot shows Ethereal 0.10.14 under Linux/Gnome when called as ethereal -r test.pcap, reading the single-packet example trace generated in the tcpdump example. The "data" portion of the UDP part of the packet has been selected by clicking on it in the middle pane, and the corresponding bytes are highlighted in the lower pane.

ethereal -r test.pcap screendump

tshark

The package includes a command-line tool called tshark, which can be used in a similar (but not quite compatible) way to tcpdump. Through complex command-line options, it can give access to some of the more advanced decoding functionality of Wireshark. Because it generates text, it can be used as part of analysis scripts.

Scripting

Wireshark can be extended using scripting languages. The Lua language has been supported for several years. Version 1.4.0 (released in September 2010) added preliminary support for Python as an extension language.

CloudShark

In a cute and possibly even useful application of tshark, QA Cafe (an IP testing solutions vendor) has put up "Wireshark as a Service" under www.cloudshark.org. This tool lets users upload packet dumps without registration, and provides the familiar Wireshark interface over the Web. Uploads are limited to 512 Kilobytes, and there are no guarantees about confidentiality of the data, so it should not be used on privacy-sensitive data.

References

-- SimonLeinen - 2006-03-04 - 2013-06-09

libtrace

libtrace is a library for trace processing. It supports multiple input methods, including device capture, raw and gz-compressed trace, and sockets; and mulitple input formats, including pcap and DAG.

References

-- SimonLeinen - 04 Mar 2006

Netdude

Netdude (Network Dump data Displayer and Editor) is a framework for inspection, analysis and manipulation of tcpdump (libpcap) trace files.

References

-- SimonLeinen - 04 Mar 2006

jnettop

Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use. The result is a nice listing of communication on network grouped by stream, which shows transported bytes and consumed bandwidth.

References

-- FrancoisXavierAndreu - 06 Jun 2005 -- SimonLeinen - 04 Mar 2006

Host and Application Measurement Tools

Network performance is only one part of a distrbuted system's perfomance. An equally important part is the performance od the actual application and hardware it runs on.

  • Web100 - fine-grained instrumentation of TCP implementation, currently available for the Linux kernel.
  • SIFTR - TCP instrumentation similar to Web100, available for *BSD
  • DTrace - Dynamic tracing facility available in Solaris and *BSD
  • NetLogger - a framework for instrumentation and log collection from various components of networked applications

-- TobyRodwell - 22 Mar 2006
-- SimonLeinen - 28 Mar 2006

NetLogger

NetLogger (copyright Lawrence Berkeley National Laboratory) is both a set of tools and a methodology for analysing the performance of a distributed system. The methodolgoy (below) can be implemented separately from the LBL developed tools. For full information on NetLogger see its website, http://dsd.lbl.gov/NetLogger/

Methodology

From the NetLogger website

The NetLogger methodology is really quite simple. It consists of the following:

  1. All components must be instrumented to produce monitoring These components include application software, middleware, operating system, and networks. The more components that are instrumented the better.
  2. All monitoring events must use a common format and common set of attributes. Monitoring events most also all contain a precision timestamp which is in a single timezone (GMT) and globally synchronized via a clock synchronization method such as NTP.
  3. Log all of the following events: Entering and exiting any program or software component, and begin/end of all IO (disk and network).
  4. Collect all log data in a central location
  5. Use event correlation and visualization tools to analyze the monitoring event logs.

Toolkit

From the NetLogger website

The NetLogger Toolkit includes a number of separate components which are designed to help you do distributed debugging and performance analysis. You can use any or all of these components, depending on your needs.

These include:

  • NetLogger message format and data model: A simple, common message format for all monitoring events which includes high-precision timestamps
  • NetLogger client API library: C/C++, Java, PERL, and Python calls that you add to your existing source code to generate monitoring events. The destination and logging level of NetLogger messages are all easily controlled using an environment variable. These libraries are designed to be as lightweight as possible, and to never block or adversely affect application performance.
  • NetLogger visualization tool (nlv): a powerful, customizable X-Windows tool for viewing and analysis of event logs based on time correlated and/or object correlated events.
  • NetLogger host/network monitoring tools: a collection of NetLogger-instrumented host monitoring tools, including tools to interoperate with Ganglia and MonALisa.
  • NetLogger storage and retrieval tools, including a daemon that collects NetLogger events from several places at a single, central host; a forwarding daemon to forward all NetLogger files in a specified directory to a given location; and a NetLogger event archive based on mySQL.

References

-- TobyRodwell - 22 Mar 2006

Web100 Linux Kernel Extensions

The Web100 project was run by PSC (Pittsburgh Supercomputing Center), the NCAA and NCSA. It was funded by the US National Science Foundation (NSF) between 2000 and 2003, although development and maintenance work extended well beyond the period of NSF funding. Its thrust was to close the "wizard gap" between what performance should be possible on modern research networks and what most users of these networks actually experience. The project focused on instrumentation for TCP to measure performance and find possible bottlenecks that limit TCP throughput. In addition, it included some work on "auto-tuning" of TCP buffer settings.

Most implementation work was done for Linux, and most of the auto-tuning code is now actually included in the mainline Linux kernel code (as of 2.6.17). The TCP kernel instrumentation is available as a patch from http://www.web100.org/, and usually tracks the latest "official" Linux kernel release pretty closely.

An interesting application of Web100 is NDT, which can be used from any Java-enabled browser to detect bottlenecks in TCP configurations and network paths, as well as duplex mismatches using active TCP tests against a Web100-enabled server.

In September 2010, the NSF agreed to fund a follow-on project called Web10G.

TCP Kernel Information Set (KIS)

A central component of Web100 is a set of "instruments" that permits the monitoring of many statistics about TCP connections (sockets) in the kernel. In the Linux implementation, these instruments are accessible through the proc filesystem.

TCP Extended Statistics MIB (TCP-ESTATS-MIB, RFC 4898)

The TCP-ESTATS-MIB (RFC 4898) includes a similar set of instruments, for access through SNMP. It has been implemented by Microsoft for the Vista operating system and later versions of Windows. In the Windows Server 2008 SDK, a tool called TcpAnalyzer.exe can be used to look at statistics of open TCP connections. IBM is also said to have an implementation of this MIB.

"Userland" Tools

Besides the kernel extension, Web100 comprises a small set of user-level tools which provide access to the TCP KIS. These tools include

  1. a libweb100 library written in C
  2. the command-line tools readvar, deltavar, writevar, and readall
  3. a set of GTK+-based GUI (graphical user interface) tools under the gutil command.

gutil

When started, gutil shows a small main panel with an entry field for specifying a TCP connection, and several graphical buttons for starting different tools on a connection once one has been selected.

gutil: main panel

The TCP connection can be chosen either by explicitly specifying its endpoints, or by selecting from a list of connections (using double-click):

gutil: TCP connection selection dialog

Once a connection of interest has been selected, a number of actions are possible. The "List" action provides a list of all kernel instruments for the connection. The list is updated every second, and "delta" values are displayed for those variables that have changed.

gutil: variable listing

Another action is "Display", which provides a graphical display of a KIS variable. The following screenshot shows the display of the DataBytesIn variable of an SSH connection.

gutil: graphical variable display

Related Work

Microsoft's Windows Software Development Kit for Server 2008 and .NET Version 3.5 contains a tool called TcpAnalyzer.exe, which is similar to gutil and uses Microsoft's RFC 4898 implementation.

The SIFTR module for FreeBSD can be used for similar applications, namely to understand what is happening inside TCP at fine granularity. Sun's DTrace would be an alternative on systems that support it, provided they offer suitable probes for the relevant actions and events within TCP. Both SIFTR and DTrace have very different user interfaces to Web100.

References

-- SimonLeinen - 27 Feb 2006 - 25 Mar 2011
-- ChrisWelti - 12 Jan 2010

NREN Tools and Statistics

A list of all NRENs and other networks who offer statistics/info for their networks and/or have tools available for PERT staff to use.

Outside Europe:

For information on other NRENs' monitoring tools please see the traffic monitoring URLs section of the TERENA Compendium

A list of network weathermaps is maintained in the JRA1 Wiki.

-- TobyRodwell - 14 Apr 2005 -- SimonLeinen - 20 Sep 2005 - 05 Aug 2011 -- FrancoisXavierAndreu - 05 Apr 2006 -- LuchesarIliev - 27 Apr 2006

GÉANT Tools

PERT Staff can access a range of tools for checking the status of the GÉANT network by logging into https://tools.geant.net/portal/ (if a member of the PERT does not have an account for this site they should contact DANTE operations operations@dante.org.uk).

The following tools are available:

GÉANT Usage Map: For each country currently connected to GÉANT, this map shows the level of usage of the access link connecting the country's national research network to GÉANT.

Network Weathermap: Map showing the GÉANT routers and the circuits interconnecting them. The circuits are coloured according to their utilisation. LOGIN REQUIRED

GÉANT Looking Glass: this allows users to execute basic operational commands on GÉANT routers, including ping, traceroute and show route

IPv4 and IPv6 Beacon: Beacon nodes are located in the majority of GÉANT PoPs. They send and receive a specific mulitcast group, and from this are able to infer the performance of all multicast traffic throughout GÉANT.

HADES Measure Points: These network performance Measurement Points (MPs) run DFN's HADES IPPM software (OWD, OWDV, packet loss) and also iperf/BWCTL. PERT engineers can log in to the devices and run AB tests (they should use common sense so as not to overload low capacity or already heavily loaded paths).

BWCTL Measurement Points: NRENs can run BWCTL/iperf tests to, from and between GEANT2's BWCTL servers. See GeantToolsDanteBwctl for more information.

-- TobyRodwell - 05 Apr 2005 -- SimonLeinen - 20 Sep 2005 - 05 Aug 2011

Performance-Related Tools at SWITCH

  • IPv4 Looking Glass -- allows running IPv4 ping, traceroute, and several BGP monitoring commands on external border routers
  • IPv6 Looking Glass -- like the above, but for IPv6
  • Network Diagnostic Tester (NDT) -- run TCP throughput tests to a Web100-instrumented server, and detect configuration and cabling issues on your client. Java 1.4 or newer required
  • BWCTL & OWAMP testboxes -- run bwctl (iperf) and one-way delay tests (owamp) to our PMPs
  • SmokePing -- permanent RTT ("ping") measurements to select places on the Internet, with good visualization
  • RIPE TTM -- tt85 (at SWITCH's PoP at ETH Zurich) and tt86 (at the University of Geneva) show one-way delays, delay variations and loss from/to SWITCH from many other networks.
  • IEPM-BW -- regular throughput measurements from SLAC to many hosts, including one at SWITCH
  • Multicast beacons -- IPv4/IPv6 multicast reachability matrices

-- AlexGall - 15 Apr 2005

PSNC - Network Monitoring Tools

Public Tools

Private Tools

Contact PSNC NOC <noc@man.poznan.pl> for information from any of the following tools

  • Smokeping Latency Graphs
  • MRTG traffic statistics

-- BartoszBelter - 18 Aug 2005

Public Tools

  • MRTG - Traffic statistics for core and access circuits
  • Map - Weathermap based on MRTG
  • Looking Glass - IPv4 and IPv6

Private Tools

Contact HEAnet NOC for information from any of the following tools

  • Smokeping - latency graphing
  • Netsaint - reachability and latency alarms
  • Netflow
  • RIPE TTM
  • Multicast Beacon

-- AnnRHarding - 17 May 2005

ISTF Network Monitoring Tools

Network Monitoring Tools Portal -- includes links to:

  • Looking Glass (IPv4 and IPv6 ping, trace, bgp)
  • Cacti Network Graphs
  • Smokeping Latency Graphs
  • Rancid CVS repository

-- VedrinJeliazkov - 30 May 2005

Public Tools

Private Tools

Contact FCCN NOC for information from any of the following tools:

  • Netsaint - reachability and latency alarms
  • Multicast Beacon IPv4 - IPv4 multicast reachability matrix
  • RRDTool
  • Iperf (IPv4/ IPv6)

-- MonicaDomingues - 31 May 2005

Public Tools

Private Tools

Contact IUCC NOC (nocplus@noc.ilan.net.il) for information from any of the following tools:

IUCC is willing for PERT CMs to have access to non-public IUCC network information. If as a part of an investigation a PERT CM needs more info about the IUCC network they should contact Hank (or the IUCC NOC) for more assistance.

-- HankNussbacher - 02 Jun 2005 -- RafiSadowsky - 25 Mar 2006

Public Tools

Private Tools

  • IPv6 cricket -- IPv6 cricket
  • Cricket -- Traffic statistics for core and access circuits
  • Weathermap -- Hungarnet Weathermap
  • Router configuration database IPv4/IPv6 -- CVS database of router configurations
  • IPv6 looking glass -- allows ping tracerotue and bgp commands
  • Nagios -- latency and service availability alarms (both IPv6/IPv4)

-- MihalyMeszaros - 16 Mar 2007

RENATER - Network Monitoring Tools

Public Tools

-- FrancoisXavierAndreu - 05 Apr 2006

Network Emulation Tools

In general, it is difficult to assess performance of distributed applications and protocols before they are deployed on the real network, because the interaction with network impairments such as delay is hard to predict. Therefore, researchers and practitioners often use emulation to mimic deployment (typically wide-area) networks for use in laboratory testbeds.

The emulators listed below implement logical interfaces with configurable parameters such as delay, bandwidth (capacity) and packet loss rates.

-- TobyRodwell - 06 Apr 2005 -- SimonLeinen - 15 Dec 2005

Linux netem (introduced in recent 2.6 kernels)

NISTnet works only on 2.4.x Linux kernels, so for having the best support for recent GE cards, I would recommend kernel 2.6.10 or .11 and the new netem (Network emulator). A lot of similar functionality is built on this special QoS queue. It is buried into menuconfig:

 Networking -->
   Networking Options -->
     QoS and/or fair queuing -->
        Network emulator

You will need a recent iproute2 package that supports netem. The tc command is used to configure netem characteristics such as loss rates and distribution, delay and delay variation, and packet corruption rate (packet corruption was added in Linux version 2.6.16).

References

-- TobyRodwell - 14 Apr 2005
-- SimonLeinen - 21 Mar 2006 - 07 Jan 2007 -- based on information from David Martinez Moreno (RedIRIS). Larry Dunn from Cisco has set up a portable network emulator based on NISTnet. He wrote this on it:

My systems started as 2x100BT, but not for the (cost) reasons you suspect. I have 3 cheap Linux systems in a mobile rack (3x1RU, <$5k USD total). They are attached in a "row", 2 with web100, 1 with NISTNet like this: web100 - NISTNet - web100. The NISTNet node acts as a variable delay-line (emulates different fiber propagation delays), and as a place to inject loss, if desired. DummyNet/BSD could work as well.

I was looking for a very "shallow" (front-to-back) 1RU system, so I could fit it into a normal-sized Anvil case (for easy shipping to Networkers, etc). Penguin Computing was one of 2 companies I found that was then selling such "shallow" rack-mount systems. The shallow units were "low-end" servers, so they only had 2x100BT ports.

Their current low-end systems have two 1xGE ports, and are (still) around $1100 USD each. So a set of 3 systems (inc. rack), Linux preloaded, 3-yr warranty, is still easily under $5k (Actaully, closer to $4k, including a $700 Anvil 2'x2'x2' rolling shock-mount case, and a cheap 8-port GE switch). Maybe $5k if you add more memory & disk...

Here's a reference to their current low-end stuff:

http://penguincomputing.com/products/servers/relion1XT.php?PHPSESSID=f90f88b4d54aad4eedc2bdae33796032

I have no affiliation with Penguin, just happened to buy their stuff, and it's working OK (though the fans are pretty loud). I suppose there are comparable vendors in Europe.

-- TobyRodwell - 14 Apr 2005

End-System (Host) Tuning

This section contains hints for tuning end systems (hosts) for maximum performance.

References

-- SimonLeinen - 31 Oct 2004 - 23 Jul 2009
-- PekkaSavola - 17 Nov 2006

Calculating required TCP buffer sizes

Regardless of operating system, TCP buffers must be appropriately dimensioned if they are to support long distance, high data rate flows. Victor Reijs's TCP Throughput calculator is able to determine the maximum achievable TCP throughput based on RTT, expected packet loss, segment size, re-transmission timeout (RTO) and buffer size. It can be found at:

http://people.heanet.ie/~vreijs/tcpcalculations.htm

Tolerance to Packet Re-ordering

As explained in the section on packet re-ordering, packets can be re-ordered wherever there is parellelism in a network. If such parellelism exists in a network (e.g. as caused by the internal architectire of the Juniper M160s used in the GEANT network) then TCP should be set to tolerate a larger number of re-ordered packets than the default, which is 3 (see here for an example for Linux). For large flows across GEANT a value of 20 should be sufficient - an alternative work-round is to use multiple parallel TCP flows.

Modifying TCP Behaviour

TCP Congestion control algorithms are often configured conservatively, to avoid 'slow start' overwhelming a network path. In a capacity rich network this may not be necessary, and may be over-ridden.

-- TobyRodwell - 24 Jan 2005

-- TobyRodwell - 24 Jan 2005

[The following information is courtesy of Jeff Boote, from Internet2, and was forwarded to me by Nicolas Simar.]

Operating Systems (OSs) have a habit of reconfiguring network interface settings back to their default, even when the correct values have been written in a specific config file. This is the result of bugs, and they appear in almost all OSs. Sometimes they get fixed in a given release but then get broken again in a later release. It is not known why this is the case but it may be partly that driver programmers don't test their products under conditions of large latency. It is worth noting that experience shows 'ifconfig' works well for ''txqueuelength' and 'MTU'.

Operating-System Specific Configuration Hints

This topic points to configuration hints for specific operating systems.

Note that the main perspective of these hints is how to achieve good performance in a big bandwidth-delay-product environment, typically with a single stream. See the ServerScaling topic for some guidance on tuning servers for many concurrent connections.

-- SimonLeinen - 27 Oct 2004 - 02 Nov 2008
-- PekkaSavola - 02 Oct 2008

OS-Specific Configuration Hints: BSD Variants

Performance tuning

By default, earlier BSD versions use very modest buffer sizes and don't enable Window Scaling by default. See the references below how to do so.

In contrast, FreeBSD 7.0 introduced TCP buffer auto-tuning, and thus should provide good TCP performance out of the box even over LFNs. This release also implements large-send offload (LSO) and large-receive offload (LRO) for some Ethernet adapters. FreeBSD 7.0 also announces the following in its presentation of technological advances:

10Gbps network optimization: With optimized device drivers from all major 10gbps network vendors, FreeBSD 7.0 has seen extensive optimization of the network stack for high performance workloads, including auto-scaling socket buffers, TCP Segment Offload (TSO), Large Receive Offload (LRO), direct network stack dispatch, and load balancing of TCP/IP workloads over multiple CPUs on supporting 10gbps cards or when multiple network interfaces are in use simultaneously. Full vendor support is available from Chelsio, Intel, Myricom, and Neterion.

Recent TCP Work

The FreeBSD Foundation has granted a project "Improvements to the FreeBSD TCP Stack" to Lawrence Stewart at Swinburne University. Goals for this project include support for Appropriate Byte Counting (ABC, RFC 3465), merging SIFTR into the FreeBSD codebase, and improving the implementation of the reassembly queue. Information is available on http://caia.swin.edu.au/urp/newtcp/. Other improvements done by this group are support for modular congestion control, implementations of CUBIC, H-TCP, TCP Vegas, the SIFTR TCP implementation tool, and a testing framework including improvements to iperf for better buffer size control.

The CUBIC implementation for FreeBSD was announced on the ICCRG mailing list in September 2009. Currently available as patches for the 7.0 and 8.0 kernels, it is planned to be merged "into the mainline FreeBSD source tree in the not too distant future".

In February 2010, a set of Software for FreeBSD TCP R&D was announced by the Swinburne group: This includes modular TCP congestion control, Hamilton TCP (H-TCP), a newer "Hamilton Delay" Congestion Control Algorithm v0.1, Vegas Congestion Control Algorithm v0.1, as well as a kernel helper/hook framework ("Khelp") and a module (ERTT) to improve RTT estimation.

Another release in August 2010 added a "CAIA-Hamilton Delay" congestion control algorithm as well as revised versions of the other components.

QoS tools

On BSD systems ALTQ implements a couple of queueing/scheduling algorithms for network interfaces, as well as some other QoS mechanisms.

To use ALTQ on a FreeBSD 5.x or 6.x box, the following are the necessary steps:

  1. build a kernel with ALTQ
    • options ALTQ and some others beginning with ALTQ_ should be added to the kernel config. Please refer to the ALTQ(4) man page.
  2. define QoS settings in /etc/pf.conf
  3. use pfctl to apply those settings
    • Set pf_enable to YES in /etc/rc.conf (set as well the other variables related to pf according to your needs) in order to apply the QoS settings every time the host boots.

References

-- SimonLeinen - 27 Jan 2005 - 18 Aug 2010
-- PekkaSavola - 17 Nov 2006

Linux-Specific Network Performance Tuning Hints

Linux has its own implementation of the TCP/IP Stack. With recent kernel versions, the TCP/IP implementation contains many useful performance features. Parameters can be controlled via the /proc interface, or using the sysctl mechanism. Note that although some of these parameters have ipv4 in their names, they apply equally to TCP over IPv6.

A typical configuration for high TCP throughput over paths with high bandwidth*delay product would include the following in /etc/sysctl.conf:

A description of each parameter listed below can be found in section Linux IP Parameters.

Basic tuning

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Since 2.6.17 kernel, buffers have sensible automatically calculated values for most uses. Unless very high RTT, loss or performance requirement (200+ Mbit/s) is present, buffer settings may not need to be tuned at all.

Nonetheless, the following values may be used:

net/core/rmem_max=16777216
net/core/wmem_max=16777216
net/ipv4/tcp_rmem="8192 87380 16777216"
net/ipv4/tcp_wmem="8192 65536 16777216"

With kernel < 2.4.27 or < 2.6.7, receive-side autotuning may not be implemented, and the default (middle value) should be increased (at the cost of higher, by-default memory consumption):

net/ipv4/tcp_rmem="8192 16777216 16777216"

NOTE: If you have a server with hundreds of connections, you might not want to use a large default value for TCP buffers, as memory may quickly run out smile

There is a subtle but important implementation detail in the socket buffer management of Linux. When setting either the send- or receive buffer sizes via the SO_SNDBUF and SO_RCVBUF socket options via setsockopt(2), the value passed in the system call is doubled by the kernel to accomodate buffer management overhead. Reading the values back with getsockopt(2) return this modified value, but the effective buffer available to TCP payload is still the original value.

The values net/core/rmem and net/core/wmem apply to the argument to setsockopt(2).

In contrast, the maximum values of net/ipv4/tcp_rmem=/=net/ipv4/tcp_wmem apply to the total buffer sizes including the factor of 2 for the buffer management overhead. As a consequence, those values must be chosen twice as large as required by a particular BandwidthDelayProduct. Also note taht the values net/core/rmem and net/core/wmem do not apply to the TCP autotuning mechanism.

Interface queue lengths

InterfaceQueueLength describes how to adjust interface transmit and receive queue lengths. This tuning is typically needed with GE or 10GE transfers.

Host/adapter architecture implications

When going for 300 Mbit/s performance, it is worth verifying that host architecture (e.g., PCI bus) is fast enough. PCI Express is usually fast enough to no longer be the bottleneck in 1Gb/s and even 10Gb/s applications.

For the older PCI/PCI-X buses, when going for 2+ Gbit/s performance, the Maximum Memory Read Byte Count (MMRBC) usually needs to be increased using setpci.

Many network adapters support features such as checksum offload. In some cases, however, these may even decrease performance. In particular, TCP Segment Offload may need to be disabled, with:

ethtool -K eth0 tso off

Advanced tuning

Sharing congestion information across connections/hosts

2.4 series kernels have a TCP/IP weakness in that their interface buffers' maximum window size is based on the experience of previous connections - if you have loss at any point (or a bad end host at the same route) you limit your future TCP connections. So, you may have to flush the route cache to improve performance.

net.ipv4.route.flush=1

2.6 kernels also remember some performance characteristics across connections. In benchmarks and other tests, this might not be desirable.

# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save=1

Other TCP performance variables

If there is packet reordering in the network, reordering could end up being interpreted as a packet loss too easily. Increasing tcp_reordering parameter might help in that case:

net/ipv4/tcp_reordering=20   # (default=3)

Several variables already have good default values, but it may make sense to check that these defaults haven't been changed:

net/ipv4/tcp_timestamps=1
net/ipv4/tcp_window_scaling=1
net/ipv4/tcp_sack=1
net/ipv4/tcp_moderate_rcvbuf=1

TCP Congestion Control algorithms

Linux 2.6.13 introduced pluggable congestion modules, which allows you to select one of the high-speed TCP congestion control variants, e.g. CUBIC

net/ipv4/tcp_congestion_control = cubic

Alternative values include highspeed (HS-TCP), scalable (Scalable TCP), htcp (Hamilton TCP), bic (BIC), reno ("Reno" TCP), and westwood (TCP Westwood).

Note that on Linux 2.6.19 and later, CUBIC is already used as the default algorithm.

Web100 kernel tuning

If you are using a web100 kernel, the following parameters seem to improve networking performance even further:

# web100 tuning
# turn off using txqueuelen as part of congestion window computation
net/ipv4/WAD_IFQ = 1

QoS tools

Modern Linux kernels have flexible traffic shaping built in.

See the Linux traffic shaping example for an illustration of how these mechanisms can be used to solve a real performance problem.

References

See also the reference to Congestion Control in Linux in the FlowControl topic.

-- SimonLeinen - 27 Oct 2004 - 23 Jul 2009
-- ChrisWelti - 27 Jan 2005
-- PekkaSavola - 17 Nov 2006
-- AlexGall - 28 Nov 2016

Interface queue lengths

txqueuelen

The txqueuelen parameter of an interface in the Linux kernel. It limits the number of packets in the transmission queue in the interface's device driver. In 2.6 series and e.g., RHEL3 (2.4.21) kernel, the default is 1000. In earlier kernels, the default is 100. '100' is often too low to support line-rate transfers over Gigabit Ethernet interfaces, and in some cases, even '1000' is too low.

For Gigabit Ethernet interfaces, it is suggested to use at least a txqueuelen of 1000. (values of up to 8000 have been used successfully to further improve performance), e.g.,

ifconfig eth0 txqueuelen 1000

If a host is low performance or has slow links, having too big txqueuelen may disturb interactive performance.

netdev_max_backlog

The sysctl net.core.netdev_max_backlog defines the queue sizes for received packets. In recent kernels (like 2.6.18), the default is 1000; in older ones, it is 300. If the interface receives packets (e.g., in a burst) faster than the kernel can process them, this could overflow. A value in the order of thousands should be reasonable for GE, tens of thousands for 10GE.

For example,

net/core/netdev_max_backlog=2500

References

-- SimonLeinen - 06 Jan 2005
-- PekkaSavola - 17 Nov 2006

OS-Specific Configuration Hints: Mac OS X

As Mac OS X is mainly a BSD derivative, you can use similar mechanisms to tune the TCP stack - see under BsdOSSpecific.

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

For testing temporary improvements, you can directly use sysctl in a terminal window: (you have to be root to do that)

sysctl -w kern.ipc.maxsockbuf=8388608
sysctl -w net.inet.tcp.rfc1323=1
sysctl -w net.inet.tcp.sendspace=1048576
sysctl -w net.inet.tcp.recvspace=1048576
sysctl -w kern.maxfiles=65536
sysctl -w net.inet.udp.recvspace=147456
sysctl -w net.inet.udp.maxdgram=57344
sysctl -w net.local.stream.recvspace=65535
sysctl -w net.local.stream.sendspace=65535

For permanent changes that last over a reboot, insert the appropriate configurations into Ltt>/etc/sysctl.conf. If this file does not exist must create it. So, for the above, just add the following lines to sysctl.conf:

kern.ipc.maxsockbuf=8388608
net.inet.tcp.rfc1323=1
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=1048576
kern.maxfiles=65536
net.inet.udp.recvspace=147456
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535

Note This only works for OSX 10.3 or later! For earlier versions you need to use /etc/rc where you can enter whole sysctl commands.

Users that are unfamiliar with terminal windows can also use the GUI tool "TinkerTool System" and use its Network Tuning option to set the TCP buffers.

TinkerTool System is available from:

-- ChrisWelti - 30 Jun 2005

OS-Specific Configuration Hints: Solaris (Sun Microsystems)

Planned Features

Pluggable Congestion Control for TCP and SCTP

A proposed OpenSolaris project foresees the implementation of pluggable congestion control for both TCP and SCTP. HS-TCP and several other congestion control algorithms for OpenSolaris. This includes implementation of the HighSpeed, CUBIC, Westwood+, and Vegas congestion control algorithms, as well as ipadm subcommands and socket options to get and set congestion control parameters.

On 15 December 2009, Artem Kachitchkine posted an initial draft of a work-in-progress design specification for this feature has been announced on the OpenSolaris networking-discuss forum. According to this proposal, the API for setting the congestion control mechanism for a specific TCP socket will be compatible with Linux: There will be a TCP_CONGESTION socket option to set and retrieve a socket's congestion control algorithm, as well as a TCP_INFO socket option to retrieve various kinds of information about the current congestion control state.

The entire plugabble-congestion control mechanism will be implemented for SCTP in addition to TCP. For example, there will also be an SCTP_CONGESTION socket option. Note that congestion control in SCTP is somewhat trickier than in TCP, because a single SCTP socket can have multiple underlying paths through SCTP's "multi-homing" feature. Congestion control state must be kept separately for each path (address pair). This also means that there is no direct SCTP equivalent to TCP_INFO. The current proposal adds a subset of TCP_INFO's information to the result of the existing SCTP_GET_PEER_ADDR_INFO option for getsockopt().

The internal structure of the code will be somewhat different to what is in the Linux kernel. In particular, the general TCP code will only make calls to the algorithm-specific congestion control modules, not vice versa. The proposed Solaris mechanism also contains ipadm properties that can be used to set the default congestion control algorithm either globally or for a specific zone. The proposal also suggests "observability" features; for example, pfiles output should include the congestion algorithm used for a socket, and there are new kstat statistics that count certain congestion-control events.

Useful Features

TCP Multidata Transmit (MDT, aka LSO)

Solaris 10, and Solaris 9 with patches, supports TCP Multidata Transmit (MDT), which is Sun's name for (software-only) Large Send Offload (LSO). In Solaris 10, this is enabled by default, but in Solaris 9 (with the required patches for MDT support), the kernel and driver have to be reconfigured to be able to use MDT. See the following pointers for more information from docs.sun.com:

Solaris 10 "FireEngine"

The TCP/IP stack in Solaris 10 has been largely rewritten from previous versions, mostly to improve performance. In particular, it supports Interrupt Coalescence, integrates TCP and IP more closely in the kernel, and provides multiprocessing enhancements to distribute work more efficiently over multiple processors. Ongoing work includes UDP/IP integration for better performance of UDP applications, and a new driver architecture that can make use of flow classification capabilities in modern network adapters.

Solaris 10: New Network Device Driver Architecture

Solaris 10 introduces GLDv3 (project "Nemo"), a new driver architecture that generally improves performance, and adds support for several performance features. Some, but not all, Ethernet device drivers were ported over to the new architecture and benefit from those improvements. Notably, the bge driver was ported early, and the new "Neptune" adapters ("multithreaded" dual-port 10GE and four-port GigE with on-board connection de-multiplexing hardware) used it from the start.

Darren Reed has posted a small C program that lists the active acceleration features for a given interface. Here's some sample output:

$ sudo ./ifcapability
lo0 inet
bge0 inet +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1) +POLL
lo0 inet6
bge0 inet6 +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1)

Displaying and setting link parameters with dladm

Another OpenSolaris project called Brussels unifies many aspects of network driver configuration under the dladm command. For example, link MTUs (for "Jumbo Frames") can be configured using

dladm set-linkprop -p mtu=9000 web1

The command can also be used to look at current physical parameters of interfaces:

$ sudo dladm show-phys
LINK         MEDIA                STATE      SPEED DUPLEX   DEVICE
bge0         Ethernet             up         1000 full      bge0

Note that Brussels is still being integrated into Solaris. Driver support was added since SXCE (Solaris Express Community Edition) build 83 for some types of adapters. Eventually this should be integrated into regular Solaris releases.

Setting TCP buffers

# To increase the maximum tcp window
# Rule-of-thumb: max_buf = 2 x cwnd_max (congestion window)
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152

# To increase the DEFAULT tcp window size
ndd -set /dev/tcp tcp_xmit_hiwat 65536
ndd -set /dev/tcp tcp_recv_hiwat 65536

Pitfall when using asymmetric send and receive buffers

The documented default behaviour (tunable TCP parameter tcp_wscale_always = 0) of Solaris is to include the TCP window scaling option in an initial SYN packet when either the send or the receive buffer is larger than 64KiB. From the tcp(7P) man page:

          For all applications, use ndd(1M) to modify the  confi-
          guration      parameter      tcp_wscale_always.      If
          tcp_wscale_always is set to 1, the window scale  option
          will  always be set when connecting to a remote system.
          If tcp_wscale_always is 0, the window scale option will
          be set only if the user has requested a send or receive
          window  larger  than  64K.   The   default   value   of
          tcp_wscale_always is 0.

However, Solaris 8, 9 and 10 do not take the send window into account. This results in an unexpected behaviour for a bulk transfer from node A to node B when the bandwidth-delay product is larger than 64KiB and

  • A's receive buffer (tcp_recv_hiwat) < 64KiB
  • A's send buffer (tcp_xmit_hiwat) > 64KiB
  • B's receive buffer > 64KiB

A will not advertize the window scaling option and B will not do so either according to RFC 1323. As a consequence, throughput will be limited by a congestion window of 64KiB.

As a workaround, the window scaling option can be forcibly advertised by setting

# ndd -set /dev/tcp tcp_wscale_always 1

A bug report has been filed with Sun Microsystems.

References

-- ChrisWelti - 11 Oct 2005, added section for setting default and maximum TCP buffers
-- AlexGall - 26 Aug 2005, added section on pitfall with asymmetric buffers
-- SimonLeinen - 27 Jan 2005 - 16 Dec 2009

Windows-Specific Host Tuning

Next Generation TCP/IP Stack in Windows Vista / Windows Server 2008 / Windows 7

According to http://www.microsoft.com/technet/community/columns/cableguy/cg0905.mspx, the new version of Windows, Vista, features a redesigned TCP/IP stack. Besides unifying IPv4/IPv6 dual-stack support, this new stack also claims much-improved performance for high-speed and asymmetric networks, as well as several auto-tuning features.

The Microsoft Windows Vista Operating System enables the TCP Window Scaling option by default (previous Windows OSes had this option disabled). This causes problems with various middleboxes, see WindowScalingProblems. As a consequence, the scaling factor is limited to 2 for HTTP traffic in Vista.

Another new feature of Vista is Compound TCP, a high-performance TCP variant that uses delay information to adapt its transmission rate (in addition to loss information). On Windows Server 2008 it is enabled by default, but it is disabled in Vista. You can enable it in Vista by running the following command as an Administrator on the command line.

netsh interface tcp set global congestionprovider=ctcp
(see CompoundTCP for more information).

Although the new TCP/IP-Stack has integrated auto-tuning for receive buffers, the TCP send buffer still seems to be limited to 16KB by default. This means you will not be able to get good upload rates for connections with higher RTTs. However, using the registry hack (see further below) you can manually change the default value to something higher.

Performance tuning for earlier Windows versions

Enabling TCP Window Scaling

The references below detail the various "whys" and "hows" to tune Windows network performance. Of particular note however is the setting of scalable TCP receive windows (see LargeTcpWindows).

It appears that, by default, not only does Windows not support TCP 1323 scalable windows, but the required key is not even in the Windows registry. The key (Tcp1323Opts) can be added to one of two places, depending on the particular system:

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\VxD\MSTCP] (Windows'95)
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters] (Windows 2000/XP)

Value Name: Tcp1323Opts
Data Type: REG_DWORD (DWORD Value)
Value Data: 0, 1, 2 or 3

  • 0 = disable RFC 1323 options
  • 1 = window scale enabled only
  • 2 = time stamps enabled only
  • 3 = both options enabled

This can also be adjusted using DrTCP GUI tool (see link below)

TCP Buffer Sizes

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Buffers for TCP Send Windows

Inquiry at Microsoft (thanks to Larry Dunn) has revealed that the default send window is 8KB and that there is no official support for configuring a system-wide default. However, the current Winsock implementation uses the following undocumented registry key for this purpose

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\AFD\Parameters]

Value Name: DefaultSendWindow
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. The maximum value is unknown.

According to Microsoft, this parameter may not be supported in future Winsock releases. However, this parameter is confirmed to be working with Windows XP, Windows 2000, Windows Vista and even the Windows 7 Beta 1.

This value needs to be manually adjusted (e.g., DrTCP can't do it), or the application needs to set it (e.g., iperf with '-l' option). It may be difficult to detect this as a bottleneck because the host will advertise large windows but will only have about 8.5KB of data in flight.

Buffers for TCP Receive Windows

(From this TechNet article

Assuming Window Scaling is enabled, then the following parameter can be set to between 1 and 1073741823 bytes

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize]

Value Name: TcpWindowSize
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. 1 to 65535 (or 1 to 1073741823 if Window scaling set).

DrTCP can also be used to adjust this parameter.

Windows XP SP2/SP3 simultaneous connection limit

Windows XP SP2 and SP3 limits the number of simultaneous TCP connection attempts (so called half-open state) to 10 (by default). How exactly it's doing this is not clear, but this may come up with some kind of applications. Especially file sharing applications like bittorrent might suffer from this behaviour. You can see yourself if your performance is affected by looking into your event logs:

EventID 4226: TCP/IP has reached the security limit imposed on the number of concurrent TCP connect attempts

A patcher has been developed to adjust this limit.

References

-- TobyRodwell - 28 Feb 2005 (initial version)
-- SimonLeinen - 06 Apr 2005 (removed dead pointers; added new ones; descriptions)
-- AlexGall - 17 Jun 2005 (added send window information)
-- HankNussbacher - 19 Jun 2005 (added new MS webcast for 2003 Server)
-- SimonLeinen - 12 Sep 2005 (added information and pointer about new TCP/IP in Vista)
-- HankNussbacher - 22 Oct 2006 (added Cisco IOS firewall upgrade)
-- AlexGall - 31 Aug 2007 (replaced IOS issue with a link to WindowScalingProblems, added info for scaling limitation in Vista)
-- SimonLeinen - 04 Nov 2007 (information on how to enable Compound TCP on Vista)
-- PekkaSavola - 05 Jun 2008 major reorganization to improve the clarity wrt send buffer adjustment, add a link to DrTCP; add XP SP3 connection limiting discussion -- ChrisWelti - 02 Feb 2009 added new information regarding the send window in windows vista / server 2008 / 7

Network Adapter and Driver Issues

One aspect that causes many performance problems is adapater and NIC compatability issues (full vs half duplex as an example. The document "Troubleshooting Cisco Catalyst Switches to NIC Compatibility Issues" from Cisco covers many vendor NICs.

Performance Impacts of Host Architecture

The host architecture has impact on performance especially with higher rates. Important factors include:

Performance-Friendly Adapter and Driver Features

There are several different techniques that enhance network performance by moving critical functions to the network adapter. Those techniques typically (but with the notable exception of GSO) require both special hardware support in the adapter and support at the device driver level to make use of that special hardware.

References

-- SimonLeinen - 2005-02-02 - 2015-01-28
-- PekkaSavola - 2006-11-15

Large Send Offload (LSO)

This feature is also known as "Segmentation Offload", "TCP Segmentation Offload (TSO)", "[TCP] Multidata Transmit (MDT), or "TCP Large Send".

From Microsoft's document, Windows Network Task Offload:

With Segmentation Offload, or TCP Large Send, TCP can pass a buffer to be transmitted that is bigger than the maximum transmission unit (MTU) supported by the medium. Intelligent adapters implement large sends by using the prototype TCP and IP headers of the incoming send buffer to carve out segments of required size. Copying the prototype header and options, then calculating the sequence number and checksum fields creates TCP segment headers. All other information, such as options and flag values, are preserved except where noted.

Large Send Offload can be seen as doing for output what interrupt coalescence combined with large-receive offload does for input, namely reduce the number of (bus/interrupt) transactions between CPUs and network adapters by bundling multiple packets to larger transactions (scatter/gather).

Hardware (network adapter) support for LSO is a refinement of transmit chaining, where multiple transmitted frames can be sent from the host to the adapter in a single transaction.

Issues with Large Send Offload

Timing and Burstiness

Like Interrupt Coalescence, LSO can affect packet timing and increase burstiness. An illustration of this effect is this patch that modified LSO (TSO as it is called in Linux) to bound the time that outgoing segments can be held while trying to accumulate a larger transfer unit. The accompanying message to the netdev mailing list includes some graphs that show the impact of (pre-patch) TSO on RTTs over a low-speed link.

In Linux, the burstiness issue was addressed in 2013 in a TSO autosizing patch by Eric Dumazet.

(Transport) Protocol Fossilization

The way it is defined by most of the industry, LSO needs to be aware of the transport protocols. In particular, it must be able to split over-large transport segments into suitable sub-segments, and generate transport (e.g. TCP) headers for these sub-segments. This function is typically implemented in the adapter's firmware, for some popular transport protocol such as TCP. This makes it hard to implement additional functions such as IPSec, or the TCP MD5 Authentication option, or even other transport protocols such as SCTP.

There is a weakened form of LSO that requires the host operating system to prepare the segmentation and construct headers. This allows for "dumber" network adapters, and in particular it doesn't require them to be transport protocol-aware. It still provides significant performance improvement because multiple segments can be transferred between host and adapter in a single transaction, which reduces bus occupation and other overhead. Sun's Solaris operating system supports this variant of LSO under the name of "MDT" (Multidata Transmit), and the Linux kernel added something similar as part of "GSO" in 2.6.18 (September 2006) for IPv4 and in 2.6.35 (August 2010) for IPv6.

Configuration

Under Linux, LSO/TSO can be controlled using the -K option to the ethtool command, which can also be used to control other offloading features. It is typically enabled by default if kernel/driver and adapter support it.

References

-- TobyRodwell & SimonLeinen - 2005-02-28 - 2015-04-26

Interrupt Coalescence (also called Interrupt Moderation, Interrupt Blanking, or Interrupt Throttling)

A common bottleneck for high-speed data transfers is the high rate of interrupts that the receiving system has to process - traditionally, a network adapter generates an interrupt for each frame that it receives. These interrupts consume signaling resources on the system's bus(es), and introduce significant CPU overhead as the system transitions back and forth between "productive" work and interrupt handling many thousand times a second.

To alleviate this load, some high-speed network adapters support interrupt coalescence. When multiple frames are received in a short timeframe ("back-to-back"), these adapters buffer those frames locally and only interrupt the system once.

Interrupt coalescence together with large-receive offload can roughly be seen as doing on the "receive" side what transmit chaining and large-send offload (LSO) do for the "transmit" side.

Issues with interrupt coalescence

While this scheme lowers interrupt-related system load significantly, it can have adverse effects on timing, and make TCP traffic more bursty or "clumpy". Therefore it would make sense to combine interrupt coalescence with on-board timestamping functionality. Unfortunately that doesn't seem to be implemented in commodity hardware/driver combinations yet.

The way that interrupt coalescence works, a network adapter that has received a frame doesn't send an interrupt to the system right away, but waits for a little while in case more packets arrive. This can have a negative impact on latency.

In general, interrupt coalescence is configured such that the additional delay is bounded. On some implementations, these delay bounds are specified in units of milliseconds, on other systems in units of microseconds. It requires some thought to find a good trade-off between latency and load reduction. One should be careful to set the coalescence threshold low enough that the additional latency doesn't cause problems. Setting a low threshold will prevent interrupt coalescence from occurring when successive packets are spaced too far apart. But in that case, the interrupt rate will probably be low enough so that this is not a problem.

Configuration

Configuration of interrupt coalescence is highly system dependent, although there are some parameters that are more or less common over implementations.

Linux

On Linux systems with additional driver support, the ethtool -C command can be used to modify the interrupt coalescence settings of network devices on the fly.

Some Ethernet drivers in Linux have parameters to control Interrupt Coalescence (Interrupt Moderation, as it is called in Linux). For example, the e1000 driver for the large family of Intel Gigabit Ethernet adapters has the following parameters according to the kernel documentation:

InterruptThrottleRate
limits the number of interrupts per second generated by the card. Values >= 100 are interpreted as the maximum number of interrupts per second. The default value used to be 8'000 up to and including kernel release 2.6.19. A value of zero (0) disabled interrupt moderation completely. Above 2.6.19, some values between 1 and 99 can be used to select adaptive interrupt rate control. The first adaptive modes are "dynamic conservative" (1) and dynamic with reduced latency (3). In conservative mode (1), the rate changes between 4'000 interrupts per second when only bulk traffic ("normal-size packets") is seen, and 20'000 when small packets are present that might benefit from lower latency. The more aggressive mode (3), "low-latency" traffic may drive the interrupt rate up to 70'000 per second. This mode is supposed to be useful for cluster communication in grid applications.
RxIntDelay
specifies, in multiples of 1'024 microseconds, the time after reception of a frame to wait for another frame to arrive before sending an interrupt.
RxAbsIntDelay
bounds the delay between reception of a frame and generation of an interrupt. It is specified in units of 1'024 microseconds. Note that InterruptThrottleRate overrides RxAbsIntDelay, so even when a very short RxAbsIntDelay is specified, the interrupt rate should never exceed the rate specified (either directly or by the dynamic algorithm) by InterruptThrottleRate
RxDescriptors
specifies the number of descriptors to store incoming frames on the adapter. The default value is 256, which is also the maximum for some types of E1000-based adapters. Others can allocate up to 4'096 of these descriptors. The size of the receive buffer associated with each descriptor varies with the MTU configured on the adapter. It is always a power-of-two number of bytes. The number of descriptors available will also depend on the per-buffer size. When all buffers have been filled by incoming frames, an interrupt will have to be signaled in any case.

Solaris

As an example, see the Platform Notes: Sun GigaSwift Ethernet Device Driver. It lists the following parameters for that particular type of adapter:

rx_intr_pkts
Interrupt after this number of packets have arrived since the last packet was serviced. A value of zero indicates no packet blanking. (Range: 0 to 511, default=3)
rx_intr_time
Interrupt after 4.5 microsecond ticks have elapsed since the last packet was serviced. A value of zero indicates no time blanking. (Range: 0 to 524287, default=1250)

References

-- SimonLeinen - 04 Jul 2005 - 02 Jul 2011

Checksum Offload

A large part of the processing costs related to TCP is the generation and verification of the TCP checksum. Many Gigabit Ethernet chipsets include "on-board" hardware that can verify and/or generate these checksums. This significantly reduces the amount of work that has to be done by the system kernel on a CPU, especially when combined with other adapter/driver enhancements such as Large-Send Offload. Checksum Offload is also part of TCP Offload Engines (TOEs), which move the entire TCP processing from the CPU(s) to the adapter. Checksum Offload requires special driver support and a kernel infrastructure that supports such drivers.

TCP Checksum Offload is the most common form of checksum offload, but of course it is possible to offload other checksums such as the UDP or SCTP checksums.

Some people (at HP?) abbreviate Checksum Offload as "CKO".

A possible issue with offloading checksums to the controller is that the integrity protection is less "end-to-end": If there are errors in the internal (bus) transmission of data from the host processor/memory to the adapter, the adapter will happily compute checksums on the corrupted data, which means that the corruption will go undetected at the receiver.

-- SimonLeinen -18 May 2008

TCP Offload Engines (TOEs)

The idea of a TOE is to put the TCP implementation onto the network adapter itself. This relieves the computer's CPUs of handling TCP segmentation, reassembly, checksum calculation and verification, and so on. Large-Send Offload (LSO) and Checksum Offload are typically assumed to be subsets of TOE functionality.

The drawbacks of TOEs are that they require driver support in the operating system, as well as additional kernel/driver interfaces for TCP-relevant operations (as opposed to the frame-based operations of traditional adapters). Also, when the operating system implements improvements to TCP over time, those normally have to be implemented on the TOE as well. And additional instrumentation such as the Web100 kernel instrumentation set would also need to be implemented separately.

For these and other reasons, TOEs (which are a relatively old idea) have never become a "mainstream" technology. In contrast, some more generic performance enhancements such as LSO, interrupt coalescence, or checksum offload, are now part of many "commodity" network adapter chip-sets, and enjoy increasing support in operating systems.

Mogul (2003) nicely presents most of the issues with TOEs, and argues that they aren't overly useful for general TCP use. However TOE - or, more generally, transport offload - may find good use as part of network adapters that provide Remote DMA (Remote Direct Memory Access) functionality for use in networked storage or clustering applications.

TOE on Windows: TCP Chimney

Microsoft Windows has its own architecture for TCP offload-capable network adapters called TCP Chimney. The name "chimney" stands for the channel that is established between the operating system and the adapter for each TCP connection that is offloaded. TCP Chimney was first introduced (in 2006?) as part of an addition to Windows Server 2003 called the Scalable Networking pack. In March 2008, it found its way into the first Service Pack for the Vista operating system (Vista SP1).

References

-- SimonLeinen - 04 Jul 2005 - 26 Mar 2008

Performance Hints for Application Developers

Caveat

Premature optimization is the root of all evil in programming

E. Dijkstra... or was it C.A.R. Hoare... or was it D. Knuth?

You should always get your program correct first, and think about optimizations once it works right. That said, it is always good to keep performance in mind while programming. The most important decisions are related to the choice of suitable algorithms, of course.

Regarding networked applications in particular, here are a few performance-related things to think about:

-- SimonLeinen - 05 Jul 2005

"Chatty" Protocols

A common problem with naively designed application protocols is that they are too "chatty", i.e. they imply too many "round-trip" cycles where one party has to wait for a response from the other. It is an easy mistake to make, because when testing such a protocol locally, these round-trips usually don't have much of an impact on overall performance. But when used over network paths with large RTTs, chattiness can dramatically impact perceived performance.

Example: SMTP (Simple Mail Transfer Protocol)

The Simple Mail Transfer Protocol (SMTP) is used to transport most e-mail messages over the Internet. In its original design (RFC 821, now superseded by RFC 2821), the protocol consisted of a strict sequence of request/response transactions, some of them very small. Taking an example from RFC 2920, a typical SMTP conversation between a client "C" that wants to send a message, and a server "S" that receives it, would look like this:

   S: <wait for open connection>
   C: <open connection to server>
   S: 220 Innosoft.com SMTP service ready
   C: HELO dbc.mtview.ca.us
   S: 250 Innosoft.com
   C: MAIL FROM:<mrose@dbc.mtview.ca.us>
   S: 250 sender <mrose@dbc.mtview.ca.us> OK
   C: RCPT TO:<ned@innosoft.com>
   S: 250 recipient <ned@innosoft.com> OK
   C: RCPT TO:<dan@innosoft.com>
   S: 250 recipient <dan@innosoft.com> OK
   C: RCPT TO:<kvc@innosoft.com>
   S: 250 recipient <kvc@innosoft.com> OK
   C: DATA
   S: 354 enter mail, end with line containing only "."
    ...
   C: .
   S: 250 message sent
   C: QUIT
   S: 221 goodbye

This simple conversation contains nine places where the client waits for a response from the server.

In order to improve this, the PIPELINING extension (RFC 2920) was later defined. When the server supports it - as signaled through the ESMTP extension mechanism in the response to an EHLO request - the client is allowed to send multiple requests in a row, and collect the responses later. The previous conversation becomes the following one with PIPELINING:

   S: <wait for open connection>
   C: <open connection to server>
   S: 220 innosoft.com SMTP service ready
   C: EHLO dbc.mtview.ca.us
   S: 250-innosoft.com
   S: 250 PIPELINING
   C: MAIL FROM:<mrose@dbc.mtview.ca.us>
   C: RCPT TO:<ned@innosoft.com>
   C: RCPT TO:<dan@innosoft.com>
   C: RCPT TO:<kvc@innosoft.com>
   C: DATA
   S: 250 sender <mrose@dbc.mtview.ca.us> OK
   S: 250 recipient <ned@innosoft.com> OK
   S: 250 recipient <dan@innosoft.com> OK
   S: 250 recipient <kvc@innosoft.com> OK
   S: 354 enter mail, end with line containing only "."
    ...
   C: .
   C: QUIT
   S: 250 message sent
   S: 221 goodbye

There are still a couple of places where the client has to wait for responses, notably during initial negotiation; but the number of these situations has been reduced to those where the response has an impact on further processing. The PIPELINING extension reduces the number of "turn-arounds" from nine to four. This speeds up the overall mail submission process when the RTT is high, reduces the number of packets that have to be sent (because several requests, or several responses, can be sent as a single TCP segment), and significantly decreases the risk of timeouts (and consequent loss of connection) when the connectivity between client and server is really bad.

The X Window System protocol (X11) is an example of a protocol that has been designed from the start to reduce turn-arounds.

References

-- SimonLeinen - 05 Jul 2005

Performance-friendly I/O interfaces

read()/write() vs. mmap()/write() vs. sendfile()

For applications with high input/output performance requirements (including network I/O), it is worthwhile to look at operating system support for efficient I/O routines.

As an example, here is simple pseudo-code that reads the contents of an open file in and writes them to an open socket out - this code could be part of a file server. A straightforward way of coding this uses the read()/write() system calls to copy the bytes through a memory buffer:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char buffer[BUFSIZE];
  int result; long written = 0;
  while (result = read (in, buffer, BUFSIZE) > 0) {
    if (write (out, buffer, result) != result)
      return -1;
    written += result;
  }
  return (result == 0 ? written : result);
}

Unfortunately, this common programming paradigm results in high memory traffic and inefficient use of a system's caches. Also, if a small buffer is used, the number of system operations and, in particular, of user/kernel context switches will be quite high.

On systems that support memory mapping of files using mmap(), the following is more efficient if the source is an actual file:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char *b;
  struct stat st;
  if (fstat (in, &st) == -1) return -1;
  if ((b = mmap (0, st.st_size, PROT_READ, 0)) == -1)
    return -1;
  madvise (b, st.st_size, MADV_SEQUENTIAL);
  return write (out, b, st.st_size);
}

An even more efficient - and also more concise - variant is the sendfile() call, which directly copies the bits from the file to the network.

long send_file (int in, int out) {
  struct stat st;
  if (fstat (in, &st) == -1) return -1;
  off_t offset = 0;
  return sendfile (out, in, &offset, st.st_size);
}

Note that an operating system could optimize this internally up to the point where data blocks are copied directly from the disk controller to the network controller without any involvement of the CPU.

For more complex situations, the sendfilev() interface can be used to send data from multiple files and memory buffers to construct complex protocol units with a single call.

-- SimonLeinen - 26 Jun 2005

One thing to note, the above usage of write and sendfile is simplified, these system-calls can stop in the middle and return the number of bytes written, for real-world usage you should have a loop around them to continue sending the rest of the file and handle signal errors.

The first loop should be written as:

#define BUFSIZE 4096
long send_file (int in, int out) {
  unsigned char buffer[BUFSIZE];
  int result; long written = 0;
  while (result = read (in, buffer, BUFSIZE) > 0) {
    ssize_t tosend = result;
    ssize_t offset = 0;
    while (tosend > 0) {
      result = write (out, buffer + offset, tosend);
      if (result == -1) {
        if (errno == EINTR || errno == EAGAIN)
          continue;
        else
          return -1;
      }
      written += result;
      offset += result;
      tosend -= result;
    }
  }
  return (result == 0 ? written : result);
}

References

  • Project Volo Design Document, OpenSolaris, 2008. This document contains, in section 4.5 ("zero-copy interface") a description of how sendfilev is implemented in current Solaris, as well as suggestions on generalizing the internal kernel interfaces that support it.

-- BaruchEven - 05 Jan 2006
-- SimonLeinen - 12 Sep 2008

Network Tuning

This section describes a few ways that can be used to improve performance of a network.

-- SimonLeinen - 01 Nov 2004 - 30 Mar 2006
-- PekkaSavola - 13 Jun 2008 (addition of WAN accelerators)
-- Alessandra Scicchitano - 11 Jun 2012 (Buffer Bloat)

Router Architectures

Basic Functions

Basic functions of an IP router can be divided into two main groups:

  • The forwarding plane is responsible for packet forwarding (or packet switching), which is the act of receiving packets on the router's interfaces and sending them out on (usually) other interfaces.
  • The control plane gathers and maintains network topology information, and passes it to the forwarding plane so that it knows where to forward the received packets.

Architectures

CPU-based vs. hardware forwarding support

Small routers usually have similar architectures to general purpose computers. They have a CPU, operational memory, some persistent storage device (to store configuration settings and the operating system software), and network interfaces. Both forwarding and control functions are carried out by the CPU, and being different processes is the only separation between the forwarding and the control plane. Network interfaces in these routers are treated just like NICs in general purpose computers, or as any other peripheral device.

High performance routers however, in order to achieve multi-Gbps throughput, use specialized hardware to forward packets. (The control plane is still very similar to a general-purpose computer architecture.) This way the forwarding and the control plane are far more separated. They do not contend for shared resources like processing power, because they both have their own.

Centralized vs. distributed forwarding

Another difference between low-end and high-end routers is that in low-end routers, packets from all interfaces are forwarded using a central forwarding engine, while some high-end routers decentralize forwarding across line cards, so that packets received on an interface are handled by a forwarding engine local to that line card. Distributed forwarding is especially attractive where one wants to scale routers to large numbers of line cards (and thus interfaces). Some routers allow line cards with and without forwarding engines to be mixed in the same chassis - packets arriving on engine-less line cards are simply handled by the central engine.

On distributed architectures, the line cards typically have their own copy of the forwarding table, and run very few "control-plane" functions - usually just what's necessary to communicate with the (central) control-plane engine. So even when forwarding is CPU-based, a router with distributed forwarding behaves much like a hardware-based router, in that forwarding performance is decoupled from control-plane performance.

There are examples for all kinds of combinations of CPU-/hardware-based and centralized/distributed forwarding, even within the products of a single vendor, for example Cisco:

  CPU-based hardware forw.
centralized 7200 7600 OSR
distributed 7500 w/VIP 7600 w/DFCs

Effects on performance analysis

Because of the separation of the forwarding and the control plane in high performance routers, the performance characteristics of traffic passing through the router and traffic destined to the router may be significantly different. The reason behind the difference is that transit traffic may be handled completely by the separated forwarding plane (for which this is the function it is optimized for), while traffic destined to the router is passed to the control plane and handled by the control plane CPU, which may have more important tasks at the moment (e.g. running routing protocols, calculating routes), and even if it's free, it cannot process traffic as efficiently as the forwarding plane.

This means that performance metrics of intermediate hops obtained from ping or traceroute-like tools may be misleading, as they may show significantly worse metrics than those of the analyzed "normal" transit traffic. In other words, it may happen easily that a router is forwarding transit traffic correctly, fast, without any packet loss, etc., while a traceroute through the router or pinging the router shows packet loss, high round-trip time, high delay variation, or some other bad things.

However, the control plane CPU of a high performance router normally is not always so busy, and in quiet periods it can deal well with the probe traffic directed to it. So intermediate hop probe traffic measurement results should be usually interpreted by dropping the occasional bad values, based on the assumption that the intermediate router's forwarding plane CPU had more important things to do than dealing with the probe traffic.

Beware the slow path

Routers with hardware support for forwarding often restrict this support to a common subset of traffic and configurations. "Unusual" packets may be handled in a "slow path" using the general-purpose CPU which is otherwise mostly dedicated to control-plane tasks. This often includes packets with IP options.

Some routers also revert from hardware-forwarding to CPU-based when certain complex functions are configured on the device - for example when the router has to do NAT or other payload-inspecting features. Another reason for reverting to CPU-based forwarding is when some resources in the forwarding hardware become exhausted, such as the forwarding table ("hardware" equivalent of the routing table) or access control tables.

To effectively use routers with hardware-based forwarding, it is therefore essential to know the restrictions of the hardware, and to ensure that the majority of traffic can indeed be hardware-switched. Where this is supported, it may be a good idea to limit the amount of "slow-path" traffic using rate-limits such as Cisco's "Control Plane Policing" (CoPP).

-- AndrasJako - 06 Mar 2006

Active Queue Management (AQM)

Packet-switching nodes such as routers usually need to accomodate queues to buffer packets, when incoming traffic exceeds the available outbound capacity. Traditionally, these buffers have been organised as tail-drop queues, where packets are queued until the buffer is full, and when it is full, newly arriving packets are dropped until the queue empties out. With bursty traffic (as is typical with TCP/IP), this can lead to entire bursts of arriving packets to be dropped because of a full queue. The effects of this are synchronisation of flows and a decrease of aggregate throughput. Another effect of the tail-drop queueing strategy is that, when congestion is long-lived, the queue will grow to fill the buffer, and will remain large until congestion eventually subsides. With large router buffers, this leads to increased one-way delay and round-trip times, which impacts network performance in various ways - see BufferBloat.

This had lead to the idea of "active queue management", where network nodes send congestion signals once they sense the onset of congestion, to avoid buffers filling up completely.

Active Queue Management is a precondition for Explicit Congestion Notification (ECN), which helps performance by reducing or eliminating packet loss during times of (light) congestion.

The earliest and best-known form of AQM on the Internet is Random Early Detection (RED). This is now supported by various routers and switches, although it is not typically activated by default. One possible reason is that RED, as originally specified, must be "tuned" depending on the given traffic mix (and optimisation goals) to be maximally effective. Various alternative methods have been proposed as improvements to RED, but none of them have enjoyed widespread use.

CoDel (May 2012) has been proposed as an promising practical alternative to RED. PIE was then proposed as an alternative to CoDel, claiming to be easier to implement efficiently, in particular on "hardware" implementations.

AQM in the IETF

RFC 2309, Recommendations on Queue Management and Congestion Avoidance in the Internet, (1998) recommended "testing, standardization, and widespread deployment" of AQM, and specifically RED, on the Internet. The testing part was certainly followed, in the sense that a huge number of academic papers was published about RED, its perceived shortcomings, proposed alternative AQMs, and so on. There was no standardization, and very little actual deployment. While RED is implemented in most routers today, it is generally not enabled by default, and very few operators explicitly enable it. There are many reasons for this, but an important one is that optimal configuration parameters for RED depend on traffic load and tradeoffs between various optimization goals (e.g. throughput and delay). RFC 3819, Advice for Internet Subnetwork Designers, also discusses questions of AQM, in particular RED and its configuration parameters.

In March 2013, a new AQM mailing list ( archive) was announced to discuss a possible replacement for RFC 2309. This evolved into the AQM Working Group. Fred Baker issued draft-baker-aqm-recommendation as a starting point. This became a Working Group document ( draft-ietf-aqm-recommendation) and was submitted to the IESG in February 2015 with Gorry Fairhurst as a co-editor.

References

  • RFC 2309, Recommendations on Queue Management and Congestion Avoidance in the Internet, B. Braden, D. Clark, J. Crowcroft, B. Davie, S. Deering, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K. Ramakrishnan, S. Shenker, J. Wroclawski, L. Zhang. April 1998
  • RFC 3819, Advice for Internet Subnetwork Designers, P. Karn, Ed., C. Bormann, G. Fairhurst, D. Grossman, R. Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood. July 2004
  • RFC 7567, IETF Recommendations Regarding Active Queue Management, F. Baker, Ed., G. Fairhurst, Ed., July 2015
  • RFC 7806, On Queuing, Marking, and Dropping, F. Baker, R. Pan, April 2016
  • Advice on network buffering, G. Fairhurst, B. Briscoe, slides presented to ICCRG at IETF-86, March 2013
  • RFC 7928, Characterization Guidelines for Active Queue Management (AQM), N. Kuhn, Ed., P. Natarajan, Ed., N. Khademi, Ed., D. Ros, July 2016
  • draft-lauten-aqm-gsp-03, Global Synchronization Protection for Packet Queues, Wolfram Lautenschlaeger, May 2016
  • draft-briscoe-tsvwg-aqm-tcpm-rmcat-l4s-problem-01, Low Latency, Low Loss, Scalable Throughput (L4S) Internet Service: Problem Statement, Bob Briscoe, Koen De Schepper, Marcelo Bagnulo, June 2016
  • draft-briscoe-tsvwg-aqm-dualq-coupled-00, DualQ Coupled AQM for Low Latency, Low Loss and Scalable Throughput, Koen De Schepper, Bob Briscoe, Olga Bondarenko, Ing-jyh Tsang, October 2016

-- Simon Leinen - 2005-01-05 - 2016-11-01

Random Early Detection (RED)

RED is the first and most widely used Active Queue Management (AQM) mechanism. The basic idea is this: a route controlled by RED samples its occupation (queue size) over time, and computes a characteristic (often a decaying average such as an Exponentially Weighted Moving Average (EWMA), although several researchers propose that using the instantaneous queue length works better). This sampled queue length is then used to compute the loss probability for an incoming packet, according to a pre-defined profile. The decision whether to drop an incoming packet is performed using a random number and the drop probability. In this way, a point of (light) congestion in the network can run with a short queue, which is beneficial for applications because it keeps one-way delay and round-trip time low. When the congested link is shared by only a few TCP streams, RED also prevents synchronization effects that cause degradation of throughput.

Although RED is widely implemented in networking equipment, it is usually "off by default", and not often activated by operators. One reason is that it has various parameters (queue thresholds, EWMA parameters etc.) that need to be configured, and finding good values for these parameters is considered as something of a black art. Also, a huge number of research papers have been published that point out shortcomings of RED in particular scenarios, which discourages network operators from using it. Unfortunately, none of the many alternatives to RED that have been proposed in these publications, such as EXPRED, REM, or BLUE, has gained any more traction.

The BufferBloat activity has rekindled interest in queue-management strategies, and Kathleen Nichols and Van Jacobson, who have a long history of doing research on practical AQM algorithms, have published CoDel in 2012, which claims to improve upon RED in several key points using a fresh approach.

References

-- SimonLeinen - 2005-01-07 - 2012-07-12

Differentiated Services (DiffServ)

From the IETF's Differentiated Services (DiffServ) Work Group's site:

The differentiated services approach to providing quality of service in networks employs a small, well-defined set of building blocks from which a variety of aggregate behaviors may be built. A small bit-pattern in each packet, in the IPv4 TOS octet or the IPv6 Traffic Class octet, is used to mark a packet to receive a particular forwarding treatment, or per-hop behavior, at each network node.

Examples of DiffServ-based services include the Premium IP and "Less-than Best Effort" (LBE) services available on GEANT/GN2 and some NRENs.

References

-- TobyRodwell - 28 Feb 2005 -- SimonLeinen - 15 Jul 2005

Premium IP

Premium IP is a new service available on GEANT/GN2 and some of the NRENs connected to it. It uses DiffServ, and in particular the EF (Expedited Forwarding) PHB (per-hop behaviour) to protect a given aggregate of IP traffic against disturbances by other traffic, so that OneWayDelay, DelayVariation, and PacketLoss are assured to be low. The aggregate is specified by source and destination IP address ranges, or by ingress/egress AS (Autonomous System) numbers when in the core domain. In addition, a Premium IP aggregate must conform to strict rate limits. Unlike earlier proposals of similar "Premium" services, GEANT/GN2's Premium IP service downgrades excess traffic (over the contractual rate) to "best effort" (DSCP=0), rather than dropping it.

References

-- SimonLeinen - 21 Apr 2005

LBE (Less than Best Effort) Service

Less Than Best Effort (LBE) is a new service available on GEANT/GN2 and some of the NRENs connected to it. It uses DiffServ with the Class Selector 1 (CS=1) DSCP to mark traffic that can get by with less than the default "best effort" from the network. It can be used by high-volume, low-priority applications in order to limit their impact on other traffic.

References

-- SimonLeinen - 06 May 2005

Integrated Services (IntServ)

IntServ was an attempt by the IETF to add Quality of Service differentiation to the Internet architecture. Components of the Integrated Services architecture include

  • A set of predefined service classes with different parameters, in particular Controlled Load and Guaranteed Service
  • A ReSerVation Protocol (RSVP) for setting up specific service parameters for a flow
  • Mappings to different lower layers such as ATM, Ethernet, or low-speed links.

Concerns with IntServ include its scaling properties when many flow reservations are active in the "core" parts of the network, the difficulties of implementing the necessary signaling and packet treatment functions in high-speed routers, and the lack of policy control and accounting/billing infrastructure to make this worthwhile for operators. While IntServ never became widely implemented beyond intra-enterprise environments, RSVP has found new uses as a signaling protocol for Multi-Protocol Label Switching (MPLS).

As an alternative to IntServ, the IETF later developed the Differentiated Services architecture, which provides simple building blocks that can be composed to similarly granular services.

References

Integrated Services Architecture

  • RFC 1633, Integrated Services in the Internet Architecture: an Overview, R. Braden, D. Clark, S. Shenker, 1994

RSVP

  • RFC 2205, Resource <ReSerVation Protocol (RSVP) -- Version 1 Functional Specification, R. Braden, Ed., L. Zhang, S. Berson, S. Herzog, S. Jamin, September 1997
  • RFC 2208, Resource ReSerVation Protocol (RSVP) -- Version 1 Applicability Statement Some Guidelines on Deployment, A. Mankin, Ed., F. Baker, B. Braden, S. Bradner, M. O'Dell, A. Romanow, A. Weinrib, L. Zhang, September 1997
  • RFC 2209, Resource ReSerVation Protocol (RSVP) -- Version 1 Message Processing Rules, R. Braden, L. Zhang, September 1997
  • RFC 2210, The Use of RSVP with IETF Integrated Services, J. Wroclawski, September 1997
  • RFC 2211, Specification of the Controlled-Load Network Element Service, J. Wroclawski, September 1997
  • RFC 2212, Specification of Guaranteed Quality of Service, S. Shenker, C. Partridge, R. Guerin, September 1997

Lower-Layer Mappings

  • RFC 2382, A Framework for Integrated Services and RSVP over ATM, E. Crawley, Ed., L. Berger, S. Berson, F. Baker, M. Borden, J. Krawczyk, August 1998
  • RFC 2689, Providing Integrated Services over Low-bitrate Links, C. Bormann, September 1999
  • RFC 2816, A Framework for Integrated Services Over Shared and Switched IEEE 802 LAN Technologies, A. Ghanwani, J. Pace, V. Srinivasan, A. Smith, M. Seaman, May 2000

-- SimonLeinen - 28 Feb 2006

Sizing of Network Buffers

Where temporary congestion cannot be avoided, some buffering in network nodes is required (in routers and other packet-forwarding devices such as Ethernet or MPLS switches) to queue incoming packets until they can be transmitted. The appropriate sizing of these buffers has been a subject of discussion for a long time.

Traditional wisdom recommends that a network node should be able to buffer an end-to-end round-trip time's worth of line-rate traffic, in order to be able to accomodate bursts of TCP traffic. This recommendation is often followed in "core" IP networks. For example, FPC (Flexible PIC Concentrators) on Juniper's M- and T-Series routers contain buffer memory for 200ms (M-series) or 100ms (T-series) at the supported interface bandwidth (cf. Juniper M-Series Datasheet and a posting from 8 May, 2005 by Hannes Gredler to the juniper-nsp mailing list.) These ideas also influenced RFC 3819, Advice for Internet Subnetwork Designers.

Recent research results suggest that much smaller buffers are sufficient when there is a high degree of multiplexing of TCP streams.

This work is highly relevant, because overly large buffers not only require more (expensive high-speed) memory, but bring about a risk of high delays that affect perceived quality of service; see BufferBloat.

References

ACM SIGCOMM Computer Communications Review

The October 2006 edition has a short summary paper on router buffer sizing. If you read one article, read this!

The July 2005 edition (Volume 35 , Issue 3) has a special feature about sizing router buffers, containing of these articles:
Making router buffers much smaller
Nick McKeown, Damon Wischik
Part I: buffer sizes for core routers
Nick McKeown, Damon Wischik
Part II: control theory for buffer sizing
Gaurav Raina, Don Towsley, Damon Wischik
Part III: routers with very small buffers
Mihaela Enachescu, Yashar Ganjali, Ashish Goel, Nick McKeown, Tim Roughgarden

Sizing Router Buffers (copy)
Guido Appenzeller Isaac Keslassy Nick McKeown, SIGCOMM'04, in: ACM Computer Communications Review 34(4), pp. 281--292

The Effect of Router Buffer Size on HighSpeed TCP Performance
Dhiman Barman, Georgios Smaragdakis and Ibrahim Matta. In Proceedings of IEEE Globecom 2004. (PowerPoint presentation)

Link Buffer Sizing: a New Look at the Old Problem
Sergey Gorinsky, A. Kantawala, and J. Turner, ISCC-05, June 2005
Another version was published as Technical Report WUCSE-2004-82, Department of Computer Science and Engineering, Washington University in St. Louis, December 2004.

Effect of Large Buffers on TCP Queueing Behavior
Jinsheng Sun, Moshe Zukerman, King-Tim Ko, Guanrong Chen and Sammy Chan, IEEE INFOCOM 2004

High Performance TCP in ANSNET
C. Villamizar and C. Song., in: ACM Computer Communications Review, 24(5), pp.45--60, 1994

RFC 3819, Advice for Internet Subnetwork Designers
P. Karn, Ed., C. Bormann, G. Fairhurst, D. Grossman, R. Ludwig, J. Mahdavi, G. Montenegro, J. Touch, L. Wood. July 2004

-- SimonLeinen - 2005-01-07 - 2013-04-03

OS-specific Tuning: Cisco IOS

  • Cisco IOS specific Tuning
  • TCP MSS Adjustment The TCP MSS Adjustment feature enables the configuration of the maximum segment size (MSS) for transient packets that traverse a router, specifically TCP segments in the SYN bit set, when Point to Point Protocol over Ethernet (PPPoE) is being used in the network. PPPoE truncates the Ethernet maximum transmission unit (MTU) 1492, and if the effective MTU on the hosts (PCs) is not changed, the router in between the host and the server can terminate the TCP sessions. The ip tcp adjust-mss command specifies the MSS value on the intermediate router of the SYN packets to avoid truncation.

-- HankNussbacher - 06 Oct 2005

Support for Large Frames/Packets ("Jumbo MTU")

On current research networks (and most other parts of the Internet), end nodes (hosts) are restricted to a 1500-byte IP Maximum Transmission Unit (MTU). Because a larger MTU would save effort on the part of both hosts and network elements - fewer packets per volume of data means less work - many people have been advocating efforts to increase this limit, in particular on research network initiatives such as Internet2 and GÉANT.

Impact of Large Packets

Improved Host Performance for Bulk Transfers

It has been argued that TCP is most efficient when the payload portions of TCP segments match integer multiples of the underlying virtual memory system's page size - this permits "page flipping" techniques in zero-copy TCP implementations. A 9000 byte MTU would "naturally" lead to 8960-byte segments, which doesn't correspond to any common page size. However, a TCP implementation should be smart enough to adapt segment size to page sizes where this actually matters, for example by using 8192-byte segments even though 8960-byte segments are permitted. In addition, Large Send Offload (LSO), which is becoming increasingly common with high-speed network adapters, removes the direct correspondance between TCP segments and driver transfer units, making this issue moot.

On the other hand, Large Send Offload (LSO) and Interrupt Coalescence remove most of the host-performance motivations for large MTUs: The semgentation and reassembly function between (large) application data units and (smaller) network packets is mostly moved to the network interface controller.

Lower Packet Rate in the Backbone

Another benefit of large frames is that their use reduces the number of packets that have to be processed by routers, switches, and other devices in the network. However, most high-speed networks are not limited by per-packet processing costs, and packet processing capability is often dimensioned for the worst case, i.e. the network should continue to work even when confronted with a flood of small packets. On the other hand, per-packet processing overhead may be an issue for devices such as firewalls, which have to do significant processing of the headers (but possibly not the contents) of each packet.

Reduced Framing Overhead

Another advantage of large packets is that they reduce the overhead for framing (headers) in relation to payload capacity. A typical TCP segment over IPv4 carries 40 bytes of IP and TCP header, plus link-dependent framing (e.g. 14 bytes over Ethernet). This represents about 3% of overhead with the customary 1500-byte MTU, whereas an MTU of 9000 bytes reduces this overhead to 0.5%.

Network Support for Large MTUs

Research Networks

The GÉANT/GÉANT2 and Abilene backbones now both support a 9000-byte IP MTU. The current access MTUs of Abilene connectors can be found on the Abilene Connector Technology Support page. The access MTUs of the NRENs to GÉANT are listed in the GÉANT Monthly Report (not publicly available).

Commercial Internet Service Providers

Many commercial ISPs support larger-than-1500-byte MTUs in their backbones (4470 bytes is a typical value) and on certain types of access interfaces, i.e. Packet-over-SONET (POS). But when Ethernet is used as an access interface, as is more and more frequently the case, the access MTU is usually set to the 1500 bytes corresponding to the Ethernet standard frame size limit. In addition, many inter-provider connections are over shared Ethernet networks at public exchange points, which also imply a 1500-byte limit, with very few exceptions - MAN LAN is a rare example of an Ethernet-based access point that explicitly supports the use of larger frames.

Possible Issues

Path MTU Discovery issues

Moving to larger MTUs may exhibit problems with the traditional Path MTU Discovery mechanism - see that topic for more information about this.

Inconsistent MTUs within a subnet

There are other deployment considerations that make the introduction of large MTUs tricky, in particular the requirement that all hosts on a logical IP subnet must use identical MTUs; this makes gradual introduction hard for large bridged (switched) campus or data center networks, and also for most Internet Exchange Points.

When different MTUs are used on a link (logical subnet), this can often go unnoticed for a long time. Packets smaller than the minimum MTU will always pass through the link, and the end with the smaller MTU configured will always fragment larger packets towards the other side. The only packets that will be affected are packets from the larger-MTU side to the smaller-MTU side that are larger than the smaller MTU. Those will typically be sent unfragmented, and dropped at the receiving end (the one with the smaller MTU). However, it may happen that such packets rarely occur in normal situations, and the misconfiguration isn't detected immediately.

Some routing protocols such as OSPF and IS-IS detect MTU mismatches and will refuse to build adjacencies in this case. This helps diagnose configurations. Other protocols, in particular BGP, will appear to work as long as only small amounts of data are sent (e.g. during initial handshake and option negotiation), but get stuck when larger amounts of data (i.e. the initial route advertisements) must be sent in the bigger-to-smaller-MTU direction.

Problems with Large MTUs on end systems network interface cards (NICs)

On some NICs it's possible to configure jumbo frames (for example MTU=9000) and the NIC is working fine when checking it's functionality with pings (jumbo sized ICMP packets), altough the NIC vendor states there is no jumbo frame support.
In that cases there are packet losses on the NIC if it receives jumbo frames with typical production data rates.

Therefore the NIC vendor information should be consulted before activating Large MTUs on the host interface. Then high data rate tests should be done with large MTUs. The hosts interface statistics should be checked on input packet drops.
Typical commands on unix systems: 'ifconfig <ifname>' or 'ethtool -S <ifname>'


References

Vendor-specific

-- SimonLeinen - 16 Mar 2005 - 02 Sep 2009
-- HankNussbacher - 18 Jul 2005 (added Phil Dykstra paper) An evil middlebox is a transparent device that sits inbetween an end-to-end connection that disturbs the normal end-to-end traffic in some way. As you can not see these devices which usually work on layer 2, it is difficult to debug issues that involve them. Examples are HTTP proxy, Gateway proxy (all protocols). Normally, these devices are installed for security reasons to filter out "bad" traffic. Bad traffic may be viri, trojans, evil javascript, or anything that is not known to the device. Sometimes also so called rate shapers are installed as middleboxes; while these do not change the contents of the traffic, they do drop packets according to rules only known by themselves. Bugs in such middleboxes can have fatal consequences for "legitimate" Internet traffic which may lead to performance or even worse connection issues.

Middleboxes come in all shapes and flavors. The most popular are firewalls:

Examples of experienced performance issues

Two examples in the beginning of 2005 in SWITCH:

  • HttpProxy: very slow response from a webserver only for a specific circle of people

  • GatewayProxy: tcp transfers get stalled as soon as a packet is lost on the local segment from the middlebox to the end host.

A Cisco IOS Firewall in August 2006 in Funet:

  • WindowScalingProblems: when window scaling was enabled, TCP performance was bad (10-20 KBytes/sec). Some older versions of PIX could also be affected by window scaling issues.

DNS Based global load balancing problems

Juniper SRX3600 mistreats fragmented IPv6 packets

This firewall (up to at least version 11.4R3.7) performs fragment reassembly in order to apply certain checks to the entire datagram, for example in "DNS ALG" mode. It then tries to forward the reassembled packet instead of the initial fragments, which triggers ICMP "packet too big" messages if the full datagram is larger than the MTU of the next link. This will lead to a permanent failure on this path, because the (correct) fragmentation at the sender is annihilated by the erroneous reassembly at the firewall.

The same issue has also been found with some models of the Fortigate firewall.

-- ChrisWelti - 01 Mar 2005
-- PekkaSavola - 10 Oct 2006

-- PekkaSavola - 07 Nov 2006

-- AlexGall - 2012-10-31 Symptom: Accessing a specific web-site which contains javascript, is very slow (around 30 seconds for one page)

Analysis Summary: HTTP traffic is split between the webserver and a transparent HTTP proxy on the customer site and the HTTP proxy server and the end-hosts. The transparent HTTP proxy fakes the end-points; to the HTTP web server it pretends to be the customer accessing it and to the customer the HTTP proxy appears to be the web server (faked IP addresses). Accordingly there are 2 TCP connections to be considered here. The proxy receives a HTTP request from the customer to the webserver. It then forwards this request to the webserver and WAITS until it has received the whole reply (this is essential, as it needs to analyze the whole reply to decide if it is bad or not). If the content of that HTTP reply is dynamic, the length is not known. With HTTP1.1 a TCP session is not built for every object but remains intact untill a timeout has occured.This means the proxy has to wait until the TCP session gets torn down, to be sure there is not more content coming. When it has received the whole reply it will forward that reply to the customer who asked for it. Of course the customer will suffer from a major delay.

-- ChrisWelti - 03 Mar 2005 Warning: Can't find topic PERTKB.SshProxy

Specific Network Technologies

This section of the knowledge base treats some (sub)network technologies with their specific performance implications.

-- SimonLeinen - 06 Dec 2005 -- BlazejPietrzak - 05 Sep 2007

Ethernet

Ethernet is now widely prevalent as a link-layer technology for local area/campus network, and is making inroads in other market segments as well, for example as a framing technique for wide-area connections (replacing ATM or SDH/SONET in some applications), or as fabric for storage networks (replacing Fibre Channel etc.) or clustered HPC systems (replacing special-purpose networks such as Myrinet etc.).

From the original media access protocol based on CSMA/CD used on shared coaxial cabling at speeds of 10 Mb/s (originally 3 Mb/s), Ethernet has involved to much higher speeds, from Fast Ethernet (100 Mb/s) through Gigabit Ethernet (1 Gb/s) to 10 Gb/s. Shared media access and bus topologies have been replaced by star-shaped topologies connected by switches. Additional extensions include speed, duplex-mode and cable-crossing auto-negotiation, virtual LANs (VLANs), flow-control and other quality-of-service enhancements, port-based access control and many more.

The topics treated here are mostly relevant for the "traditional" use of Ethernet in local (campus) networks. Some of these topics are relevant for non-Ethernet networks, but are mentioned here nevertheless because Ethernet is so widely used that they have become associated with it.

-- SimonLeinen - 15 Dec 2005

Duplex Modes and Auto-Negotiation

A point-to-point Ethernet segment (typically between a switch and an end-node, or between two directly connected end-nodes) can operate in one of two duplex modes: half duplex means that only one station can send at a time, and full duplex means that both stations can send at the same time. Of course full-duplex mode is preferable for performance reasons if both stations support it.

Duplex Mismatch

Duplex mismatch describes the situation where one station on a point-to-point Ethernet link (typically between a switch and a host or router) uses full-duplex mode, and the other uses half-duplex mode. A link with duplex mismatch will seem to work fine as long as there is little traffic. But when there is traffic in both directions, it will experience packet loss and severely decreased performance. Note that the performance in the duplex mismatch case will be much worse than when both stations operate in half-duplex mode.

Work in the Internet2 "End-to-End Performance Initiative" suggests that duplex mismatch is one of the most common causes of bad bulk throughput. Rich Carlson's NDT (Network Diagnostic Tester) uses heuristics to try to determine whether the path to a remote host suffers from duplex mismatch.

Duplex Auto-Negotiation

In early versions of Ethernet, only half-duplex mode existed, mostly because point-to-point Ethernet segments weren't all that common - typically an Ethernet would be shared by many stations, with the CSMA/CD (Collision Sense Multiple Access/Collision Detection) protocol used to arbitrate the sending channel.

When "Fast Ethernet" (100 Mb/s Ethernet) over twisted pair cable (100BaseT) was introduced, an auto-negotiation procedure was added to allow two stations and the ends of an Ethernet cable to agree on the duplex mode (and also to detect whether the stations support 100 Mb/s at all - otherwise communication would fall back to traditional 10 Mb/s Ethernet). Gigabit Ethernet over twisted pair (1000BaseTX) had speed, duplex, and even "crossed-cable" (MDX) autonegotiation from the start.

Why people turn off auto-negotiation

Unfortunately, some early products supporting Fast Ethernet didn't include the auto-negotiation mechanism, and those that did sometimes failed to interoperate with each other. So many knowledgeable people recommended to avoid the use of duplex-autonegotiation, because it introduced more problems than it solved. The common recommendation was thus to manually configure the desired duplex mode - typically full duplex by hand.

Problems with turning auto-negotiation off

There are two main problems with turning off auto-negotiation

  1. You have to remember to configure both ends consistently. Even when the initial configuration is consistent on both ends, it often turns into an inconsistent one as devices and connectinos are moved around.
  2. Hardcoding one side to full duplex when the other does autoconfiguration causes duplex mismatch. In situations where one side must use auto-negotiation (maybe because it is a non-manageable switch), it is never right to manually configure full-duplex mode on the other. This is because the auto-negotiation mechanism requires that, when the other side doesn't perform auto-negotiation, the local side must set itself to half-duplex mode.

Both situations result in duplex mismatches, with the associated performance issues.

Recommendation: Leave auto-negotiation on

In the light of these problems with hard-coded duplex modes, it is generally preferable to rely on auto-negotiation of duplex mode. Recent equipment handles auto-negotiation in a reliable and interoperable way, with very few exceptions.

References

-- SimonLeinen - 12 Jun 2005 - 4 Sep 2006

LAN Collisions

In some legacy networks, workstations or other devices may still be connected as into a LAN segment using hubs. All incoming and outgoing traffic is propagated throughout the entire hub, often resulting in a collision when two or more devices attempt to send data at the same time. For each collision, the original information will need to be resent, reducing performance.

Operationally, this can lead to up to 100% of 5000-byte packets being lost when sending traffic off network and 31%-60% packet loss within a single subnet. It should be noted that common applications (e.g. email, FTP, WWW) on LANs produce packets close to 1500 bytes in size, and that packet loss rates >1% render applications such as video conferencing unusable.

To prevent collisions from traveling to every workstation in the entire network, bridges or switches should be installed. These devices will not forward collisions, but will permit broadcasts to all users and multicasts to specific groups to pass through.

When only a single system is connected to a single switch port, each collision domain is made up of only one system, and full-duplex communication also becomes possible.

-- AnnRHarding - 18 Jul 2005

LAN Broadcast Domains

While switches help network performance by reducing collision domains, they will permit broadcasts to all users and multicasts to specific groups to pass through. In a switched network with a lot of broadcast traffic, network congestion can occur despite high speed backbones. As universities and colleges were often early adopters of Internet technologies, they may have large address allocations, perhaps even deployed as big flat networks which generate a lot of broadcast traffic. In some cases, these networks can be as large as a /16, and having the potential to put up to sixty five thousand hosts on a single network segment could be disastrous for performance.

The main purpose of subnetting is to help relieve network congestion caused by broadcast traffic. A successful subnetting plan is one where most of the network traffic will be isolated to the subnet in which it originated and broadcast domains are of a manageable size. This may be possible based on physical location, or it may be better to use VLANs. VLANs allow you to segment a LAN into different broadcast domains regardless of physical location. Users and devices on different floors or buildings have the ability to belong to the same LAN, since the segmentation is handled virtually and not via the physical layout.

References and Further Reading

RIPE Subnet Mask Information http://www.ripe.net/rs/ipv4/subnets.html

Priscilla Oppenheimer, Top-Down Network Design http://www.topdownbook.com/

-- AnnRHarding - 18 Jul 2005

Wireless LAN

Wireless Local Area Networks according to IEEE 802.11 standards has become extremely widespread in recent years, in campus networks, for home networking, for convenient network access at conferences, and to a certain point for commercial Internet access provision in hotels, public places, and even planes.

While wireless LANs are usually built more for convenience (or profit) than for performance, there are some interesting performance issues specific to WLANs. As an example, it is still a big challenge to build WLANs using multiple access points so that they can scale to large numbers of simultaneous users, e.g. for large events.

Common problems with 802.11 wireless LANs

Interference

In the 2.4 GHz band, the number of usable channels (frequency) is low. Adjacent channels use overlapping frequencies, so there are typically only three truly non-overlapping channels in this band - channels 1, 6, and 11 are frequently used. In campus networks requiring many access points, care must be taken to avoid interference between same-channel access points. The problem is even more severe in areas where access points are deployed without coordination (such as in residential areas). Some modern access points can sense the radio environment during initialization, and try to use a channel that doesn't suffer from much interference. The 2.4 GHz is also used by other technologies such as Bluetooth or microwave ovens.

Capacity loss due to backwards compatibility or slow stations

The radio link in 802.11 can work at many different data rates below the nominal rate. For example, the 802.11g (54 Mb/s) access point to which I am presently connected supports operation at 1, 2, 5.5, 6, 9, 11, 12, 48, 18, 24, 36, or 54 Mb/s. Using the lower speeds can be useful in terms of adverse radio transmission conditions. In addition, it allows backwards compatibility - for example, 802.11g equipment interoperates with older 802.11b equipment, albeit at most at the lower 11 Mb/s supported by 802.11b.

When lower-rate and higher-rate stations coexist on the same access point, it should be noticed that the lower-rate station will occupy disproportionally more of the medium's capacity, because of increased serialization times at lower rates. So a single station operating at 1 Mb/s and transferring data at 500 kb/s will consume an equal part of the access point's capacity as 54 stations also transferring 500 kb/s each, but at a 54 Mb/s wireless rate.

Multicast/broadcast

Wireless is a "natural" broadcast medium, so broadcast and multicast should be relatively efficient. But the access point normally sends multicast and broadcast frames at a low rate, to increase the probability that all stations can actually receive them. Thus, multicast traffic streams can quickly consume a large fraction of an access point's capacity as per the considerations in the preceding section.

This is a reason why wireless networks often aren't multicast-enabled even in environments that typically have multicast connectivity (such as campus networks). Note that broadcast and multicast cannot easily be disabled completely, because they are required for lower-layer protocols such as ARP (broadcast) or IPv6 Neighbor Discovery (multicast) for work.

802.11n Performance Features

IEEE 802.11n is an recent addition to the standards for wireless LAN offering higher performance in terms of both capacity ("bandwidth") and reach. The standard supports both bands, although "consumer" 802.11n products often work with 2.4 GHz only unless marked "dual-band". Within each band, 802.11n equipment is normally backwards compatible with the respective prior standards, i.e. 802.11b/g for 2.4 GHz and 802.11a for 5 GHz. 802.11n achieves performance increases by

  • physical diversity using multiple antennas in a "MIMO" multiple in/multiple out scheme
  • the option to use wider channels (spaced 40 MHz rather than 20 MHz)
  • frame aggregation options at the MAC levels, allowing larger packets or bundling of multiple frames to a single radio frame, in order to better amortize the (relatively high) link access overhead.

References

-- SimonLeinen - 15 Dec 2005 - 21 Oct 2008

Performance Case Studies

Detailed Case Studies

This section describes some case studies in which alterations to end systems, applications or network elements improved overall performance.

Short Case Studies

These are brief descriptions of problems seen, and the reasons behind the problems

-- AnnRHarding - 20 Jul 2005
-- SimonLeinen - 21 Jul 2006

Scaling Apache 2.x beyond 20,000 concurrent downloads

ftp.heanet.ie is HEAnet's National Mirror Server for Ireland. Currently mirroring over 50,000 projects, it is a popular source of content on the internet. It serves mostly static content via HTTP, FTP and RSYNC, all available via IPv4 and IPv6. It regularly sustains over 20,000 concurrent connections on a single Apache instance and has served as many as 27,000 with about 3.5 Terabytes of content per day. The front-end system is a Dell 2650, with 2 2.4 Ghz Xeon processors, 12Gb of memory and the usual 2 system disks and 15k RPM SCSI disks, running Debian GNU/Linux and Apache 2.x.

Considerable system and application tuning enabled this system to achieve these performance rates. Apachebench was used for web server benchmarking, bonnie++ and iozone for file system benchmarking and an in-house script to measure buffering, virtual memory management and scheduling. The steps taken are highlighted below:

Apache

MPM Tuning

Apache 2.x has a choice of multi-processing modules. For this system, the prefork MPM was chosen, tuned to have 10 spare servers, the number of spare servers calculated such that there are enough child processes available to handle new requests when the rate of new connections exceeds the rate at which Apache can manage to create new processes.

Module Compilation

Apache modules can be compiled in directly to one binary, or as dynamically-loaded share dobjects which are then loaded by a smaller binary. For our load, a small performance gain (measurable as about 0.2%) was found by compiling the modules in directly.

htaccess

As the highperformance.conf sample provided with Apache suggests, turning off the use of .htaccess files, if appropriate can give significant performance improvements.

sendfile

Sendfile is a system call that enables programs to hand off the job of sending files out of network sockets to the kernel, improving performance and efficiency. It is enabled by default at compile-time if Apache detects that the system supports the call. However, the Linux implementation of sendfile corrupted IPv6 sessions so this was not implementable on ftp.heanet.ie for policy reasons.

Mmap

Mmap (memory map) support allows Apache to treat a file as if it were a contiguous region of memory, greatly speeding up the I/O by dispensing with unnecessary read operations. This allowed serving of files roughly 3 times quicker.

mod_cache

mod_disk_cache is an experimental feature in Apache 2.x that caches files in a defined area as they are being served for the first time. Repeated requests are served from this cache, avoiding the slower file systems. mod The default was further tuned to increase the CacheDirLevel4 to 5 to faciliate more files in the cache.

Configure Options

The following configure options were used:

CFLAGS="-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE"; export CFLAGS
"./configure" \
"--with-mpm=prefork" \
"--prefix=/usr/local/web" \
"--enable-cgid" \
"--enable-rewrite" \
"--enable-expires" \
"--enable-cache" \
"--enable-disk-cache" \
"--without-sendfile"

As little as possible was compiled into the httpd binary to reduce the amount of memory used. The CFLAGS exported enabled serving of files over 2Gb in size.

File System

Chosing a fast and efficient filesystem is very important for a web server. Tests at the time showed XFS gave better performance than ext2 and ext3, up to a margin of 20%. As a caveat, as the number of used inodes in the filesystem grows, XFS becomes very slow at directory traversal. This resulted in a migration to ext3, despite the reduced perfomance.

noatime

One significant mount option, noatime, was set. This is the single easiest way to dramatically increase filesystem performance for read operations. Normally, when a file is read Unix-like systems update the inode for the file with this access time so that the time of last access is known. This operation means that read operations also involve writing to the filesystem - a severe performance bottleneck in most cases. If knowing this access time is not critical, and it certainly is not for a busy mirror server, it can be turned off by mounting the filesystem with the noatime option.

logbufs

For XFS, the logbufs mount option allows the administrator to specify how many in-memory log buffers are used.While it was not clear what these log buffers do, increasing this number to its maximum increased performance. This performance increase comes at the expense of memory which was acceptable for the overall design.

dir_index

For ext3, dir_index option is an option whereby ext3 uses hashed binary-trees to speed up lookup in directories. This has proved much faster for directory traversal.

Kernel

The system originally ran the SGI Linux 2.4 kernel which gave about 12,000 sessions as a maximum. However after simply upgrading to the 2.6 kernel the server hit the compiled-in 20,000 limit of Apache without any additional effort, so the scheduler in the 2.6 kernel appears to have markedly improved.

File Descriptors

One of the most important tunables for a large-scale webserver is the maximum number of file descriptors the system is allowed to have open at once. The default is not suffcient when serving thousands of clients. It is important to remember that regular files, sockets, pipes and the standard streams for every running process are all classed as filedescriptors and that it is easy to run out.

This figure was set to 5049800.

Virtual Memory Manager

In Linux, the VM manages the memory allocated to processes and the kernel and also manages the in-memory cache of files. By far the easiest way to “tune” the VM in this regard is to increase the amount of memory available to it. This is probably the most reliable and easy way of speeding up a webserver - add as much memory as you can afford.

The VM takes a similar approach to mod_disk_cache for freeing up space - it assigns programs memory as they request it and then periodically prunes back what can be made free. If a lot of files are being read very quickly, the rate of increase of memory usage will be very high. If this rate is so high that memory is exhausted before the VM had had a chance to free any there will be severe system instability. To correct for this 5 sysctl's were set:

vm/min_free_kbytes = 204800
vm/lower_zone_protection = 1024
vm/page-cluster = 20
vm/swappiness = 200
vm/vm_vfs_scan_ratio = 2

* The first sysctl sets the VM to aim for at least 200 Megabytes of memory to be free. * The second sysctl sets the amount of “lower zone” memory directly addressable by the CPU that should be kept free. * The third sysctl, “vm/page-cluster” tells Linux how many pages to free at a time when freeing pages. * The fourth sysctl, “swappiness,” is a very vague sysctl which seems to boil down to how much Linux “prefers” swap, or how “swappy” it should be. * The final sysctl, the “vm vfs scan ratio,” sets what proportion of the filesystem-data caches should be scanned when freeing memory. By setting this to 20 we mean that 1/20th of them should be scanned - this means that some cached data is kept longer than it otherwise would, leading to increased opportunity for re-use.

Network Stack

Six sysctl's were set relating to the network stack:

net/ipv4/tcp_rfc1337=1
net/ipv4/tcp_syncookies=1
net/ipv4/tcp_keepalive_time = 300
net/ipv4/tcp_max_orphans=1000
sys/net/core/rmem_default=262144
sys/net/core/rmem_max=262144

* TCP syncookies and the RFC1337 options were enabled for security reasons. * The default tcp keepalive time was set to 5 minutes to avoid the situation where httpd children handling connections which have not been responsive for 5 minutes are not needlessly waiting in the queue. This has the minor impact that if the client does try to continue with the TCP session at a later time it will disconnect. * The max orphans option ensures that even despite the 5 minute timeout there are never more than 1,000 processes held in such a state, and will instead start closing the sockets of the longest waiting processes. This prevents process starvation due to many broken connections. * The final two options increase the amount of memory generally available to the networking stack for queueing packets.

Hyperthreading

Hyperthreading is a technology which makes one processor show up as two with the aim of improving resource usage efficiency within the processor. The webserver was benchmarked with hyperthreading enabled and hyperthreading disabled. Hyperthreading enabled resulted in a 37% performance increase (from 721 requests per second to 989 requests per second, with the same test). It was therefore enabled.

References

Colm MacCárthaigh, Scaling Apache 2.x beyond 20,000 concurrent downloads, http://www.stdlib.net/~colmmacc/Apachecon-EU2005/scaling-apache-handout.pdf, ApacheCon 2005

-- AnnRHarding - 20 Jul 2005

EXPERIMENTAL 10Gbps TECHNOLOGY RESULTS AT FORSCHUNGSZENTRUM KARLSRUHE

A world wide, distributed Grid Computing environment is currently under development for the upcoming Large Hadron Collider (LHC) at CERN, organized in several so called Tier-centres. CERN will be the source of a very high data load, originating from particle detectors at the LHC accelerator ring, with an estimated 2000 MByte/sec. Part of the data processing is done in a Tier-0 centre, located at CERN. It is responsible mainly for initial processing of data and its archiving to tape. One tenth of the data is distributed to each of the 10 Tier-1 centres for further processing and backup-purposes. The Gridka cluster (www.gridka.de), located at Forschungszentrum Karlsruhe/Germany (www.fzk.de), is one of these Tier-1 centres. Further details about the LHC computing model can be found here: http://les.home.cern.ch/les/talks/lcg%20overview%20sep05.ppt.

A 10 Gbit link is active between the CERN Tier-0 centre and GridKa in order to handle this high load of date.

Often network administrators, after having set up a Gigabit connection (or even one with 10Gb/s), sadly realize that what they get out of it is not what they had expected. The performance achieved with vanilla systems is often far away from what it should be. The current problems of Gigabit- and 10 GB/s-Ethernet were not yet apparent years ago, when the Fast Ethernet technology was released. Currently it is really difficult to get close to line speed and it is almost impossible to fill a full-duplex communication when the latter technology is used. While at the height of the Fast Ethernet technology the networks were the bottlenecks, nowadays the problems have moved towards the end-systems, and the networks are not likely to be the bottlenecks anymore.

The requirements of a full duplex 10Gbps communication are just too demanding for the capabilities of current end-systems. Most of the applications in use today are based on the TCP/IP protocol. The inherent unreliability of the IP protocol is complemented by several methods in the TCP layer of the OSI model to ensure a correct and reliable end-to-end comunication. This reliability does of course not come for free, as each TCP-flow within a system passes the Front Side Bus (FSB) up to four times and the memory is accessed up to five times.

Currently vendors try to decrease the number of transferences back and forth through the FSB and the interruption rate that a 10Gbps system has to deal with. This is done by introducing offload engines (see http://www.redbooks.ibm.com/redbooks/SG245287/) and Interrupt Coalescence to current 10Gbps network devices, with the aim of getting the best out of the current technology found in end-systems. The first approach implements some software TCP procedures of last year in hardware, thereby eliminating one cycle, which reduces the number of accesses to two Front Side Bus transferences and three memory accesses. The second approach simply places each new package into a queue during a set time rather than sending it as soon it is ready. It is hoped that when the time frame is finished, there will be more packets ready to be sent. This allows to send them all in one interruption rather than generating an interruption for every single packet.

The various 10Gbps tests run at the Forschungszentrum Karlsruhe can be divided into two big groups:

  • a local test inside the Forschungszentrum testbed

  • tests involving experiments in a Wide Area Network environment (WAN), between Forschungszentrum Karlsruhe and CERN. Such tests use the Deutsche Forschungsnetz and Geant through a 20 ms RTT path over a shared Least Best Effort ATM MPLS Tunnel. This allows DFN and Geant to stay in control, as their 10Gbps backbone could be easily filled up with these tests only. This would effectively cut off the communication of thousands of scientist across Europe ...

The local environment at Forschungszentrum Karlsruhe consisted of two IBM xSeries 345 Intel Xeon based systems, both equipped with a 10Gbps LR card, kindly provided by Intel. With these unmodified systems the throughput went up slightly above 1Gbps. After modifying Intel's device driver's default interruption coalescence configuration, using Ethernet extended non-standard jumbo frames, and setting to its maximum the MMRBC register of the PCI-X command register set, an unidirectional single stream of slightly over 5.3Gbps could be sent in a back to back transference, this way improving it by more than 400%. As both the IBM system's load rose up to 99%, no higher throughput could be achieved with these machines. The bottleneck in this case was the memory subsystem of the Xeon systems.

In the WAN environment, a single Intel Itanium node at CERN plus one of the already tuned IBM systems were configured to take part in the wide area tests. Both were configured in the same way. The first tests were really disappointing, as they did not go beyond a few MegaBytes. Once the TCP SACK (selective acknowledgements, RFC 2018) and TCP Timestamps (RFC 1323) were enabled, and the TCP windows were enlarged by means of the sysctl parameters in order to match the bandwidth delay product (BDP), the throughput drastically increased up to 5.4Gbps. In the latter case, the BDP is roughly 20Mbit for this 20ms RTT across Germany, France and Switzerland. In this situation the bottleneck was again the xServer’s memory subsystem. This did not come unexpected, as two different architectures were brought face to face; Xeon versus Itanium.

Here is the modification of the TCP stack, as done using the Linux kernel’s sysctl parameters:

net.ipv4.tcp_timestamps =1

net.ipv4.tcp_sack = 1

net.ipv4.tcp_rmem = 10000000 25165824 50331648# sets min/default/max TCP

read buffer, default 4096 87380 174760

net.ipv4.tcp_wmem = 10000000 25165824 50331648# sets min/pressure/max TCP

write buffer, default 4096 16384 131072

net.ipv4.tcp_mem = 32768 65536 131072 # sets min/pressure/max TCP buffer

space, default 31744 32256 32768

Related links:

Forschungszentrum Karlsruhe: http://www.fzk.de

GridKa: http://www.gridka.de

CERN: http://www.cern.ch

LHC GridComputing: http://les.home.cern.ch/les/talks/lcg%20overview%20sep05.ppt

IBM Redbook: http://www.redbooks.ibm.com/redbooks/SG245287

Autors

Marc García Martí

Bruno Hoeft

-- MonicaDomingues - 24 Oct 2005
-- SimonLeinen - 14 Oct 2006 (added cross-references)

  • figure1.bmp: Logical topology of the Forschungszentrum /CERN 10Gbps testbed

Network Performance People

There are a many individuals who contributed to the field of network performance. A few of them are listed here - if you think of someone else, just add them!

-- SimonLeinen - 06 May 2005

Van Jacobson

Traceroute, TCP, and DiffServ work

Van Jacobson made vast contributions to networking, in particular related to performance. He wrote the original traceroute tool based on a idea by Steve Deering, introduced Congestion Avoidance to TCP, proposed (with Sally Floyd) Explicit Congestion Notification, implemented the first zero-copy TCP in BSD Unix, and did some of the early work on what later became the Differentiated Services Architecture and Premium IP.

Channel-based Networking Driver Architecture

Recent work includes a rearchitecture (based on a new "channel" concept) of the device driver and buffer management architecture that is common to networking stack implementations of practically all current operating systems. This work is described in a talk at Linux Conference Australia (LCA2006) (slides from the talk; blog article by DaveM).

-- SimonLeinen - 06 May 2005-04 Feb 2006

Sally Floyd

Sally co-invented RandomEarlyDetection with VanJacobson, proposed ExplicitCongestionNotification, and works on the HS-TCP (High-Speed TCP) variant of the TransmissionControlProtocol.

References

Sally's home page, including papers, talks, and informal notes, as well as a number of valuable pointer collections and open research questions.

-- SimonLeinen - 06 May 2005

Related Efforts

There are many other places where useful information about network performance can be found. Here is a small selection - feel free to add more references as you encounter them.

Network Monitoring Tools

Standardization

  • IETF (Internet Engineering Task Force), in particular the working groups on IPPM, LMAP...

-- SimonLeinen - 2006-01-22 - 2014-12-28

-- SimonLeinen - 05 Feb 2006

by sex

Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r25 - 2007-08-17 - TobyRodwell
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2004-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.