DS3.3.3a - Basic Guide to Network Performance

General Background and Context

Motivation and Purpose

The GN2 Performance Enhancement and Response Team (PERT) is a virtual team committed to helping academic users get efficient network performance for their end-systems. There is an emerging need for this kind of support, due to the proliferation of systems which are dependent on Long Fat pipe Networks (LFNs), and whose requirements exceed the scope of the original TCP specification (TCP being the most common of the network transport protocols used on top of the Internet Protocol (IP)).

The PERT offers support in two ways. First, they respond to requests to investigate quality of service (QoS) issues submitted by the PERT Primary Customers. The PERT Primary Customers are the European NRENs, peer academic networks and certain international academic research projects (the full list of PERT Primary Customers is given in ‘The PERT Policy’ (GN2-05-18). The purpose of specifying a relatively small number of Primary Customers is to ensure that the majority of potential users of the PERT (the PERT’s ‘End Customers’) first contact their NREN, who will be able to help with minor problems, or those issues that are in fact outside the scope of the PERT (such as network hardware failure).

The second way in which the PERT helps its clients is by producing reference documentation that end users and network administrators can use for self-help. This documentation explains the concepts of network performance, highlights the most common causes of quality of service (QoS) problems, and offers general guidelines on how to configure systems to optimise performance. It also provides pointers to other resources for troubleshooting issues, such as the NREN network statistics and monitoring tools that will help end users and network administrators to determine quickly for themselves whether current problems are likely to be the result of changes in network conditions. This information is all held in the PERT Knowledgebase, an online resource that is accessible to all (see GEANT2 web site, under ‘Users’ -> ‘PERT’ for a link).

In order to showcase the work of the PERT in this area, a snapshot has been taken of the content of the PERT Knowledgebase and this has been used to create two self-contained guides to network quality of service. The ‘User Guide’, as its name suggests, is targeted at end-users, and in particular those end-users who have demanding requirements (typically those who depend on LFNs), and the network administrators who supportthem. The User Guide is at Annex A of this document. The second guide (Annex B) is called the ‘Best Practice Guide’. This document is aimed more at campus network administrators, since it concentrates on issues which can not be controlled by re-configuring or upgrading end-systems. In fact the Best Practice Guide is really an addendum to the User Guide, in as much as its target audience (network administrators) is a subset of the User Guide’s target audience.

Conclusions and Future Work

Whilst the online PERT Knowledgebase, which will always have the most up to date content, is expected to be the main reference source for PERT users seeking specific advice on a given subject, the two guides presented here will provide a useful and easy to read introduction to the main issues surrounding network performance issues today.

A significant amount of effort was put into building up the PERT Knowledgebase, prior to producing these two guides, and the effort is planned to continue, albeit at a slower, steadier pace, over the rest of GEANT2. The PERT Knowledge base will be kept under regular review, and as and when it has changed significantly from its present state the guides will be re-published.

-- SimonLeinen - 09 Apr 2006

Glossary of Abbreviations and Acronyms

ACK - TCP acknowledgement packet
AIMD - Additive Increase/Multiplicative Decrease
AQM - Active Queue Management
ASIC - Application-Specific Integrated Circuit
ATM - Asynchronous Transfer Mode
BDP - Bandwidth-delay product
CE - Congestion Experienced
CPU - Central Processing Unit
CSMA/CD - Collision Sense Multiple Access/Collision Detection
Cwnd - Congestion Window
CWR - Congestion Window Reduced
DFN - Deutsches Forschungsnetz
ECE - ECN-Echo
ECMP - Equal Cost Multipath
ECN - Explicit Congestion Notification
ECT - ECN-Capable Transport
EGEE - Enabling Grids for E-Science in Europe
FDDI - Fiber Distributed Data Interface
FTP - File Transfer Protocol
HTTP - HyperText Transfer Protocol
ICMP - Internet Control Message Protocol
IEEE - Institute of Electrical and Electronics Engineers
IETF - Internet Engineering Task Force
IPDV - IP Delay Variation Metric
IP - Internet Protocol
IPPM - IP Performance Metrics, an IETF Working Group
LFN - Long Fat Network, a network with a large BDP
LSO - Large Send Offload
MDT - Multidata Transmit
MSS - maximum segment size
MTU - Maximum Transmission Unit
NIC - Network Interface Card
NIST - National Institute of Standards and Technology
NREN - National Research and Education Network
OS - Operating System
OWAMP - One-Way Active Measurement Protocol
OWD - One-Way Delay
PAWS - Protection Against Wrapped Sequence-numbers
PDU - Protocol Data Unit
PERT - Performance Enhancement & Response Team
PMTUD - Path MTU Discovery
POS - Packet Over Sonet
RCP - Rate Control Protocol
RCP - Remote Copy
RED - Random Early Detection, a form of active queue management
RENATER - Réseau National de télécommunications pour la Technologie l'Enseignement et la Recherche
RFC - Request For Comment - Technical and organisational notes about the Internet
RIPE NCC - RIPE Network Control Center
RIPE - Réseaux IP Européens
RTCP - RTP Control Protocol
RTP - Real-time Transport Protocol
RTT - Round-trip time
SACK - Selective ACKnowledgement
SCP - Secure Copy
SCTP - Stream Control Transmission Protocol
SMDS - Switched Multimegabit Data Service
SNMP - Simple Network Management Protocol
SSH - Secure Shell
Ssthresh - Slow Start Threshold
SYN - TCP synchronisation packet
TCP - Transmission Control Protocol
TOE - TCP Offload Engine
TSO - TCP Segmentation Offload
TTL - Time To Live
TTM - Test Traffic Measurements, a service provided by RIPE NCC
UDP - User Datagram Protocol
VOP - Velocity of Propagation
WIDE - Widely Integrated Distributed Environment
WLAN - Wireless Local Area Network
XCP - eXplicit Control Protocol

-- SimonLeinen - 09 Apr 2006

Abstract

This paper is a guide to network performance issues. We present the fundamentals of network performance, and then look at ways of investigating and remedying problems on end-systems. We finish with several case studies which demonstrate the effects of the issues previously examined.

-- SimonLeinen - 09 Apr 2006

Notes About Performance

This part of the PERT Knowledgebase contains general observations and hints about the performance of networked applications.

The mapping between network performance metrics and user-perceived performance is complex and often non-intuitive.

-- SimonLeinen - 13 Dec 2004 - 07 Apr 2006

User-Perceived Performance

User-perceived performance of network applications is made up of a number of mainly qualitative metrics, some of which are in conflict with each other. In the case of some applications, a single metric will outweigh the others, such as responsiveness from video services or throughput for bulk transfer applications. More commonly, a combination of factors usually determines the experience of network performance for the end-user.

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006
-- AlessandraScicchitano - 10 Feb 2012

Responsiveness

One of the most important user experiences in networking applications is the perception of responsiveness. If end-users feel that an application is slow, it is often the case that it is slow to respond to them, rather than being directly related to network speed. This is a particular issue for real-time applications such as audio/video conferencing systems and must be prioritised in applications such as remote medical services and off-campus teaching facilities. It can be difficult to quantitatively define an acceptable figure for response times as the requirements may vary from application to application.

However, some applications have relatively well-defined "physiological" bounds beyond which the responsiveness feeling vanishes. For example, for voice conversations, a (round-trip) delay of 150ms is practically unnoticeable, but an only slightly larger delay is typically felt as very intrusive.

-- SimonLeinen - 07 Apr 2006 - 23 Aug 2007

Throughput/Capacity/"Bandwidth"

Throughput per se is not directly perceived by the user, although a lack of throughput will increase waiting times and reduce the impression of responsiveness. However, "bandwidth" is widely used as a "marketing" metric to differentiate "fast" connections from "slow" ones, and many applications display throughput during long data transfers. Therefore users often have specific performance expectations in terms of "bandwidth", and are disappointed when the actual throughput figures they see is significantly lower than the advertised capacity of their network connection.

-- SimonLeinen - 07 Apr 2006

Reliability

Reliability is often the most important performance criterion for a user: The application must be available when the user requires it. Note that this doesn't necessarily mean that it is always available, although there are some applications - such as the public Web presence of a global corporation - that have "24x7" availability requirements.

The impression of reliability will also be heavily influenced by what happens (or is expected to happen) in cases of unavailability: Does the user have a possibility to build a workaround? In case of provider problems: Does the user have someone competent to call - or can they even be sure that the provider will notice the problem themselves, and fix it in due time? How is the user informed during the outage, in particular concerning the estimated time to repair?

Another aspect of reliability is the predictability of performance. It can be profoundly disturbing to a user to see large performance variations over time, even if the varying performance is still within the required performance range - who can guarantee that variations won't increase beyond the tolerable during some other time when the application is needed? E.g., a 10 Mb/s throughput that remains rock-stable over time can feel more reliable than throughput figures that vary between 200 and 600 Mb/s.

-- SimonLeinen - 07 Apr 2006

The "Wizard Gap"

The Wizard Gap is an expression coined by Matt Mathis (then PSC) in 1999. It designates the difference between the performance that is "theoretically" possible on today's high-speed networks (in particular, research networks), and the performance that most users actually perceive. The idea is that today, the "theoretical" performance can only be (approximately) obtained by "wizards" with superior knowledge and skills concerning system tuning. Good examples for "Wizard" communities are the participants in Internet2 Land-Speed Record or SC Bandwidth Challenge competitions.

The Internet2 end-to-end performance initiative strives to reduce the Wizard Gap by user education as well as improved instrumentation (see e.g. Web100) of networking stacks. In the GÉANT community, PERTs focus on assistance to users (case management), as well as user education through resources such as this knowledge base. Commercial players are also contributing to closing the wizard gap, by improving "out-of-the-box" performance of hardware and software, so that their customers can benefit from faster networking.

References

  • M. Mathis, Pushing up Performance for Everyone, Presentation at the NLANR/Internet2 Joint Techs Workshop, 1999 (PPT)

-- SimonLeinen - 2006-02-28 - 2016-04-27

Why Latency Is Important

Traditionally, the metric of focus in networking has been bandwidth. As more and more parts of the Internet have their capacity upgraded, bandwidth is often not the main problem anymore. However, network-induced latency as measured in One-Way Delay (OWD) or Round-Trip Time often has noticeable impact on performance.

It's not just for gamers...

The one group of Internet users today that is most aware of the importance of latency are online gamers. It is intuitively obvious than in real-time multiplayer games over the network, players don't want to be put at a disadvantage because their actions take longer to reach the game server than their opponents'.

However, latency impacts the other users of the Internet as well, probably much more than they are aware of. At a given bottleneck bandwidth, a connection with lower latency will reach its achievable rate faster than one with higher latency. The effect of this should not be underestimated, since most connections on the Internet are short - in particular connections associated with Web browsing.

But even for long connections, latency often has a big impact on performance (throughput), because many of those long connections have their throughput limited by the window size that is available for TCP. And when the window size is the bottleneck, throughput is inversely proportional to round-trip time. Furthermore, for a given (small) loss rate, RTT places an upper limit on the achieveable throughput of a TCP connection, as shown by the Mathis Equation

But I thought latency was only important for multimedia!?

Common wisdom is that latency (and also jitter) is important for audio/video ("multimedia") applications. This is only partly true: Many applications of audio/video involve "on-demand" unidirectional transmission. In those applications, the real-time concerns can often be mitigated by clever buffering or transmission ahead of time. For conversational audio/video, such as "Voice over IP" or videoconferencing, the latency issue is very real. The principal sources of latency in these applications are not backbone-related, but related to compression/sampling rates (see packetization delay) and to transcoding devices such as H.323 MCUs (Multi-Channel Units).

References

-- SimonLeinen - 13 Dec 2004

TCP (Transmission Control Protocol)

The Transmission Control Protocol (TCP) is the prevalent transport protocol used on the Internet today. It uses window-based transmission to provide the service of a reliable byte stream, and adapts the rate of transfer to the state (of congestion) of the network and the receiver. Basic mechanisms include:

  • Segments that fit into IP packets, into which the byte-stream is split by the sender,
  • a checksum for each segment,
  • a Window, which bounds the amount of data "in flight" between the sender and the receiver,
  • Acknowledgements, by which the receiver tells the sender about segments that were received successfully,
  • a flow-control mechanism, which regulates data rate as a function of network and receiver conditions.

Originally specified in September 1981 RFC 793, TCP was clarified, refined and extended in many documents. Perhaps most importantly, congestion control was introduced in "TCP Tahoe" in 1988, described in Van Jacobson's 1988 SIGCOMM article on "Congestion Avoidance and Control". It can be said that TCP's congestion control is what keeps the Internet working when links are overloaded. In today's Internet, the enhanced "Reno" variant of congestion control is probably the most widespread.

RFC 7323 (formerly RFC 1323) specifies a set of "TCP Extensions for High Performance", namely the Window Scaling Option, which provides for much larger windows than the original 64K, the Timestamp Option and the PAWS (Protection Against Wrapped Sequence numbers) mechanism. These extensions are supported by most contemporary TCP stacks, although they frequently must be activated explicitly (or implicitly by configuring large TCP windows).

Another widely implemented performance enhancement to TCP are selective acknowledgements (SACK, RFC 2018). In TCP as originally specified, the acknowledgements ("ACKs") sent by a receiver were always "cumulative", that is, they specified the last byte of the part of the stream that was completely received. This is advantageous with large TCP windows, in particular where chances are high that multiple segments within a single window are lost. RFC 2883 describes an extension to SACK which makes TCP more robust in the presence of packet reordering in the network.

In addition, RFC 3449 - TCP Performance Implications of Network Path Asymmetry provides excellent information since a vast majority of Internet connections are asymmetrical.

Further Reading

References

  • RFC 7414, A Roadmap for Transmission Control Protocol (TCP) Specification Documents]], M. Duke, R. Braden, W. Eddy, E. Blanton, A. Zimmermann, February 2015
  • RFC 793, Transmission Control Protocol]], J. Postel, September 1981
  • draft-ietf-tcpm-rfc793bis-07, Transmission Control Protocol Specification, Wesley M. Eddy, November 2017
  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • RFC 2018, TCP Selective Acknowledgement Options]], M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, October 1996
  • RFC 5681, TCP Congestion Control]], M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 3449, TCP Performance Implications of Network Path Asymmetry]], H. Balakrishnan, V. Padmanabhan, G. Fairhurst, M. Sooriyabandara, December 2002

There is ample literature on TCP, in particular research literature on its performance and large-scale behaviour.

Juniper has a nice white paper, Supporting Differentiated Service Classes: TCP Congestion Control Mechanisms (PDF format) explaining TCP's congestion control as well as many of the enhancements proposed over the years.

-- Simon Leinen - 2004-10-27 - 2017-11-13 -- Ulrich Schmid - 2005-05-31

Window-Based Transmission

TCP is a sliding-window protocol. The receiver tells the sender the available buffer space at the receiver (TCP header field "window"). The total window size is the minimum of sender buffer size, advertised receiver window size and congestion window size.

The sender can transmit up to this amount of data before having to wait for further buffer update from the receiver and should not have more than this amount of data in transit in the network. The sender must buffer the sent data until it has been ACKed by the receiver, so that the data can be retransmitted if neccessary. For each ACK the sent segment left the window and a new segment fills the window if it fits the (possibly updated) window buffer.

Due to TCP's flow control mechanism, TCP window size can limit the maximum theoretical throughput regardless of the bandwidth of the network path. Using too small a TCP window can degrade the network performance lower than expected and a too large window may have the same problems in case of error recovery.

The TCP window size is the most important parameter for achieving maximum throughput across high-performance networks. To reach the maximum transfer rate, the TCP window should be no smaller than the bandwidth-delay product.

Window size (bytes) => Bandwidth (bytes/sec) x Round-trip time (sec)

Example:

window size: 8192 bytes
round-trip time: 100ms
maximum throughput: < 0.62 Mbit/sec.

References

-- UlrichSchmid & SimonLeinen - 31 May-07 Jun 2005

Large TCP Windows

In order to achieve high data rates with TCP over "long fat networks", i.e. network paths with a large bandwidth-delay product, TCP sinks (that is, hosts receiving data transported by TCP) must advertise a large TCP receive window (referred to as just 'the window', since there is not an equivalent advertised 'send window').

The window is a 16 bit value (bytes 15 and 16 in the TCP header) and so, in TCP as originally specified, it is limited to a value of 65535 (64K). The receive window sets an upper limit on the sustained throughput achieveable over a TCP connection since it represents the maximum amount of unacknowledged data (in bytes) there can be on the TCP path. Mathematically, achieveable throughput can never be more than WINDOW_SIZE/RTT, so for a trans-Atlantic link, with say an RTT (Round trip Time) of 150ms, the throughput is limited to a maximum of 3.4Mbps. With the emergence of "long fat networks", the limit of 64K bytes (on some systems even just 32K bytes!) was clearly insufficient and so RFC 7323 laid down (amongst other things) a way of scaling the advertised window, such that the 16-bit window value can represent numbers larger than 64K.

RFC 7323 extensions

RFC 7323, TCP Extensions for High Performance, (and formerly RFC 1323) defines several mechanisms to enable high-speed transfers over LFNs: Window Scaling, TCP Timestamps, and Protection Against Wrapped Sequence numbers (PAWS).

The TCP window scaling option increases the maximum window size fom 64KB to 1Gbyte, by shifting the window field left by up to 14. The window scale option is used only during the TCP 3-way handshake (both sides must set the window scale option in their SYN segments if scaling is to be used in either direction).

It is important to use TCP timestamps option with large TCP windows. With the TCP timestamps option, each segment contains a timestamp. The receiver returns that timestamp in each ACK and this allows the sender to estimate the RTT. On the other hand with the TCP timestamps option the problem of wrapped sequence number could be solved (PAWS - Protection Against Wrapped Sequences) which could occur with large windows.

(Auto) Configuration

In the past, most operating systems required manual tuning to use large TCP windows. The OS-specific tuning section contains information on how to to this for a set of operating systems.

Since around 2008-2010, many popular operating systems will use large windows and the necessary protocol options by default, thanks to TCP Buffer Auto-Tuning.

Can TCP Windows ever be too large?

There are several potential issues when TCP Windows are larger than necessary:

  1. When there are many active TCP connection endpoints (sockets) on a system - such as a popular Web or file server - then a large TCP window size will lead to high consumption of system (kernel) memory. This can have a number of negative consequences: The system may run out of buffer space so that no new connections can be opened, or the high occupation of kernel memory (which typically must reside in actual RAM and cannot be "paged out" to disk) can "starve" other processes of access to fast memory (cache and RAM)
  2. Large windows can cause large "bursts" of consecutive segments/packets. When there is a bottleneck in the path - because of a slower link or because of cross-traffic - these bursts will fill up buffers in the network device (router or switch) in front of that bottleneck. The larger these bursts, the higher are the risks that this buffer overflows and causes multiple segments to be dropped. So a large window can lead to "sawtooth" behavior and worse link utilisation than with a just-big-enough window where TCP could operate at a steady rate.

Several methods for automatic TCP buffer tuning have been developed to resolve these issues, some of which have been implemented in (at least) recent Linux versions.

References

-- SimonLeinen - 27 Oct 2004 - 27 Sep 2014

TCP Acknowledgements

In TCP's sliding-window scheme, the receiver acknowledges the data it receives, so that the sender can advance the window and send new data. As originally specified, TCP's acknowledgements ("ACKs") are cumulative: the receiver tells the sender how much consecutive data it has received. More recently, selective acknowledgements were introduced to allow more fine-grained acknowledgements of received data.

Delayed Acknowledgements and "ack every other"

RFC 831 first suggested a delayed acknowledgement (Delayed ACK) strategy, where a receiver doesn't always immediately acknowledge segments as it receives them. This recommendation was carried forth and specified in more detail in RFC 1122 and RFC 5681 (formerly known as RFC 2581). RFC 5681 mandates that an acknowledgement be sent for at least every other full-size segment, and that no more than 500ms expire before any segment is acknowledged.

The resulting behavior is that, for longer transfers, acknowledgements are only sent for every two segments received ("ack every other"). This is in order to reduce the amount of reverse flow traffic (and in particular the number of small packets). For transactional (request/response) traffic, the delayed acknowledgement strategy often makes it possible to "piggy-back" acknowledgements on response data segments.

A TCP segment which is only an acknowledgement i.e. has no payload, is termed a pure ack.

Delayed Acknowledgments should be taken into account when doing RTT estimation. As an illustration, see this change note for the Linux kernel from Gavin McCullagh.

Critique on Delayed ACKs

John Nagle nicely explains problems with current implementations of delayed ACK in a comment to a thread on Hacker News:

Here's how to think about that. A delayed ACK is a bet. You're betting that there will be a reply, upon with an ACK can be piggybacked, before the fixed timer runs out. If the fixed timer runs out, and you have to send an ACK as a separate message, you lost the bet. Current TCP implementations will happily lose that bet forever without turning off the ACK delay. That's just wrong.

The right answer is to track wins and losses on delayed and non-delayed ACKs. Don't turn on ACK delay unless you're sending a lot of non-delayed ACKs closely followed by packets on which the ACK could have been piggybacked. Turn it off when a delayed ACK has to be sent.

Duplicate Acknowledgements

A duplicate acknowledgement (DUPACK) is one with the same acknowledgement number as its predecessor - it signifies that the TCP receiver has received a segment newer than the one it was expecting i.e. it has missed a segment. The missed segment might not be lost, it might just be re-ordered. For this reason the TCP sender will not assume data loss on the first DUPACK but (by default) on the third DUPACK, when it will, as per RFC 5681, perform a "Fast Retransmit", by sending the segment again without waiting for a timeout. DUPACKs are never delayed; they are sent immediately the TCP receiver detects an out-of-order segment.

"Quickack mode" in Linux

The TCP implementation in Linux has a special receive-side feature that temporarily disables delayed acknowledgements/ack-every-other when the receiver thinks that the sender is in slow-start mode. This speeds up the slow-start phase when RTT is high. This "quickack mode" can also be explicitly enabled or disabled with setsockopt() using the TCP_QUICKACK option. The Linux modification has been criticized because it makes Linux' slow-start more aggressive than other TCPs' (that follow the SHOULDs in RFC 5681), without sufficient validation of its effects.

"Stretch ACKs"

Techniques such as LRO and GRO (and to some level, Delayed ACKs as well as ACKs that are lost or delayed on the path) can cause ACKs to cover many more than the two segments suggested by the historic TCP standards. Although ACKs in TCP have always been defined as "cumulative" (with the exception of SACKs), some congestion control algorithms have trouble with ACKs that are "stretched" in this way. In January 2015, a patch set was submitted to the Linux kernel network development with the goal to improve the behavior of the Reno and CUBIC congestion control algorithms when faced with stretch ACKs.

References

  • TCP Congestion Control, RFC 5681, M. Allman, V. Paxson, E. Blanton, September 2009
  • RFC 813, Window and Acknowledgement Strategy in TCP, D. Clark, July 1982
  • RFC 1122, Requirements for Internet Hosts -- Communication Layers, R. Braden (Ed.), October 1989

-- TobyRodwell - 2005-04-05
-- SimonLeinen - 2007-01-07 - 2015-01-28

TCP Window Scaling Option

TCP as originally specified only allows for windows up to 65536 bytes, because the window-related protocol fields all have a width of 16 bits. This prevents the use of large TCP windows, which are necessary to achieve high transmission rates over paths with a large bandwidth*delay product.

The Window Scaling Option is one of several options defined in RFC 7323, "TCP Extensions for High Performance". It allows TCP to advertise and use a window larger than 65536 bytes. The way this works is that in the initial handshake, the a TCP speaker announces a scaling factor, which is a power of two between 20 (no scaling) and 214, to allow for an effective window of 230 (one Gigabyte). Window Scaling only comes into effect when both ends of the connection advertise the option (even with just a scaling factor of 20).

As explained under the large TCP windows topic, Window Scaling is typically used in connection with the TCP Timestamp option and the PAWS (Protection Against Wrapped Sequence numbers), both defined in RFC 7323.

32K limitations on some systems when Window Scaling is not used

Some systems, notably Linux, limit the usable window in the absence of Window Scaling to 32 Kilobytes. The reasoning behind this is that some (other) systems used in the past were broken by larger windows, because they erroneously interpreted the 16-bit values related to TCP windows as signed, rather than unsigned, integers.

This limitation can have the effect that users (or applications) that request window sizes between 32K and 64K, but don't have Window Scaling enabled, do not actually benefit from the desired window sizes.

The artificial limitation was finally removed from the Linux kernel sources on March 21, 2006. The old behavior (window limited to 32KB when Window Scaling isn't used) can still be enabled through a sysctl tuning parameter, in case one really wants to interoperate with TCP implementations that are broken in the way described above.

Problems with window scaling

Middlebox issues

MiddleBoxes which do not understand window scaling may cause very poor performance, as described in WindowScalingProblems. Windows Vista limits the scaling factor for HTTP transactions to 2 to avoid some of the problems, see WindowsOSSpecific.

Diagnostic issues

The Window Scaling option has an impact on how the offered window in TCP ACKs should be interpreted. This can cause problems when one is faced with incomplete packet traces that lack the initial SYN/SYN+ACK handshake where the options are negotiated: The SEQ/ACK sequence may look incorrect because scaling cannot be taken into account. One effect of this is that tcpgraph or similar analysis tools (included in tools such as WireShark) may produce incoherent results.

References

  • RFC 7323, TCP Extensions for High Performance, D. Borman, B. Braden, V. Jacobson, B. Scheffenegger, September 2014 (obsoletes RFC 1323 from May 1993)
  • git note about the Linux kernel modification that got rid of the 32K limit, R. Jones and D.S. Miller, March 2006. Can also be retrieved as a patch.

-- SimonLeinen - 21 Mar 2006 - 27 September 2014
-- PekkaSavola - 07 Nov 2006
-- AlexGall - 31 Aug 2007 (added reference to scaling limitiation in Windows Vista)

TCP Flow Control

Note: This topic describes the Reno enhancement of classical "Van Jacobson" or Tahoe congestion control. There have been many suggestions for improving this mechanism - see the topic on high-speed TCP variants.

TCP flow control and window size adjustment is mainly based on two key mechanism: Slow Start and Additive Increase/Multiplicative Decrease (AIMD), also known as Congestion Avoidance. (RFC 793 and RFC 5681)

Slow Start

To avoid that a starting TCP connection floods the network, a Slow Start mechanism was introduced in TCP. This mechanism effectively probes to find the available bandwidth.

In addition to the window advertised by the receiver, a Congestion Window (cwnd) value is used and the effective window size is the lesser of the two. The starting value of the cwnd window is set initially to a value that has been evolving over the years, the TCP Initial Window. After each acknowledgment, the cwnd window is increased by one MSS. By this algorithm, the data rate of the sender doubles each round-trip time (RTT) interval (actually, taking into account Delayed ACKs, rate increases by 50% every RTT). For a properly implemented version of TCP this increase continues until:

  • the advertised window size is reached
  • congestion (packet loss) is detected on the connection.
  • there is no traffic waiting to take advantage of an increased window (i.e. cwnd should only grow if it needs to)

When congestion is detected, the TCP flow-control mode is changed from Slow Start to Congestion Avoidance. Note that some TCP implementations maintain cwnd in units of bytes, while others use units of full-sized segments.

Congestion Avoidance

Once congestion is detected (through timeout and/or duplicate ACKs), the data rate is reduced in order to let the network recover.

Slow Start uses an exponential increase in window size and thus also in data rate. Congestion Avoidance uses a linear growth function (additive increase). This is achieved by introducing - in addition to the cwnd window - a slow start threshold (ssthresh).

As long as cwnd is less than ssthresh, Slow Start applies. Once ssthresh is reached, cwnd is increased by at most one segment per RTT. The cwnd window continues to open with this linear rate until a congestion event is detected.

When congestion is detected, ssthresh is set to half the cwnd (or to be strictly accurate, half the "Flight Size". This distinction is important if the implementation lets cwnd grow beyond rwnd (the receiver's declared window)). cwnd is either set to 1 if congestion was signalled by a timeout, forcing the sender to enter Slow Start, or to ssthresh if congestion was signalled by duplicate ACKs and the Fast Recovery algorithm has terminated. In either case, once the sender enters Congestion Avoidance, its rate has been reduced to half the value at the time of congestion. This multiplicative decrease causes the cwnd to close exponentially with each detected loss event.

Fast Retransmit

In Fast Retransmit, the arrival of three duplicate ACKs is interpreted as packet loss, and retransmission starts before the retransmission timer (RTO) expires.

The missing segment will be retransmitted immediately without going through the normal retransmission queue processing. This improves performance by eliminating delays that would suspend effective data flow on the link.

Fast Recovery

Fast Recovery is used to react quickly to a single packet loss. In Fast recovery, the receipt of 3 duplicate ACKs, while being taken to mean a loss of a segment, does not result in a full Slow Start. This is because obviously later segments got through, and hence congestion is not stopping everything. In fast recovery, ssthresh is set to half of the current send window size, the missing segment is retransmitted (Fast Retransmit) and cwnd is set to ssthresh plus three segments. Each additional duplicate ACK indicates that one segment has left the network at the receiver and cwnd is increased by one segment to allow the transmission of another segment if allowed by the new cwnd. When an ACK is received for new data, cwmd is reset to the ssthresh, and TCP enters congestion avoidance mode.

References

-- UlrichSchmid - 07 Jun 2005 -- SimonLeinen - 27 Jan 2006 - 23 Jun 2011

Troubleshooting Procedures

PERT Troubleshooting Procedures, as used in the earlier centralised PERT in GN2 (2004-2008), are laid down in GN2 Deliverable DS3.5.2 (see references section), which also contains the User Guide for using the (now-defunct) PERT Ticket System (PTS).

Standard tests

  • Ping - simple RTT/loss measurement using ICMP ECHO requests
  • Fping - ping variant with concurrent measurement of multiple destinations

End-user tests

tweak tools http://www.dslreports.com/tweaks is able to run a simple end-user test, checking such parameters as TCP options, receive window size, data transfer rates. It is all done through the user's web-browser making it a simple test for them to do.

Linux Tools and Checks

ip route show cache

Linux is able to apply specific conditions (MSS, ssthresh) to specific routes. Some parameters (such as MSS, ssthresh) can be set manually (with ip route add|replace ...) whilst others are changed automatically by Linux in response to what it learns from TCP (such parameters include estimated rtt, cwnd and re-ordering). The learned info is stored in the route cache and thus can be shown with ip route show cache. Note, this learning behaviour can actually limit TCP performance - if the last transfer was poor, then the starting TCP parameters will be be pessimistic. For this reason some tools, e.g. bwctl, always flush the route cache before starting a test.

References

  • PERT Troubleshooting Procedures - 2nd Edition, B. Belter, A. Harding, T. Rodwell, GN2 deliverable D3.5.2, August 2006 (PDF)

-- SimonLeinen - 18 Sep 2005 - 17 Aug 2010
-- BartoszBelter - 28 Mar 2006

Measurement Tools

Traceroute-like Tools: traceroute, MTR, PingPlotter, lft, tracepath, traceproto

There is a large and growing number of path-measurement tools derived from the well-known traceroute tool. Those tools all attempt to find the route a packet will take from the source (typically where the tool is running) to a given destination, and to find out some hop-wise performance parameters along the way.

Bandwidth Measurement Tools: pchar, Iperf, bwctl, nuttcp, Netperf, RUDE/CRUDE, ttcp, NDT, DSL Reports

Use: Evaluate the Bandwidth between two points in the network.

Active Measurement Boxes

Use: information about Delay, Jitter and packets loss between two probes.

Passive Measurement Tools

These tools perform their measurement by looking at existing traffic. They can be further classified into

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- HankNussbacher - 01 Jun 2005
-- SimonLeinen - 2005-07-15 - 2014-12-27
-- AlessandraScicchitano - 2013-01-30

traceroute-like Tools

Traceroute is used to determine the route a packet takes through the Internet to reach its destination; i.e. the sequence of gateways or "hops" it passes through. Since its inception, traceroute has been widely used for network diagnostics as well as for research in the widest sense.

The basic idea is to send out "probe" packets with artificially small TTL (time-to-live) values, eliciting ICMP "time exceeded" messages from routers in the network, and reconstructing the path from these messages. This is described in more detail under the VanJacobsonTraceroute topic. This original traceroute implementation was followed by many attempts at improving on the idea in various directions: More useful output (often graphically enhanced, sometimes trying to map the route geographically), more detailed measurements along the path, faster operation for many targets ("topology discovery"), and more robustness in the face of packet filters, using different types of suitable probe packets.

Traceroute Variants

Close cousins

Variants with more filter-friendly probe packets

Extensions

GUI (Graphical User Interface) traceroute variants

  • 3d Traceroute plots per-hop RTTs over time in 3D. Works on Windows, free demo available
  • LoriotPro graphical traceroute plugin. Several graphical representations.
  • Path Analyzer Pro, graphical traceroute tool for Windows and MacOS X. Include "synopsis" feature.
  • PingPlotter combines traceroute, ping and whois to collect data for Windows platform.
  • VisualRoute performs traceroutes and provides various graphical representations and analysis. Runs on Windows, Web demo available.

More detailed measurements along the path

Other

  • Multicast Traceroute (mtrace) uses IGMP protocol extensions to allow "traceroute" functionality for multicast
  • Paris Traceroute claims to produce better results in the presence of "multipath" routing.
  • Scamper performs bulk measurements to large numbers of destinations, under configurable traffic limits.
  • tracefilter tries to detect where along a path packets are filtered.
  • Tracebox tries to detect middleboxes along the path
  • traceiface tries to find "both ends" of each link traversed using expanding-ring search.
  • Net::Traceroute::PurePerl is an implementation of traceroute in Perl that lends itself to experimentation.
  • traceflow is a proposed protocol to trace the path for traffic of a specific 3/5-tuple and collect diagnostic information

Another list of traceroute variants, with a focus on (geo)graphical display capabilities, can be found on the JRA1 Wiki.

Traceroute Servers

There are many TracerouteServers on the Internet that allow running traceroute from other parts of the network.

Traceroute Data Collection and Representation

Researchers from the network measurement community have created large collections of traceroute results to help understand the Internet's topology, i.e. the structure of its connectivity. Some of these collections are available for other researchers. Scamper is an example of a tool that can be used to efficiently obtain traceroute results towards a large number of destinations.

The IETF IPPM WG is standardizing an XML-based format to store traceroute results.

-- FrancoisXavierAndreu - SimonMuyal - 06 Jun 2005
-- SimonLeinen - 2005-05-06 - 2015-09-11

Original Van Jacobson/Unix/LBL Traceroute

The well known traceroute program has been written by Van Jacobson in December 1988. In the header comment of the source, Van explains that he just implemented an idea he got from Steve Deering (and that could have come from Guy Almes or Matt Mathis). Traceroute sends "probe" packets with TTL (Time to Live) values incrementing from one, and uses ICMP "Time Exceeded" messages to detect router "hops" on the way to the specified destination. It also records "response" times for each hop, and displays losses and other types of failures in a compact way.

This traceroute is also known as "Unix traceroute" (although some Unix variants have departed from the code base, see e.g. Solaris traceroute) or "LBL traceroute" after Van Jacobson's employer at the time he wrote this.

Basic function

UDP packets are sent as probes to a high ephemeral port (usually in the range 33434--33525) with the Time-To-Live (TTL) field in the IP header increasing by one for each probe sent until the end host is reached. The originating host listens for ICMP Time Exceeded responses from each of the routers/hosts en-route. It knows that the packet's destination has been reached when it receives an ICMP Port Unreachable message; we expect a port unreachable message as no service should be listening for connections in this port range. If there is no response to the probe within a certain time period (typically 5 seconds), then a * is displayed.

Output

The output of the traceroute program shows each host that the packet passes through on its way to its destination and the RTT to each gateway en-route. Occasionally, the maximum number of hops (specified by the TTL field, which defaults to 64 hops in *NIX implementations) is exceeded before the port unreachable is received. When this happens an ! will be printed beside the RTT in the output.

Other error messages that may appear after the RTT in the output of a traceroute are:

!H
Host unreachable
!N
Network unreachable
!P
Protocol unreachable
!S
Source-route failed (that is to say, the router was not able to honour the source-route option set in an IP packet)
!F[pmtu]
Fragmentation needed. [pmtu] displays the Path MTU Discovery value, typically the "next-hop MTU" contained in the ICMP response packet.
!X
Administratively prohibited. The gateway prohibits these packets, but sends an ICMP message back to the source of the traceroute to inform them of this.
!V
Host precedence violation
!C
Precedence cut-off in effect
![num]
displays the ICMP unreachable code, as defined in a variety of RFC, shown at http://www.iana.org/assignments/icmp-parameters

UDP Probes

The original traceroute sends UDP packets as probes to a high ephemeral port (usually in the range 33434--33525) with the Time-To-Live (TTL) field in the IP header increasing by one for each probe sent until the end host is reached. The originating host listens for ICMP Time Exceeded responses from each of the routers/hosts en-route. It knows that the packet's destination has been reached when it receives an ICMP Port Unreachable message; we expect a port unreachable message as no service should be listening for connections in this port range. If there is no response to the probe within a certain time period (typically 5 seconds), then a * is displayed.

Why UDP?

It would seem natural to use ICMP ECHO requests as probe packets, but Van Jacobson chose UDP packets to presumably unused ports instead. It is believed that this is because at that time, some gateways (as routers were called then) refused to send ICMP (TTL exceeded) messages in response to ICMP messages, as specified in the introduction of RFC 792, "Internet Control Message Protocol". Therefore the UDP variant was more robust.

These days, all gateways (routers) send ICMP TTL Exceeded messages about ICMP ECHO request packets (as specified in RFC1122, "Requirements for Internet Hosts -- Communication Layers", so more recent traceroute versions (such as Windows tracert) do indeed use ICMP probes, and newer Unix traceroute versions allow ICMP probes to be selected with the -I option.

Why increment the UDP port number?

Traceroute varies (increments) the UDP destination port number for each probe sent out, in order to reliably match ICMP TTL Exceeded messages to individual probes. Because the UDP ports occur right after the IP header, they can be relied on to be included in the "original packet" portion of the ICMP TTL Exceeded messages, even though the ICMP standards only mandate that the first eight octets following the IP header of the original packet be included in ICMP messages (it is allowed to send more though).

When ICMP ECHO requests are used, the probes can be disambiguated by using the sequence number field, which also happens to be located just before that 8-octet boundary.

Filters

Note that either or both of ICMP and UDP may be blocked by firewalls, so this must be taken into account when troubleshooting. As an alternative, one can often use traceroute variants such as lft or tcptraceroute to work around these filters, provided that the destination host has at least one TCP port open towards the source.

References

  • ftp://ftp.ee.lbl.gov/ - the original distribution site. The latest traceroute version found as of February 2006 is traceroute-1.4a12 from December 2000.
  • original announcement from the archives of the end2end-interest mailing list

-- SimonLeinen - 26 Feb 2006

Traceroute6

Traceroute6 uses the Hop-Limit field of the IPv6 protocol to elicit an ICMPv6 Time Exceeded ICMPv6 message from each gateway ("hop") along the path to some host. Just as with traceroute, it prints the route to the given destination and the RTT to each gateway/router.
The following are a list of possible errors that may appear after the RTT for a gateway (especially for OSes that use the KAME IPv6 network stack, such as the BSDs):

!N
No route to host
!P
Administratively prohibited (i.e. Blocked by a firewall, but the firewall issues an ICMPv6 message to the originating host to inform them of this)
!S
Not a Neighbour
!A
Address unreachable
!
The hop-limit is <= 1 on a Port Unreachable ICMPv6 message. This means that the packet got to its destination, but that the reply only had a hop-limit large enough that was just large enough to allow it to get back to the source of the traceroute6. This option was more interesting in IPv4, where bugs in some implementations of the IP stack could be identified by this behaviour.

Traceroute6 can also be specified to use ICMPv6 Echo messages to send the probe packets, instead of the default UDP probes, by specifying the -I flag when running the program. This may be useful in situations where UDP packets are blocked by a packet filter or firewall, while ICMP ECHO requests are permitted.

Note that on some systems, notably Sun's Solaris (see SolarisTraceroute), IPv6 functionality is integrated in the normal traceroute program.

-- TobyRodwell - 06 Apr 2005
-- SimonLeinen - 26 Feb 2006

Solaris traceroute (includes IPv6 functionality)

On Sun's Solaris, IPv6 traceroute functionality is included in the normal traceroute program, so a separate "traceroute6" isn't necessary. Sun's traceroute program has additional options to select between IPv4 and IPv6:

-A [inet|inet6]
Resolve hostnames to an IPv4 (inet) or IPv6 (inet6) address only

-a
Perform traceroutes to all addresses in all address families found for the given destination name.

Examples

Here are a few examples of the address-family selection switches for Solaris' traceroute. By default, IPv6 is preferred:

: leinen@diotima[leinen]; traceroute cemp1
traceroute: Warning: cemp1 has multiple addresses; using 2001:620:0:114:20b:cdff:fe1b:3d1a
traceroute: Warning: Multiple interfaces found; using 2001:620:0:4:203:baff:fe4c:d751 @ bge0:1
traceroute to cemp1 (2001:620:0:114:20b:cdff:fe1b:3d1a), 30 hops max, 60 byte packets
 1  swiLM1-V4.switch.ch (2001:620:0:4::1)  0.683 ms  0.757 ms  0.449 ms
 2  swiNM1-V610.switch.ch (2001:620:0:c047::1)  0.576 ms  0.576 ms  0.463 ms
 3  swiCS3-G3-3.switch.ch (2001:620:0:c046::1)  0.461 ms  0.334 ms  0.340 ms
 4  swiEZ2-P1.switch.ch (2001:620:0:c03f::2)  0.467 ms  0.332 ms  0.348 ms
 5  swiLS2-10GE-1-1.switch.ch (2001:620:0:c03c::1)  3.976 ms  3.825 ms  3.729 ms
 6  swiCE2-10GE-1-3.switch.ch (2001:620:0:c006::1)  4.817 ms  4.703 ms  4.740 ms
 7  cemp1-eth1.switch.ch (2001:620:0:114:20b:cdff:fe1b:3d1a)  4.583 ms  4.566 ms  4.590 ms

If IPv4 is desired, this can be selected using -A inet:

: leinen@diotima[leinen]; traceroute -A inet cemp1
traceroute to cemp1 (130.59.35.130), 30 hops max, 40 byte packets
 1  swiLM1-V4.switch.ch (130.59.4.1)  0.643 ms  0.539 ms  0.465 ms
 2  swiNM1-V610.switch.ch (130.59.15.229)  0.453 ms  0.553 ms  0.470 ms
 3  swiCS3-G3-3.switch.ch (130.59.15.238)  0.590 ms  0.426 ms  0.476 ms
 4  swiEZ2-P1.switch.ch (130.59.36.222)  0.463 ms  0.307 ms  0.352 ms
 5  swiLS2-10GE-1-1.switch.ch (130.59.36.205)  3.723 ms  3.755 ms  3.743 ms
 6  swiCE2-10GE-1-3.switch.ch (130.59.37.1)  4.677 ms  4.678 ms  4.690 ms
 7  cemp1-eth1.switch.ch (130.59.35.130)  5.028 ms  4.555 ms  4.567 ms

The -a switch can be used to trace to all addresses:

: leinen@diotima[leinen]; traceroute -a cemp1
traceroute: Warning: Multiple interfaces found; using 2001:620:0:4:203:baff:fe4c:d751 @ bge0:1
traceroute to cemp1 (2001:620:0:114:20b:cdff:fe1b:3d1a), 30 hops max, 60 byte packets
 1  swiLM1-V4.switch.ch (2001:620:0:4::1)  0.684 ms  0.515 ms  0.457 ms
 2  swiNM1-V610.switch.ch (2001:620:0:c047::1)  0.580 ms  0.848 ms  0.561 ms
 3  swiCS3-G3-3.switch.ch (2001:620:0:c046::1)  0.304 ms  0.428 ms  0.315 ms
 4  swiEZ2-P1.switch.ch (2001:620:0:c03f::2)  0.455 ms  0.516 ms  0.397 ms
 5  swiLS2-10GE-1-1.switch.ch (2001:620:0:c03c::1)  3.853 ms  3.826 ms  3.874 ms
 6  swiCE2-10GE-1-3.switch.ch (2001:620:0:c006::1)  5.071 ms  4.654 ms  4.702 ms
 7  cemp1-eth1.switch.ch (2001:620:0:114:20b:cdff:fe1b:3d1a)  4.581 ms  4.564 ms  4.585 ms

traceroute to cemp1 (130.59.35.130), 30 hops max, 40 byte packets
 1  swiLM1-V4.switch.ch (130.59.4.1)  0.677 ms  0.523 ms  0.716 ms
 2  swiNM1-V610.switch.ch (130.59.15.229)  0.462 ms  0.558 ms  0.470 ms
 3  swiCS3-G3-3.switch.ch (130.59.15.238)  0.340 ms  0.309 ms  0.352 ms
 4  swiEZ2-P1.switch.ch (130.59.36.222)  0.341 ms  0.307 ms  0.351 ms
 5  swiLS2-10GE-1-1.switch.ch (130.59.36.205)  3.722 ms  3.684 ms  3.719 ms
 6  swiCE2-10GE-1-3.switch.ch (130.59.37.1)  4.794 ms  4.695 ms  4.658 ms
 7  cemp1-eth1.switch.ch (130.59.35.130)  4.645 ms  4.653 ms  4.587 ms

-- SimonLeinen - 26 Feb 2006

NANOG Traceroute

Ehud Gavron maintains this version of traceroute. It is derived from the original traceroute program, but adds a few features such as AS (Autonomous System) number lookup, and detection of TOS (Type-of-Service) changes along the path.

TOS Change Detection

This feature can be used to check whether a network path is DSCP-transparent. Use the -t option to select the TOS byte, and watch out for TOS byte changes indicated with TOS=x!.

As shown in the following example, GEANT2 accepts TOS=32/DSCP=8/CLS=1 - this corresponds to "LBE"

: root@diotima[nanog]; ./traceroute -t 32 www.dfn.de.
traceroute to sirius.dfn.de (192.76.176.5), 30 hops max, 40 byte packets
1 swiLM1-V4.switch.ch (130.59.4.1) 1 ms 1 ms 0 ms
2 swiNM1-V610.switch.ch (130.59.15.229) 0 ms 1 ms 0 ms
3 swiCS3-G3-3.switch.ch (130.59.15.238) 0 ms 0 ms 0 ms
4 swiEZ2-P1.switch.ch (130.59.36.222) 0 ms 0 ms 0 ms
5 swiLS2-10GE-1-1.switch.ch (130.59.36.205) 4 ms 4 ms 4 ms
6 swiCE2-10GE-1-3.switch.ch (130.59.37.1) 5 ms 5 ms 5 ms
7 switch.rt1.gen.ch.geant2.net (62.40.124.21) 5 ms 5 ms 5 ms
8 so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) 13 ms 13 ms 13 ms
9 dfn-gw.rt1.fra.de.geant2.net (62.40.124.34) 13 ms 13 ms 13 ms
10 cr-berlin1-po1-0.x-win.dfn.de (188.1.18.53) 25 ms 25 ms 25 ms
11 ar-berlin1-ge6-1.x-win.dfn.de (188.1.20.34) 25 ms 25 ms 25 ms
12 zpl-gw.dfn.de (192.76.176.253) 25 ms 25 ms 25 ms
13 * ^C

On the other hand, TOS=64/DSCP=16/CLS=2 is rewritten to TOS=0 at hop 7 (the rewrite only shows up at hop 8):

: root@diotima[nanog]; ./traceroute -t 64 www.dfn.de.
traceroute to sirius.dfn.de (192.76.176.5), 30 hops max, 40 byte packets
1 swiLM1-V4.switch.ch (130.59.4.1) 2 ms 1 ms 0 ms
2 swiNM1-V610.switch.ch (130.59.15.229) 0 ms 0 ms 0 ms
3 swiCS3-G3-3.switch.ch (130.59.15.238) 0 ms 0 ms 0 ms
4 swiEZ2-P1.switch.ch (130.59.36.222) 0 ms 0 ms 0 ms
5 swiLS2-10GE-1-1.switch.ch (130.59.36.205) 4 ms 4 ms 4 ms
6 swiCE2-10GE-1-3.switch.ch (130.59.37.1) 5 ms 5 ms 5 ms
7 switch.rt1.gen.ch.geant2.net (62.40.124.21) 5 ms 5 ms 5 ms
8 so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) 13 ms (TOS=0!) 13 ms 13 ms
9 dfn-gw.rt1.fra.de.geant2.net (62.40.124.34) 13 ms 13 ms 13 ms
10 cr-berlin1-po1-0.x-win.dfn.de (188.1.18.53) 25 ms 25 ms 25 ms
11 ar-berlin1-ge6-1.x-win.dfn.de (188.1.20.34) 25 ms 25 ms 25 ms
12 zpl-gw.dfn.de (192.76.176.253) 25 ms 25 ms 25 ms
13 * ^C

References

-- SimonLeinen - 26 Feb 2006

Windows tracert

All Windows versions that come with TCP/IP support include a tracert command, which is a simple traceroute client that uses ICMP probes. The (few) options are slightly different from the Unix variants of traceroute. In short, -d is used to suppress address-to-name resolution (-n in Unix traceroute), -h specifies the maximum hopcount (-m in the Unix version), -j is used to specify loose source routes (-g in Unix), and although the same -w option is used to specify the timeout, tracert interprets it in milliseconds, while in Unix it is specified in seconds.

References

-- SimonLeinen - 26 Feb 2006

TCP Traceroute

TCPTraceroute is a traceroute implementation that uses TCP packets instead of UDP or ICMP packets to send its probes. TCPtraceroute can be used in situations where a firewall blocks ICMP and UDP traffic. It is based on the "half-open scanning" technique that is used by NMAP, sending a TCP with the SYN flag set and waiting for a SYN/ACK (which indicates that something is listening on this port for connections). When it receives a response, the tcptraceroute program sends a packet with a RST flag to close the connection.

References

-- TobyRodwell - 06 Apr 2005

LFT (Layer Four Traceroute)

lft is a variant of traceroute that uses per default TCP and port 80 to go through packet-filter based firewalls.

Example

142:/home/andreu# lft -d 80 -m 1 -M 3 -a 5 -c 20 -t 1000 -H 30 -s 53 www.cisco.com     
Tracing _________________________________.
TTL  LFT trace to www.cisco.com (198.133.219.25):80/tcp
1   129.renater.fr (193.49.159.129) 0.5ms
2   gw1-renater.renater.fr (193.49.159.249) 0.4ms
3   nri-a-g13-0-50.cssi.renater.fr (193.51.182.6) 1.0ms
4   193.51.185.1 0.6ms
5   PO11-0.pascr1.Paris.opentransit.net (193.251.241.97) 7.0ms
6   level3-1.GW.opentransit.net (193.251.240.214) 0.8ms
7   ae-0-17.mp1.Paris1.Level3.net (212.73.240.97) 1.1ms
8   so-1-0-0.bbr2.London2.Level3.net (212.187.128.42) 10.6ms
9   as-0-0.bbr1.NewYork1.Level3.net (4.68.128.106) 72.1ms
10   as-0-0.bbr1.SanJose1.Level3.net (64.159.1.133) 158.7ms
11   ge-7-0.ipcolo1.SanJose1.Level3.net (4.68.123.9) 159.2ms
12   p1-0.cisco.bbnplanet.net (4.0.26.14) 159.4ms
13   sjck-dmzbb-gw1.cisco.com (128.107.239.9) 159.0ms
14   sjck-dmzdc-gw2.cisco.com (128.107.224.77) 159.1ms
15   [target] www.cisco.com (198.133.219.25):80 159.2ms

References

  • Home page: http://pwhois.org/lft/
  • Debian package: lft - display the route packets take to a network host/socket

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005
-- SimonLeinen - 21 May 2006

traceproto

Another traceroute variant which allows different protocols and port to be used. "It currently supports tcp, udp, and icmp traces with the possibility of others in the future." It comes with a wrapper script called HopWatcher, which can be used to quickly detect when a path has changed -

Example

142:/home/andreu# traceproto www.cisco.com
traceproto: trace to www.cisco.com (198.133.219.25), port 80
ttl  1:  ICMP Time Exceeded from 129.renater.fr (193.49.159.129)
        6.7040 ms       0.28100 ms      0.28600 ms
ttl  2:  ICMP Time Exceeded from gw1-renater.renater.fr (193.49.159.249)
        0.16900 ms      6.0140 ms       0.25500 ms
ttl  3:  ICMP Time Exceeded from nri-a-g13-0-50.cssi.renater.fr (193.51.182.6)
        6.8280 ms       0.58200 ms      0.52100 ms
ttl  4:  ICMP Time Exceeded from 193.51.185.1 (193.51.185.1)
        6.6400 ms       7.4230 ms       6.7690 ms
ttl  5:  ICMP Time Exceeded from PO11-0.pascr1.Paris.opentransit.net (193.251.241.97)
        0.58100 ms      0.64100 ms      0.54700 ms
ttl  6:  ICMP Time Exceeded from level3-1.GW.opentransit.net (193.251.240.214)
        6.9390 ms       0.62200 ms      6.8990 ms
ttl  7:  ICMP Time Exceeded from ae-0-17.mp1.Paris1.Level3.net (212.73.240.97)
        7.0790 ms       7.0250 ms       0.79400 ms
ttl  8:  ICMP Time Exceeded from so-1-0-0.bbr2.London2.Level3.net (212.187.128.42)
        10.362 ms       10.100 ms       16.384 ms
ttl  9:  ICMP Time Exceeded from as-0-0.bbr1.NewYork1.Level3.net (4.68.128.106)
        109.93 ms       78.367 ms       80.352 ms
ttl  10:  ICMP Time Exceeded from as-0-0.bbr1.SanJose1.Level3.net (64.159.1.133)
        156.61 ms       179.35 ms
          ICMP Time Exceeded from ae-0-0.bbr2.SanJose1.Level3.net (64.159.1.130)
        148.04 ms
ttl  11:  ICMP Time Exceeded from ge-7-0.ipcolo1.SanJose1.Level3.net (4.68.123.9)
        153.59 ms
         ICMP Time Exceeded from ge-11-0.ipcolo1.SanJose1.Level3.net (4.68.123.41)
        142.50 ms
         ICMP Time Exceeded from ge-7-1.ipcolo1.SanJose1.Level3.net (4.68.123.73)
        133.66 ms
ttl  12:  ICMP Time Exceeded from p1-0.cisco.bbnplanet.net (4.0.26.14)
        150.13 ms       191.24 ms       156.89 ms
ttl  13:  ICMP Time Exceeded from sjck-dmzbb-gw1.cisco.com (128.107.239.9)
        141.47 ms       147.98 ms       158.12 ms
ttl  14:  ICMP Time Exceeded from sjck-dmzdc-gw2.cisco.com (128.107.224.77)
        188.85 ms       148.17 ms       152.99 ms
ttl  15:no response     no response     

hop :  min   /  ave   /  max   :  # packets  :  # lost
      -------------------------------------------------------
  1 : 0.28100 / 2.4237 / 6.7040 :   3 packets :   0 lost
  2 : 0.16900 / 2.1460 / 6.0140 :   3 packets :   0 lost
  3 : 0.52100 / 2.6437 / 6.8280 :   3 packets :   0 lost
  4 : 6.6400 / 6.9440 / 7.4230 :   3 packets :   0 lost
  5 : 0.54700 / 0.58967 / 0.64100 :   3 packets :   0 lost
  6 : 0.62200 / 4.8200 / 6.9390 :   3 packets :   0 lost
  7 : 0.79400 / 4.9660 / 7.0790 :   3 packets :   0 lost
  8 : 10.100 / 12.282 / 16.384 :   3 packets :   0 lost
  9 : 78.367 / 89.550 / 109.93 :   3 packets :   0 lost
 10 : 148.04 / 161.33 / 179.35 :   3 packets :   0 lost
 11 : 133.66 / 143.25 / 153.59 :   3 packets :   0 lost
 12 : 150.13 / 166.09 / 191.24 :   3 packets :   0 lost
 13 : 141.47 / 149.19 / 158.12 :   3 packets :   0 lost
 14 : 148.17 / 163.34 / 188.85 :   3 packets :   0 lost
 15 : 0.0000 / 0.0000 / 0.0000 :   0 packets :   2 lost
     ------------------------Total--------------------------
total 0.0000 / 60.540 / 191.24 :  42 packets :   2 lost

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

MTR (Matt's TraceRoute)

mtr (Matt's TraceRoute) combines the functionality of the traceroute and ping programs in a single network diagnostic tool.

Example

mtr.png

References

-- FrancoisXavierAndreu & SimonMuyal - 06 Jun 2005

-- SimonLeinen - 06 May 2005 - 26 Feb 2006

NDT (Network Diagnostic Tool)

NDT can be used to check the TCP configuration of any host that can run Java applets. The client connects to a Web page containing a special applet. The Web server that serves the applet must run a kernel with the Web100 extensions for TCP measurement instrumentation. The applet performs TCP memory-to-memory tests between the client and the Web100-instrumented server, and then uses the measurements from the Web100 instrumentation to find out about the TCP configuration of the client. NDT also detects common configuration errors such as duplex mismatch.

Besides the applet, there is also a command-line client called web100clt that can be used without a browser. The client works on Linux without any need for Web100 extensions, and can be compiled on other Unix systems as well. It doesn't require a Web server on the server side, just the NDT server web100srv - which requires a Web100-enhanced kernel.

Applicability

Since NDT performs memory-to-memory tests, it avoids end-system bottlenecks such as file system or disk limitations. Therefore it is useful for estimating "pure" TCP throughput limitations for a system, better than, for instance, measuring the throughput of a large file retrieval from a public FTP or WWW server. In addition, NDT servers are supposed to be well-connected and lightly loaded.

When trying to tune a host for maximum throughput, it is a good idea to start testing it against an NDT server that is known to be relatively close to the host in question. Once the throughput with this server is satisfactory, one can try to further tune the host's TCP against a more distant NDT server. Several European NRENs as well as many sites in the U.S. operate NDT servers.

Example

Applet (tcp100bw.html)

This example shows the results (overview) of an NDT measurement between a client in Switzerland (Gigabit Ethernet connection) and an IPv6 NDT server in Ljubljana, Slovenia (Gigabit Ethernet connection).

Screen_shot_2011-07-25_at_18.11.44_.png

Command-line client

The following is an example run of the web100clt command-line client application. In this case, the client runs on a small Linux-based home router connected to a commercial broadband-over-cable Internet connection. The connection is marketed as "5000 kb/s downstream, 500 kb/s upstream", and it can be seen that this nominal bandwidth can in fact be achieved for an individual TCP connection.

$ ./web100clt -n ndt.switch.ch
Testing network path for configuration and performance problems  --  Using IPv6 address
Checking for Middleboxes . . . . . . . . . . . . . . . . . .  Done
checking for firewalls . . . . . . . . . . . . . . . . . . .  Done
running 10s outbound test (client to server) . . . . .  509.00 kb/s
running 10s inbound test (server to client) . . . . . . 5.07 Mb/s
Your host is connected to a Cable/DSL modem
Information [C2S]: Excessive packet queuing detected: 10.18% (local buffers)
Information [S2C]: Excessive packet queuing detected: 73.20% (local buffers)
Server 'ndt.switch.ch' is not behind a firewall. [Connection to the ephemeral port was successful]
Client is probably behind a firewall. [Connection to the ephemeral port failed]
Information: Network Middlebox is modifying MSS variable (changed to 1440)
Server IP addresses are preserved End-to-End
Client IP addresses are preserved End-to-End

Note the remark "-- Using IPv6 address". The example used the new version 3.3.12 of the NDT software, which includes support for both IPv4 and IPv6. Because both the client and the server support IPv6, the test is run over IPv6. To run an IPv4 test in this situation, one could have called the client with the -4 option, i.e. web100clt -4 -n ndt.switch.ch.

Public Servers

There are many public NDT server installations available on the Web. You should choose between the servers according to distance (in terms of RTT) and the throughput range you are interested in. The limits of your local connectivity can best be tested by connecting to a server with a fast(er) connection that is close by. If you want to tune your TCP parameters for "LFNs", use a server that is far away from you, but still reachable over a path without bottlenecks.

Development

Development of NDT is now hosted on Google Code, and as of July 2011 there is some activity. Notably, there is now documentation about the architecture as well as the detailed protocol or NDT. One of the stated goals is to allow outside parties to write compatible NDT clients without consulting the NDT source code.

References

-- SimonLeinen - 30 Jun 2005 - 22 Nov 2011

Active Measurement Tools

Active measurement injects traffic into a network to measure properties about that network.

  • ping - a simple RTT and loss measurement tool using ICMP ECHO messages
  • fping - ping variant supporting concurrent measurement to multiple destinations
  • SmokePing - nice graphical representation of RTT distribution and loss rate from periodic "pings"
  • OWAMP - a protocol and tool for one-way measurements

There are several permanent infrastructures performing active measurements:

  • HADES DFN-developed HADES Active Delay Evaluation System, deployed in GEANT/GEANT2 and DFN
  • RIPE TTM
  • QoSMetricsBoxes used in RENATER's Métrologie project

-- SimonLeinen - 29 Mar 2006

ping

Ping sends ICMP echo request messages to a remote host and waits for ICMP echo replies to determine the latency between those hosts. The output shows the Round Trip Time (RTT) between the host machine and the remote target. Ping is often used to determine whether a remote host is reachable. Unfortunately, it is quite common these days for ICMP traffic to be blocked by packet filters / firewalls, so a ping timing out does not necessarily mean that a host is unreachable.

(Another method for checking remote host availability is to telnet to a port that you know to be accessible; such as port 80 (HTTP) or 25 (SMTP). If the connection is still timing out, then the host is probably not reachable - often because of a filter/firewall silently blocking the traffic. When there is no service listening on that port, this usually results in a "Connection refused" message. If a connection is made to the remote host, then it can be ended by typing the escape character (usually 'CTRL ^ ]') then quit.)

The -f flag may be specified to send ping packets as fast as they come back, or 100 times per second - whichever is more frequent. As this option can be very hard on the network only a super-user (the root account on *NIX machines) is allowed to specified this flag in a ping command.

The -c flag specifies the number of Echo request messages sent to the remote host by ping. If this flag isn't used, then the ping continues to send echo request messages until the user types CTRL C. If the ping is cancelled after only a few messages have been sent, the RTT summary statistics that are displayed at the end of the ping output may not have finished being calculated and won't be completely accurate. To gain an accurate representation of the RTT, it is recommended to set a count of 100 pings. The MS Windows implementation of ping just sends 4 echo request messages by default.

Ping6 is the IPv6 implementation of the ping program. It works in the same way, but sends ICMPv6 Echo Request packets and waits for ICMPv6 Echo Reply packets to determine the RTT between two hosts. There are no discernable differences between the Unix implementations and the MS Windows implementation.

As with traceroute, Solaris integrates IPv4 and IPv6 functionality in a single ping binary, and provides options to select between them (-A <AF>).

See fping for a ping variant that supports concurrent measurements to multiple destinations and is easier to use in scripts.

Implementation-specific notes

-- TobyRodwell - 06 Apr 2005
-- MarcoMarletta - 15 Nov 2006 (Added the difference in size between Juniper and Cisco)

fping

fping is a ping like program which uses the Internet Control Message Protocol (ICMP) echo request to determine if a host is up. fping is different from ping in that you can specify any number of hosts on the command line, or specify a file containing the lists of hosts to ping. Instead of trying one host until it timeouts or replies, fping will send out a ping packet and move on to the next host in a round-robin fashion. If a host replies, it is noted and removed from the list of hosts to check. If a host does not respond within a certain time limit and/or retry limit it will be considered unreachable.

Unlike ping, fping is meant to be used in scripts and its output is easy to parse. It is often used as a probe for packet loss and round-trip time in SmokePing.

fping version 3

The original fping was written by Roland Schemers in 1992, and stopped being updated in 2002. In December 2011, David Schweikert decided to take up maintainership of fping, and increased the major release number to 3, mainly to reflect the change of maintainer. Changes from earlier versions include:

  • integration of all Debian patches, including IPv6 support (2.4b2-to-ipv6-16)
  • an optimized main loop that is claimed to bring performance close to the theoretical maximum
  • patches to improve SmokePing compatibility

The Web site location has changed to fping.org, and the source is now maintained on GitHub.

Since July 2013, there is also a Mailing List, which will be used to announce new releases.

Examples

Simple example of usage:

# fping -c 3 -s www.man.poznan.pl www.google.pl
www.google.pl     : [0], 96 bytes, 8.81 ms (8.81 avg, 0% loss)
www.man.poznan.pl : [0], 96 bytes, 37.7 ms (37.7 avg, 0% loss)
www.google.pl     : [1], 96 bytes, 8.80 ms (8.80 avg, 0% loss)
www.man.poznan.pl : [1], 96 bytes, 37.5 ms (37.6 avg, 0% loss)
www.google.pl     : [2], 96 bytes, 8.76 ms (8.79 avg, 0% loss)
www.man.poznan.pl : [2], 96 bytes, 37.5 ms (37.6 avg, 0% loss)

www.man.poznan.pl : xmt/rcv/%loss = 3/3/0%, min/avg/max = 37.5/37.6/37.7
www.google.pl     : xmt/rcv/%loss = 3/3/0%, min/avg/max = 8.76/8.79/8.81

       2 targets
       2 alive
       0 unreachable
       0 unknown addresses

       0 timeouts (waiting for response)
       6 ICMP Echos sent
       6 ICMP Echo Replies received
       0 other ICMP received

 8.76 ms (min round trip time)
 23.1 ms (avg round trip time)
 37.7 ms (max round trip time)
        2.039 sec (elapsed real time)

IPv6 Support

Jeroen Massar has added IPv6 support to fping. This has been implemented as a compile-time variant, so that there are separate fping (for IPv4) and fping6 (for IPv6) binaries. The IPv6 patch has been partially integrated into the fping version on www.fping.com as of release "2.4b2_to-ipv6" (thus also integrated in fping 3.0). Unfortunately his modifications to the build routine seem to have been lost in the integration, so that the fping.com version only installs the IPv6 version as fping. Jeroen's original version doesn't have this problem, and can be downloaded from his IPv6 Web page.

ICMP Sequence Number handling

Older versions of fping used the Sequence Number field in ICMP ECHO requests in a peculiar way: it used a different sequence number for each destination host, but used the same sequence number for all requests to a specific host. There have been reports of specific systems that suppress (or rate-limit) ICMP ECHO requests with repeated sequence numbers, which causes high loss rates reported from tools that use fping, such as SmokePing. Another issue is that fping could not distinguish a perfect link from one that drops every other packet and that duplicates every other.

Newer fping versions such as 3.0 or 2.4.2b2_to (on Debian GNU/Linux) include a change to sequence number handling attributed to Stephan Fuhrmann. These versions increment sequence numbers for every probe sent, which should solve both of these problems.

References

-- BartoszBelter - 2005-07-14 - 2005-07-26
-- SimonLeinen - 2008-05-19 - 2013-07-26

Packet Capture and Analysis Tools:

Detect protocol problems via the analysis of packets, trouble shooting

  • tcpdump: Packet capture tool based on the libpcap library. http://www.tcpdump.org/
  • snoop: tcpdump equivalent in Sun's Solaris operating system
  • Wireshark (formerly called Ethereal): Extended tcpdump with a user-friendly GUI. Its plugin architecture allowed many protocol-specific "dissectors" to be written. Also includes some analysis tools useful for performance debugging.
  • libtrace: A library for trace processing; works with libpcap (tcpdump) and other formats.
  • Netdude: The Network Dump data Displayer and Editor is a framework for inspection, analysis and manipulation of tcpdump trace files.
  • jnettop: Captures traffic coming across the host it is running on and displays streams sorted by the bandwidth they use.
  • tcptrace: a statistics/analysis tool for TCP sessions using tcpdump capture files
  • capturing packets with Endace DAG cards (dagsnap, dagconvert, ...)

General Hints for Taking Packet Traces

Capture enough data - you can always throw away stuff later. Note that tcpdump's default capture length is small (96 bytes), so use -s 0 (or something like -s 1540) if you are interested in payloads. Wireshark and snoop capture entire packets by default. Seemingly unrelated traffic can impact performance. For example, Web pages from foo.example.com may load slowly because of the images from adserver.example.net. In situations of high background traffic, it may however be necessary to filter out unrelated traffic.

It can be extremely useful to collect packet traces from multiple points in the network

  • On the endpoints of the communication
  • Near �suspicious� intermediate points such as firewalls

Synchronized clocks (e.g. by NTP) are very useful for matching traces.

Address-to-name resolution can slow down display and causes additional traffic which confuses the trace. With tcpdump, consider using -n or tracing to file (-w file).

Request remote packet traces

When a packet trace from a remote site is required, this often means having to ask someone at that site to provide it. When requesting such a trace, consider making this as easy as possible for the person having to do it. Try to use a packet tracing tool that is already available - tcpdump for most BSD or Linux systems, snoop for Solaris machines. Windows doesn't seem to come bundled with a packet capturing program, but you can direct the user to Wireshark, which is reasonably easy to install and use under Windows. Try to give clean indications on how to call the packet capture program. It is usually best to ask the user to capture to a file, and then send you the capture file as an e-mail attachment or so.

References

  • Presentation slides from a short talk about packet capturing techniques given at the network performance section of the January 2006 GEANT2 technical workshop

-- FrancoisXavierAndreu - 06 Jun 2005
-- SimonLeinen - 05 Jan 2006-09 Apr 2006
-- PekkaSavola - 26 Oct 2006

tcpdump

One of the early diagnostic tools for TCP/IP that was written by Van Jacobson, Craig Leres, and Steven McCanne. Tcpdump can be used to capture and decode packets in real time, or to capture packets to files (in "libpcap" format, see below), and analyze (decode) them later.

There are now more elaborate and, in some ways, user-friendly packet capturing programs, such as Wireshark (formerly called Ethereal), but tcpdump is widely available, widely used, so it is very useful to know how to use it.

Tcpdump/libpcap is still actively being maintained, although not by its original authors.

libpcap

Libpcap is a library that is used by tcpdump, and also names a file format for packet traces. This file format - usually used in files with the extension .pcap - is widely supported by packet capture and analysis tools.

Selected options

Some useful options to tcpdump include:

  • -s snaplen capture snaplen bytes of each frame. By default, tcpdump captures only the first 68 bytes, which is sufficient to capture IP/UDP/TCP/ICMP headers, but usually not payload or higher-level protocols. If you are interested in more than just headers, use -s 0 to capture packets without truncation.
  • -r filename read from an previously created capture file (see -w)
  • -w filename dump to a file instead of analyzing on-the-fly
  • -i interface capture on an interface other than the default (first "up" non-loopback interface). Under Linux, -i any can be used to capture on all interfaces, albeit with some restrictions.
  • -c count stop the capture after count packets
  • -n don't resolve addresses, port numbers, etc. to symbolic names - this avoids additional DNS traffic when analyzing a live capture.
  • -v verbose output
  • -vv more verbose output
  • -vvv even more verbose output

Also, a pflang expression can be appended to the command so as to filter the captured packets. An expression is made up of one or more of "type", "direction" and "protocol".

  • type Can be host, net or port. host is presumed unless otherwise speciified
  • dir Can be src, dst, src or dst or src and dst
  • proto (for protocol) Common types are ether, ip, tcp, udp, arp ... If none is specifiied then all protocols for which the value is valid are considered.
Example expressions:
  • dst host <address>
  • src host <address>
  • udp dst port <number>
  • host <host> and not port ftp and not port ftp-data

Usage examples

Capture a single (-c 1) udp packet to file test.pcap:

: root@diotima[tmp]; tcpdump -c 1 -w test.pcap udp
tcpdump: listening on bge0, link-type EN10MB (Ethernet), capture size 96 bytes
1 packets captured
3 packets received by filter
0 packets dropped by kernel

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.pcap
-rw-r--r-- 1 root root 114 2006-04-09 18:57 test.pcap
: root@diotima[tmp]; file test.pcap
test.pcap: tcpdump capture file (big-endian) - version 2.4 (Ethernet, capture length 96)

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; tcpdump -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: UDP, length: 12

Display the same capture file in verbose mode:

: root@diotima[tmp]; tcpdump -v -r test.pcap
reading from file test.pcap, link-type EN10MB (Ethernet)
18:57:28.732789 2001:630:241:204:211:43ff:fee1:9fe0.32832 > ff3e::beac.10000: [udp sum ok] UDP, length: 12 (len 20, hlim 118)

More examples with some advanced tcpdump use cases.

References

-- SimonLeinen - 2006-03-04 - 2016-03-17

Wireshark®

Wireshark is a packet capture/analysis tool, similar to tcpdump but much more elaborate. It has a graphical user interface (GUI) which allows "drilling down" into the header structure of captured packets. In addition, it has a "plugin" architecture that allows decoders ("dissectors" in Wireshark terminology) to be written with relative ease. This and the general user-friendliness of the tool has resulted in Wireshark supporting an abundance of network protocols - presumably writing Wireshark dissectors is often part of the work of developing/implementing new protocols. Lastly, Wireshark includes some nice graphical analysis/statictics tools such as much of the functionality of tcptrace and xplot.

One of the main attractions of Wireshark is that it works nicely under Microsoft Windows, although it requires a third-party library to implement the equivalent of the libpcap packet capture library.

Wireshark used to be called Ethereal™, but was renamed in June 2006, when its principal maintainer changed employers. Version 1.8 adds support for capturing on multiple interfaces in parallel, simplified management of decryption keys (for 802.11 WLANs and IPsec/ISAKMP), and geolocation for IPv6 addresses. Some TCP events (fast retransmits and TCP Window updates) are no longer flagged as warnings/errors. The default format for saving capture files is changing from pcap to pcap-ng. Wireshark 1.10, announced in June 2013, adds many features including Windows 8 support and response-time analysis for HTTP requests.

Usage examples

The following screenshot shows Ethereal 0.10.14 under Linux/Gnome when called as ethereal -r test.pcap, reading the single-packet example trace generated in the tcpdump example. The "data" portion of the UDP part of the packet has been selected by clicking on it in the middle pane, and the corresponding bytes are highlighted in the lower pane.

ethereal -r test.pcap screendump

tshark

The package includes a command-line tool called tshark, which can be used in a similar (but not quite compatible) way to tcpdump. Through complex command-line options, it can give access to some of the more advanced decoding functionality of Wireshark. Because it generates text, it can be used as part of analysis scripts.

Scripting

Wireshark can be extended using scripting languages. The Lua language has been supported for several years. Version 1.4.0 (released in September 2010) added preliminary support for Python as an extension language.

CloudShark

In a cute and possibly even useful application of tshark, QA Cafe (an IP testing solutions vendor) has put up "Wireshark as a Service" under www.cloudshark.org. This tool lets users upload packet dumps without registration, and provides the familiar Wireshark interface over the Web. Uploads are limited to 512 Kilobytes, and there are no guarantees about confidentiality of the data, so it should not be used on privacy-sensitive data.

References

-- SimonLeinen - 2006-03-04 - 2013-06-09

Solaris snoop

Sun's Solaris operating system includes a utility called snoop for taking and analyzing packet traces. While snoop is similar in spirit to tcpdump, its options, filter syntax, and stored file formats are all slightly different. The file format is described in the Informational RFC 1761, and supported by some other tools, notably Wireshark.

Selected options

Useful options in connection with snoop include:

  • -i filename read from capture file. Corresponds to tcpdump's -r option.
  • -o filename dump to file. Corresponds to tcpdump's -w option.
  • -d device capture on device rather than on the default (first non-loopback) interface. Corresponds to tcpdump's -i option.
  • -c count stop capture after capturing count packets (same as for tcpdump)
  • -r do not resolve addresses to names. Corresponds to tcpdump's -n option. This is useful for suppressing DNS traffic when looking at a live trace.
  • -V verbose summary mode - between (default) summary mode and verbose mode
  • -v (full) verbose mode

Usage examples

Capture a single (-c 1) udp packet to file test.snoop:

: root@diotima[tmp]; snoop -o test.snoop -c 1 udp
Using device /dev/bge0 (promiscuous mode)
1 1 packets captured

This produces a binary file containing the captured packet as well as a small file header and a timestamp:

: root@diotima[tmp]; ls -l test.snoop
-rw-r--r-- 1 root root 120 2006-04-09 18:27 test.snoop
: root@diotima[tmp]; file test.snoop
test.snoop:     Snoop capture file - version 2

Analyze the contents of the previously created capture file:

: root@diotima[tmp]; snoop -i test.snoop
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac UDP D=10000 S=32820 LEN=20

Display the same capture file in "verbose summary" mode:

: root@diotima[tmp]; snoop -V -i test.snoop
________________________________
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac ETHER Type=86DD (IPv6), size = 74 bytes
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac IPv6  S=2001:690:1fff:1661::3 D=ff1e::1:f00d:beac LEN=20 HOPS=116 CLASS=0x0 FLOW=0x0
  1   0.00000 2001:690:1fff:1661::3 -> ff1e::1:f00d:beac UDP D=10000 S=32820 LEN=20

Finally, in full verbose mode:

: root@diotima[tmp]; snoop -v -i test.snoop
ETHER:  ----- Ether Header -----
ETHER:
ETHER:  Packet 1 arrived at 18:27:16.81233
ETHER:  Packet size = 74 bytes
ETHER:  Destination = 33:33:f0:d:be:ac, (multicast)
ETHER:  Source      = 0:a:f3:32:56:0,
ETHER:  Ethertype = 86DD (IPv6)
ETHER:
IPv6:   ----- IPv6 Header -----
IPv6:
IPv6:   Version = 6
IPv6:   Traffic Class = 0
IPv6:   Flow label = 0x0
IPv6:   Payload length = 20
IPv6:   Next Header = 17 (UDP)
IPv6:   Hop Limit = 116
IPv6:   Source address = 2001:690:1fff:1661::3
IPv6:   Destination address = ff1e::1:f00d:beac
IPv6:
UDP:  ----- UDP Header -----
UDP:
UDP:  Source port = 32820
UDP:  Destination port = 10000
UDP:  Length = 20
UDP:  Checksum = 0635
UDP:

Note: the captured packet is an IPv6 multicast packet from a "beacon" system. Such traffic forms the majority of the background load on our office LAN at the moment.

References

  • RFC 1761, Snoop Version 2 Packet Capture File Format, B. Callaghan, R. Gilligan, February 1995
  • snoop(1) manual page for Solaris 10, April 2004, docs.sun.com

-- SimonLeinen - 09 Apr 2006

Web100 Linux Kernel Extensions

The Web100 project was run by PSC (Pittsburgh Supercomputing Center), the NCAA and NCSA. It was funded by the US National Science Foundation (NSF) between 2000 and 2003, although development and maintenance work extended well beyond the period of NSF funding. Its thrust was to close the "wizard gap" between what performance should be possible on modern research networks and what most users of these networks actually experience. The project focused on instrumentation for TCP to measure performance and find possible bottlenecks that limit TCP throughput. In addition, it included some work on "auto-tuning" of TCP buffer settings.

Most implementation work was done for Linux, and most of the auto-tuning code is now actually included in the mainline Linux kernel code (as of 2.6.17). The TCP kernel instrumentation is available as a patch from http://www.web100.org/, and usually tracks the latest "official" Linux kernel release pretty closely.

An interesting application of Web100 is NDT, which can be used from any Java-enabled browser to detect bottlenecks in TCP configurations and network paths, as well as duplex mismatches using active TCP tests against a Web100-enabled server.

In September 2010, the NSF agreed to fund a follow-on project called Web10G.

TCP Kernel Information Set (KIS)

A central component of Web100 is a set of "instruments" that permits the monitoring of many statistics about TCP connections (sockets) in the kernel. In the Linux implementation, these instruments are accessible through the proc filesystem.

TCP Extended Statistics MIB (TCP-ESTATS-MIB, RFC 4898)

The TCP-ESTATS-MIB (RFC 4898) includes a similar set of instruments, for access through SNMP. It has been implemented by Microsoft for the Vista operating system and later versions of Windows. In the Windows Server 2008 SDK, a tool called TcpAnalyzer.exe can be used to look at statistics of open TCP connections. IBM is also said to have an implementation of this MIB.

"Userland" Tools

Besides the kernel extension, Web100 comprises a small set of user-level tools which provide access to the TCP KIS. These tools include

  1. a libweb100 library written in C
  2. the command-line tools readvar, deltavar, writevar, and readall
  3. a set of GTK+-based GUI (graphical user interface) tools under the gutil command.

gutil

When started, gutil shows a small main panel with an entry field for specifying a TCP connection, and several graphical buttons for starting different tools on a connection once one has been selected.

gutil: main panel

The TCP connection can be chosen either by explicitly specifying its endpoints, or by selecting from a list of connections (using double-click):

gutil: TCP connection selection dialog

Once a connection of interest has been selected, a number of actions are possible. The "List" action provides a list of all kernel instruments for the connection. The list is updated every second, and "delta" values are displayed for those variables that have changed.

gutil: variable listing

Another action is "Display", which provides a graphical display of a KIS variable. The following screenshot shows the display of the DataBytesIn variable of an SSH connection.

gutil: graphical variable display

Related Work

Microsoft's Windows Software Development Kit for Server 2008 and .NET Version 3.5 contains a tool called TcpAnalyzer.exe, which is similar to gutil and uses Microsoft's RFC 4898 implementation.

The SIFTR module for FreeBSD can be used for similar applications, namely to understand what is happening inside TCP at fine granularity. Sun's DTrace would be an alternative on systems that support it, provided they offer suitable probes for the relevant actions and events within TCP. Both SIFTR and DTrace have very different user interfaces to Web100.

References

-- SimonLeinen - 27 Feb 2006 - 25 Mar 2011
-- ChrisWelti - 12 Jan 2010

End-System (Host) Tuning

This section contains hints for tuning end systems (hosts) for maximum performance.

References

-- SimonLeinen - 31 Oct 2004 - 23 Jul 2009
-- PekkaSavola - 17 Nov 2006

Operating-System Specific Configuration Hints

This topic points to configuration hints for specific operating systems.

Note that the main perspective of these hints is how to achieve good performance in a big bandwidth-delay-product environment, typically with a single stream. See the ServerScaling topic for some guidance on tuning servers for many concurrent connections.

-- SimonLeinen - 27 Oct 2004 - 02 Nov 2008
-- PekkaSavola - 02 Oct 2008

OS-Specific Configuration Hints: BSD Variants

Performance tuning

By default, earlier BSD versions use very modest buffer sizes and don't enable Window Scaling by default. See the references below how to do so.

In contrast, FreeBSD 7.0 introduced TCP buffer auto-tuning, and thus should provide good TCP performance out of the box even over LFNs. This release also implements large-send offload (LSO) and large-receive offload (LRO) for some Ethernet adapters. FreeBSD 7.0 also announces the following in its presentation of technological advances:

10Gbps network optimization: With optimized device drivers from all major 10gbps network vendors, FreeBSD 7.0 has seen extensive optimization of the network stack for high performance workloads, including auto-scaling socket buffers, TCP Segment Offload (TSO), Large Receive Offload (LRO), direct network stack dispatch, and load balancing of TCP/IP workloads over multiple CPUs on supporting 10gbps cards or when multiple network interfaces are in use simultaneously. Full vendor support is available from Chelsio, Intel, Myricom, and Neterion.

Recent TCP Work

The FreeBSD Foundation has granted a project "Improvements to the FreeBSD TCP Stack" to Lawrence Stewart at Swinburne University. Goals for this project include support for Appropriate Byte Counting (ABC, RFC 3465), merging SIFTR into the FreeBSD codebase, and improving the implementation of the reassembly queue. Information is available on http://caia.swin.edu.au/urp/newtcp/. Other improvements done by this group are support for modular congestion control, implementations of CUBIC, H-TCP, TCP Vegas, the SIFTR TCP implementation tool, and a testing framework including improvements to iperf for better buffer size control.

The CUBIC implementation for FreeBSD was announced on the ICCRG mailing list in September 2009. Currently available as patches for the 7.0 and 8.0 kernels, it is planned to be merged "into the mainline FreeBSD source tree in the not too distant future".

In February 2010, a set of Software for FreeBSD TCP R&D was announced by the Swinburne group: This includes modular TCP congestion control, Hamilton TCP (H-TCP), a newer "Hamilton Delay" Congestion Control Algorithm v0.1, Vegas Congestion Control Algorithm v0.1, as well as a kernel helper/hook framework ("Khelp") and a module (ERTT) to improve RTT estimation.

Another release in August 2010 added a "CAIA-Hamilton Delay" congestion control algorithm as well as revised versions of the other components.

QoS tools

On BSD systems ALTQ implements a couple of queueing/scheduling algorithms for network interfaces, as well as some other QoS mechanisms.

To use ALTQ on a FreeBSD 5.x or 6.x box, the following are the necessary steps:

  1. build a kernel with ALTQ
    • options ALTQ and some others beginning with ALTQ_ should be added to the kernel config. Please refer to the ALTQ(4) man page.
  2. define QoS settings in /etc/pf.conf
  3. use pfctl to apply those settings
    • Set pf_enable to YES in /etc/rc.conf (set as well the other variables related to pf according to your needs) in order to apply the QoS settings every time the host boots.

References

-- SimonLeinen - 27 Jan 2005 - 18 Aug 2010
-- PekkaSavola - 17 Nov 2006

Linux-Specific Network Performance Tuning Hints

Linux has its own implementation of the TCP/IP Stack. With recent kernel versions, the TCP/IP implementation contains many useful performance features. Parameters can be controlled via the /proc interface, or using the sysctl mechanism. Note that although some of these parameters have ipv4 in their names, they apply equally to TCP over IPv6.

A typical configuration for high TCP throughput over paths with high bandwidth*delay product would include the following in /etc/sysctl.conf:

A description of each parameter listed below can be found in section Linux IP Parameters.

Basic tuning

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Since 2.6.17 kernel, buffers have sensible automatically calculated values for most uses. Unless very high RTT, loss or performance requirement (200+ Mbit/s) is present, buffer settings may not need to be tuned at all.

Nonetheless, the following values may be used:

net/core/rmem_max=16777216
net/core/wmem_max=16777216
net/ipv4/tcp_rmem="8192 87380 16777216"
net/ipv4/tcp_wmem="8192 65536 16777216"

With kernel < 2.4.27 or < 2.6.7, receive-side autotuning may not be implemented, and the default (middle value) should be increased (at the cost of higher, by-default memory consumption):

net/ipv4/tcp_rmem="8192 16777216 16777216"

NOTE: If you have a server with hundreds of connections, you might not want to use a large default value for TCP buffers, as memory may quickly run out smile

There is a subtle but important implementation detail in the socket buffer management of Linux. When setting either the send- or receive buffer sizes via the SO_SNDBUF and SO_RCVBUF socket options via setsockopt(2), the value passed in the system call is doubled by the kernel to accomodate buffer management overhead. Reading the values back with getsockopt(2) return this modified value, but the effective buffer available to TCP payload is still the original value.

The values net/core/rmem and net/core/wmem apply to the argument to setsockopt(2).

In contrast, the maximum values of net/ipv4/tcp_rmem=/=net/ipv4/tcp_wmem apply to the total buffer sizes including the factor of 2 for the buffer management overhead. As a consequence, those values must be chosen twice as large as required by a particular BandwidthDelayProduct. Also note taht the values net/core/rmem and net/core/wmem do not apply to the TCP autotuning mechanism.

Interface queue lengths

InterfaceQueueLength describes how to adjust interface transmit and receive queue lengths. This tuning is typically needed with GE or 10GE transfers.

Host/adapter architecture implications

When going for 300 Mbit/s performance, it is worth verifying that host architecture (e.g., PCI bus) is fast enough. PCI Express is usually fast enough to no longer be the bottleneck in 1Gb/s and even 10Gb/s applications.

For the older PCI/PCI-X buses, when going for 2+ Gbit/s performance, the Maximum Memory Read Byte Count (MMRBC) usually needs to be increased using setpci.

Many network adapters support features such as checksum offload. In some cases, however, these may even decrease performance. In particular, TCP Segment Offload may need to be disabled, with:

ethtool -K eth0 tso off

Advanced tuning

Sharing congestion information across connections/hosts

2.4 series kernels have a TCP/IP weakness in that their interface buffers' maximum window size is based on the experience of previous connections - if you have loss at any point (or a bad end host at the same route) you limit your future TCP connections. So, you may have to flush the route cache to improve performance.

net.ipv4.route.flush=1

2.6 kernels also remember some performance characteristics across connections. In benchmarks and other tests, this might not be desirable.

# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save=1

Other TCP performance variables

If there is packet reordering in the network, reordering could end up being interpreted as a packet loss too easily. Increasing tcp_reordering parameter might help in that case:

net/ipv4/tcp_reordering=20   # (default=3)

Several variables already have good default values, but it may make sense to check that these defaults haven't been changed:

net/ipv4/tcp_timestamps=1
net/ipv4/tcp_window_scaling=1
net/ipv4/tcp_sack=1
net/ipv4/tcp_moderate_rcvbuf=1

TCP Congestion Control algorithms

Linux 2.6.13 introduced pluggable congestion modules, which allows you to select one of the high-speed TCP congestion control variants, e.g. CUBIC

net/ipv4/tcp_congestion_control = cubic

Alternative values include highspeed (HS-TCP), scalable (Scalable TCP), htcp (Hamilton TCP), bic (BIC), reno ("Reno" TCP), and westwood (TCP Westwood).

Note that on Linux 2.6.19 and later, CUBIC is already used as the default algorithm.

Web100 kernel tuning

If you are using a web100 kernel, the following parameters seem to improve networking performance even further:

# web100 tuning
# turn off using txqueuelen as part of congestion window computation
net/ipv4/WAD_IFQ = 1

QoS tools

Modern Linux kernels have flexible traffic shaping built in.

See the Linux traffic shaping example for an illustration of how these mechanisms can be used to solve a real performance problem.

References

See also the reference to Congestion Control in Linux in the FlowControl topic.

-- SimonLeinen - 27 Oct 2004 - 23 Jul 2009
-- ChrisWelti - 27 Jan 2005
-- PekkaSavola - 17 Nov 2006
-- AlexGall - 28 Nov 2016

Interface queue lengths

txqueuelen

The txqueuelen parameter of an interface in the Linux kernel. It limits the number of packets in the transmission queue in the interface's device driver. In 2.6 series and e.g., RHEL3 (2.4.21) kernel, the default is 1000. In earlier kernels, the default is 100. '100' is often too low to support line-rate transfers over Gigabit Ethernet interfaces, and in some cases, even '1000' is too low.

For Gigabit Ethernet interfaces, it is suggested to use at least a txqueuelen of 1000. (values of up to 8000 have been used successfully to further improve performance), e.g.,

ifconfig eth0 txqueuelen 1000

If a host is low performance or has slow links, having too big txqueuelen may disturb interactive performance.

netdev_max_backlog

The sysctl net.core.netdev_max_backlog defines the queue sizes for received packets. In recent kernels (like 2.6.18), the default is 1000; in older ones, it is 300. If the interface receives packets (e.g., in a burst) faster than the kernel can process them, this could overflow. A value in the order of thousands should be reasonable for GE, tens of thousands for 10GE.

For example,

net/core/netdev_max_backlog=2500

References

-- SimonLeinen - 06 Jan 2005
-- PekkaSavola - 17 Nov 2006

OS-Specific Configuration Hints: Mac OS X

As Mac OS X is mainly a BSD derivative, you can use similar mechanisms to tune the TCP stack - see under BsdOSSpecific.

TCP Socket Buffer Tuning

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

For testing temporary improvements, you can directly use sysctl in a terminal window: (you have to be root to do that)

sysctl -w kern.ipc.maxsockbuf=8388608
sysctl -w net.inet.tcp.rfc1323=1
sysctl -w net.inet.tcp.sendspace=1048576
sysctl -w net.inet.tcp.recvspace=1048576
sysctl -w kern.maxfiles=65536
sysctl -w net.inet.udp.recvspace=147456
sysctl -w net.inet.udp.maxdgram=57344
sysctl -w net.local.stream.recvspace=65535
sysctl -w net.local.stream.sendspace=65535

For permanent changes that last over a reboot, insert the appropriate configurations into Ltt>/etc/sysctl.conf. If this file does not exist must create it. So, for the above, just add the following lines to sysctl.conf:

kern.ipc.maxsockbuf=8388608
net.inet.tcp.rfc1323=1
net.inet.tcp.sendspace=1048576
net.inet.tcp.recvspace=1048576
kern.maxfiles=65536
net.inet.udp.recvspace=147456
net.inet.udp.maxdgram=57344
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535

Note This only works for OSX 10.3 or later! For earlier versions you need to use /etc/rc where you can enter whole sysctl commands.

Users that are unfamiliar with terminal windows can also use the GUI tool "TinkerTool System" and use its Network Tuning option to set the TCP buffers.

TinkerTool System is available from:

-- ChrisWelti - 30 Jun 2005

OS-Specific Configuration Hints: Solaris (Sun Microsystems)

Planned Features

Pluggable Congestion Control for TCP and SCTP

A proposed OpenSolaris project foresees the implementation of pluggable congestion control for both TCP and SCTP. HS-TCP and several other congestion control algorithms for OpenSolaris. This includes implementation of the HighSpeed, CUBIC, Westwood+, and Vegas congestion control algorithms, as well as ipadm subcommands and socket options to get and set congestion control parameters.

On 15 December 2009, Artem Kachitchkine posted an initial draft of a work-in-progress design specification for this feature has been announced on the OpenSolaris networking-discuss forum. According to this proposal, the API for setting the congestion control mechanism for a specific TCP socket will be compatible with Linux: There will be a TCP_CONGESTION socket option to set and retrieve a socket's congestion control algorithm, as well as a TCP_INFO socket option to retrieve various kinds of information about the current congestion control state.

The entire plugabble-congestion control mechanism will be implemented for SCTP in addition to TCP. For example, there will also be an SCTP_CONGESTION socket option. Note that congestion control in SCTP is somewhat trickier than in TCP, because a single SCTP socket can have multiple underlying paths through SCTP's "multi-homing" feature. Congestion control state must be kept separately for each path (address pair). This also means that there is no direct SCTP equivalent to TCP_INFO. The current proposal adds a subset of TCP_INFO's information to the result of the existing SCTP_GET_PEER_ADDR_INFO option for getsockopt().

The internal structure of the code will be somewhat different to what is in the Linux kernel. In particular, the general TCP code will only make calls to the algorithm-specific congestion control modules, not vice versa. The proposed Solaris mechanism also contains ipadm properties that can be used to set the default congestion control algorithm either globally or for a specific zone. The proposal also suggests "observability" features; for example, pfiles output should include the congestion algorithm used for a socket, and there are new kstat statistics that count certain congestion-control events.

Useful Features

TCP Multidata Transmit (MDT, aka LSO)

Solaris 10, and Solaris 9 with patches, supports TCP Multidata Transmit (MDT), which is Sun's name for (software-only) Large Send Offload (LSO). In Solaris 10, this is enabled by default, but in Solaris 9 (with the required patches for MDT support), the kernel and driver have to be reconfigured to be able to use MDT. See the following pointers for more information from docs.sun.com:

Solaris 10 "FireEngine"

The TCP/IP stack in Solaris 10 has been largely rewritten from previous versions, mostly to improve performance. In particular, it supports Interrupt Coalescence, integrates TCP and IP more closely in the kernel, and provides multiprocessing enhancements to distribute work more efficiently over multiple processors. Ongoing work includes UDP/IP integration for better performance of UDP applications, and a new driver architecture that can make use of flow classification capabilities in modern network adapters.

Solaris 10: New Network Device Driver Architecture

Solaris 10 introduces GLDv3 (project "Nemo"), a new driver architecture that generally improves performance, and adds support for several performance features. Some, but not all, Ethernet device drivers were ported over to the new architecture and benefit from those improvements. Notably, the bge driver was ported early, and the new "Neptune" adapters ("multithreaded" dual-port 10GE and four-port GigE with on-board connection de-multiplexing hardware) used it from the start.

Darren Reed has posted a small C program that lists the active acceleration features for a given interface. Here's some sample output:

$ sudo ./ifcapability
lo0 inet
bge0 inet +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1) +POLL
lo0 inet6
bge0 inet6 +HCKSUM(version=1 +full +ipv4hdr) +ZEROCOPY(version=1 flags=0x1)

Displaying and setting link parameters with dladm

Another OpenSolaris project called Brussels unifies many aspects of network driver configuration under the dladm command. For example, link MTUs (for "Jumbo Frames") can be configured using

dladm set-linkprop -p mtu=9000 web1

The command can also be used to look at current physical parameters of interfaces:

$ sudo dladm show-phys
LINK         MEDIA                STATE      SPEED DUPLEX   DEVICE
bge0         Ethernet             up         1000 full      bge0

Note that Brussels is still being integrated into Solaris. Driver support was added since SXCE (Solaris Express Community Edition) build 83 for some types of adapters. Eventually this should be integrated into regular Solaris releases.

Setting TCP buffers

# To increase the maximum tcp window
# Rule-of-thumb: max_buf = 2 x cwnd_max (congestion window)
ndd -set /dev/tcp tcp_max_buf 4194304
ndd -set /dev/tcp tcp_cwnd_max 2097152

# To increase the DEFAULT tcp window size
ndd -set /dev/tcp tcp_xmit_hiwat 65536
ndd -set /dev/tcp tcp_recv_hiwat 65536

Pitfall when using asymmetric send and receive buffers

The documented default behaviour (tunable TCP parameter tcp_wscale_always = 0) of Solaris is to include the TCP window scaling option in an initial SYN packet when either the send or the receive buffer is larger than 64KiB. From the tcp(7P) man page:

          For all applications, use ndd(1M) to modify the  confi-
          guration      parameter      tcp_wscale_always.      If
          tcp_wscale_always is set to 1, the window scale  option
          will  always be set when connecting to a remote system.
          If tcp_wscale_always is 0, the window scale option will
          be set only if the user has requested a send or receive
          window  larger  than  64K.   The   default   value   of
          tcp_wscale_always is 0.

However, Solaris 8, 9 and 10 do not take the send window into account. This results in an unexpected behaviour for a bulk transfer from node A to node B when the bandwidth-delay product is larger than 64KiB and

  • A's receive buffer (tcp_recv_hiwat) < 64KiB
  • A's send buffer (tcp_xmit_hiwat) > 64KiB
  • B's receive buffer > 64KiB

A will not advertize the window scaling option and B will not do so either according to RFC 1323. As a consequence, throughput will be limited by a congestion window of 64KiB.

As a workaround, the window scaling option can be forcibly advertised by setting

# ndd -set /dev/tcp tcp_wscale_always 1

A bug report has been filed with Sun Microsystems.

References

-- ChrisWelti - 11 Oct 2005, added section for setting default and maximum TCP buffers
-- AlexGall - 26 Aug 2005, added section on pitfall with asymmetric buffers
-- SimonLeinen - 27 Jan 2005 - 16 Dec 2009

Windows-Specific Host Tuning

Next Generation TCP/IP Stack in Windows Vista / Windows Server 2008 / Windows 7

According to http://www.microsoft.com/technet/community/columns/cableguy/cg0905.mspx, the new version of Windows, Vista, features a redesigned TCP/IP stack. Besides unifying IPv4/IPv6 dual-stack support, this new stack also claims much-improved performance for high-speed and asymmetric networks, as well as several auto-tuning features.

The Microsoft Windows Vista Operating System enables the TCP Window Scaling option by default (previous Windows OSes had this option disabled). This causes problems with various middleboxes, see WindowScalingProblems. As a consequence, the scaling factor is limited to 2 for HTTP traffic in Vista.

Another new feature of Vista is Compound TCP, a high-performance TCP variant that uses delay information to adapt its transmission rate (in addition to loss information). On Windows Server 2008 it is enabled by default, but it is disabled in Vista. You can enable it in Vista by running the following command as an Administrator on the command line.

netsh interface tcp set global congestionprovider=ctcp
(see CompoundTCP for more information).

Although the new TCP/IP-Stack has integrated auto-tuning for receive buffers, the TCP send buffer still seems to be limited to 16KB by default. This means you will not be able to get good upload rates for connections with higher RTTs. However, using the registry hack (see further below) you can manually change the default value to something higher.

Performance tuning for earlier Windows versions

Enabling TCP Window Scaling

The references below detail the various "whys" and "hows" to tune Windows network performance. Of particular note however is the setting of scalable TCP receive windows (see LargeTcpWindows).

It appears that, by default, not only does Windows not support TCP 1323 scalable windows, but the required key is not even in the Windows registry. The key (Tcp1323Opts) can be added to one of two places, depending on the particular system:

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\VxD\MSTCP] (Windows'95)
[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters] (Windows 2000/XP)

Value Name: Tcp1323Opts
Data Type: REG_DWORD (DWORD Value)
Value Data: 0, 1, 2 or 3

  • 0 = disable RFC 1323 options
  • 1 = window scale enabled only
  • 2 = time stamps enabled only
  • 3 = both options enabled

This can also be adjusted using DrTCP GUI tool (see link below)

TCP Buffer Sizes

See the EndSystemTcpBufferSizing topic for general information about sizing TCP buffers.

Buffers for TCP Send Windows

Inquiry at Microsoft (thanks to Larry Dunn) has revealed that the default send window is 8KB and that there is no official support for configuring a system-wide default. However, the current Winsock implementation uses the following undocumented registry key for this purpose

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\AFD\Parameters]

Value Name: DefaultSendWindow
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. The maximum value is unknown.

According to Microsoft, this parameter may not be supported in future Winsock releases. However, this parameter is confirmed to be working with Windows XP, Windows 2000, Windows Vista and even the Windows 7 Beta 1.

This value needs to be manually adjusted (e.g., DrTCP can't do it), or the application needs to set it (e.g., iperf with '-l' option). It may be difficult to detect this as a bottleneck because the host will advertise large windows but will only have about 8.5KB of data in flight.

Buffers for TCP Receive Windows

(From this TechNet article

Assuming Window Scaling is enabled, then the following parameter can be set to between 1 and 1073741823 bytes

[HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize]

Value Name: TcpWindowSize
Data Type: REG_DWORD (DWORD Value)
Value Data: The window size in bytes. 1 to 65535 (or 1 to 1073741823 if Window scaling set).

DrTCP can also be used to adjust this parameter.

Windows XP SP2/SP3 simultaneous connection limit

Windows XP SP2 and SP3 limits the number of simultaneous TCP connection attempts (so called half-open state) to 10 (by default). How exactly it's doing this is not clear, but this may come up with some kind of applications. Especially file sharing applications like bittorrent might suffer from this behaviour. You can see yourself if your performance is affected by looking into your event logs:

EventID 4226: TCP/IP has reached the security limit imposed on the number of concurrent TCP connect attempts

A patcher has been developed to adjust this limit.

References

-- TobyRodwell - 28 Feb 2005 (initial version)
-- SimonLeinen - 06 Apr 2005 (removed dead pointers; added new ones; descriptions)
-- AlexGall - 17 Jun 2005 (added send window information)
-- HankNussbacher - 19 Jun 2005 (added new MS webcast for 2003 Server)
-- SimonLeinen - 12 Sep 2005 (added information and pointer about new TCP/IP in Vista)
-- HankNussbacher - 22 Oct 2006 (added Cisco IOS firewall upgrade)
-- AlexGall - 31 Aug 2007 (replaced IOS issue with a link to WindowScalingProblems, added info for scaling limitation in Vista)
-- SimonLeinen - 04 Nov 2007 (information on how to enable Compound TCP on Vista)
-- PekkaSavola - 05 Jun 2008 major reorganization to improve the clarity wrt send buffer adjustment, add a link to DrTCP; add XP SP3 connection limiting discussion -- ChrisWelti - 02 Feb 2009 added new information regarding the send window in windows vista / server 2008 / 7

Network Adapter and Driver Issues

One aspect that causes many performance problems is adapater and NIC compatability issues (full vs half duplex as an example. The document "Troubleshooting Cisco Catalyst Switches to NIC Compatibility Issues" from Cisco covers many vendor NICs.

Performance Impacts of Host Architecture

The host architecture has impact on performance especially with higher rates. Important factors include:

Performance-Friendly Adapter and Driver Features

There are several different techniques that enhance network performance by moving critical functions to the network adapter. Those techniques typically (but with the notable exception of GSO) require both special hardware support in the adapter and support at the device driver level to make use of that special hardware.

References

-- SimonLeinen - 2005-02-02 - 2015-01-28
-- PekkaSavola - 2006-11-15

-- SimonLeinen - 09 Apr 2006

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r6 - 2017-01-31 - KurtBaumann
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2004-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.