Tuneable Linux Kernel IP Parameters

See the TCP Man Pages and Oscar Andreasson's Ipsysctl Tutorial for more information on Linux Kernel parameters.

Changes in newer Linux kernel versions

Note that recently there have been significant changes to the Linux kernel concerning these parameters. In particular, many aspects of TCP buffer auto-tuning were implemented. In Linux 2.6.17, auto-tuning has been implemented for both the send and receive directions, and this has made it possible to dramatically raise the default buffer size limits. So if you run a 2.6.17 or newer kernel, it should be possible to achieve very decent TCP throughput over networks with large bandwidth*delay products without any tuning of these parameters.

NOTE! If the application manually selects the socket buffer size, buffer auto-tuning is automatically disabled for that connection. In some cases manual selection can now cause worse (burstier) performance especially if there is congestion-related packet loss!

Description of individual parameters

net/core/rmem_default

The default general socket receive buffer (overwritten by /tcp_rmem)

net/core/wmem_default

The default general socket send buffer (overwritten by /tcp_wmem)

net/core/rmem_max

The maximum socket receive buffer (not overwritten by /tcp_rmem)

net/core/wmem_max

The maximum socket send buffer (not overwritten by /tcp_wmem)

net/core/netdev_max_backlog

The maximum number of socket-buffers (a socket buffer is an internal representation of a packet) that will be read during a softirq (soft interrupt). It is during the softirq that the protocol handlers are run. See Gianluca Insolvibile's Inside the Linux Kernel for a more in-depth description.

net/ipv4/tcp_mem

from the TCP Man Pages
This is a vector of 3 integers: (low, pressure, high). These bounds are used by TCP to track its memory usage. The defaults are calculated at boot time from the amount of available memory.

  • low - TCP doesn't regulate its memory allocation when the number of pages it has allocated globally is below this number.

  • pressure - when the amount of memory allocated by TCP exceeds this number of pages, TCP moderates its memory consumption. This memory pressure state is exited once the number of pages allocated falls below the low mark.

  • high - the maximum number of pages, globally, that TCP will allocate. This value overrides any other limits imposed by the kernel.

net/ipv4/tcp_rmem

from the TCP Man Pages
This is a vector of 3 integers: (min, default, max). These parameters are used by TCP to regulate receive buffer sizes. TCP dynamically adjusts the size of the receive buffer from the defaults listed below, in the range of these sysctl variables, depending on memory available in the system.

  • min - minimum size of the receive buffer used by each TCP socket. The default value is 4K, and is lowered to PAGE_SIZE bytes in low memory systems. This value is used to ensure that in memory pressure mode, allocations below this size will still succeed. This is not used to bound the size of the receive buffer declared using SO_RCVBUF on a socket.

  • default - the default size of the receive buffer for a TCP socket. This value overwrites the initial default buffer size from the generic global net.core.rmem_default defined for all protocols. The default value is 87380 bytes, and is lowered to 43689 in low memory systems. If larger receive buffer sizes are desired, this value should be increased (to affect all sockets). To employ large TCP windows, the net.ipv4.tcp_window_scaling must be enabled (default).

  • max - the maximum size of the receive buffer used by each TCP socket. This value does not override the global net.core.rmem_max. This is not used to limit the size of the receive buffer declared using SO_RCVBUF on a socket. The default value of 87380*2 bytes is lowered to 87380 in low memory systems.

net/ipv4/tcp_wmem

from the TCP Man Pages
This is a vector of 3 integers: (min, default, max). These parameters are used by TCP to regulate send buffer sizes. TCP dynamically adjusts the size of the send buffer from the default values listed below, in the range of these sysctl variables, depending on memory available.

  • min - minimum size of the send buffer used by each TCP socket. The default value is 4K bytes. This value is used to ensure that in memory pressure mode, allocations below this size will still succeed. This is not used to bound the size of the send buffer declared using SO_SNDBUF on a socket.

  • default - the default size of the send buffer for a TCP socket. This value overwrites the initial default buffer size from the generic global net.core.wmem_default defined for all protocols. The default value is 16K bytes. If larger send buffer sizes are desired, this value should be increased (to affect all sockets). To employ large TCP windows, the sysctl variable net.ipv4.tcp_window_scaling must be enabled (default).

  • max - the maximum size of the send buffer used by each TCP socket. This value does not override the global net.core.wmem_max. This is not used to limit the size of the send buffer declared using SO_SNDBUF on a socket. The default value is 128K bytes. It is lowered to 64K depending on the memory available in the system.

net/ipv4/tcp_sack

The tcp_sack variable enables Selective Acknowledgements (SACK) as they are defined in RFC 2018 - TCP Selective Acknowledgement Options and RFC 2883 - An Extension to Selective Acknowledgement (SACK) Option for TCP. These RFC documents contain information on an TCP option that was especially developed to handle lossy connections.

If this variable is turned on, our host will set the SACK option in the TCP option field in the TCP header when it sends out a SYN packet. This tells the server we are connecting to that we are able to handle SACK. In the future, if the server knows how to handle SACK, it will then send ACK packets with the SACK option turned on. This option selectively acknowledges each segment in a TCP window. This is especially good on very lossy connections (connections that loose a lot of data in the transfer) since this makes it possible to only retransmit specific parts of the TCP window which lost data and not the whole TCP window as the old standards told us to do. This means that if a certain segment of a TCP window is not received, the receiver will not return a SACK for that segment. The sender will then know which packets where not received by the receiver, and will hence retransmit that packet. For redundancy, this option will fill up all space possibly within the option space, 40 bytes per segment. Each SACK'ed packet takes up 2 32-bit unsigned integers and hence the option space can contain 4 SACK'ed segments. However, normally the timestamp option is used in conjunction with this option. The timestamp option takes up 10 bytes of data, and hence only 3 segments may be SACK'ed in each packet in normal operation. ... The tcp_sack option takes a boolean value. This is per default set to 1, or turned on. This is generally a good idea and should cause no problems.
from Oscar Andreasson's Ipsysctl Tutorial

-- TobyRodwell - 17 Feb 2006
-- SimonLeinen - 09 Aug 2006
-- PekkaSavola - 25 Oct 2006

Topic revision: r6 - 2006-10-25 - PekkaSavola
 
GÉANT
Copyright © 2004-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.