12/13/2012

Why does Linux enforce a minimum MTU of 552?


Today I want to write about an interesting question that once came up in University.

What's an MTU?


The Maximum Transfer Unit is the maximum size of a network layer (protocols like IP) packet in bytes that fits into a data link layer (protocols like IEEE 802.3 / Ethernet and 802.11 / Wi-Fi) frame. The MTU basically depends on the layer 2 protocol used. For Ethernet it is 1500, for PPPoE (used by many DSL providers) typically 1492.

Each host knows about its links' MTU values. But think about a host, connected to a router, that wants to send data to an Internet server. It knows about the MTU of the link to the router - say 1500 - but how could it know about the so called Path MTU, which is the minimum MTU of all the links used to reach the server? Initially it can not. If it sends a 1500 bytes IP packet to the router and the router is connected to a service provider via PPPoE, the router will not be able to forward it, since the next link's MTU is lower than the packet size.

In IPv4 there are two possible outcomes of such a situation depending on the header's Don't Fragement bit. If it's not set, the router will split the packet and forward it in two fragments, which both fit the next link's MTU. The sender will not notice the problem and continue to send too large packets, that need to be split and reassembled on their way. If the Don't Fragement bit is set, the router will not apply fragmentation, but dismiss the packet and send an ICMP error message to the sender. The ICMP message will contain the next link's MTU value, so that the sender learns about the possible size it may send. Of course there may be even lower values for some links on the rest of the path, therefore more error messages and more than two tries until the sender knows about the path's minimum MTU and the packet finally reaches its destination. This is basically what's called Path MTU Discovery and described in RFC 1191 for IPv4.

The Linux implementation


In practice, IP implementations cache the PMTU whenever they receive an ICMP error message, stating that it needs to be lowered. In UNIX operating systems you can see the cached values with e. g. ip route list cache. In Linux there's an interesting detail you'll see whenever the PMTU is below 552. If you set a router's MTU to 500 and try to send an IP packet larger than that through it, it will, of course, send an ICMP error message stating that the next hop's MTU is 500. But: Linux will not store the value 500, but 552.

Is this a problem? It may be, although an MTU of 552 is rather low. As I already mentioned most of the Internet has PMTUs of 1500 or 1492, so it's definitely not a problem in most cases. RFC 791 defines a lower bound for IPv4 of 68. So it's actually a broken IPv4 implementation (in fact you can change . Given a PMTU between 68 and 552, which is perfectly valid, Linux will ignore the value and always state that it would be 552. The host may try to send packets larger than the actual PMTU, but smaller than 552, resulting in error messages, which do not have any effect, since Linux will still state that the PMTU is 552 and the packet should be transferred correctly.

Reality check: Are there really PMTUs below 552? The honest answer is no. You can always set an arbitrarily low MTU value, but there's no reason for that. You may find references of SLIP or PPP connections over V.34 voiceband modems (28.8 to 33.8 kbit/s) that did not work reliably with an MTU larger than 296. Hence, dip (8) sets an MTU of 296 by default. But that's of course of no more interest today. What may apply to some legacy sites is X.25's MTU of 576, which would still be able to transport 552 bytes.

Digging into it


Nevertheless one may ask: Why does Linux restrict the PMTU at all? It may not be a problem, but is it any use? And what's 552?

RFC 791 states that conforming IPv4 implementations must be prepared to accept datagrams of up to 576 octets, whole or in fragments. That means a 552 bytes IP packet has to be handled correctly by any IP implementation. Maybe there's a connection, but what's that difference of 24 bytes? And inferring a minimum PMTU from the minimum handable size of an IP packet, that may be fragmented into multiple frames, does not make any sense at all.

The consequence of IPv4's requirement is that most IP implementations assume a default path MTU of 576, meaning it's neither too high for any of the involved hosts' IP stacks (at least if they are standard-compliant), nor lower than necessary. If the assumption is wrong (and the DF flag set), the host will receive an ICMP message and may proceed accordingly.

Where's 576 coming from? The RFC states:
The number 576 is selected to allow a reasonable sized data block to
be transmitted in addition to the required header information. For
example, this size allows a data block of 512 octets plus 64 header
octets to fit in a datagram. The maximal internet header is 60
octets, and a typical internet header is 20 octets, allowing a
margin for headers of higher level protocols.
This leads to the term Maximum Segment Size (MSS), which denotes exactly the "data block" that's mentioned in the RFC, or: the transport protocol's maximum payload size. The MSS and MTU values are not directly connected, but if you e. g. want to avoid fragmentation, the MSS may be at most the MTU minus the size of network and transport protocol headers. RFC 879 from November 1983 contains some ways of calculating reasonable MSS values for TCP. It suggests to take the MTU and subtract 40, i. e. 20 bytes for minimal (and most) IP and TCP headers each. This is indeed often applied in practice.

576 - 40 is 536, which would be a reasonable default MSS following this logic. This is in fact the default MSS in many operating systems, including Windows (only those versions including a TCP/IP implementation of course) and the commercial UNIX tree, e. g. Solaris and HP-UX. BSD, on the other hand, adopted a more "convenient" value of 512 as default MSS. AIX seems to have switched from 536 in v3 to 512 in v4.

Taking the lower value of 512 as default MSS is no problem, since it will fit into the default maximum IP packet size of 576 even better. Most probably there will be 40 additional bytes for the IP and TCP headers leading to 512 + 20 + 20 = 552. That's Linux's minimum PMTU! Indeed: Taking a look at the source code, you will find it at the top of net/ipv4/route.c, where it's defined as:
static int ip_rt_min_pmtu __read_mostly = 512 + 20 + 20

Wait! What?


Let's sum it up: IP implementations are required to handle packets of up to 576 bytes. Many implementations take that as the default maximum packet size. Subtracting the lengths of the shortest possible IP and TCP headers, you get a default MSS of 536, or 512 if you want a more "convenient" value. Data of that size together with a TCP and an IP header should form an IP packet not larger than the default maximum size and what every host needs to be able to handle.

So, 512 / 536 bytes always fit on the wire and we can assume a minimum MTU of 552 / 576? Certainly not. The IPv4 RFC states that a packet of size 576 must be supported, but not that it must not be fragmented. Indeed it says "or in fragments" and explicitly defines a minimum MTU of 68. Although the RFC is quite clear on this, it seems to be a frequent source for misinterpretation, e. g. IBM documentations include:
TCP in AIX defaults to a maximum segment size (MSS) of 512 bytes. This conservative value is based on a requirement that all IP routers support an MTU of at least 576 bytes
To find out the reason for Linux's minimum PMTU, I had a look at the Linux Cross Reference and quickly found that ip_rt_min_pmtu was introduced in Linux 2.3.15 in Summer 1999. Sadly that's way before Git and even before BitKeeper, so there's nothing like a commit with an author and a comment. Back then, Linux patches were sent via mailing lists, but I could not find any related messages at the Linux Kernel Mailing List Archive. So I wrote to the netdev mailing list (which did not exist back then either), but I did not get a useful response from any of the developers.

There has to be a reason


Although the misinterpretation of the RFC seems obvious, this does not make much sense. The changes in Linux 2.3.15 specifically implement the limitation without any other pieces. The IPv4 implementation was already there. Why would somebody just add a constraint if there's no hard reason for it? And: If it's really just a mistake, how could it be that, despite such a number of absolute experts working on Linux networking, it was not figured it out and corrected for more than 13 years now.

The best I could find is a thread called MTU and 2.4.x kernel, started Feb 14, 2001. There, Alan Cox mentions:
Our handling of DF on syn frames is also broken due to that misassumption, but fortunately only for crazy mtus like 70.
And Alexey Kuznetsov points out:
It stops to work even earlier: at mtu<128.
It is strict limit. Pardon, discussing marginal cases is useless.
If someone has device with mtu of 128, let him to put it back to the place, where he found it.
So there's obviously an issue with MTUs smaller than 128 (if it has not been resolved yet). But that's not 552.

In fact, I was not able to find out why min_pmtu exists. Please leave a comment, if you know something interesting about it.

3 comments:

  1. I believe that 512 was chosen as being a nice number 2^9, and hence it still preserves nice overhead to data ratio, approx. 93% efficiency of the packet. So 512+20IP+20TCP = 552+24(for eventual IP/TCP options)=576, hence if you set your MTU to 576, you will most likely be able to forward a 512byte block of data across multiple link's without being fragmented. That is why it is a default, to avoid fragmentation....

    ReplyDelete
  2. 576 - 40 is 536, which would be a reasonable default MSS following this logic. This is in fact the default MSS in many operating systems, including Windows
    *********************
    Completely wrong.

    http://packetlife.net/blog/2008/nov/5/mtu-manipulation/
    When a TCP client initiates a connection to a server, it includes its MSS as an option in the first (SYN) packet. On an Ethernet interface, this value is typically 1460 (1500 byte Ethernet MTU - 20 byte IP header - 20 byte TCP header).

    ReplyDelete
  3. I tried Long Path Tool to resolved the issues.

    ReplyDelete