9. Queueing Disciplines for Bandwidth Management
9.7. The Intermediate queueing device (IMQ)
The IMQ iptables targets is valid in the PREROUTING and POSTROUTING chains of the mangle table.
It’s syntax is
IMQ [ --todev n ] n : number of imq device An ip6tables target is also provided.
Please note traffic is not enqueued when the target is hit but afterwards. The exact location where traffic enters the imq device depends on the direction of the traffic (in/out). These are the predefined netfilter hooks used by iptables:
enum nf_ip_hook_priorities { NF_IP_PRI_FIRST = INT_MIN, NF_IP_PRI_CONNTRACK = -200, NF_IP_PRI_MANGLE = -150, NF_IP_PRI_NAT_DST = -100, NF_IP_PRI_FILTER = 0, NF_IP_PRI_NAT_SRC = 100, NF_IP_PRI_LAST = INT_MAX, };
For ingress traffic, imq registers itself with NF_IP_PRI_MANGLE + 1 priority which means packets enter the imq device directly after the mangle PREROUTING chain has been passed.
For egress imq uses NF_IP_PRI_LAST which honours the fact that packets dropped by the filter table won’t occupy bandwidth.
The patches and some more information can be found at the imq site (http://luxik.cdi.cz/~patrick/imq/).
Chapter 10. Load sharing over multiple interfaces
There are several ways of doing this. One of the easiest and straightforward ways is ’TEQL’ - "True" (or
"trivial") link equalizer. Like most things having to do with queueing, load sharing goes both ways. Both ends of a link may need to participate for full effect.
Imagine this situation:
+---+ eth1 +---+
| |==========| |
’network 1’ ----| A | | B |---- ’network 2’
| |==========| |
+---+ eth2 +---+
A and B are routers, and for the moment we’ll assume both run Linux. If traffic is going from network 1 to network 2, router A needs to distribute the packets over both links to B. Router B needs to be
configured to accept this. Same goes the other way around, when packets go from network 2 to network 1, router B needs to send the packets over both eth1 and eth2.
The distributing part is done by a ’TEQL’ device, like this (it couldn’t be easier):
# tc qdisc add dev eth1 root teql0
# tc qdisc add dev eth2 root teql0
# ip link set dev teql0 up
Don’t forget the ’ip link set up’ command!
This needs to be done on both hosts. The device teql0 is basically a roundrobbin distributor over eth1 and eth2, for sending packets. No data ever comes in over an teql device, that just appears on the ’raw’ eth1 and eth2.
But now we just have devices, we also need proper routing. One way to do this is to assign a /31 network to both links, and a /31 to the teql0 device as well:
On router A:
# ip addr add dev eth1 10.0.0.0/31
# ip addr add dev eth2 10.0.0.2/31
# ip addr add dev teql0 10.0.0.4/31
On router B:
# ip addr add dev eth1 10.0.0.1/31
# ip addr add dev eth2 10.0.0.3/31
# ip addr add dev teql0 10.0.0.5/31
Router A should now be able to ping 10.0.0.1, 10.0.0.3 and 10.0.0.5 over the 2 real links and the 1 equalized device. Router B should be able to ping 10.0.0.0, 10.0.0.2 and 10.0.0.4 over the links.
If this works, Router A should make 10.0.0.5 its route for reaching network 2, and Router B should make 10.0.0.4 its route for reaching network 1. For the special case where network 1 is your network at home, and network 2 is the Internet, Router A should make 10.0.0.5 its default gateway.
10.1. Caveats
Nothing is as easy as it seems. eth1 and eth2 on both router A and B need to have return path filtering turned off, because they will otherwise drop packets destined for ip addresses other than their own:
# echo 0 > /proc/sys/net/ipv4/conf/eth1/rp_filter
# echo 0 > /proc/sys/net/ipv4/conf/eth2/rp_filter
Then there is the nasty problem of packet reordering. Let’s say 6 packets need to be sent from A to B - eth1 might get 1, 3 and 5. eth2 would then do 2, 4 and 6. In an ideal world, router B would receive this in order, 1, 2, 3, 4, 5, 6. But the possibility is very real that the kernel gets it like this: 2, 1, 4, 3, 6, 5. The problem is that this confuses TCP/IP. While not a problem for links carrying many different TCP/IP sessions, you won’t be able to bundle multiple links and get to ftp a single file lots faster, except when your receiving or sending OS is Linux, which is not easily shaken by some simple reordering.
However, for lots of applications, link load balancing is a great idea.
10.2. Other possibilities
William Stearns has used an advanced tunneling setup to achieve good use of multiple, unrelated, internet connections together. It can be found on his tunneling page (http://www.stearns.org/tunnel/).
The HOWTO may feature more about this in the future.
Chapter 11. Netfilter & iproute - marking packets
So far we’ve seen how iproute works, and netfilter was mentioned a few times. This would be a good time to browse through Rusty’s Remarkably Unreliable Guides (http://netfilter.samba.org/unreliable-guides/).
Netfilter itself can be found here (http://netfilter.filewatcher.org/).
Netfilter allows us to filter packets, or mangle their headers. One special feature is that we can mark a packet with a number. This is done with the --set-mark facility.
As an example, this command marks all packets destined for port 25, outgoing mail:
# iptables -A PREROUTING -i eth0 -t mangle -p tcp --dport 25 \ -j MARK --set-mark 1
Let’s say that we have multiple connections, one that is fast (and expensive, per megabyte) and one that is slower, but flat fee. We would most certainly like outgoing mail to go via the cheap route.
We’ve already marked the packets with a ’1’, we now instruct the routing policy database to act on this:
# echo 201 mail.out >> /etc/iproute2/rt_tables
# ip rule add fwmark 1 table mail.out
# ip rule ls
0: from all lookup local
32764: from all fwmark 1 lookup mail.out 32766: from all lookup main
32767: from all lookup default
Now we generate a route to the slow but cheap link in the mail.out table:
# /sbin/ip route add default via 195.96.98.253 dev ppp0 table mail.out
And we are done. Should we want to make exceptions, there are lots of ways to achieve this. We can modify the netfilter statement to exclude certain hosts, or we can insert a rule with a lower priority that points to the main table for our excepted hosts.
We can also use this feature to honour TOS bits by marking packets with a different type of service with different numbers, and creating rules to act on that. This way you can even dedicate, say, an ISDN line to interactive sessions.
Needless to say, this also works fine on a host that’s doing NAT (’masquerading’).
IMPORTANT: We received a report that MASQ and SNAT at least collide with marking packets. Rusty Russell explains it in this posting
(http://lists.samba.org/pipermail/netfilter/2000-November/006089.html). Turn off the reverse path filter to make it work properly.
Note: to mark packets, you need to have some options enabled in your kernel:
IP: advanced router (CONFIG_IP_ADVANCED_ROUTER) [Y/n/?]
IP: policy routing (CONFIG_IP_MULTIPLE_TABLES) [Y/n/?]
IP: use netfilter MARK value as routing key (CONFIG_IP_ROUTE_FWMARK) [Y/n/?]
See also the Section 15.5 in the Cookbook.
Chapter 12. Advanced filters for (re-)classifying packets
As explained in the section on classful queueing disciplines, filters are needed to classify packets into any of the sub-queues. These filters are called from within the classful qdisc.
Here is an incomplete list of classifiers available:
fw
Bases the decision on how the firewall has marked the packet. This can be the easy way out if you don’t want to learn tc filter syntax. See the Queueing chapter for details.
u32
Bases the decision on fields within the packet (i.e. source IP address, etc) route
Bases the decision on which route the packet will be routed by rsvp, rsvp6
Routes packets based on RSVP (http://www.isi.edu/div7/rsvp/overview.html). Only useful on networks you control - the Internet does not respect RSVP.
tcindex
Used in the DSMARK qdisc, see the relevant section.
Note that in general there are many ways in which you can classify packet and that it generally comes down to preference as to which system you wish to use.
Classifiers in general accept a few arguments in common. They are listed here for convenience:
protocol
The protocol this classifier will accept. Generally you will only be accepting only IP traffic.
Required.
parent
The handle this classifier is to be attached to. This handle must be an already existing class.
Required.
prio
The priority of this classifier. Lower numbers get tested first.
handle
This handle means different things to different filters.
All the following sections will assume you are trying to shape the traffic going toHostA. They will assume that the root class has been configured on 1: and that the class you want to send the selected traffic to is 1:1.
12.1. The u32 classifier
The U32 filter is the most advanced filter available in the current implementation. It entirely based on hashing tables, which make it robust when there are many filter rules.
In its simplest form the U32 filter is a list of records, each consisting of two fields: a selector and an action. The selectors, described below, are compared with the currently processed IP packet until the first match occurs, and then the associated action is performed. The simplest type of action would be
directing the packet into defined class.
The command line oftc filterprogram, used to configure the filter, consists of three parts: filter specification, a selector and an action. The filter specification can be defined as:
tc filter add dev IF [ protocol PROTO ]
[ (preference|priority) PRIO ] [ parent CBQ ]
Theprotocolfield describes protocol that the filter will be applied to. We will only discuss case ofip protocol. Thepreferencefield (prioritycan be used alternatively) sets the priority of currently defined filter. This is important, since you can have several filters (lists of rules) with different priorities.
Each list will be passed in the order the rules were added, then list with lower priority (higher preference number) will be processed. Theparentfield defines the CBQ tree top (e.g. 1:0), the filter should be attached to.
The options described above apply to all filters, not only U32.
12.1.1. U32 selector
The U32 selector contains definition of the pattern, that will be matched to the currently processed packet. Precisely, it defines which bits are to be matched in the packet header and nothing more, but this simple method is very powerful. Let’s take a look at the following examples, taken directly from a pretty complex, real-world filter:
# tc filter add dev eth0 protocol ip parent 1:0 pref 10 u32 \ match u32 00100000 00ff0000 at 0 flowid 1:10
For now, leave the first line alone - all these parameters describe the filter’s hash tables. Focus on the selector line, containingmatchkeyword. This selector will match to IP headers, whose second byte will be 0x10 (0010). As you can guess, the 00ff number is the match mask, telling the filter exactly which bits to match. Here it’s 0xff, so the byte will match if it’s exactly 0x10. Theatkeyword means that the match is to be started at specified offset (in bytes) -- in this case it’s beginning of the packet. Translating all that to human language, the packet will match if its Type of Service field will have ‘low delay’ bits set. Let’s analyze another rule:
# tc filter add dev eth0 protocol ip parent 1:0 pref 10 u32 \ match u32 00000016 0000ffff at nexthdr+0 flowid 1:10
Thenexthdroption means next header encapsulated in the IP packet, i.e. header of upper-layer protocol. The match will also start here at the beginning of the next header. The match should occur in the second, 32-bit word of the header. In TCP and UDP protocols this field contains packet’s destination port. The number is given in big-endian format, i.e. older bits first, so we simply read 0x0016 as 22 decimal, which stands for SSH service if this was TCP. As you guess, this match is ambiguous without a context, and we will discuss this later.
Having understood all the above, we will find the following selector quite easy to read:match
c0a80100 ffffff00 at 16. What we got here is a three byte match at 17-th byte, counting from the IP header start. This will match for packets with destination address anywhere in 192.168.1/24 network.
After analyzing the examples, we can summarize what we have learned.
12.1.2. General selectors
General selectors define the pattern, mask and offset the pattern will be matched to the packet contents.
Using the general selectors you can match virtually any single bit in the IP (or upper layer) header. They
are more difficult to write and read, though, than specific selectors that described below. The general selector syntax is:
match [ u32 | u16 | u8 ] PATTERN MASK [ at OFFSET | nexthdr+OFFSET]
One of the keywordsu32,u16oru8specifies length of the pattern in bits. PATTERN and MASK should follow, of length defined by the previous keyword. The OFFSET parameter is the offset, in bytes, to start matching. Ifnexthdr+keyword is given, the offset is relative to start of the upper layer header.
Some examples:
Packet will match to this rule, if its time to live (TTL) is 64. TTL is the field starting just after 8-th byte of the IP header.
# tc filter add dev ppp14 parent 1:0 prio 10 u32 \ match u8 64 0xff at 8 \
flowid 1:4
The following matches all TCP packets which have the ACK bit set:
# tc filter add dev ppp14 parent 1:0 prio 10 u32 \ match ip protocol 6 0xff \
match u8 0x10 0xff at nexthdr+13 \ flowid 1:3
Use this to match ACKs on packets smaller than 64 bytes:
## match acks the hard way,
## IP protocol 6,
## IP header length 0x5(32 bit words),
## IP Total length 0x34 (ACK + 12 bytes of TCP options)
## TCP ack set (bit 5, offset 33)
# tc filter add dev ppp14 parent 1:0 protocol ip prio 10 u32 \ match ip protocol 6 0xff \
match u8 0x05 0x0f at 0 \ match u16 0x0000 0xffc0 at 2 \ match u8 0x10 0xff at 33 \ flowid 1:3
This rule will only match TCP packets with ACK bit set, and no further payload. Here we can see an example of using two selectors, the final result will be logical AND of their results. If we take a look at TCP header diagram, we can see that the ACK bit is second older bit (0x10) in the 14-th byte of the TCP header (at nexthdr+13). As for the second selector, if we’d like to make our life harder, we could writematch u8 0x06 0xff at 9instead of using the specific selectorprotocol tcp, because 6 is the number of TCP protocol, present in 10-th byte of the IP header. On the other hand, in this example we couldn’t use any specific selector for the first match - simply because there’s no specific selector to match TCP ACK bits.
The filter below is a modified version of the filter above. The difference is, that it doesn’t check the ip header length. Why? Because the filter above does only work on 32 bit systems.
tc filter add dev ppp14 parent 1:0 protocol ip prio 10 u32 \ match ip protocol 6 0xff \
match u8 0x10 0xff at nexthdr+13 \ match u16 0x0000 0xffc0 at 2 \ flowid 1:3
12.1.3. Specific selectors
The following table contains a list of all specific selectors the author of this section has found in thetc program source code. They simply make your life easier and increase readability of your filter’s configuration.
FIXME: table placeholder - the table is in separate file „selector.html”
FIXME: it’s also still in Polish :-(
FIXME: must be sgml’ized
Some examples:
# tc filter add dev ppp0 parent 1:0 prio 10 u32 \ match ip tos 0x10 0xff \
flowid 1:4
FIXME: tcp dport match does not work as described below:
The above rule will match packets which have the TOS field set to 0x10. The TOS field starts at second byte of the packet and is one byte big, so we could write an equivalent general selector:match u8 0x10 0xff at 1. This gives us hint to the internals of U32 filter -- the specific rules are always translated to general ones, and in this form they are stored in the kernel memory. This leads to another conclusion -- thetcpandudpselectors are exactly the same and this is why you can’t use singlematch tcp dport 53 0xffffselector to match TCP packets sent to given port -- they will also match UDP packets sent to this port. You must remember to also specify the protocol and end up with the following rule:
# tc filter add dev ppp0 parent 1:0 prio 10 u32 \ match tcp dport 53 0xffff \
match ip protocol 0x6 0xff \ flowid 1:2
12.2. The route classifier
This classifier filters based on the results of the routing tables. When a packet that is traversing through the classes reaches one that is marked with the "route" filter, it splits the packets up based on information in the routing table.
# tc filter add dev eth1 parent 1:0 protocol ip prio 100 route
Here we add a route classifier onto the parent node 1:0 with priority 100. When a packet reaches this node (which, since it is the root, will happen immediately) it will consult the routing table. If the packet matches, it will be send to the given class and have a priority of 100. Then, to finally kick it into action, you add the appropriate routing entry:
The trick here is to define ’realm’ based on either destination or source. The way to do it is like this:
# ip route add Host/Network via Gateway dev Device realm RealmNumber
For instance, we can define our destination network 192.168.10.0 with a realm number 10:
# ip route add 192.168.10.0/24 via 192.168.10.1 dev eth1 realm 10
When adding route filters, we can use realm numbers to represent the networks or hosts and specify how the routes match the filters.
# tc filter add dev eth1 parent 1:0 protocol ip prio 100 \ route to 10 classid 1:10
The above rule matches the packets going to the network 192.168.10.0.
Route filter can also be used to match source routes. For example, there is a subnetwork attached to the Linux router on eth2.
# ip route add 192.168.2.0/24 dev eth2 realm 2
# tc filter add dev eth1 parent 1:0 protocol ip prio 100 \ route from 2 classid 1:2
Here the filter specifies that packets from the subnetwork 192.168.2.0 (realm 2) will match class id 1:2.
12.3. Policing filters
To make even more complicated setups possible, you can have filters that only match up to a certain bandwidth. You can declare a filter either to entirely cease matching above a certain rate, or not to match only the bandwidth exceeding a certain rate.
So if you decided to police at 4mbit/s, but 5mbit/s of traffic is present, you can stop matching either the entire 5mbit/s, or only not match 1mbit/s, and do send 4mbit/s to the configured class.
If bandwidth exceeds the configured rate, you can drop a packet, reclassify it, or see if another filter will match it.
12.3.1. Ways to police
There are basically two ways to police. If you compiled the kernel with ’Estimators’, the kernel can measure for each filter how much traffic it is passing, more or less. These estimators are very easy on the