Open Fabrics Enterprise Distribution (OFED) IPoIB in OFED 1.4 Release Notes December 2008 =============================================================================== Table of Contents =============================================================================== 1. Overview 2. New Features 3. Known Issues 4. DHCP Support of IPoIB 5. The ib-bonding driver 6. Bug Fixes and Enhancements Since OFED 1.3 7. Bug Fixes and Enhancements Since OFED 1.3.1 8. Performance tuning =============================================================================== 1. Overview =============================================================================== IPoIB is a network driver implementation that enables transmitting IP and ARP protocol packets over an InfiniBand UD channel. The implementation conforms to the relevant IETF working group's RFCs (http://www.ietf.org). =============================================================================== 2. New Features =============================================================================== 1. This version of ofed introduces improvements to IPOIB by cutting the CPU overhead in handling receive packets. This will improve operation in datagram mode: Large Receive Offload (LRO) - aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack, thus reducing the number of packets that have to be processed. This feature is enabled on HCAs that can support LRO, e.g. ConnectX. 2. Datagram mode: LSO (large send offload) allows the networking stack to pass SKBs with data size larger than the MTU to the IPoIB driver and have the HCA HW fragment the data to multiple MSS-sized packets. Add a device capability flag IB_DEVICE_UD_TSO for devices that can perform TCP segmentation offload, a new send work request opcode IB_WR_LSO, header, hlen and mss fields for the work request structure, and a new IB_WC_LSO completion type. This feature is enabled on HCAs that can support LSO, e.g. ConnectX. Usage and configuration: ======================== 1. To check the current mode used for outgoing connections, enter: cat /sys/class/net/ib0/mode 2. To disable IPoIB CM at compile time, enter: cd OFED-1.4 export OFA_KERNEL_PARAMS="--without-ipoib-cm" ./install.pl 3. To change the run-time configuration for IPoIB, enter: edit /etc/infiniband/openib.conf, change the following parameters: # Enable IPoIB Connected Mode SET_IPOIB_CM=yes # Set IPoIB MTU IPOIB_MTU=65520 4. You can also change the mode and MTU for a specific interface manually. To enable connected mode for interface ib0, enter: echo connected > /sys/class/net/ib0/mode To increase MTU, enter: ifconfig ib0 mtu 65520 5. Switching between CM and UD mode can be done in run time: echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM =============================================================================== 3. Known Issues =============================================================================== 1. If a host has multiple interfaces and (a) each interface belongs to a different IP subnet, (b) they all use the same InfiniBand Partition, and (c) they are connected to the same IB Switch, then the host violates the IP rule requiring different broadcast domains. Consequently, the host may build an incorrect ARP table. The correct setting of a multi-homed IPoIB host is achieved by using a different PKEY for each IP subnet. If a host has multiple interfaces on the same IP subnet, then to prevent a peer from building an incorrect ARP entry (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This causes the network stack to send ARP replies only on the interface with the IP address specified in the ARP request: sysctl -w net.ipv4.conf.ib0.arp_ignore=1 sysctl -w net.ipv4.conf.ib1.arp_ignore=1 Or, globally, sysctl -w net.ipv4.conf.all.arp_ignore=1 To learn more about the arp_ignore parameter, see Documentation/networking/ip-sysctl.txt. Note that distributions have the means to make kernel parameters persistent. 2. There are IPoIB alias lines in modprobe.conf which prevent stopping/ unloading the stack (i.e., '/etc/init.d/openibd stop' will fail). These alias lines cause the drivers to be loaded again by udev scripts. Workaround: Change modprobe.conf to set OFA_KERNEL_PARAMS="--without-modprobe" before running install.pl, or remove the alias lines from modprobe.conf. 3. On SLES 10: The ib1 interface uses the configuration script of ib0. Workaround: Invoke ifup/ifdown using both the interface name and the configuration script name (example: ifup ib1 ib1). 4. After a hotplug event, the IPoIB interface falls back to datagram mode, and MTU is reduced to 2K. Workaround: Re-enable connected mode and increase MTU manually: echo connected > /sys/class/net/ib0/mode ifconfig ib0 mtu 65520 5. Since the IPoIB configuration files (ifcfg-ib) are installed under the standard networking scripts location (RedHat:/etc/sysconfig/network-scripts/ and SuSE: /etc/sysconfig/network/), the option IPOIB_LOAD=no in openib.conf does not prevent the loading of IPoIB on boot. 6. On RedHat EL 4 up4, the IPOIB implementation is not spec-compliant: - ipoib multicast does not work - ipoib cannot inter-operate between RHEL4U4 and other hosts. This is due to missing code in the kernel which was available in U3 and U5 but removed in U4. As a workaround, upgrade to RHEL4U5. 7. If IPoIB connected mode is enabled, it uses a large MTU for connected mode messages and a small MTU for datagram (in particular, multicast) messages, and relies on path MTU discovery to adjust MTU appropriately. Packets sent in the window before MTU discovery automatically reduces the MTU for a specific destination will be dropped, producing the following message in the system log: "packet len (> ) too long to send, dropping" To warn about this, a message is produced in the system log each time MTU is set to a value higher than 2K. 8. In connected mode, TCP latency for short messages is larger by approx. 1usec (~5%) than in datagram mode. As a workaround, use datagram mode. 9. Single-socket TCP bandwidth for kernels < 2.6.18 is lower than with newer kernels. We recommend kernels from 2.6.18 and up for best IPoIB performance. 10. Connectivity issues encountered when using IPv6 on ia64 systems. 11. The IPoIB module uses a Linux implementation for Large Receive Offload (LRO) in kernel 2.6.24 and later. These kernels require installing the "inet_lro" module. =============================================================================== 4. DHCP Support of IPoIB =============================================================================== Note: To use DHCP the user must apply a special patch (see "DHCP Notes" below). DHCP Supported Operating Systems -------------------------------- 1. SLES 10 2. RHEL 5 3. All kernels from 2.6.14 and up DHCP Unsupported Operating Systems ---------------------------------- RedHat EL 4 distributions are supported. DHCP Notes ---------- 1. It may be required to run over different UDP ports than the well known ports (67 and 68). Free port numbers greater than 0x8000 must be chosen. To specify a server or a client port number, use the option -p . The client's port number must be the chosen server's port number plus one. 2. For IPoIB to use DHCP, you must patch ISC's DHCP. The patch file can be found under OFED-1.3/docs/dhcp after extracting the distribution file. (After installation it can also be found under /docs/dhcp.) The patch should be applied for the server and for each client. Tests were run on version 3.0.4 of the DHCP package. =============================================================================== 5. The ib-bonding driver =============================================================================== The ib-bonding driver is a High Availability solution for IPoIB interfaces. It is based on the Linux Ethernet Bonding Driver and was adapted to work with IPoIB. The ib-bonding package contains a bonding driver and a utility called ib-bond to manage and control the driver operation. The ib-bonding driver comes with the ib-bonding package (run rpm -qi ib-bonding to get the package information). Using the ib-bonding driver --------------------------- The ib-bonding driver can be loaded manually or automatically. 1. Manual operation: Use the utility ib-bond to start, query, or stop the driver. For details on this utility, read the documentation for the ib-bonding package. 2. Automatic operation: Use standard OS tools (sysconfig in SuSE and initscripts in Redhat) to create a configuration that will come up with network restart. For details on this, read the documentation for the ib-bonding package. Notes: * Using /etc/infiniband/openib.conf to create a persistent configuration is no longer supported =============================================================================== 6. Bug Fixes and Enhancements Since OFED 1.3 =============================================================================== - There is no default configuration for IPoIB interfaces: One should manually specify the full IP configuration or use the ofed_net.conf file. See OFED_Installation_Guide.txt for details on ipoib configuration. - Don't drop multicast sends when they can be queued - IPoIB panics with RHEL5U1, RHEL4U6 and RHEL4U5: Bug fix when copying small SKBs (bug 989) - IPoIB failed on stress testing (bug 1004) - Kernel Oops during "port up/down test" (bug 1040) - Restart the stack during iperf 2.0.4 ver2.0.4 in client side cause to kernel panic (bug 985) - Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20 - Set max CM MTU when moving to CM mode, instead of setting it in openibd script - Fix CQ size calculations for ipoib - Bonding: Enable build for SLES10 SP2 - Bonding: Fix issue in using the bonding module for Ethernet slaves (see documentation for details) =============================================================================== 7. Bug Fixes and Enhancements Since OFED 1.3.1 =============================================================================== - IPoIB: Refresh paths instead of flushing them on SM change events to improve failover respond - IPoIB: Fix loss of connectivity after bonding failover on both sides - Bonding: Fix link state detection under RHEL4 - Bonding: Avoid annoying messages from initscripts when starting bond - Bonding: Set default number of grat. ARP after failover to three (was one) =============================================================================== 8. Performance tuning =============================================================================== - In IPoIB connected mode, the throughput of medium and large messages can be increased by setting the following TCP parameters as follows: /sbin/sysctl -w net.ipv4.tcp_timestamps=0 /sbin/sysctl -w net.ipv4.tcp_sack=0 /sbin/sysctl -w net.core.netdev_max_backlog=250000 /sbin/sysctl -w net.core.rmem_max=16777216 /sbin/sysctl -w net.core.wmem_max=16777216 /sbin/sysctl -w net.core.rmem_default=16777216 /sbin/sysctl -w net.core.wmem_default=16777216 /sbin/sysctl -w net.core.optmem_max=16777216 /sbin/sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216" /sbin/sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216" /sbin/sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"