OpenFabrics Enterprise Distribution (for Windows)

User's Manual

Release 2.3

10/07/2010

Overview

The OpenFabrics Enterprise Distribution for Windows package is composed of software modules intended for use on Microsoft Windows based computer systems connected via an InfiniBand fabric.

The OpenFabrics Enterprise Distribution for Windows software package contains the following:

OpenFabrics Infiniband core drivers and Upper Level Protocols (ULPs):

OpenFabrics Tools:

Documentation

 

OFED Features

 

 

 

Tools


The OpenFabrics Alliance Enterprise for Windows release contains a set of user mode tools which are designed to faciliate the smooth operation of an OpenFabrics Enterprise Distribution installation. These tools are available from a command window (cmd.exe) as the installation path '%SystemDrive%\Program Files\OFED' is appended to the system wide search path registry entry. A start menu short-cut 'OFED Cmd Window' is provided to faciliate correction tool operation.

IPoIB Partition Management

Infiniband Subnet Management

QLogic VNIC Child Device Management

Performance

Diagnostics

<return-to-top>

 

User mode micro-benchmarks


The following user-mode test programs are intended as useful micro-benchmarks for HW or SW tuning and/or functional testing.

Tests use CPU cycle counters to get time stamps without context switch.

Tests measure round-trip time but report half of that as one-way latency
(i.e.. May not be sufficiently accurate for asymmetrical configurations).

Min/Median/Max result is reported.
The median (vs. average) is less sensitive to extreme scores.
Typically the "Max" value is the first value measured.

larger samples only marginally help. The default (1000) is pretty good.
Note that an array of cycles_t (typically unsigned long) is allocated
once to collect samples and again to store the difference between them.
Really big sample sizes (e.g. 1 million) might expose other problems
with the program.

"-H" option will dump the histogram for additional statistical analysis.
See xgraph, ygraph, r-base (http://www.r-project.org/), pspp, or other
statistical math programs.

Architectures tested: x86, x86_64, ia64

Also see winverbs performance tools.


ib_send_lat.exe      - latency test with send transactions

Usage:

ib_send_lat start a server and wait for connection
ib_send_lat <host> connect to server at <host>

Options:

-p, --port=<port> listen on/connect to port <port> (default 18515)
-c, --connection=<RC/UC> connection type RC/UC (default RC)
-m, --mtu=<mtu> mtu size (default 2048)
-d, --ib-dev=<dev> use IB device <dev> (default first device found)
-i, --ib-port=<port> use port <port> of IB device (default 1)
-s, --size=<size> size of message to exchange (default 1)
-t, --tx-depth=<dep> size of tx queue (default 50)
-l, --signal signal completion on each msg
-a, --all Run sizes from 2 till 2^23
-n, --iters=<iters> number of exchanges (at least 2, default 1000)
-C, --report-cycles report times in cpu cycle units (default microseconds)
-H, --report-histogram print out all results (default print summary only)
-U, --report-unsorted (implies -H) print out unsorted results (default sorted)
-V, --version display version number
-e, --events sleep on CQ events (default poll)


ib_send_bw.exe     - BW (BandWidth) test with send transactions

Usage:

ib_send_bw start a server and wait for connection
ib_send_bw <host> connect to server at <host>

Options:

-p, --port=<port> listen on/connect to port <port> (default 18515)
-d, --ib-dev=<dev> use IB device <dev> (default first device found)
-i, --ib-port=<port> use port <port> of IB device (default 1)
-c, --connection=<RC/UC> connection type RC/UC/UD (default RC)
-m, --mtu=<mtu> mtu size (default 1024)
-s, --size=<size> size of message to exchange (default 65536)
-a, --all Run sizes from 2 till 2^23
-t, --tx-depth=<dep> size of tx queue (default 300)
-n, --iters=<iters> number of exchanges (at least 2, default 1000)
-b, --bidirectional measure bidirectional bandwidth (default unidirectional)
-V, --version display version number
-e, --events sleep on CQ events (default poll)


ib_write_lat.exe      - latency test with RDMA write transactions

Usage:

ib_write_lat start a server and wait for connection
ib_write_lat <host> connect to server at <host>

Options:

-p, --port=<port> listen on/connect to port <port> (default 18515)
-c, --connection=<RC/UC> connection type RC/UC (default RC)
-m, --mtu=<mtu> mtu size (default 1024)
-d, --ib-dev=<dev> use IB device <dev> (default first device found)
-i, --ib-port=<port> use port <port> of IB device (default 1)
-s, --size=<size> size of message to exchange (default 1)
-a, --all Run sizes from 2 till 2^23
-t, --tx-depth=<dep> size of tx queue (default 50)
-n, --iters=<iters> number of exchanges (at least 2, default 1000)
-C, --report-cycles report times in cpu cycle units (default microseconds)
-H, --report-histogram print out all results (default print summary only)
-U, --report-unsorted (implies -H) print out unsorted results (default sorted)
-V, --version display version number


ib_write_bw.exe     - BW test with RDMA write transactions

Usage:

ib_write_bw                # start a server and wait for connection
ib_write_bw <host>    # connect to server at <host>

Options:

-p, --port=<port> listen on/connect to port <port> (default 18515)
-d, --ib-dev=<dev> use IB device <dev> (default first device found)
-i, --ib-port=<port> use port <port> of IB device (default 1)
-c, --connection=<RC/UC> connection type RC/UC (default RC)
-m, --mtu=<mtu> mtu size (default 1024)
-g, --post=<num of posts> number of posts for each qp in the chain (default tx_depth)
-q, --qp=<num of qp's> Num of qp's(default 1)
-s, --size=<size> size of message to exchange (default 65536)
-a, --all Run sizes from 2 till 2^23
-t, --tx-depth=<dep> size of tx queue (default 100)
-n, --iters=<iters> number of exchanges (at least 2, default 5000)
-b, --bidirectional measure bidirectional bandwidth (default unidirectional)
-V, --version display version number

<return-to-top>


 


ttcp - Test TCP performance

TTCP accesses the Windows socket layer, hence it does not access IB verbs directly. IPoIB or WSD layers are invoked beneath the socket layer depending on configuration. TTCP is included as a quick baseline performance check.

Usage: ttcp -t [-options] host 
       ttcp -r [-options]
Common options:
	-l ##	length of bufs read from or written to network (default 8192)
	-u	use UDP instead of TCP
	-p ##	port number to send to or listen at (default 5001)
	-A	align the start of buffers to this modulus (default 16384)
	-O	start buffers at this offset from the modulus (default 0)
	-d	set SO_DEBUG socket option
	-b ##	set socket buffer size (if supported)
	-f X	format for rate: k,K = kilo{bit,byte}; m,M = mega; g,G = giga
Options specific to -t:
	-n##	number of source bufs written to network (default 2048)
	-D	don't buffer TCP writes (sets TCP_NODELAY socket option)
Options specific to -r:
	-B	for -s, only output full blocks as specified by -l (for TAR)
	-T	"touch": access each byte as it's read

Requires a receiver (server) side and a transmitter (client) side, host1 and host2 are IPoIB connected hosts.

at host1 (receiver)        ttcp -r -f M -l 4096

at host2 (transmitter)    ttcp -t -f M -l 4096 -n1000 host1

<return-to-top>

 

 

Diagnostics


IBADDR(8) OFED Diagnostics

NAME
ibaddr - query InfiniBand address(es)

SYNOPSIS
ibaddr [-d(ebug)] [-D(irect)] [-G(uid)] [-l(id_show)] [-g(id_show)] [-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] [-V(ersion)] [-h(elp)] [<lid | dr_path | guid>]

DESCRIPTION
Display the lid (and range) as well as the GID address of the port
specified (by DR path, lid, or GUID) or the local port by default.

Note: this utility can be used as simple address resolver.

OPTIONS
-G, --Guid
show lid range and gid for GUID address

-l, --lid_show
show lid range only

-L, --Lid_show
show lid range (in decimal) only

-g, --gid_show
show gid address only


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
ibaddr # local port´s address

ibaddr 32 # show lid range and gid of lid 32

ibaddr -G 0x8f1040023 # same but using guid address

ibaddr -l 32 # show lid range only

ibaddr -L 32 # show decimal lid range only

ibaddr -g 32 # show gid address only


SEE ALSO
ibroute(8), ibtracert(8)

AUTHOR
Hal Rosenstock
<halr@voltaire.com>


OFED June 18, 2007 IBADDR(8)
 

 

IBLINKINFO(8) OFED Diagnostics
 

NAME
iblinkinfo - report link info for all links in the fabric


SYNOPSIS
iblinkinfo [-Rhcdl -C <ca_name> -P <ca_port> -v <lt,hoq,vlstall> -S <guid> -D<direct_route>]


DESCRIPTION
iblinkinfo reports the link info for each port of each switch active
in the IB fabric.


OPTIONS
-R Recalculate the ibnetdiscover information, ie do not use the
cached information. This option is slower but should be used if
the diag tools have not been used for some time or if there are
other reasons to believe the fabric has changed.

-S <guid>
Output only the switch specified by <guid> (hex format)

-D <direct_route>
Output only the switch specified by the direct route path.

-l Print all information for each link on one line. Default is to
print a header with the switch information and then a list for
each port (useful for grep´ing output).

-d Print only switches which have a port in the "Down" state.

-v <lt,hoq,vlstall>
Verify additional switch settings (<Life-
Time>,<HoqLife>,<VLStallCount>)

-c Print port capabilities (enabled and supported values)

-C <ca_name> use the specified ca_name for the search.

-P <ca_port> use the specified ca_port for the search.



AUTHOR
Ira Weiny <weiny2@llnl.gov>


OFED Jan 24, 2008 IBLINKINFO(8)

<return-to-top>
 

 

IBNETDISCOVER(8) OFED Diagnostics
 

NAME
ibnetdiscover - discover InfiniBand topology

SYNOPSIS
ibnetdiscover [-d(ebug)] [-e(rr_show)] [-v(erbose)] [-s(how)] [-l(ist)]
[-g(rouping)] [-H(ca_list)] [-S(witch_list)] [-R(outer_list)] [-C
ca_name] [-P ca_port] [-t(imeout) timeout_ms] [-V(ersion)]
[--node-name-map <node-name-map>] [-p(orts)] [-h(elp)] [<topology-file>]

DESCRIPTION
ibnetdiscover performs IB subnet discovery and outputs a human readable
topology file. GUIDs, node types, and port numbers are displayed as
well as port LIDs and NodeDescriptions. All nodes (and links) are dis-
played (full topology). Optionally, this utility can be used to list
the current connected nodes by nodetype. The output is printed to
standard output unless a topology file is specified.

OPTIONS
-l, --list
List of connected nodes

-g, --grouping
Show grouping. Grouping correlates IB nodes by different vendor
specific schemes. It may also show the switch external ports
correspondence.

-H, --Hca_list
List of connected CAs

-S, --Switch_list
List of connected switches

-R, --Router_list
List of connected routers

-s, --show
Show progress information during discovery.

--node-name-map <node-name-map>
Specify a node name map. The node name map file maps GUIDs to
more user friendly names. See file format below.

-p, --ports
Obtain a ports report which is a list of connected ports with
relevant information (like LID, portnum, GUID, width, speed, and
NodeDescription).


COMMON OPTIONS
Most OpenIB diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


TOPOLOGY FILE FORMAT
The topology file format is human readable and largely intuitive. Most
identifiers are given textual names like vendor ID (vendid), device ID
(device ID), GUIDs of various types (sysimgguid, caguid, switchguid,
etc.). PortGUIDs are shown in parentheses (). For switches, this is
shown on the switchguid line. For CA and router ports, it is shown on
the connectivity lines. The IB node is identified followed by the num-
ber of ports and a quoted the node GUID. On the right of this line is
a comment (#) followed by the NodeDescription in quotes. If the node
is a switch, this line also contains whether switch port 0 is base or
enhanced, and the LID and LMC of port 0. Subsequent lines pertaining
to this node show the connectivity. On the left is the port number of
the current node. On the right is the peer node (node at other end of
link). It is identified in quotes with nodetype followed by - followed
by NodeGUID with the port number in square brackets. Further on the
right is a comment (#). What follows the comment is dependent on the
node type. If it it a switch node, it is followed by the NodeDescrip-
tion in quotes and the LID of the peer node. If it is a CA or router
node, it is followed by the local LID and LMC and then followed by the
NodeDescription in quotes and the LID of the peer node. The active
link width and speed are then appended to the end of this output line.

An example of this is:
#
# Topology file: generated on Tue Jun 5 14:15:10 2007
#
# Max of 3 hops discovered
# Initiated from node 0008f10403960558 port 0008f10403960559

Non-Chassis Nodes

vendid=0x8f1
devid=0x5a06
sysimgguid=0x5442ba00003000
switchguid=0x5442ba00003080(5442ba00003080)
Switch 24 "S-005442ba00003080" # "ISR9024 Voltaire" base port 0 lid 6 lmc 0
[22] "H-0008f10403961354"[1](8f10403961355) # "MT23108 InfiniHost Mellanox Technologies" lid 4 4xSDR
[10] "S-0008f10400410015"[1] # "SW-6IB4 Voltaire" lid 3 4xSDR
[8] "H-0008f10403960558"[2](8f1040396055a) # "MT23108 InfiniHost Mellanox Technologies" lid 14 4xSDR
[6] "S-0008f10400410015"[3] # "SW-6IB4 Voltaire" lid 3 4xSDR
[12] "H-0008f10403960558"[1](8f10403960559) # "MT23108 InfiniHost Mellanox Technologies" lid 10 4xSDR

vendid=0x8f1
devid=0x5a05
switchguid=0x8f10400410015(8f10400410015)
Switch 8 "S-0008f10400410015" # "SW-6IB4 Voltaire" base port 0 lid 3 lmc 0
[6] "H-0008f10403960984"[1](8f10403960985) # "MT23108 InfiniHost Mellanox Technologies" lid 16 4xSDR
[4] "H-005442b100004900"[1](5442b100004901) # "MT23108 InfiniHost Mellanox Technologies" lid 12 4xSDR
[1] "S-005442ba00003080"[10] # "ISR9024 Voltaire" lid 6 1xSDR
[3] "S-005442ba00003080"[6] # "ISR9024 Voltaire" lid 6 4xSDR

vendid=0x2c9
devid=0x5a44
caguid=0x8f10403960984
Ca 2 "H-0008f10403960984" # "MT23108 InfiniHost Mellanox Technologies"
[1](8f10403960985) "S-0008f10400410015"[6] # lid 16 lmc 1 "SW-6IB4 Voltaire" lid 3 4xSDR

vendid=0x2c9
devid=0x5a44
caguid=0x5442b100004900
Ca 2 "H-005442b100004900" # "MT23108 InfiniHost Mellanox Technologies"
[1](5442b100004901) "S-0008f10400410015"[4] # lid 12 lmc 1 "SW-6IB4 Voltaire" lid 3 4xSDR

vendid=0x2c9
devid=0x5a44
caguid=0x8f10403961354
Ca 2 "H-0008f10403961354" # "MT23108 InfiniHost Mellanox Technologies"
[1](8f10403961355) "S-005442ba00003080"[22] # lid 4 lmc 1 "ISR9024 Voltaire" lid 6 4xSDR

vendid=0x2c9
devid=0x5a44
caguid=0x8f10403960558
Ca 2 "H-0008f10403960558" # "MT23108 InfiniHost Mellanox Technologies"
[2](8f1040396055a) "S-005442ba00003080"[8] # lid 14 lmc 1 "ISR9024 Voltaire" lid 6 4xSDR
[1](8f10403960559) "S-005442ba00003080"[12] # lid 10 lmc 1 "ISR9024 Voltaire" lid 6 1xSDR

When grouping is used, IB nodes are organized into chasses which are
numbered. Nodes which cannot be determined to be in a chassis are dis-
played as "Non-Chassis Nodes". External ports are also shown on the
connectivity lines.


NODE NAME MAP FILE FORMAT
The node name map is used to specify user friendly names for nodes in
the output. GUIDs are used to perform the lookup.

Generically:

# comment
0x<guid> "<name>"

Example:

# IB1
# Line cards
0x0008f104003f125c "IB1 (Rack 11 slot 1 ) ISR9288/ISR9096 Voltaire sLB-24D"
0x0008f104003f125d "IB1 (Rack 11 slot 1 ) ISR9288/ISR9096 Voltaire sLB-24D"
0x0008f104003f10d2 "IB1 (Rack 11 slot 2 ) ISR9288/ISR9096 Voltaire sLB-24D"
0x0008f104003f10d3 "IB1 (Rack 11 slot 2 ) ISR9288/ISR9096 Voltaire sLB-24D"
0x0008f104003f10bf "IB1 (Rack 11 slot 12 ) ISR9288/ISR9096 Voltaire sLB-24D"
# Spines
0x0008f10400400e2d "IB1 (Rack 11 spine 1 ) ISR9288 Voltaire sFB-12D"
0x0008f10400400e2e "IB1 (Rack 11 spine 1 ) ISR9288 Voltaire sFB-12D"
0x0008f10400400e2f "IB1 (Rack 11 spine 1 ) ISR9288 Voltaire sFB-12D"
0x0008f10400400e31 "IB1 (Rack 11 spine 2 ) ISR9288 Voltaire sFB-12D"
0x0008f10400400e32 "IB1 (Rack 11 spine 2 ) ISR9288 Voltaire sFB-12D"
# GUID Node Name
0x0008f10400411a08 "SW1 (Rack 3) ISR9024 Voltaire 9024D"
0x0008f10400411a28 "SW2 (Rack 3) ISR9024 Voltaire 9024D"
0x0008f10400411a34 "SW3 (Rack 3) ISR9024 Voltaire 9024D"
0x0008f104004119d0 "SW4 (Rack 3) ISR9024 Voltaire 9024D"


AUTHORS
Hal Rosenstock    <halr@voltaire.com>
Ira Weiny    <weiny2@llnl.gov>


OFED January 3, 2008 IBNETDISCOVER(8)

<return-to-top>
 

 

IBPING(8) OFED Diagnostics
 

NAME
ibping - ping an InfiniBand address


SYNOPSIS
ibping [-d(ebug)] [-e(rr_show)] [-v(erbose)] [-G(uid)] [-C ca_name] [-P
ca_port] [-s smlid] [-t(imeout) timeout_ms] [-V(ersion)] [-c
ping_count] [-f(lood)] [-o oui] [-S(erver)] [-h(elp)] <dest lid | guid>


DESCRIPTION
ibping uses vendor mads to validate connectivity between IB nodes. On
exit, (IP) ping like output is show. ibping is run as client/server.
Default is to run as client. Note also that a default ping server is
implemented within the kernel.


OPTIONS
-c stop after count packets

-f, --flood
flood destination: send packets back to back without delay

-o, --oui
use specified OUI number to multiplex vendor mads

-S, --Server
start in server mode (do not return)


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


AUTHOR
Hal Rosenstock <halr@voltaire.com>


OFED August 11, 2006 IBPING(8)


<return-to-top>

 

IBPORTSTATE(8) OFED Diagnostics


NAME
ibportstate - handle port (physical) state and link speed of an Infini-
Band port


SYNOPSIS
ibportstate [-d(ebug)] [-e(rr_show)] [-v(erbose)] [-D(irect)] [-G(uid)] [-s smlid] [-V(ersion)] [-C ca_name] [-P ca_port] [-t(imeout) time-out_ms] [-h(elp)] <dest dr_path|lid|guid> <portnum> [<op>]


DESCRIPTION
ibportstate allows the port state and port physical state of an IB port
to be queried (in addition to link width and speed being validated rel-
ative to the peer port when the port queried is a switch port), or a
switch port to be disabled, enabled, or reset. It also allows the link
speed enabled on any IB port to be adjusted.


OPTIONS
op Port operations allowed
supported ops: enable, disable, reset, speed, query
Default is query

ops enable, disable, and reset are only allowed on switch ports
(An error is indicated if attempted on CA or router ports)
speed op is allowed on any port
speed values are legal values for PortInfo:LinkSpeedEnabled
(An error is indicated if PortInfo:LinkSpeedSupported does not support
this setting)
(NOTE: Speed changes are not effected until the port goes through
link renegotiation)
query also validates port characteristics (link width and speed)
based on the peer port. This checking is done when the port
queried is a switch port as it relies on combined routing
(an initial LID route with directed routing to the peer) which
can only be done on a switch. This peer port validation feature
of query op requires LID routing to be functioning in the subnet.


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
ibportstate 3 1 disable # by lid

ibportstate -G 0x2C9000100D051 1 enable # by guid

ibportstate -D 0 1 # (query) by direct route

ibportstate 3 1 reset # by lid

ibportstate 3 1 speed 1 # by lid


AUTHOR
Hal Rosenstock <halr@voltaire.com>


OFED October 19, 2006 IBPORTSTATE(8)


<return-to-top>

 

IBQUERYERRORS(8) OFED Diagnostics
 

NAME
ibqueryerrors - query and report non-zero IB port counters


SYNOPSIS
ibqueryerrors [-a -c -r -R -C <ca_name> -P <ca_port> -s
<err1,err2,...> -S <switch_guid> -D <direct_route> -d]


DESCRIPTION
ibqueryerrors reports the port counters of switches. This is simi-
lar to ibcheckerrors with the additional ability to filter out selected
errors, include the optional transmit and receive data counters, report
actions to remedy a non-zero count, and report full link information
for the link reported.


OPTIONS
-a Report an action to take. Some of the counters are not errors
in and of themselves. This reports some more information on
what the counters mean and what actions can/should be taken if
they are non-zero.

-c Suppress some of the common "side effect" counters. These coun-
ters usually do not indicate an error condition and can be usu-
ally be safely ignored.

-r Report the port information. This includes LID, port, external
port (if applicable), link speed setting, remote GUID, remote
port, remote external port (if applicable), and remote node
description information.

-R Recalculate the ibnetdiscover information, ie do not use the
cached information. This option is slower but should be used if
the diag tools have not been used for some time or if there are
other reasons to believe that the fabric has changed.

-s <err1,err2,...>
Suppress the errors listed in the comma separated list provided.

-S <switch_guid>
Report results only for the switch specified. (hex format)

-D <direct_route>
Report results only for the switch specified by the direct route
path.

-d Include the optional transmit and receive data counters.

-C <ca_name> use the specified ca_name for the search.

-P <ca_port> use the specified ca_port for the search.

AUTHOR
Ira Weiny <weiny2@llnl.gov>


OFED Jan 24, 2008 IBQUERYERRORS(8)


<return-to-top>

 

IBROUTE(8) OFED Diagnostics
 

NAME
ibroute - query InfiniBand switch forwarding tables


SYNOPSIS
ibroute [-d(ebug)] [-a(ll)] [-n(o_dests)] [-v(erbose)] [-D(irect)]
[-G(uid)] [-M(ulticast)] [-s smlid] [-C ca_name] [-P ca_port] [-t(ime-
out) timeout_ms] [-V(ersion)] [-h(elp)] [<dest dr_path|lid|guid>
[<startlid> [<endlid>]]]


DESCRIPTION
ibroute uses SMPs to display the forwarding tables (unicast (LinearFor-
wardingTable or LFT) or multicast (MulticastForwardingTable or MFT))
for the specified switch LID and the optional lid (mlid) range. The
default range is all valid entries in the range 1...FDBTop.


OPTIONS
-a, --all
show all lids in range, even invalid entries

-n, --no_dests
do not try to resolve destinations

-M, --Multicast
show multicast forwarding tables In this case, the range parame-
ters are specifying the mlid range.


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
Unicast examples

ibroute 4 # dump all lids with valid out ports of switch with lid 4

ibroute -a 4 # same, but dump all lids, even with invalid out ports

ibroute -n 4 # simple dump format - no destination resolution

ibroute 4 10 # dump lids starting from 10 (up to FDBTop)

ibroute 4 0x10 0x20 # dump lid range

ibroute -G 0x08f1040023 # resolve switch by GUID

ibroute -D 0,1 # resolve switch by direct path


Multicast examples

ibroute -M 4 # dump all non empty mlids of switch with lid 4

ibroute -M 4 0xc010 0xc020 # same, but with range

ibroute -M -n 4 # simple dump format


SEE ALSO
ibtracert(8)

AUTHOR
Hal Rosenstock <halr@voltaire.com>


OFED July 25, 2006 IBROUTE(8)


<return-to-top>
 

 


ibv_devinfo - print CA (Channel Adapter) attributes

usage: ibv_devinfo  [options]

Options:
   -d, --ib-dev=<dev> use IB device <dev> (default: first device found)
    -i, --ib-port=<port> use port <port> of IB device (default: all ports)
    -l, --list print only the IB devices names
    -v, --verbose print all the attributes of the IB device(s)

<return-to-top>
 


IBSTAT(8) OFED Diagnostics

NAME
ibstat - query basic status of InfiniBand device(s)


SYNOPSIS
ibstat [-d(ebug)] [-l(ist_of_cas)] [-s(hort)] [-p(ort_list)] [-V(ersion)] [-h] <ca_name> [portnum]


DESCRIPTION
ibstat is a binary which displays basic information obtained from the
local IB driver. Output includes LID, SMLID, port state, link width
active, and port physical state.

It is similar to the ibstatus utility but implemented as a binary
rather than a script. It has options to list CAs and/or ports and dis-
plays more information than ibstatus.


OPTIONS
-l, --list_of_cas
list all IB devices

-s, --short
short output

-p, --port_list
show port list

ca_name
InfiniBand device name

portnum
port number of InfiniBand device


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
ibstat # display status of all ports on all IB devices

ibstat -l # list all IB devices

ibstat -p # show port guids

ibstat ibv_device0 2 # show status of port 2 of ’hca0’


SEE ALSO
ibstatus(8)


AUTHOR
Hal Rosenstock <halr@voltaire.com>


OFED July 25, 2006 IBSTAT(8)

<return-to-top>
 

 

IBSYSSTAT(8) OFED Diagnostics
 

NAME
ibsysstat - system status on an InfiniBand address


SYNOPSIS
ibsysstat [-d(ebug)] [-e(rr_show)] [-v(erbose)] [-G(uid)] [-C ca_name]
[-P ca_port] [-s smlid] [-t(imeout) timeout_ms] [-V(ersion)] [-o oui]
[-S(erver)] [-h(elp)] <dest lid | guid> [<op>]


DESCRIPTION
ibsysstat uses vendor mads to validate connectivity between IB nodes
and obtain other information about the IB node. ibsysstat is run as
client/server. Default is to run as client.


OPTIONS
Current supported operations:
ping - verify connectivity to server (default)
host - obtain host information from server
cpu - obtain cpu information from server

-o, --oui
use specified OUI number to multiplex vendor mads

-S, --Server
start in server mode (do not return)



COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


AUTHOR
Hal Rosenstock    <halr@voltaire.com>


OFED August 11, 2006 IBSYSSTAT(8)


<return-to-top>

 

IBTRACERT(8) OFED Diagnostics


NAME
ibtracert- trace InfiniBand path


SYNOPSIS
ibtracert [-d(ebug)] [-v(erbose)] [-D(irect)] [-G(uids)] [-n(o_info)]
[-m mlid] [-s smlid] [-C ca_name] [-P ca_port] [-t(imeout) timeout_ms]
[-V(ersion)] [--node-name--map <node-name-map>] [-h(elp)] [<dest
dr_path|lid|guid> [<startlid> [<endlid>]]]


DESCRIPTION
ibtracert uses SMPs to trace the path from a source GID/LID to a desti-
nation GID/LID. Each hop along the path is displayed until the destina-
tion is reached or a hop does not respond. By using the -m option, mul-
ticast path tracing can be performed between source and destination
nodes.


OPTIONS
-n, --no_info
simple format; don’t show additional information

-m show the multicast trace of the specified mlid

--node-name-map <node-name-map>
Specify a node name map. The node name map file maps GUIDs to
more user friendly names. See ibnetdiscover(8) for node name
map file format.


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
Unicast examples

ibtracert 4 16 # show path between lids 4 and 16

ibtracert -n 4 16 # same, but using simple output format

ibtracert -G 0x8f1040396522d 0x002c9000100d051 # use guid addresses


Multicast example

ibtracert -m 0xc000 4 16 # show multicast path of mlid 0xc000
between lids 4 and 16


SEE ALSO
ibroute(8)


AUTHOR
    Hal Rosenstock    <halr@voltaire.com>

    Ira Weiny    <weiny2@llnl.gov>

OFED April 14, 2007 IBTRACERT(8)

<return-to-top>

 

PERFQUERY(8) OFED Diagnostics


NAME
perfquery - query InfiniBand port counters


SYNOPSIS
perfquery [-d(ebug)] [-G(uid)] [-x|--extended] [-X|--xmtsl]
[-S|--rcvsl] [-a(ll_ports)] [-l(oop_ports)] [-r(eset_after_read)]
[-R(eset_only)] [-C ca_name] [-P ca_port] [-t(imeout) timeout_ms]
[-V(ersion)] [-h(elp)] [<lid|guid> [[port] [reset_mask]]]


DESCRIPTION
perfquery uses PerfMgt GMPs to obtain the PortCounters (basic perfor-
mance and error counters), PortExtendedCounters, PortXmitDataSL, or
PortRcvDataSL from the PMA at the node/port specified. Optionally shows
aggregated counters for all ports of node. Also, optionally, reset
after read, or only reset counters.

Note: In PortCounters, PortCountersExtended, PortXmitDataSL, and PortR-
cvDataSL, components that represent Data (e.g. PortXmitData and PortR-
cvData) indicate octets divided by 4 rather than just octets.

Note: Inputting a port of 255 indicates an operation be performed on
all ports.


OPTIONS
-x, --extended
show extended port counters rather than (basic) port counters.
Note that extended port counters attribute is optional.

-X, --xmtsl
show transmit data SL counter. This is an optional counter for
QoS.

-S, --rcvsl
show receive data SL counter. This is an optional counter for
QoS.

-a, --all_ports
show aggregated counters for all ports of the destination lid or
reset all counters for all ports. If the destination lid does
not support the AllPortSelect flag, all ports will be iterated
through to emulate AllPortSelect behavior.

-l, --loop_ports
If all ports are selected by the user (either through the -a
option or port 255) iterate through each port rather than doing
than aggregate operation.

-r, --reset_after_read
reset counters after read

-R, --Reset_only
only reset counters


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
perfquery # read local port performance counters

perfquery 32 1 # read performance counters from lid 32, port 1

perfquery -x 32 1 # read extended performance counters from lid 32, port 1

perfquery -a 32 # read perf counters from lid 32, all ports

perfquery -r 32 1 # read performance counters and reset

perfquery -x -r 32 1 # read extended performance counters and reset

perfquery -R 0x20 1 # reset performance counters of port 1 only

perfquery -x -R 0x20 1 # reset extended performance counters of port 1 only

perfquery -R -a 32 # reset performance counters of all ports

perfquery -R 32 2 0x0fff # reset only error counters of port 2

perfquery -R 32 2 0xf000 # reset only non-error counters of port 2


AUTHOR
Hal Rosenstock    <halr@voltaire.com>


OFED March 10, 2009 PERFQUERY(8)


<return-to-top>

 

SAQUERY(8) OFED Diagnostics


NAME
saquery - query InfiniBand subnet administration attributes


SYNOPSIS
saquery [-h] [-d] [-p] [-N] [--list | -D] [-S] [-I] [-L] [-l] [-G] [-O]
[-U] [-c] [-s] [-g] [-m] [-x] [-C ca_name] [-P ca_port] [--smkey val]
[-t(imeout) <msec>] [--src-to-dst <src:dst>] [--sgid-to-dgid
<sgid-dgid>] [--node-name-map <node-name-map>] [<name> | <lid> |
<guid>]


DESCRIPTION
saquery issues the selected SA query. Node records are queried by
default.


OPTIONS
-p get PathRecord info

-N get NodeRecord info

--list | -D
get NodeDescriptions of CAs only

-S get ServiceRecord info

-I get InformInfoRecord (subscription) info

-L return the Lids of the name specified

-l return the unique Lid of the name specified

-G return the Guids of the name specified

-O return the name for the Lid specified

-U return the name for the Guid specified

-c get the SA’s class port info

-s return the PortInfoRecords with isSM or isSMdisabled capability
mask bit on

-g get multicast group info

-m get multicast member info. If a group is specified, limit the output to the group specified and print one line containing only the GUID and node description for each entry. Example: saquery -m 0xc000

-x get LinkRecord info

--src-to-dst
get a PathRecord for <src:dst> where src and dst are either node names or LIDs

--sgid-to-dgid
get a PathRecord for sgid to dgid where both GIDs are in an IPv6 format acceptable to inet_pton.

-C <ca_name>
use the specified ca_name.

-P <ca_port>
use the specified ca_port.

--smkey <val>
use SM_Key value for the query. Will be used only with "trusted"
queries. If non-numeric value (like ’x’) is specified then
saquery will prompt for a value.

-t, -timeout <msec>
Specify SA query response timeout in milliseconds. Default is
100 milliseconds. You may want to use this option if IB_TIMEOUT
is indicated.

--node-name-map <node-name-map>
Specify a node name map. The node name map file maps GUIDs to
more user friendly names. See ibnetdiscover(8) for node name
map file format
. Only used with the -O and -U options.

Supported query names (and aliases):
ClassPortInfo (CPI)
NodeRecord (NR) [lid]
PortInfoRecord (PIR) [[lid]/[port]]
SL2VLTableRecord (SL2VL) [[lid]/[in_port]/[out_port]]
PKeyTableRecord (PKTR) [[lid]/[port]/[block]]
VLArbitrationTableRecord (VLAR) [[lid]/[port]/[block]]
InformInfoRecord (IIR)
LinkRecord (LR) [[from_lid]/[from_port]] [[to_lid]/[to_port]]
ServiceRecord (SR)
PathRecord (PR)
MCMemberRecord (MCMR)
LFTRecord (LFTR) [[lid]/[block]]
MFTRecord (MFTR) [[mlid]/[position]/[block]]

-d enable debugging

-h show help


AUTHORS
Ira Weiny <weiny2@llnl.gov>

Hal Rosenstock <halr@voltaire.com>


OFED October 19, 2008 SAQUERY(8)

<return-to-top>

 

SMINFO(8) OFED Diagnostics


NAME
sminfo - query InfiniBand SMInfo attribute


SYNOPSIS
sminfo [-d(ebug)] [-e(rr_show)] -s state -p prio -a activity
[-D(irect)] [-G(uid)] [-C ca_name] [-P ca_port] [-t(imeout) time-
out_ms] [-V(ersion)] [-h(elp)] sm_lid | sm_dr_path [modifier]


DESCRIPTION
Optionally set and display the output of a sminfo query in human read-
able format. The target SM is the one listed in the local port info, or
the SM specified by the optional SM lid or by the SM direct routed
path.

Note: using sminfo for any purposes other then simple query may be very
dangerous, and may result in a malfunction of the target SM.


OPTIONS
-s set SM state
0 - not active
1 - discovering
2 - standby
3 - master

-p set priority (0-15)

-a set activity count

COMMON OPTIONS

Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
sminfo         # local port´s sminfo

sminfo 32     # show sminfo of lid 32

sminfo -G 0x8f1040023     # same but using guid address


SEE ALSO
smpdump(8)


AUTHOR
Hal Rosenstock    <halr@voltaire.com>

OFED July 25, 2006 SMINFO(8)


<return-to-top>

 

SMPDUMP(8) OFED Diagnostics


NAME
smpdump - dump InfiniBand subnet management attributes


SYNOPSIS
smpdump [-s(ring)] [-D(irect)] [-C ca_name] [-P ca_port] [-t(imeout)
timeout_ms] [-V(ersion)] [-h(elp)] <dlid|dr_path> <attr> [mod]


DESCRIPTION
smpdump is a general purpose SMP utility which gets SM attributes from
a specified SMA. The result is dumped in hex by default.


OPTIONS
attr IBA attribute ID for SM attribute

mod IBA modifier for SM attribute


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
Direct Routed Examples

smpdump -D 0,1,2,3,5 16 # NODE DESC

smpdump -D 0,1,2 0x15 2 # PORT INFO, port 2

LID Routed Examples

smpdump 3 0x15 2 # PORT INFO, lid 3 port 2

smpdump 0xa0 0x11 # NODE INFO, lid 0xa0


SEE ALSO
smpquery(8)


AUTHOR
Hal Rosenstock    <halr@voltaire.com>


OFED July 25, 2006 SMPDUMP(8)


<return-to-top>

 

SMPQUERY(8) OFED Diagnostics


NAME
smpquery - query InfiniBand subnet management attributes


SYNOPSIS
smpquery [-d(ebug)] [-e(rr_show)] [-v(erbose)] [-D(irect)] [-G(uid)]
[-C ca_name] [-P ca_port] [-t(imeout) timeout_ms]
[--node-name-map node-name-map-file]
[-V(ersion)] [-h(elp)] <op> <dest dr_path|lid|guid> [op params]


DESCRIPTION
smpquery allows a basic subset of standard SMP queries including the
following: node info, node description, switch info, port info. Fields
are displayed in human readable format.


OPTIONS
Current supported operations and their parameters:
nodeinfo <addr>
nodedesc <addr>
portinfo <addr> [<portnum>] # default port is zero
switchinfo <addr>
pkeys <addr> [<portnum>]
sl2vl <addr> [<portnum>]
vlarb <addr> [<portnum>]
guids <addr>


--node-name-map <node-name-map>
Specify a node name map. The node name map file maps GUIDs to
more user friendly names. See ibnetdiscover(8) for node name
map file format
.


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-D use directed path address arguments. The path
is a comma separated list of out ports.
Examples:
"0" # self port
"0,1,2,1,4" # out via port 1, then 2, ...

-c use combined route address arguments. The
address is a combination of a LID and a direct route path.
The LID specified is the DLID and the local LID is used
as the DrSLID.

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
smpquery portinfo 3 1 # portinfo by lid, with port modifier

smpquery -G switchinfo 0x2C9000100D051 1 # switchinfo by guid

smpquery -D nodeinfo 0 # nodeinfo by direct route

smpquery -c nodeinfo 6 0,12 # nodeinfo by combined route


SEE ALSO
smpdump(8)


AUTHOR
Hal Rosenstock <halr@voltaire.com>


OFED March 14, 2007 SMPQUERY(8)

<return-to-top>

 

VENDSTAT(8) OFED Diagnostics

NAME
vendstat - query InfiniBand vendor specific functions


SYNOPSIS
vendstat [-d(ebug)] [-G(uid)] [-N] [-w] [-i] [-c <num,num>] [-C ca_name] [-P ca_port] [-t(imeout) timeout_ms] [-V(ersion)] [-h(elp)] <lid|guid>


DESCRIPTION
vendstat uses vendor specific MADs to access beyond the IB spec vendor
specific functionality. Currently, there is support for Mellanox InfiniSwitch-III (IS3) and InfiniSwitch-IV (IS4).


OPTIONS
-N show IS3 general information.

-w show IS3 port xmit wait counters.

-i show IS4 counter group info.

-c <num,num>
configure IS4 counter groups.

Configure IS4 counter groups 0 and 1. Such configuration is not
persistent across IS4 reboot. First number is for counter group
0 and second is for counter group 1.

Group 0 counter config values:
0 - PortXmitDataSL0-7
1 - PortXmitDataSL8-15
2 - PortRcvDataSL0-7

Group 1 counter config values:
1 - PortXmitDataSL8-15
2 - PortRcvDataSL0-7
8 - PortRcvDataSL8-15


COMMON OPTIONS
Most OFED diagnostics take the following common flags. The exact list
of supported flags per utility can be found in the usage message and
can be shown using the util_name -h syntax.

# Debugging flags

-d raise the IB debugging level.
May be used several times (-ddd or -d -d -d).

-e show send and receive errors (timeouts and others)

-h show the usage message

-v increase the application verbosity level.
May be used several times (-vv or -v -v -v)

-V show the version info.

# Addressing flags

-G use GUID address argument. In most cases, it is the Port GUID.
Example:
"0x08f1040023"

-s <smlid> use ’smlid’ as the target lid for SM/SA queries.

# Other common flags:

-C <ca_name> use the specified ca_name.

-P <ca_port> use the specified ca_port.

-t <timeout_ms> override the default timeout for the solicited mads.

Multiple CA/Multiple Port Support

When no IB device or port is specified, the port to use is selected by
the following criteria:

1. the first port that is ACTIVE.

2. if not found, the first port that is UP (physical link up).

If a port and/or CA name is specified, the user request is attempted to
be fulfilled, and will fail if it is not possible.


EXAMPLES
vendstat -N 6 # read IS3 general information

vendstat -w 6 # read IS3 port xmit wait counters

vendstat -i 6 12 # read IS4 port 12 counter group info

vendstat -c 0,1 6 12 # configure IS4 port 12 counter groups for PortXmitDataSL

vendstat -c 2,8 6 12 # configure IS4 port 12 counter groups for PortRcvDataSL


AUTHOR
Hal Rosenstock    <halr@voltaire.com>


OFED April 16, 2009 VENDSTAT(8)

<return-to-top>
 


ib_limits - Infiniband verbs tests

Usage: ib_limits [options]

Options:
-m or --memory
    Direct ib_limits to test memory registration
-c or --cq
    Direct ib_limits to test CQ creation
-r or --resize_cq
    direct ib_limits to test CQ resize
-q or --qp
    Directs ib_limits to test QP creation
-v or --verbose
    Enable verbosity level to debug console.
-h or --help
    Display this usage info then exit.

<return-to-top>

 


cmtest - Connection Manager Tests

Usage: cmtest [options]

    Options:

 -s --server This option directs cmtest to act as a Server
 -l --local This option specifies the local endpoint.
 -r --remote This option specifies the remote endpoint LID as a hex integer 0x; see vstat command for active port LID hex integer.
 -c --connect This option specifies the number of connections to open. Default of 1.
 -m --msize This option specifies the byte size of each message. Default is 100 bytes.
 -n --nmsgs This option specifies the number of messages to send at a time.
 -p --permsg This option indicates if a separate buffer should be used per message. Default is one buffer for all messages.
 -i --iterate This option specifies the number of times to loop through 'nmsgs'. Default of 1.
 -v --verbose This option enables verbosity level to debug console.
 -h --help Display this usage info then exit.

<return-to-top>

 

InfiniBand Partition Management

The part_man.exe application allows creating, deleting and viewing existing host partitions.

Usage : part_man.exe <show|add|rem> <port_guid> <pkey1 pkey2 ...>

show - – shows existing partitions

Expected results after execution part_man.exe show

1.      Output has a format 

port_guid1   pkey1  pkey2  pkey3  pkey4  pkey5  pkey6  pkey7  pkey8

port_guid2   pkey1  pkey2  pkey3  pkey4  pkey5  pkey6  pkey7  pkey8

where port_guid is a port guid in hexadecimal format, pkey – values of partition key (in hex format) for this port.

Default partition key (0xFFFF) is not shown and can not be created by the part_man.exe.

 

add - create new partition(s) on specified port

port_guid  add   <port_guid>  <pkey1>   <pkey2>

creates new partition(s) on port specified by port_guid parameter (in hexadecimal format) and pkey – new partition key value in hexadecimal format (e.g. 0xABCD or ABCD).

Port guid is taken form vstat output and has a following format:

XXXX:XXXX:XXXX:XXXX.

Vstat prints node guid, so user has to add 1 to node guid value to obtain port guid. For example, if node guid is 0008:f104:0397:7ccc, port guid will be

0008:f104:0397:7ccd – for the first port,

0008:f104:0397:7cce – for the second port.

 

Expected results of execution part_man.exe add 0x0D99:9703:04f1:0800 0xABCD

1.      part_man.exe output ends up with …Done message.

2.      A new instance of a Network Adapter named “OpenFabrics IPoIB Adapter Partition” will appear in Device manager window. 
If the new adapter appears with yellow label, manual device driver installation is required.

3.      New adapter name ends with “Partition”, e.g. “OpenFabrics IPoIB Adapter Partition”.

 

rem – removes partition key on specified port.

part_man.exe rem <port_guid> <pkey1>  <pkey2>

Port_guid – in hexadecimal format (same as for add command), identifies port for operation.

Expected results after execution part_man rem <port_guid>  <pkey>

1.      Application prints …Done message.

2.      In device manager window IPoIB network adapter will disappear.

3.      Execution of  part_man.exe show will not show removed adapter.

 

<return-to-top>

 


PrintIP - print ip adapters and their addresses

PrintIP is used to print IP adapters and their addresses, or ARP (Address Resolution Protocol) and IP address.

Usage:
    printip <print_ips>
    printip <remoteip> <ip>        (example printip remoteip 10.10.2.20)

<return-to-top>

 



vstat - HCA Stats and Counters

Display HCA (Host channel Adapter) attributes.

Usage: vstat [-v] [-c]
          -v - verbose mode
          -c - HCA error/statistic counters

Includes Node GUID, Subnet Manager and port LIDs.

<return-to-top>

 

Subnet Management with OpenSM version 3.3.6


A single running process (opensm.exe) is required to configure and thus make an Infiniband subnet useable.  For most cases, InfiniBand Subnet Management as a Windows service is sufficient to correctly configure most InfiniBand fabrics.

The Infiniband subnet management process (opensm) may exist on a Windows (OFED) node or a Linux (OFED) node.

Limit the number of OpenSM processes per IB fabric; one SM is sufficient although redundant SMs are supported. You do not need a Subnet Manager per node/system.

OpenIB Subnet Management as a Windows Service

InfiniBand subnet management (OpenSM), as a Windows service, is installed by default, although it is NOT started by default. There are two ways to enable the InfiniBand Subnet Management service.

  1. Reset the installed OpenSM service "InfiniBand Subnet Management" to start automatically; From a command window type 'services.msc'.
    Locate the InfiniBand Subnet Management view and select the start option; additionally select the startup option 'Automatic' to start the OpenSM service on system startup.
     
  2. Install OpenSM as a 'running' Windows service:
    Select the OpenSM_service_Started install feature. Once the installation has completed, check the running InfiniBand Subnet Management service status via the Windows service manager (see #1).
     
  3. Consult the OpenSM log files to see what OpenSM thinks is happening.
        %TEMP%\osm.log
        %TEMP%\osm.syslog
    Note:
        When opensm.exe is run as a Windows Service, the 'normal' case, %temp% is defined as %windir%\TEMP\.
        If opensm.exe is run from a command window, %TEMP% is not defined as %windir%\TEMP\.
     

InfiniBand Subnet Management from a command window

opensm - InfiniBand subnet manager and administration (SM/SA)

SYNOPSIS

opensm [--version]] [-F | --config <file_name>] [-c(reate-config) <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority) <PRIORITY>] [-smkey <SM_Key>] [--sm_sl <SL number>] [-r(eassign_lids)] [-R <engine name(s)> | --routing_engine <engine name(s)>] [--do_mesh_analysis] [--lash_start_vl <vl number>] [-A | --ucast_cache] [-z | --connect_roots] [-M <file name> | --lid_matrix_file <file name>] [-U <file name> | --lfts_file <file name>] [-S | --sadb_file <file name>] [-a | --root_guid_file <path to file>] [-u | --cn_guid_file <path to file>] [-G | --io_guid_file <path to file>] [-H | --max_reverse_hops <max reverse hops allowed>] [-X | --guid_routing_order_file <path to file>] [-m | --ids_guid_file <path to file>] [-o(nce)] [-s(weep) <interval>] [-t(imeout) <milliseconds>] [--retries <number>] [-maxsmps <number>] [-console [off | local | socket | loopback]] [-console-port <port>] [-i(gnore-guids) <equalize-ignore-guids-file>] [-w | --hop_weights_file <path to file>] [-O | --dimn_ports_file <path to file>] [-f <log file path> | --log_file <log file path> ] [-L | --log_limit <size in MB>] [-e(rase_log_file)] [-P(config) <partition config file> ] [-N | --no_part_enforce] [-Q | --qos [-Y | --qos_policy_file <file name>]] [-y | --stay_on_fatal] [-B | --service --daemon] [-I | --inactive] [--perfmgr] [--perfmgr_sweep_time_s <seconds>] [--prefix_routes_file <path>] [--consolidate_ipv6_snm_req] [--log_prefix <prefix text>] [-v(erbose)] [-V] [-D <flags>] [-d(ebug) <number>] [-h(elp)] [-?]

DESCRIPTION

opensm is an InfiniBand compliant Subnet Manager and Administration, and runs on top of OFED for Windows. opensm provides an implementation of an InfiniBand Subnet Manager and Administration. Such a software entity is required to run for in order to initialize the InfiniBand hardware (at least one per each InfiniBand subnet).

opensm also now contains an experimental version of a performance manager as well.

opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB fabric, initialize it, and sweep occasionally for changes.

opensm attaches to a specific IB port on the local machine and configures only the fabric connected to it. (If the local machine has other IB ports, opensm will ignore the fabrics connected to those other ports). If no port is specified, it will select the first "best" available port.

opensm can present the available ports and prompt for a port number to attach to.

By default, the run is logged to two files: %TEMP%\osm.syslog (aka %windir%\temp\osm.syslog) and %windir%\temp\opensm.log. The first file will register only general major events, whereas the second will include details of reported errors. All errors reported in this second file should be treated as indicators of IB fabric health issues. (Note that when a fatal and non-recoverable error occurs, opensm will exit.) Both log files should include the message "SUBNET UP" if opensm was able to setup the subnet correctly. Note when opensm.exe is run as a service, %TEMP% == %windir%\temp .

OPTIONS

--version
Prints OpenSM version and exits.
-F, --config <config file>
The name of the OpenSM config file. When not specified %ProgramFiles%\OFED\opensm\opensm.conf will be used (if exists).
-c, --create-config <file name>
OpenSM will dump its configuration to the specified file and exit. This is a way to generate OpenSM configuration file template.
-g, --guid <GUID in hex>
This option specifies the local port GUID value with which OpenSM should bind. OpenSM may be bound to 1 port at a time. If GUID given is 0, OpenSM displays a list of possible port GUIDs and waits for user input. Without -g, OpenSM tries to use the default port.
-l, --lmc <LMC value>
This option specifies the subnet's LMC value. The number of LIDs assigned to each port is 2^LMC. The LMC value must be in the range 0-7. LMC values > 0 allow multiple paths between ports. LMC values > 0 should only be used if the subnet topology actually provides multiple paths between ports, i.e. multiple interconnects between switches. Without -l, OpenSM defaults to LMC = 0, which allows one path between any two ports.
-p, --priority <Priority value>
This option specifies the SM's PRIORITY. This will effect the handover cases, where master is chosen by priority and GUID. Range goes from 0 (default and lowest priority) to 15 (highest).
-smkey <SM_Key value>
This option specifies the SM's SM_Key (64 bits). This will effect SM authentication. Note that OpenSM version 3.2.1 and below used the default value '1' in a host byte order, it is fixed now but you may need this option to interoperate with old OpenSM running on a little endian machine.
--sm_sl <SL number>
This option sets the SL to use for communication with the SM/SA. Defaults to 0.
-r, --reassign_lids
This option causes OpenSM to reassign LIDs to all end nodes. Specifying -r on a running subnet may disrupt subnet traffic. Without -r, OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID.
-R, --routing_engine <Routing engine names>
This option chooses routing engine(s) to use instead of Min Hop algorithm (default). Multiple routing engines can be specified separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail. Supported engines: minhop, updn, file, ftree, lash, dor
--do_mesh_analysis
This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing.
--lash_start_vl <vl number>
This option sets the starting VL to use for the lash routing algorithm. Defaults to 0.
-A, --ucast_cache
This option enables unicast routing cache and prevents routing recalculation (which is a heavy task in a large cluster) when there was no topology change detected during the heavy sweep, or when the topology change does not require new routing calculation, e.g. when one or more CAs/RTRs/leaf switches going down, or one or more of these nodes coming back after being down. A very common case that is handled by the unicast routing cache is host reboot, which otherwise would cause two full routing recalculations: one when the host goes down, and the other when the host comes back online.
-z, --connect_roots
This option enforces routing engines (up/down and fat-tree) to make connectivity between root switches and in this way to be fully IBA complaint. In many cases this can violate "pure" deadlock free algorithm, so use it carefully.
-M, --lid_matrix_file <file name>
This option specifies the name of the lid matrix dump file from where switch lid matrices (min hops tables will be loaded.
-U, --lfts_file <file name>
This option specifies the name of the LFTs file from where switch forwarding tables will be loaded.
-S, --sadb_file <file name>
This option specifies the name of the SA DB dump file from where SA database will be loaded.
-a, --root_guid_file <file name>
Set the root nodes for the Up/Down or Fat-Tree routing algorithm to the guids provided in the given file (one to a line).
-u, --cn_guid_file <file name>
Set the compute nodes for the Fat-Tree routing algorithm to the guids provided in the given file (one to a line).
-G, --io_guid_file <file name>
Set the I/O nodes for the Fat-Tree routing algorithm to the guids provided in the given file (one to a line). I/O nodes are non-CN nodes allowed to use up to max_reverse_hops switches the wrong way around to improve connectivity.
-H, --max_reverse_hops <file name>
Set the maximum number of reverse hops an I/O node is allowed to make. A reverse hop is the use of a switch the wrong way around.
-m, --ids_guid_file <file name>
Name of the map file with set of the IDs which will be used by Up/Down routing algorithm instead of node GUIDs (format: <guid> <id> per line).
-X, --guid_routing_order_file <file name>
Set the order port guids will be routed for the MinHop and Up/Down routing algorithms to the guids provided in the given file (one to a line).
-o, --once
This option causes OpenSM to configure the subnet once, then exit. Ports remain in the ACTIVE state.
-s, --sweep <interval value>
This option specifies the number of seconds between subnet sweeps. Specifying -s 0 disables sweeping. Without -s, OpenSM defaults to a sweep interval of 10 seconds.
-t, --timeout <value>
This option specifies the time in milliseconds used for transaction timeouts. Specifying -t 0 disables timeouts. Without -t, OpenSM defaults to a timeout value of 200 milliseconds.
--retries <number>
This option specifies the number of retries used for transactions. Without --retries, OpenSM defaults to 3 retries for transactions.
-maxsmps <number>
This option specifies the number of VL15 SMP MADs allowed on the wire at any one time. Specifying -maxsmps 0 allows unlimited outstanding SMPs. Without -maxsmps, OpenSM defaults to a maximum of 4 outstanding SMPs.
-console [off | local | socket | loopback]
This option brings up the OpenSM console (default off). Note that the socket and loopback options will only be available if OpenSM was built with --enable-console-socket.
-console-port <port>
Specify an alternate telnet port for the socket console (default 10000). Note that this option only appears if OpenSM was built with --enable-console-socket.
-i, -ignore-guids <equalize-ignore-guids-file>
This option provides the means to define a set of ports (by node guid and port number) that will be ignored by the link load equalization algorithm.
-w, --hop_weights_file <path to file>
This option provides weighting factors per port representing a hop cost in computing the lid matrix. The file consists of lines containing a switch port GUID (specified as a 64 bit hex number, with leading 0x), output port number, and weighting factor. Any port not listed in the file defaults to a weighting factor of 1. Lines starting with # are comments. Weights affect only the output route from the port, so many useful configurations will require weights to be specified in pairs.
-O, --dimn_ports_file <path to file>
This option provides a mapping between hypercube dimensions and ports on a per switch basis for the DOR routing engine. The file consists of lines containing a switch node GUID (specified as a 64 bit hex number, with leading 0x) followed by a list of non-zero port numbers, separated by spaces, one switch per line. The order for the port numbers is in one to one correspondence to the dimensions. Ports not listed on a line are assigned to the remaining dimensions, in port order. Anything after a # is a comment.
-x, --honor_guid2lid
This option forces OpenSM to honor the guid2lid file, when it comes out of Standby state, if such file exists under OSM_CACHE_DIR, and is valid. By default, this is FALSE.
-f, --log_file <file name>
This option defines the log to be the given file. By default, the log goes to %windir%\temp\opensm.log. For the log to go to standard output use -f stdout.
-L, --log_limit <size in MB>
This option defines maximal log file size in MB. When specified the log file will be truncated upon reaching this limit.
-e, --erase_log_file
This option will cause deletion of the log file (if it previously exists). By default, the log file is accumulative.
-P, --Pconfig <partition config file>
This option defines the optional partition configuration file. The default name is %ProgramFiles%\OFED\opensm\partitions.conf.
--prefix_routes_file <file name>
Prefix routes control how the SA responds to path record queries for off-subnet DGIDs. By default, the SA fails such queries. The PREFIX ROUTES section below describes the format of the configuration file. The default path is %ProgramFiles%\OFED\opensm\prefix-routes.conf.
-Q, --qos
This option enables QoS setup. It is disabled by default.
-Y, --qos_policy_file <file name>
This option defines the optional QoS policy file. The default name is %ProgramFiles%\OFED\opensm\qos-policy.conf. See QoS_management_in_OpenSM.txt in opensm doc for more information on configuring QoS policy via this file.
-N, --no_part_enforce
This option disables partition enforcement on switch external ports.
-y, --stay_on_fatal
This option will cause SM not to exit on fatal initialization issues: if SM discovers duplicated guids or a 12x link with lane reversal badly configured. By default, the SM will exit on these errors.
-B, --service
OpenSM will run in the background (without a console window) as a Windows system service (the preferred Windows mode).
-I, --inactive
Start SM in inactive rather than init SM state. This option can be used in conjunction with the perfmgr so as to run a standalone performance manager without SM/SA. However, this is NOT currently implemented in the performance manager.
-perfmgr
Enable the perfmgr. Only takes effect if --enable-perfmgr was specified at configure time. See performance-manager-HOWTO.txt in opensm doc for more information on running perfmgr.
-perfmgr_sweep_time_s <seconds>
Specify the sweep time for the performance manager in seconds (default is 180 seconds). Only takes effect if --enable-perfmgr was specified at configure time.
--consolidate_ipv6_snm_req
Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P_Key.
-log_prefix <prefix text>
This option specifies the prefix to the syslog messages from OpenSM. A suitable prefix can be used to identify the IB subnet in syslog messages when two or more instances of OpenSM run in a single node to manage multiple fabrics. For example, in a dual-fabric (or dual-rail) IB cluster, the prefix for the first fabric could be "mpi" and the other fabric could be "storage".
-v, --verbose
This option increases the log verbosity level. The -v option may be specified multiple times to further increase the verbosity level. See the -D option for more information about log verbosity.
-V
This option sets the maximum verbosity level and forces log flushing. The -V option is equivalent to '-D 0xFF -d 2'. See the -D option for more information about log verbosity.
-D <value>
This option sets the log verbosity level. A flags field must follow the -D option. A bit set/clear in the flags enables/disables a specific log level as follows:

 BIT    LOG LEVEL ENABLED
 ----   -----------------
 0x01 - ERROR (error messages)
 0x02 - INFO (basic messages, low volume)
 0x04 - VERBOSE (interesting stuff, moderate volume)
 0x08 - DEBUG (diagnostic, high volume)
 0x10 - FUNCS (function entry/exit, very high volume)
 0x20 - FRAMES (dumps all SMP and GMP frames)
 0x40 - ROUTING (dump FDB routing information)
 0x80 - currently unused.

Without -D, OpenSM defaults to ERROR + INFO (0x3). Specifying -D 0 disables all messages. Specifying -D 0xFF enables all messages (see -V). High verbosity levels may require increasing the transaction timeout with the -t option.

-d, --debug <value>
This option specifies a debug option. These options are not normally needed. The number following -d selects the debug option to enable as follows:

 OPT   Description
 ---    -----------------
 -d0  - Ignore other SM nodes
 -d1  - Force single threaded dispatching
 -d2  - Force log flushing after each log message
 -d3  - Disable multicast support
-h, --help
Display this usage info then exit.
-?
Display this usage info then exit.

ENVIRONMENT VARIABLES

The following environment variables control opensm behavior:

NOTES

When opensm running as a windows service, if the opensm process receives a service control code of 129, it starts a new heavy sweep as if a trap was received or a topology change was found.

Also, service control code 128 can be used to trigger a reopen of %windir%\temp\osm.log for logrotate purposes.
 

Examples:

    sc.exe control OpenSM 128            # will clear the contents of %windir%\temp\osm.log, logrotate.
    sc.exe control OpenSM 129            # start a new heavy sweep

 

PARTITION CONFIGURATION

The default name of OpenSM partitions configuration file is %ProgramFiles\OFED\OpenSM\partitions.conf. The default may be changed by using the --Pconfig (-P) option with OpenSM.

The default partition will be created by OpenSM unconditionally even when partition configuration file does not exist or cannot be accessed.

The default partition has P_Key value 0x7fff. OpenSM's port will always have full membership in default partition. All other end ports will have full membership if the partition configuration file is not found or cannot be accessed, or limited membership if the file exists and can be accessed but there is no rule for the Default partition.

Effectively, this amounts to the same as if one of the following rules below appear in the partition configuration file.

In the case of no rule for the Default partition:

Default=0x7fff : ALL=limited, SELF=full ;

In the case of no partition configuration file or file cannot be accessed:

Default=0x7fff : ALL=full ;

File Format

Comments:

Line content followed after '#' character is comment and ignored by parser.

General file format:

<Partition Definition>:<PortGUIDs list> ;

Partition Definition:

[PartitionName][=PKey][,flag[=value]][,defmember=full|limited]

 PartitionName - string, will be used with logging. When omitted
                 empty string will be used.
 PKey          - P_Key value for this partition. Only low 15 bits will
                 be used. When omitted will be autogenerated.
 flag          - used to indicate IPoIB capability of this partition.
 defmember=full|limited - specifies default membership for port guid
                 list. Default is limited.

Currently recognized flags are:

 ipoib       - indicates that this partition may be used for IPoIB, as
               result IPoIB capable MC group will be created.
 rate=<val>  - specifies rate for this IPoIB MC group
               (default is 3 (10GBps))
 mtu=<val>   - specifies MTU for this IPoIB MC group
               (default is 4 (2048))
 sl=<val>    - specifies SL for this IPoIB MC group
               (default is 0)
 scope=<val> - specifies scope for this IPoIB MC group
               (default is 2 (link local)).  Multiple scope settings
               are permitted for a partition.

Note that values for rate, mtu, and scope should be specified as defined in the IBTA specification (for example, mtu=4 for 2048).

PortGUIDs list:

 PortGUID         - GUID of partition member EndPort. Hexadecimal
                    numbers should start from 0x, decimal numbers
                    are accepted too.
 full or limited  - indicates full or limited membership for this
                    port.  When omitted (or unrecognized) limited
                    membership is assumed.

There are two useful keywords for PortGUID definition:

 - 'ALL' means all end ports in this subnet.
 - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
 - 'ALL_SWITCHES' means all Switch end ports in this subnet.
 - 'ALL_ROUTERS' means all Router end ports in this subnet.
 - 'SELF' means subnet manager's port.

Empty list means no ports in this partition.

Notes:

White space is permitted between delimiters ('=', ',',':',';').

The line can be wrapped after ':' followed after Partition Definition and between.

PartitionName does not need to be unique, PKey does need to be unique. If PKey is repeated then those partition configurations will be merged and first PartitionName will be used (see also next note).

It is possible to split partition configuration in more than one definition, but then PKey should be explicitly specified (otherwise different PKey values will be generated for those definitions).

Examples:

 Default=0x7fff : ALL, SELF=full ;
 Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;

 NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;

 YetAnotherOne = 0x300 : SELF=full ;
 YetAnotherOne = 0x300 : ALL=limited ;

 ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
 # 0x123453, 0x123454 will be limited
 ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
 # 0x123456, 0x123457 will be limited
 ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
 ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
 ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;

Note:

The following rule is equivalent to how OpenSM used to run prior to the partition manager:

 Default=0x7fff,ipoib:ALL=full;  

QOS CONFIGURATION

There are a set of QoS related low-level configuration parameters. All these parameter names are prefixed by "qos_" string. Here is a full list of these parameters:

 qos_max_vls    - The maximum number of VLs that will be on the subnet
 qos_high_limit - The limit of High Priority component of VL
                  Arbitration table (IBA 7.6.9)
 qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
                  template
 qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
                  template
                  Both VL arbitration templates are pairs of
                  VL and weight
 qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
                  a list of VLs corresponding to SLs 0-15 (Note
                  that VL15 used here means drop this SL)

Typical default values (hard-coded in OpenSM initialization) are:

 qos_max_vls 15
 qos_high_limit 0
 qos_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
 qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
 qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

The syntax is compatible with rest of OpenSM configuration options and values may be stored in OpenSM config file (cached options file).

In addition to the above, we may define separate QoS configuration parameters sets for various target types. As targets, we currently support CAs, routers, switch external ports, and switch's enhanced port 0. The names of such specialized parameters are prefixed by "qos_<type>_" string. Here is a full list of the currently supported sets:

 qos_ca_  - QoS configuration parameters set for CAs.
 qos_rtr_ - parameters set for routers.
 qos_sw0_ - parameters set for switches' port 0.
 qos_swe_ - parameters set for switches' external ports.

Examples:
 qos_sw0_max_vls=2
 qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
 qos_swe_high_limit=0  

PREFIX ROUTES

Prefix routes control how the SA responds to path record queries for off-subnet DGIDs. By default, the SA fails such queries. Note that IBA does not specify how the SA should obtain off-subnet path record information. The prefix routes configuration is meant as a stop-gap until the specification is completed.

Each line in the configuration file is a 64-bit prefix followed by a 64-bit GUID, separated by white space. The GUID specifies the router port on the local subnet that will handle the prefix. Blank lines are ignored, as is anything between a # character and the end of the line. The prefix and GUID are both in hex, the leading 0x is optional. Either, or both, can be wild-carded by specifying an asterisk instead of an explicit prefix or GUID.

When responding to a path record query for an off-subnet DGID, opensm searches for the first prefix match in the configuration file. Therefore, the order of the lines in the configuration file is important: a wild-carded prefix at the beginning of the configuration file renders all subsequent lines useless. If there is no match, then opensm fails the query. It is legal to repeat prefixes in the configuration file, opensm will return the path to the first available matching router. A configuration file with a single line where both prefix and GUID are wild-carded means that a path record query specifying any off-subnet DGID should return a path to the first available router. This configuration yields the same behavior formerly achieved by compiling opensm with -DROUTER_EXP which has been obsoleted.

ROUTING

OpenSM now offers five routing engines:

1. Min Hop Algorithm - based on the minimum hops to each node where the path length is optimized.

2. UPDN Unicast routing algorithm - also based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and deadlock may occur due to a loop in the subnet.

3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing for congestion-free "shift" communication pattern. It should be chosen if a subnet is a symmetrical or almost symmetrical fat-tree of various types, not just K-ary-N-Trees: non-constant K, not fully staffed, any Constant Bisectional Bandwidth (CBB) ratio. Similar to UPDN, Fat Tree routing is constrained to ranking rules.

4. LASH unicast routing algorithm - uses Infiniband virtual layers (SL) to provide deadlock-free shortest-path routing while also distributing the paths between layers. LASH is an alternative deadlock-free topology-agnostic routing algorithm to the non-minimal UPDN algorithm avoiding the use of a potentially congested root node.

5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but avoids port equalization except for redundant links between the same two switches. This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh (see details below).

OpenSM also supports a file method which can load routes from a table. See 'Modular Routing Engine' for more information on this.

The basic routing algorithm is comprised of two stages:

1. MinHop matrix calculation
   How many hops are required to get from each port to each LID ?
   The algorithm to fill these tables is different if you run standard (min hop) or Up/Down.
   For standard routing, a "relaxation" algorithm is used to propagate min hop from every destination LID through neighbor switches
   For Up/Down routing, a BFS from every target is used. The BFS tracks link direction (up or down) and avoid steps that will perform up after a down step was used.

2. Once MinHop matrices exist, each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID.
   This step is common to standard and Up/Down routing. Each port has a counter counting the number of target LIDs going through it.
   When there are multiple alternative ports with same MinHop to a LID, the one with less previously assigned ports is selected.
   If LMC > 0, more checks are added: Within each group of LIDs assigned to same target port,
   a. use only ports which have same MinHop
   b. first prefer the ones that go to different systemImageGuid (then the previous LID of the same LMC group)
   c. if none - prefer those which go through another NodeGuid
   d. fall back to the number of paths method (if all go to same node).

Effect of Topology Changes

OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the -r (--reassign_lids) option is specified.

-r
--reassign_lids
          This option causes OpenSM to reassign LIDs to all
          end nodes. Specifying -r on a running subnet
          may disrupt subnet traffic.
          Without -r, OpenSM attempts to preserve existing
          LID assignments resolving multiple use of same LID.

If a link is added or removed, OpenSM does not recalculate the routes that do not have to change. A route has to change if the port is no longer UP or no longer the MinHop. When routing changes are performed, the same algorithm for balancing the routes is invoked.

In the case of using the file based routing, any topology changes are currently ignored The 'file' routing engine just loads the LFTs from the file specified, with no reaction to real topology. Obviously, this will not be able to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent switches will be skipped. Multicast is not affected by 'file' routing engine (this uses min hop tables).

Min Hop Algorithm

The Min Hop algorithm is invoked by default if no routing algorithm is specified. It can also be invoked by specifying '-R minhop'.

The Min Hop algorithm is divided into two stages: computation of min-hop tables on every switch and LFT output port assignment. Link subscription is also equalized with the ability to override based on port GUID. The latter is supplied by:

-i <equalize-ignore-guids-file>
-ignore-guids <equalize-ignore-guids-file>
          This option provides the means to define a set of ports
          (by guid) that will be ignored by the link load
          equalization algorithm. Note that only endports (CA,
          switch port 0, and router ports) and not switch external
          ports are supported.

LMC awareness routes based on (remote) system or switch basis.

Purpose of UPDN Algorithm

The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet. A loop-deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop. As such, the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree, and one of its loops may experience a deadlock (due, for example, to high pressure).

The UPDN algorithm is based on the following main stages:

1. Auto-detect root nodes - based on the CA hop length from any switch in the subnet, a statistical histogram is built for each switch (hop num vs number of occurrences). If the histogram reflects a specific column (higher than others) for a certain node, then it is marked as a root node. Since the algorithm is statistical, it may not find any root nodes. The list of the root nodes found by this auto-detect stage is used by the ranking process stage.


    Note 1: The user can override the node list manually.
    Note 2: If this stage cannot find any root nodes, and the user did
            not specify a guid list file, OpenSM defaults back to the
            Min Hop routing algorithm.

2. Ranking process - All root switch nodes (found in stage 1) are assigned a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the subnet are ranked incrementally. This ranking aids in the process of enforcing rules that ensure loop-free paths.

3. Min Hop Table setting - after ranking is done, a BFS algorithm is run from each (CA or switch) node in the subnet. During the BFS process, the FDB table of each switch node traversed by BFS is updated, in reference to the starting node, based on the ranking rules and guid values.

At the end of the process, the updated FDB tables ensure loop-free paths through the subnet.

Note: Up/Down routing does not allow LID routing communication between switches that are located inside spine "switch systems". The reason is that there is no way to allow a LID route between them that does not break the Up/Down rule. One ramification of this is that you cannot run SM on switches other than the leaf switches of the fabric.

UPDN Algorithm Usage

Activation through OpenSM

Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm. Use '-a <root_guid_file>' for adding an UPDN guid file that contains the root nodes for ranking. If the `-a' option is not used, OpenSM uses its auto-detect root nodes algorithm.

Notes on the guid list file:

1. A valid guid file specifies one guid in each line. Lines with an invalid format will be discarded.
2. The user should specify the root switch guids. However, it is also possible to specify CA guids; OpenSM will use the guid of the switch (if it exists) that connects the CA to the subnet as a root node.

Fat-tree Routing Algorithm

The fat-tree algorithm optimizes routing for "shift" communication pattern. It should be chosen if a subnet is a symmetrical or almost symmetrical fat-tree of various types. It supports not just K-ary-N-Trees, by handling for non-constant K, cases where not all leafs (CAs) are present, any CBB ratio. As in UPDN, fat-tree also prevents credit-loop-deadlocks.

If the root guid file is not provided ('-a' or '--root_guid_file' options), the topology has to be pure fat-tree that complies with the following rules:
  - Tree rank should be between two and eight (inclusively)
  - Switches of the same rank should have the same number
    of UP-going port groups*, unless they are root switches,
    in which case the shouldn't have UP-going ports at all.
  - Switches of the same rank should have the same number
    of DOWN-going port groups, unless they are leaf switches.
  - Switches of the same rank should have the same number
    of ports in each UP-going port group.
  - Switches of the same rank should have the same number
    of ports in each DOWN-going port group.
  - All the CAs have to be at the same tree level (rank).

If the root guid file is provided, the topology doesn't have to be pure fat-tree, and it should only comply with the following rules:
  - Tree rank should be between two and eight (inclusively)
  - All the Compute Nodes** have to be at the same tree level (rank).
    Note that non-compute node CAs are allowed here to be at different
    tree ranks.

* ports that are connected to the same remote switch are referenced as 'port group'.

** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file' OpenSM options.

Topologies that do not comply cause a fallback to min hop routing. Note that this can also occur on link failures which cause the topology to no longer be "pure" fat-tree.

Note that although fat-tree algorithm supports trees with non-integer CBB ratio, the routing will not be as balanced as in case of integer CBB ratio. In addition to this, although the algorithm allows leaf switches to have any number of CAs, the closer the tree is to be fully populated, the more effective the "shift" communication pattern will be. In general, even if the root list is provided, the closer the topology to a pure and symmetrical fat-tree, the more optimal the routing will be.

The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) in the same directory where the OpenSM log resides. This ordering file provides the CN order that may be used to create efficient communication pattern, that will match the routing tables.

Routing between non-CN nodes

The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree. In such case, it is not guaranteed that the Fat Tree algorithm will route between two non-CN nodes. To solve this problem, a list of non-CN nodes can be specified by '-G' or '--io_guid_file' option. Theses nodes will be allowed to use switches the wrong way round a specific number of times (specified by '-H' or '--max_reverse_hops'. With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree.

Please note that using max_reverse_hops creates routes that use the switch in a counter-stream way. This option should never be used to connect nodes with high bandwidth traffic between them ! It should only be used to allow connectivity for HA purposes or similar. Also having routes the other way around can in theory cause credit loops.

Use these options with extreme care !

Activation through OpenSM

Use '-R ftree' option to activate the fat-tree algorithm. Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option is not used, routing algorithm will detect roots automatically. Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option is not used, all the CAs are considered as compute nodes.

Note: LMC > 0 is not supported by fat-tree routing. If this is specified, the default routing algorithm is invoked instead.

LASH Routing Algorithm

LASH is an acronym for LAyered SHortest Path Routing. It is a deterministic shortest path routing algorithm that enables topology agnostic deadlock-free routing within communication networks.

When computing the routing function, LASH analyzes the network topology for the shortest-path routes between all pairs of sources / destinations and groups these paths into virtual layers in such a way as to avoid deadlock.

Note LASH analyzes routes and ensures deadlock freedom between switch pairs. The link from HCA between and switch does not need virtual layers as deadlock will not arise between switch and HCA.

In more detail, the algorithm works as follows:

1) LASH determines the shortest-path between all pairs of source / destination switches. Note, LASH ensures the same SL is used for all SRC/DST - DST/SRC pairs and there is no guarantee that the return path for a given DST/SRC will be the reverse of the route SRC/DST.

2) LASH then begins an SL assignment process where a route is assigned to a layer (SL) if the addition of that route does not cause deadlock within that layer. This is achieved by maintaining and analysing a channel dependency graph for each layer. Once the potential addition of a path could lead to deadlock, LASH opens a new layer and continues the process.

3) Once this stage has been completed, it is highly likely that the first layers processed will contain more paths than the latter ones. To better balance the use of layers, LASH moves paths from one layer to another so that the number of paths in each layer averages out.

Note, the implementation of LASH in opensm attempts to use as few layers as possible. This number can be less than the number of actual layers available.

In general LASH is a very flexible algorithm. It can, for example, reduce to Dimension Order Routing in certain topologies, it is topology agnostic and fares well in the face of faults.

It has been shown that for both regular and irregular topologies, LASH outperforms Up/Down. The reason for this is that LASH distributes the traffic more evenly through a network, avoiding the bottleneck issues related to a root node and always routes shortest-path.

The algorithm was developed by Simula Research Laboratory.

Use '-R lash -Q ' option to activate the LASH algorithm.

Note: QoS support has to be turned on in order that SL/VL mappings are used.

Note: LMC > 0 is not supported by the LASH routing. If this is specified, the default routing algorithm is invoked instead.

For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm. For toroidal meshes on the other hand there are routing loops that can cause deadlocks. LASH can be used to route these cases. The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently. An option exists for LASH to do this. To invoke this use '-R lash -Q --do_mesh_analysis'. This will add an additional phase that analyses the mesh to try to determine the dimension and size of a mesh. If it determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order before the rest of the LASH algorithm runs.

DOR Routing Algorithm

The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses shortest paths. Instead of spreading traffic
out across different paths with the same shortest distance, it chooses among the available shortest paths based on an ordering of dimensions.  Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension. Alternatively, the -O option can be
used to assign a custom mapping between the ports on a given switch, and the associated dimension. Paths are grown from a destination back
to a source using the lowest dimension (port) of available paths at each step. This provides the ordering necessary to avoid deadlock. When there are multiple links between any two switches, they still represent only one dimension and traffic is balanced across them
unless port equalization is turned off. In the case of hypercubes, the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable, or the -O option used to accomplish the alignment. In the case of meshes, the dimension should consistently use the same pair of ports, one port on one end of the cable, and the other port on the other end, continuing along the mesh dimension, or the -O option used as an override.

Use '-R dor' option to activate the DOR algorithm.

Routing References

To learn more about deadlock-free routing, see the article "Deadlock Free Message Routing in Multiprocessor Interconnection Networks" by William J Dally and Charles L Seitz (1985).

To learn more about the up/down algorithm, see the article "Effective Strategy to Compute Forwarding Tables for InfiniBand Networks" by Jose Carlos Sancho, Antonio Robles, and Jose Duato at the Universidad Politecnica de Valencia.

To learn more about LASH and the flexibility behind it, the requirement for layers, performance comparisons to other algorithms, see the following articles:

"Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions on Parallel and Distributed Systems, VOL.16, No12, December 2005.

"Routing for the ASI Fabric Manager", Solheim et al. IEEE Communications Magazine, Vol.44, No.7, July 2006.

"Layered Shortest Path (LASH) Routing in Irregular System Area Networks", Skeie et al. IEEE Computer Society Communication Architecture for Clusters 2002.

Modular Routine Engine

Modular routing engine structure allows for the ease of "plugging" new routing modules.

Currently, only unicast callbacks are supported. Multicast can be added later.

One existing routing module is up-down "updn", which may be activated with '-R updn' option (instead of old '-u').

General usage is: $ opensm -R 'module-name'

There is also a trivial routing module which is able to load LFT tables from a file.

Main features:

 - this will load switch LFTs and/or LID matrices (min hops tables)
 - this will load switch LFTs according to the path entries introduced
   in the file
 - no additional checks will be performed (such as "is port connected",
   etc.)
 - in case when fabric LIDs were changed this will try to reconstruct
   LFTs correctly if endport GUIDs are represented in the file
   (in order to disable this, GUIDs may be removed from the file
    or zeroed)

The file format is compatible with output of 'ibroute' util and for whole fabric can be generated with dump_lfts.sh script.

To activate file based routing module, use:   opensm -R file -U /path/to/lfts_file

If the lfts_file is not found or is in error, the default routing algorithm is utilized.

The ability to dump switch lid matrices (aka min hops tables) to file and later to load these is also supported.

The usage is similar to unicast forwarding tables loading from a lfts file (introduced by 'file' routing engine), but new lid matrix file name should be specified by -M or --lid_matrix_file option. For example:

  opensm -R file -M ./opensm-lid-matrix.dump

The dump file is named 'opensm-lid-matrix.dump' and will be generated in standard opensm dump directory (%TEMP% by default) when OSM_LOG_ROUTING logging flag is set.

When routing engine 'file' is activated, but the lfts file is not specified or not cannot be open default lid matrix algorithm will be used.

There is also a switch forwarding tables dumper which generates a file compatible with dump_lfts.sh output. This file can be used as input for forwarding tables loading by 'file' routing engine. Both or one of options -U and -M can be specified together with '-R file'.

FILES

%ProgramFiles\OFED\OpenSM\opensm.conf
default OpenSM config file.

%ProgramFiles%\OFED\OpenSM\ib-node-name-map
default node name map file. See ibnetdiscover for more information on format.

%ProgramFiles%\OFED\OpenSM\partitions.conf
default partition config file

%ProgramFiles%\OFED\OpenSM\qos-policy.conf
default QOS policy config file

%ProgramFiles%\OFED\OpenSM\prefix-routes.conf
default prefix routes file.

AUTHORS

Hal Rosenstock
<hal.rosenstock@gmail.com>
Sasha Khapyorsky
<sashak@voltaire.com>
Eitan Zahavi
<eitan@mellanox.co.il>
Yevgeny Kliteynik
<kliteyn@mellanox.co.il>
Thomas Sodring
<tsodring@simula.no>
Ira Weiny
<weiny2@llnl.gov>
Stan Smith
<stan.smith@intel.com>
Dale Purdy
    < purdy@sgi.com >

<return-to-top>

 


Osmtest - Subnet Management Tests

osmtest - InfiniBand subnet manager and administration (SM/SA) test program

osmtest currently can not run on the same HCA port which OpenSM is currently using.

SYNOPSIS

osmtest [-f(low) <c|a|v|s|e|f|m|q|t>] [-w(ait) <trap_wait_time>] [-d(ebug) <number>] [-m(ax_lid) <LID in hex>] [-g(uid)[=]<GUID in hex>] [-p(ort)] [-i(nventory) <filename>] [-s(tress)] [-M(ulticast_Mode)] [-t(imeout) <milliseconds>] [-l | --log_file] [-v] [-vf <flags>] [-h(elp)]

DESCRIPTION

osmtest is a test program used to validate the correct operation of the InfiniBand subnet manager and administration (SM/SA).

Default is to run all flows with the exception of the QoS flow.

osmtest provides a test suite for opensm.

osmtest has the following capabilities and testing flows:

It creates an inventory file of all available Nodes, Ports, and PathRecords, including all their fields. It verifies the existing inventory, with all the object fields, and matches it to a pre-saved one. A Multicast Compliancy test. An Event Forwarding test. A Service Record registration test. An RMPP stress test. A Small SA Queries stress test.

It is recommended that after installing opensm, the user should run "osmtest -f c" to generate the inventory file, and immediately afterwards run "osmtest -f a" to test OpenSM.

Another recommendation for osmtest usage is to create the inventory when the IB fabric is stable, and occasionally run "osmtest -v" to verify that nothing has changed.

OPTIONS

-f, --flow
This option directs osmtest to run a specific flow:
 FLOW  DESCRIPTION
 c = create an inventory file with all nodes, ports and paths
 a = run all validation tests (expecting an input inventory)
 v = only validate the given inventory file
 s = run service registration, deregistration, and lease test
 e = run event forwarding test
 f = flood the SA with queries according to the stress mode
 m = multicast flow
 q = QoS info: dump VLArb and SLtoVL tables
 t = run trap 64/65 flow (this flow requires running of external tool)
 (default is all flows except QoS)
-w, --wait
This option specifies the wait time for trap 64/65 in seconds It is used only when running -f t - the trap 64/65 flow (default to 10 sec)
-d, --debug
This option specifies a debug option. These options are not normally needed. The number following -d selects the debug option to enable as follows:

 OPT   Description
 ---    -----------------
 -d0  - Ignore other SM nodes
 -d1  - Force single threaded dispatching
 -d2  - Force log flushing after each log message
 -d3  - Disable multicast support

-m, --max_lid
This option specifies the maximal LID number to be searched for during inventory file build (default to 100)
-g, --guid
This option specifies the local port GUID value with which OpenSM should bind. OpenSM may be bound to 1 port at a time. If GUID given is 0, OpenSM displays a list of possible port GUIDs and waits for user input. Without -g, OpenSM trys to use the default port.
-p, --port
This option displays a menu of possible local port GUID values with which osmtest could bind
-i, --inventory
This option specifies the name of the inventory file Normally, osmtest expects to find an inventory file, which osmtest uses to validate real-time information received from the SA during testing If -i is not specified, osmtest defaults to the file 'osmtest.dat' See -c option for related information
-s, --stress
This option runs the specified stress test instead of the normal test suite Stress test options are as follows:

 OPT    Description
 ---    -----------------
 -s1  - Single-MAD (RMPP) response SA queries
 -s2  - Multi-MAD (RMPP) response SA queries
 -s3  - Multi-MAD (RMPP) Path Record SA queries
 -s4  - Single-MAD (non RMPP) get Path Record SA queries

Without -s, stress testing is not performed

-M, --Multicast_Mode
This option specify length of Multicast test:

 OPT    Description
 ---    -----------------
 -M1  - Short Multicast Flow (default) - single mode
 -M2  - Short Multicast Flow - multiple mode
 -M3  - Long Multicast Flow - single mode
 -M4  - Long Multicast Flow - multiple mode

Single mode - Osmtest is tested alone, with no other apps that interact with OpenSM MC

Multiple mode - Could be run with other apps using MC with OpenSM. Without -M, default flow testing is performed

-t, --timeout
This option specifies the time in milliseconds used for transaction timeouts. Specifying -t 0 disables timeouts. Without -t, OpenSM defaults to a timeout value of 200 milliseconds.
-l, --log_file
This option defines the log to be the given file. By default the log goes to stdout.
-v, --verbose
This option increases the log verbosity level. The -v option may be specified multiple times to further increase the verbosity level. See the -vf option for more information about. log verbosity.
-V
This option sets the maximum verbosity level and forces log flushing. The -V is equivalent to '-vf 0xFF -d 2'. See the -vf option for more information about. log verbosity.
-vf
This option sets the log verbosity level. A flags field must follow the -D option. A bit set/clear in the flags enables/disables a specific log level as follows:

 BIT    LOG LEVEL ENABLED
 ----   -----------------
 0x01 - ERROR (error messages)
 0x02 - INFO (basic messages, low volume)
 0x04 - VERBOSE (interesting stuff, moderate volume)
 0x08 - DEBUG (diagnostic, high volume)
 0x10 - FUNCS (function entry/exit, very high volume)
 0x20 - FRAMES (dumps all SMP and GMP frames)
 0x40 - ROUTING (dump FDB routing information)
 0x80 - currently unused.

Without -vf, osmtest defaults to ERROR + INFO (0x3) Specifying -vf 0 disables all messages Specifying -vf 0xFF enables all messages (see -V) High verbosity levels may require increasing the transaction timeout with the -t option

-h, --help
Display this usage info then exit.

AUTHORS

Hal Rosenstock
<hal.rosenstock@gmail.com>
Eitan Zahavi
<eitan@mellanox.co.il>
 

EXAMPLES

Note - osmtest will not run on the node where OpenSM is running.
See 'osmtest -h' for all options.

Functionality:

osmtest -f c            # creates osmtest.dat inventory file in the current directory; required by other osmtest runs.
osmtest -f v            # validate the default inventory file 'osmtest.dat'.
osmtest -f a            # run all validation tests (expecting an input inventory file 'osmtest.dat' in the current folder).

Stress tests

osmtest -f  f -s1        #  Single-MAD (RMPP) response SA queries
osmtest -f  f -s2        # Multi-MAD (RMPP) response SA queries
osmtest -f  f -s3        # Multi-MAD (RMPP) Path Record SA queries

<return-to-top>



ibtrapgen - Generate Infiniband subnet management traps

Usage: ibtrapgen -t|--trap_num <TRAP_NUM> -n|--number <NUM_TRAP_CREATIONS>
                          -r|--rate <TRAP_RATE> -l|--lid <LIDADDR>
                          -s|--src_port <SOURCE_PORT> -p|--port_num <PORT_NUM>

Options: one of the following optional flows:

-t <TRAP_NUM>
--trap_num <TRAP_NUM>
          This option specifies the number of the trap to generate. Valid values are 128-131.
-n <NUM_TRAP_CREATIONS>
--number <NUM_TRAP_CREATIONS>
          This option specifies the number of times to generate this trap.
          If not specified - default to 1.
-r <TRAP_RATE>
--rate <TRAP_RATE>
          This option specifies the rate of the trap generation.
          What is the time period between one generation and another?
          The value is given in miliseconds.
          If the number of trap creations is 1 - this value is ignored.
-l <LIDADDR>
--lid <LIDADDR>
          This option specifies the lid address from where the trap should be generated.
-s <SOURCE_PORT>
--src_port <SOURCE_PORT>
          This option specifies the port number from which the trap should
          be generated. If trap number is 128 - this value is ignored (since
          trap 128 is not sent with a specific port number)
-p <port num>
--port_num <port num>
          This is the port number used for communicating with the SA.
-h
--help
          Display this usage info then exit.
-o
--out_log_file
          This option defines the log to be the given file.
          By default the log goes to stdout.
-v
          This option increases the log verbosity level.
          The -v option may be specified multiple times to further increase the verbosity level.
          See the -vf option for more information about log verbosity.
-V
          This option sets the maximum verbosity level and forces log flushing.
          The -V is equivalent to '-vf 0xFF -d 2'.
          See the -vf option for more information about. log verbosity.
-x <flags>
          This option sets the log verbosity level.
          A flags field must follow the -vf option.
          A bit set/clear in the flags enables/disables a
          specific log level as follows:

BIT LOG LEVEL ENABLED
---- -----------------
0x01 - ERROR (error messages)
0x02 - INFO (basic messages, low volume)
0x04 - VERBOSE (interesting stuff, moderate volume)
0x08 - DEBUG (diagnostic, high volume)
0x10 - FUNCS (function entry/exit, very high volume)
0x20 - FRAMES (dumps all SMP and GMP frames)
0x40 - currently unused.
0x80 - currently unused.
Without -x, ibtrapgen defaults to ERROR + INFO (0x3).
Specifying -x 0 disables all messages.
Specifying -x 0xFF enables all messages (see -V).

<return-to-top>

 

 

IPoIB - Internet Protocols over InfiniBand


IPoIB enables the use of Internet Protocol utilities (e.g., ftp, telnet) to function correctly over an Infiniband fabric. IPoIB is implemented as an NDIS Miniport driver with a WDM lower edge.

The IPoIB Network adapters are located via 'My Computer->Manage->Device Manager->Network adapters->IPoIB'.
'My Network Places->Properties' will display IPoIB Local Area Connection instances and should be used to configure IP addresses for the IPoIB interfaces; one Local Area Connection instance per HCA port. The IP (Internet Protocol) address bound to the IPoIB adapter instance can be assigned by DHCP or as a static IP addresses via
'My Network Places->Properties->Local Area Connection X->Properties->(General Tab)Internet Protocol(TCP/IP)->Properties'.

When the subnet manager (opensm) configures/sweeps the local Infiniband HCA, the Local Area Connection will become enabled. If you discover the Local Area Connection to be disabled, then likely your subnet manager (opensm) is not running or functioning correctly.

IPoIB Partition Management

<return-to-top>

 

 

Winsock Direct Service Provider


Winsock Direct (WSD) is Microsoft's proprietary protocol that predates SDP (Sockets Direct Protocol) for accelerating TCP/IP applications by using RDMA hardware. Microsoft had a significant role in defining the SDP protocol, hence SDP and WSD are remarkably similar, though unfortunately incompatible.

WSD is made up of two parts, the winsock direct switch and the winsock direct provider. The WSD switch is in the winsock DLL that ships in all editions of Windows Server 2003/2008, and is responsible for routing socket traffic over either the regular TCP/IP stack, or offload it to a WSD provider. The WSD provider is a hardware specific DLL that implements connection management and data transfers over particular RDMA hardware.

OFED WSD is not supported in the Windows XP environment.

The WSD Protocol seamlessly transports TCP data using Infiniband data packets in 'buffered' mode or Infiniband RDMA in 'direct' mode. Either way the user mode socket application sees no behavioral difference in the standard Internet Protocol socket it created other than reduced data transfer times and increased bandwidth.

The Windows OpenFabrics release includes a WSD provider library that has been extensively tested with Microsoft Windows Server 2008/3.
During testing, bugs where found in the WSD switch that could lead to hangs, crashes, data corruption, and other unwanted behavior. Microsoft released a hotfix to address these issues which should be installed if using WSD; the Microsoft Windows Server 2003 hotfix can be found here.
Windows Server 2003 (R2) no longer requires this patch, nor does Windows Server 2008.
 

Environment variables can be used to change the behavior of the WSD provider:

IBWSD_NO_READ - Disables RDMA Read operations when set to any value. Note that this variable must be used consistently throughout the cluster or communication will fail.

IBWSD_POLL - Sets the number of times to poll the completion queue after processing completions in response to a CQ event. Reduces latency at the cost of CPU utilization. Default is 500.

IBWSD_SA_RETRY - Sets the number of times to retry SA query requests. Default is 4, can be increased if connection establishment fails.

IBWSD_SA_TIMEOUT - Sets the number of milliseconds to wait before retrying SA query requests. Default is 4, can be increased if connection establishment fails.

IBWSD_NO_IPOIB - SA query timeouts by default allow the connection to be established over IPoIB. Setting this environment variable to any value prevents fall back to IPoIB if SA queries time out.

IBWSD_DBG - Controls debug output when using a debug version of the WSD provider. Takes a hex value, with leading '0x', default value is '0x80000000'

 
0x00000001 DLL
0x00000002 socket info
0x00000004 initialization code
0x00000008 WQ related functions
0x00000010 Enpoints related functions
0x00000020 memory registration
0x00000040 CM (Connection Manager)
0x00000080 connections
0x00000200 socket options
0x00000400 network events
0x00000800 Hardware
0x00001000 Overlapped I/O request
0x00002000 Socket Duplication
0x00004000 Performance Monitoring
0x01000000 More verbose than IBSP_DBG_LEVEL3
0x02000000 More verbose than IBSP_DBG_LEVEL2
0x04000000 More verbose than IBSP_DBG_LEVEL1
0x08000000 Verbose output
0x20000000 Function enter/exit
0x40000000 Warnings
0x80000000 Errors


See https://wiki.openfabrics.org/tiki-index.php?page=Winsock+Direct for the latest WSD status.

Winsock Direct Service Provider Installation

WSD service is automatically installed although not enabled as part of the 'default' installation; except for XP systems - WSD not supported.
Manual control is performed via the \Program Files\OFED\installsp.exe utility.

usage: installsp [-i | -r | -l]

-i    Install the Winsock Direct (WSD) service provider
-r    Remove the WSD service provider
-r <name>    Remove the specified service provider
-l    List service providers
 

<return-to-top>

 

NetworkDirect Service Provider


NetworkDirect Service Provider Installation

ND service is automatically installed and started as part of the 'default' installation for Windows server 2008, Vista or HPC systems.
Manual control is performed via the %windir%\system32\ndinstall.exe utility.

usage: ndinstall [-l] [-i | -r [ServiceProvider]]

where ServiceProvider is 'ibal' or 'winverbs' or blank [blank implies the default Service Provider 'ibal']

-i <name>    Install (enable) the NetworkDirect (ND) Service Provider 'name'
-r <name>    Remove the specified Service Provider 'name'
-l    List all service providers; same as 'ndinstall' with no args.

The Microsoft Network Direct SDK can be downloaded from here.  Once the ND SDK is installed, ND test programs can be located @
%ProgramFiles%\Microsoft HPC Pack 2008 SDK\NetworkDirect\Bin\amd64\ as nd*.exe.

Known working ND test command invocations (loopback or remote host)

svr: ndrpingpong s IPoIB_IPv4_addr 4096 p1
cli: ndrpingpong c IPoIB_IPv4_addr 4096 p1

svr: ndpingpong s IPoIB_IPv4_addr 4096 b1
cli: ndpingpong c IPoIB_IPv4_addr 4096 b1

See ndping.exe /? for details.

<return-to-top>

 

Usermode Direct Access Transport and Direct Access Programming Libraries


The DAT (Direct Access Transport) API is a C programming interface developed by the DAT Collaborative in order provide a set of transport-independent, platform-independent Application Programming Interfaces that exploit the RDMA (remote direct memory access) capabilities of next-generation interconnect technologies such as InfiniBand, and iWARP.

OFED uDAT and uDAPL are based on the 2.0 DAT specification. The DAPL (Direct Access Provider Library) which now fully supports Infiniband RDMA and IPoIB.

Previous OFED releases supported the uDAT/uDAPL 1.1 provider which has now been deprecated.
uDAT/uDAPL version 2.0 runtime libraries along with an optional v2.0 application build environment are the only options.
uDAT 2.0 is configured with InfiniBand extensions enabled. The IB extensions include


How  DAT objects map to equivalent InfiniBand objects:
 
Interface Adapter (IA) HCA (Host Channel Adapter)
Protection Zone (PZ) PD (Protection Domain)
Local Memory Region (LMR) MR (Memory Region)
Remote Memory Region (RMR) MW (Memory Windows)
Event Dispatcher (EVD) CQ (Completion Queue)
Endpoint (EP) QP (Queue Pair)
Public Service Point (PSP) connection identifier
Reserved Service Point (RSP) connection identifier
Connection Request (CR) connection manager event


DAT ENVIRONMENT
:

DAT/DAPL 2.0 (free-build) libraries are identified in %SystemRoot%\System32 as dat2.dll and dapl2.dll.  Debug versions of the v2.0 runtime libraries are located in '%SystemDrive%\%ProgramFiles%\OFED'.

IA32 (aka, 32-bit) versions of DAT/DAPL 2.0 runtime libraries, found only on 64-bit systems, are identified in '%ProgramFiles%\OFED' as dat32.dll and dapl32.dll.

In order for DAT/DAPL programs to execute correctly, the runtime library files 'dat2.dll and dapl2.dll' must be present in one of the following folders: current directory, %SystemRoot%, %SystemRoot%\System32 or in the library search path.

The default OFED installation places the runtime library files dat2.dll and dapl2.dll in the '%SystemRoot%\System32' folder; symbol files (.pdb) are located in '%ProgramFiles%\OFED\'.

The default DAPL configuration file is defined as '%SystemDrive%\DAT\dat.conf'. This default specification can be overriden by use of the environment variable DAT_OVERRIDE; see following environment variable discussion.

Within the dat.conf file, the DAPL library specification can be located as the 5th whitespace separated line argument. By default the DAPL library file is installed as '%SystemRoot%\System32\dapl2.dll'.

Should you choose to relocated the DAPL library file to a path where whitespace appears in the full library path specification, then the full library file specification must be contained within double-quotes. A side effect of the double-quotes is the library specification is treated as a Windows string which implies the '\' (backslash character) is treated as an 'escape' character.  Hence all backslashes in the library path must be duplicated when enclosed in double-quotes (e.g., "C:\\Programs Files\\OFED\\dapl.dll").

A sample InfiniBand dat.conf file is installed as '\Program Files\OFED\dat.conf'.  If dat.conf does not exist in the DAT default configuration folder '%SystemDrive%\DAT\', dat.conf will be copied there.
 

DAPL Providers

DAT 2.0 (free-build) libraries utilize the following user application selectable DAPL providers. Each DAPL provider represents an RDMA hardware interface device type and it's Connection Manager.
DAPL providers are listed in the file '%SystemDrive%\DAT\dat.conf'.
The dat.conf InfiniBand DAPL provider names are formatted 'ibnic-HCA#-DAPL_Version-CM_type'.
Example:
    ibnic0v2 - InfiniBand HCA #zero, DAPL version 2.0, (default CM is IBAL).
    ibnic1v2-scm - InfiniBand HCA #one, DAPL version 2.0, CM is 'socket-CM'
    ibnic0v2-cma - InfiniBand HCA #zero, DAPL version 2.0, CM is 'rdma-CM'

Each non-comment line in the dat.conf file describes a DAPL provider interface.
The 2nd to the last field on the right (7th from the left) describes the ia_device_params (Interface Adapter Device Parameters) (aka, RDMA device) in accordance with the specific DAPL provider specified in the 5th field.

 

DAT application build environment:

DAT library header files are selectively installed in the DAT default configuration folder as
'%SystemDrive%\DAT\v2-0'. Your C language based DAT application compilation command line should include'/I%SystemDrive%\DAT\v2-0' with C code referencing '#include <DAT\udat.h>'.

The 'default' DAT/DAPL C language calling convention is '__stdcall', not the 'normal' Visual Studio C compiler default. __stdcall was chosen as MS recommended it to be more efficient. An application can freely mix default C compiler linkages '__cdecl' and '__stdcall'.

Visual Studio 2005 command window - (nmake) Makefile Fragments:

DAT_PATH=%SystemDrive%\DAT\v2-0
CC = cl
INC_FLAGS = /I $(DAT_PATH)

CC_FLAGS= /nologo /Gy /W3 /Gm- /GR- /GF /O2 /Oi /Oy- /D_CRT_SECURE_NO_WARNINGS \
            /D_WIN64 /D_X64_ /D_AMD64_ $(INC_FLAGS)

LINK = link
LIBS = ws2_32.lib advapi32.lib User32.lib bufferoverflowU.lib dat.lib

LINK_FLAGS = /nologo /subsystem:console /machine:X64 /libpath:$(DAT_PATH) $(LIBS)


When linking a DEBUG/Checked version make sure to use dat2d.lib .

DAT library environment variables:

DAT_OVERRIDE
------------
Value used as the static registry configuration file, overriding the
default location, 'C:\DAT\dat.conf'.

Example: set DAT_OVERRIDE=%SystemDrive%\path\to\my\private.conf


DAT_DBG_LEVEL
-------------

Value specifies which parts of the registry will print debugging
information, valid values are 

DAT_OS_DBG_TYPE_ERROR        = 0x1
DAT_OS_DBG_TYPE_GENERIC      = 0x2
DAT_OS_DBG_TYPE_SR           = 0x4
DAT_OS_DBG_TYPE_DR           = 0x8
DAT_OS_DBG_TYPE_PROVIDER_API = 0x10
DAT_OS_DBG_TYPE_CONSUMER_API = 0x20
DAT_OS_DBG_TYPE_ALL          = 0xff

or any combination of these. For example you can use 0xC to get both 
static and dynamic registry output.

Example set DAT_DBG_LEVEL=0xC

DAT_DBG_DEST
------------ 

Value sets the output destination, valid values are 

DAT_OS_DBG_DEST_STDOUT = 0x1
DAT_OS_DBG_DEST_SYSLOG = 0x2 
DAT_OS_DBG_DEST_ALL    = 0x3 

For example, 0x3 will output to both stdout and the syslog. 

DAPL Provider library environment variables

DAPL_DBG_TYPE
-------------

Value specifies which parts of the registry will print debugging information, valid values are

DAPL_DBG_TYPE_ERR          = 0x0001
DAPL_DBG_TYPE_WARN         = 0x0002
DAPL_DBG_TYPE_EVD          = 0x0004
DAPL_DBG_TYPE_CM           = 0x0008
DAPL_DBG_TYPE_EP           = 0x0010
DAPL_DBG_TYPE_UTIL         = 0x0020
DAPL_DBG_TYPE_CALLBACK     = 0x0040
DAPL_DBG_TYPE_DTO_COMP_ERR = 0x0080
DAPL_DBG_TYPE_API          = 0x0100
DAPL_DBG_TYPE_RTN          = 0x0200
DAPL_DBG_TYPE_EXCEPTION    = 0x0400

or any combination of these. For example you can use 0xC to get both
EVD and CM output.

Example set DAPL_DBG_TYPE=0xC


DAPL_DBG_DEST
-------------

Value sets the output destination, valid values are

DAPL_DBG_DEST_STDOUT = 0x1
DAPL_DBG_DEST_SYSLOG = 0x2
DAPL_DBG_DEST_ALL    = 0x3

For example, 0x3 will output to both stdout and the syslog.


<return-to-top>


DAPLTEST


    dapltest - test for the Direct Access Provider Library (DAPL) v2.0

DESCRIPTION

    Dapltest is a set of tests developed to exercise, characterize,
    and verify the DAPL interfaces during development and porting.
    At least two instantiations of the test must be run.  One acts
    as the server, fielding requests and spawning server-side test
    threads as needed.  Other client invocation connects to the
    Dapltest server and issue test requests.

    The server side of the test, once invoked, listens continuously
    for client connection requests, until quit or killed.  Upon
    receipt of a connection request, the connection is established,
    the server and client sides swap version numbers to verify that
    they are able to communicate, and the client sends the test
    request to the server.  If the version numbers match, and the
    test request is well-formed, the server spawns the threads
    needed to run the test before awaiting further connections.

USAGE

    dapltest [ -f script_file_name ]
             [ -T S|Q|T|P|L ] [ -D device_name ] [ -d ] [ -R HT|LL|EC|PM|BE ]

    With no arguments, dapltest runs as a server using default values,
    and loops accepting requests from clients.  The -f option allows
    all arguments to be placed in a file, to ease test automation.
    The following arguments are common to all tests:

    [ -T S|Q|T|P|L ]    Test function to be performed:
                            S   - server loop
                            Q   - quit, client requests that server
                                  wait for any outstanding tests to
                                  complete, then clean up and exit
                            T   - transaction test, transfers data between 
                                  client and server
                            P   - performance test, times DTO operations
                            L   - limit test, exhausts various resources,
                                  runs in client w/o server interaction
                        Default: S

    [ -D device_name ]  Specifies the name of the device (interface adapter).
                        Default: host-specific, look for DT_MdepDeviceName
                                 in dapl_mdep.h

    [ -d ]              Enables extra debug verbosity, primarily tracing
			of the various DAPL operations as they progress.
			Repeating this parameter increases debug spew.
			Errors encountered result in the test spewing some
			explanatory text and stopping; this flag provides
			more detail about what lead up to the error.
                        Default: zero

    [ -R BE ]           Indicate the quality of service (QoS) desired.
                        Choices are:
                            HT  - high throughput
                            LL  - low latency
                            EC  - economy (neither HT nor LL)
                            PM  - premium
                            BE  - best effort
                        Default: BE

USAGE - Quit test client

    dapltest [Common_Args] [ -s server_name ]

    Quit testing (-T Q) connects to the server to ask it to clean up and
    exit (after it waits for any outstanding test runs to complete).
    In addition to being more polite than simply killing the server,
    this test exercises the DAPL object teardown code paths.
    There is only one argument other than those supported by all tests:

    -s server_name      Specifies the name of the server interface.
                        No default.


USAGE - Transaction test client

    dapltest [Common_Args] [ -s server_name ]
             [ -t threads ] [ -w endpoints ] [ -i iterations ] [ -Q ] 
             [ -V ] [ -P ] OPclient OPserver [ op3, 

    Transaction testing (-T T) transfers a variable amount of data between 
    client and server.  The data transfer can be described as a sequence of 
    individual operations; that entire sequence is transferred 'iterations' 
    times by each thread over all of its endpoint(s).

    The following parameters determine the behavior of the transaction test:

    -s server_name      Specifies the hostname of the dapltest server.
                        No default.

    [ -t threads ]      Specify the number of threads to be used.
                        Default: 1

    [ -w endpoints ]    Specify the number of connected endpoints per thread.
                        Default: 1

    [ -i iterations ]   Specify the number of times the entire sequence
                        of data transfers will be made over each endpoint.
                        Default: 1000

    [ -Q ]              Funnel completion events into a CNO.
			Default: use EVDs

    [ -V ]              Validate the data being transferred.
			Default: ignore the data

    [ -P ]		Turn on DTO completion polling
			Default: off

    OP1 OP2 [ OP3, ... ]
                        A single transaction (OPx) consists of:

                        server|client   Indicates who initiates the
                                        data transfer.

                        SR|RR|RW        Indicates the type of transfer:
                                        SR  send/recv
                                        RR  RDMA read
                                        RW  RDMA write
                        Defaults: none

                        [ seg_size [ num_segs ] ]
                                        Indicates the amount and format
                                        of the data to be transferred.
                                        Default:  4096  1
                                                  (i.e., 1 4KB buffer)

                        [ -f ]          For SR transfers only, indicates
                                        that a client's send transfer
                                        completion should be reaped when
                                        the next recv completion is reaped.
					Sends and receives must be paired
					(one client, one server, and in that
					order) for this option to be used.

    Restrictions:  
    
    Due to the flow control algorithm used by the transaction test, there 
    must be at least one SR OP for both the client and the server.  

    Requesting data validation (-V) causes the test to automatically append 
    three OPs to those specified. These additional operations provide 
    synchronization points during each iteration, at which all user-specified 
    transaction buffers are checked. These three appended operations satisfy 
    the "one SR in each direction" requirement.

    The transaction OP list is printed out if -d is supplied.

USAGE - Performance test client

    dapltest [Common_Args] -s server_name [ -m p|b ]
             [ -i iterations ] [ -p pipeline ] OP

    Performance testing (-T P) times the transfer of an operation.
    The operation is posted 'iterations' times.

    The following parameters determine the behavior of the transaction test:

    -s server_name      Specifies the hostname of the dapltest server.
                        No default.

    -m b|p		Used to choose either blocking (b) or polling (p)
                        Default: blocking (b)

    [ -i iterations ]   Specify the number of times the entire sequence
                        of data transfers will be made over each endpoint.
                        Default: 1000

    [ -p pipeline ]     Specify the pipline length, valid arguments are in 
                        the range [0,MAX_SEND_DTOS]. If a value greater than 
                        MAX_SEND_DTOS is requested the value will be
                        adjusted down to MAX_SEND_DTOS.
                        Default: MAX_SEND_DTOS

    OP
                        An operation consists of:

                        RR|RW           Indicates the type of transfer:
                                        RR  RDMA read
                                        RW  RDMA write
                        Default: none

                        [ seg_size [ num_segs ] ]
                                        Indicates the amount and format
                                        of the data to be transferred.
                                        Default:  4096  1
                                                  (i.e., 1 4KB buffer)

USAGE - Limit test client

    Limit testing (-T L) neither requires nor connects to any server
    instance.  The client runs one or more tests which attempt to
    exhaust various resources to determine DAPL limits and exercise
    DAPL error paths.  If no arguments are given, all tests are run.

    Limit testing creates the sequence of DAT objects needed to
    move data back and forth, attempting to find the limits supported
    for the DAPL object requested.  For example, if the LMR creation
    limit is being examined, the test will create a set of
    {IA, PZ, CNO, EVD, EP} before trying to run dat_lmr_create() to
    failure using that set of DAPL objects.  The 'width' parameter
    can be used to control how many of these parallel DAPL object
    sets are created before beating upon the requested constructor.
    Use of -m limits the number of dat_*_create() calls that will
    be attempted, which can be helpful if the DAPL in use supports
    essentailly unlimited numbers of some objects.

    The limit test arguments are:

    [ -m maximum ]      Specify the maximum number of dapl_*_create()
                        attempts.
                        Default: run to object creation failure

    [ -w width ]        Specify the number of DAPL object sets to
                        create while initializing.
                        Default: 1

    [ limit_ia ]        Attempt to exhaust dat_ia_open()

    [ limit_pz ]        Attempt to exhaust dat_pz_create()

    [ limit_cno ]       Attempt to exhaust dat_cno_create()

    [ limit_evd ]       Attempt to exhaust dat_evd_create()

    [ limit_ep ]        Attempt to exhaust dat_ep_create()

    [ limit_rsp ]       Attempt to exhaust dat_rsp_create()

    [ limit_psp ]       Attempt to exhaust dat_psp_create()

    [ limit_lmr ]       Attempt to exhaust dat_lmr_create(4KB)

    [ limit_rpost ]     Attempt to exhaust dat_ep_post_recv(4KB)

    [ limit_size_lmr ]  Probe maximum size dat_lmr_create()

                        Default: run all tests


EXAMPLES

    dapltest -T S -d -D ibnic0v2

                        Starts a local dapltest server process with debug verbosity.
                        Server loops (listen for dapltest request, process request).
    
    dapltest -T T -d -s winIB -D ibnic0v2 -i 100 client SR 4096 2 server SR 4096 2

                        Runs a transaction test, with both sides
                        sending one buffer with two 4KB segments,
                        one hundred times; dapltest server is on host winIB.

    dapltest -T P -d -s winIB -D ibnic0v2 -i 100 RW 4096 2

                        Runs a performance test, with the client 
                        RDMA writing one buffer with two 4KB segments,
                        one hundred times.

    dapltest -T Q -s winIB -D ibnic0v2

                        Asks the dapltest server at host 'winIB' to clean up and exit.

    dapltest -T L -D ibnic0v2 -d -w 16 -m 1000

                        Runs all of the limit tests, setting up
                        16 complete sets of DAPL objects, and
                        creating at most a thousand instances
                        when trying to exhaust resources.

    dapltest -T T -V -d -t 2 -w 4 -i 55555 -s winIB -D ibnic0v2 \
       client RW  4096 1    server RW  2048 4    \
       client SR  1024 4    server SR  4096 2    \
       client SR  1024 3 -f server SR  2048 1 -f

                        Runs a more complicated transaction test,
                        with two thread using four EPs each,
                        sending a more complicated buffer pattern
                        for a larger number of iterations,
                        validating the data received.
dt-svr.bat - DAPLtest server script; starts a DAPL2test.exe server on the local system.
	dt-svr DAPL-provider [-D [hex-debug-bitmask] ]
where: DAPL-provider can be one of [ ibal | scm | cma ]
  • ibal - Original InfiniBand Access Layer (eye-bal) ibal verbs interface
  • scm - Socket-CM (Connection Manager), exchanges QP information over a n IP socket.
  • cma - rdma CM, use the OFED rdma Communications Manager to create the QP connection.
  • or the DAPL-provider name from %SystemDrive%\DAT\dat.conf
dt-cli.bat - DAPLtest client; drives testing by interacting with dt-svr.bat script.
	dt-cli DAPL-provider host-IPv4-address testname [-D [hex-debug-bitmask] ]
		example: dt-cli ibnic0v2 10.10.2.20 trans
		         dt-cli -h  # outputs help text.
			 dt-svr ibnic0v2	# IBAL dapltest server listening on port HCA0
Verify dt-*.bat script is running same dapl2test.exe(DAPL v2.0)


BUGS  (and To Do List)

    Use of CNOs (-Q) is not yet supported.

    Further limit tests could be added.

<return-to-top>

 

 

SRP (SCSI RDMA) Protocol Driver


The SCSI RDMA Protocol  (SRP) is an emerging industry standard protocol for utilizing block storage devices over an InfiniBand™ fabric. SRP is being defined in the ANSI T-10 committee.

OFED SRP is a storage driver implementation that enables the SCSI RDMA protocol over an InfiniBand fabric.
The implementation conforms to the T10 Working Group draft http://www.t10.org/ftp/t10/drafts/srp/srp-r16a.pdf.

Software Dependencies

The SRP driver depends on the installation of the OFED stack with a Subnet
Manager running somewhere on the IB fabric.

- Supported Operating Systems and Service Packs:
   o Windows 7 (x86 & x64)
   o Windows Server 2008 R2  (x86, x64)
   o Windows Server 2008/Vista  (x86, x64)
   o Windows Server 2008 HPC (x64,x86)
   o Windows Server 2003 SP2/R2 (x86, x64, IA64)

Testing Levels

The SRP driver has undergone basic testing against Mellanox Technologies' SRP Targets MTD1000 and MTD2000.
Additionally the Linux OFED 1.4.1 SRP target with scst 1.0.0.0 (vdisk with blockio) has been tested.  Note: ONLY the scst-1.0.0.0 release will work as the OFED 1.4.1 ib_srpt driver has a symbol dependency on scst. later releases of scst.ko no longer export the required symbol hence ib_srpt fails to load.
When ib_srpt is updated later versions of scst can be used.

Testing included SRP target drive format, read, write and dismount/offline operations.
 

Installation

The OFED installer does not install the SRP driver as part of a default installation.  If the SRP feature is selected in the custom features installation view, an InfiniBand SRP Miniport driver will be installed; see the device manager view under SCSI and RAID controllers.

The system device 'InfiniBand I/O Unit' (IOU) device is required for correct SRP operation.  The OFED installer will install and load the IOU driver if the SRP feature is selected.  See the device manager view System Devices --> InfiniBand I/O Unit for conformation of correct IOU driver loading.

In order for the SRP miniport driver installation to complete, an SRP target must be detected by a Subnet Manager running somewhere on the InfiniBand fabric; either a local or remote Subnet Manager works.

SRP Driver Uninstall

If the SRP (SCSI RDMA Protocol) driver has been previously installed, then in order to achieve a 'clean' uninstall, the SRP target drive(s) must be released.  Unfortunately the 'offline disk' command is only valid for diskpart (ver 6.0.6001) which is not distributed with Windows Server 2003 or XP.

The consequences of not releasing the SRP target drive(s) are that after the OFED uninstall reboot there are lingering InfiniBand driver files. These driver files remain because while the SRP target is active they have references, thus when the OFED uninstall attempts to delete the files the operation fails.

SRP supports WPP tracing tools by using the GUID: '5AF07B3C-D119-4233-9C81-C07EF481CBE6'.  The flags and level of debug can be controlled at load-time or run-time; see ib_srp.inf file for details.

Constructing a RHEL 5.1 OFED 1.4.1 SRP vdisk BLOCKIO target

Example assumptions:

Use out of the box scst defines which include (#undef STRICT_SERIALIZING),
'no' kernel mods are required for BLOCKIO access to /dev/sdb[123].

cd scst-1.0.0.0
make all
make install

cd OFED-1.4.1
build OFED select #3 for 'all' OFED components
  - no SRP loads in /etc/infiniband/openib.conf, edit prior to reboot.

REBOOT.

./LOAD & ./UNLOAD scripts are manual versions of what scstAdmin (separate scst package) will do minus
loading the OFED driver ib_srpt.

SRP targets formatted from Windows using default NTFS allocation size.
Partition size & numbering is derrived from local test conventions; your setup will be different.

/dev/sdb1 NTFS < 1GB
/dev/sdb2 NTFS > 1GB
/dev/sdb3 NTFS > sdb2

 

Manual SRP Target LOAD script

#!/bin/sh 

GRP=Default

if [ ! -e /proc/scsi_tgt ] ; then
  echo -n Loading scst driver
  modprobe scst
  if [ $? -ne 0 ] ; then
    echo
    echo err $? modprobe scst
    exit $?
  fi
  echo ...OK
fi 

if [ ! -e /proc/scsi_tgt/vdisk ] ; then
  echo -n Loading scst_vdisk driver
  modprobe scst_vdisk
  if [ $? -ne 0 ] ; then
    echo
    echo err $? modprobe scst_vdisk
    exit $?
  fi
  echo ...OK
fi 

if [ ! -e /proc/scsi_tgt/vdisk ] ; then
  echo -n Loading scst_vdisk driver
  modprobe scst_vdisk
  if [ $? -ne 0 ] ; then
    echo
    echo err $? modprobe scst_vdisk
    exit $?
  fi
  echo ...OK
fi 

fgrep -q ib_srpt /proc/modules
if [ $? -ne 0 ] ; then
  modprobe ib_srpt
  echo ib_srpt...OK
fi 

echo -n Open SRP devices srp[123]
echo "open srp1 /dev/sdb1 512 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
if [ $? -ne 0 ] ; then
  echo err $? open srp1 /dev/sdb1
  exit $?
fi
echo "open srp2 /dev/sdb2 512 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
if [ $? -ne 0 ] ; then
  echo err $? open srp2 /dev/sdb2
  exit $?
fi
echo "open srp3 /dev/sdb3 512 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk
if [ $? -ne 0 ] ; then
  echo err $? open srp3 /dev/sdb3
  exit $?
fi
echo ...OK

echo -n Set allowed hosts access...
echo "add *" > /proc/scsi_tgt/groups/Default/names
echo ...OK

echo -n Adding targets srp[123] as LUNs [012] in group $GRP
echo "add srp1 0" > /proc/scsi_tgt/groups/Default/devices
if [ $? -ne 0 ] ; then
  echo
  echo err $? add srp1 0
  exit $?
fi
echo "add srp2 1" > /proc/scsi_tgt/groups/Default/devices
if [ $? -ne 0 ] ; then
  echo
  echo err $? add srp2 1
  exit $?
fi
echo "add srp3 2" > /proc/scsi_tgt/groups/Default/devices
if [ $? -ne 0 ] ; then
  echo
  echo err $? add srp3 2
  exit $?
fi
echo ...OK

Manual SRP Target UNLOAD script

#!/bin/sh

if [ -w /proc/scsi_tgt/vdisk/vdisk ] ; then
  echo -n Closing SRP Targets srp[321]...
  echo "close srp3" > /proc/scsi_tgt/vdisk/vdisk
  echo "close srp2" > /proc/scsi_tgt/vdisk/vdisk
  echo "close srp1" > /proc/scsi_tgt/vdisk/vdisk
  echo Done.
fi

fgrep -q scst_vdisk /proc/modules
if [ $? -eq 0 ] ; then
  modprobe -r scst_vdisk
fi

fgrep -q ib_srpt /proc/modules
if [ $? -eq 0 ] ; then
  modprobe -r ib_srpt
fi

fgrep -q scst /proc/modules
if [ $? -eq 0 ] ; then
  modprobe -r scst
fi

<return-to-top>

 

QLogic VNIC Configuration


The QLogic VNIC (Virtual Network Interface Card) driver in conjunction with the QLogic Ethernet Virtual I/O Controller (EVIC) provides virtual Ethernet interfaces and transport for Ethernet packets over Infiniband.

Users can modify NIC parameters through User Interface icon in Network Connections:
( Properties->"Configure..." button -> "Advanced" Tab).

Parameters available:

Vlan Id (802.1Q) 

  values from 0 to 4094 ( default 0, disabled )
  This specifies if VLAN ID-marked packet transmission is enabled and, if so, specifies the ID.

Priority (802.1P)

  values from 0 to 7 ( default 0, feature disabled)
  This specifies if priority-marked packet transmission is enabled.

Payload MTU size 

  values from 1500 to 9500 (default 1500)
  This specifies the maximum transfer unit size in 100 bytes increments.

Recv ChkSum offload 

  (default enabled)
  This specifies if IP protocols checksum calculations for receive is offloaded.

Send ChkSum offload

  (default enabled)
  This specifies if IP protocols checksum calculations for send is offloaded.
 

Secondary Path 

   (default disabled)
   Enabled - If more than one IB path to IOC exist then secondary IB instance of virtual port will be created and configured with the same parameters as primary one. Failover from Primary to Secondary IB path is transparent for user application sending data through associated NIC.

   Disabled – only one path at a time is allowed. If more than one path to IOC exists then failed path will be destroyed and next available path will be used for new connection. With this scenario there is a possibility new interface instance will be assigned different MAC address when other hosts compete for EVIC resources.
 

LBFO Bundle Id
   (default disabled) Enabling support for OS provided Load Balancing and Fail Over functionality on adapter level.
   If enabled group ID can be selected from predefined names.

 

Heartbeat interval

   configures interval for VNIC protocol heartbeat messages in milliseconds.
   0 – heartbeats disabled.

Note:
   To take advantage of the features supported by these options, ensure that the Ethernet gateway is also configured appropriately.  For example, if the Payload MTU for a VNIC interface is set to 4000, the MTU at the EVIC module must also be set at least 4000 for the setting to take effect.

 <return-to-top>

 

QLogic VNIC Child Device Management


Each I/O Controller (IOC) of QLogic's EVIC gateway device is able to handle 256 connections per host. So a single host can have multiple VNIC interfaces connecting to the same IOC. So qlgcvnic_config can be used to create multiple VNIC interfaces by giving local channel adapter node guid and target ioc guid parameters as input.

Usage:

Create child vnic devices

qlgcvnic_config -c {caguid}  {iocguid}  {instanceid}  {interface description}

caguid -- Local HCA node guid value in hex format (may start with "0x")
iocguid -- Target IOC's guid vale in hex format (may start with "0x")
instanceid -- InstanceID is used to distinguish between different child devices created by IBAL. So this must be a unique value. InstanceID is a 32bit value. User input should be in decimal format.
interface description -- Description that should be shown in device manager's device tree for the child device.

Listing Channel Adapter to IOC paths

Executing qlgcvnic_config without any option or with -l option will list the IOCs reachable from the host.

 <return-to-top>

 

OFED Software Development Kit


If selected during install, the OFED Software Development Kit will be installed as '%SystemDrive%\OFED_SDK'. Underneath the OFED_SDK\ folder you will find the following folders:

Compilation:

Add the additional include path '%SystemDrive%\OFED_SDK\Inc'; resource files may also use this path.

Linking:

Add the additional library search path '%SystemDrive%\OFED_SDK\Lib'.

Include dependent libraries: ibal.lib and complib.lib, or ibal32.lib & complib32.lib for win32 applications on 64-bit platforms.

Samples:

<return-to-top>

 

OFED InfiniBand Verbs


NAME

    libibverbs.lib - OpenFabrics Enterprise Distribution (OFED) Infiniband verbs library

SYNOPSIS

    #include <infiniband/verbs.h>

DESCRIPTION

This library is an implementation of the verbs based on the Infiniband specification volume 1.2 chapter 11. It handles the control path of creating, modifying, querying and destroying resources such as Protection Domains (PD), Completion Queues (CQ), Queue-Pairs (QP), Shared Receive Queues (SRQ), Address Handles (AH), Memory Regions (MR). It also handles sending and receiving data posted to QPs and SRQs, getting completions from CQs using polling and completions events.

The control path is implemented through system calls to the uverbs kernel module which further calls the low level HW driver. The data path is implemented through calls made to low level HW library which in most cases interacts directly with the HW providing kernel and network stack bypass (saving context/mode switches) along with zero copy and an asynchronous I/O model.

Typically, under network and RDMA programming, there are operations which involve interaction with remote peers (such as address resolution and connection establishment) and remote entities (such as route resolution and joining a multicast group under IB), where a resource managed through IB verbs such as QP or AH would be eventually created or effected from this interaction. In such cases, applications whose addressing semantics is based on IP can use librdmacm (see rdma_cm) which works in conjunction with libibverbs.

This library is thread safe library and verbs can be called from every thread in the process (the same resource can even be handled from different threads, for example: ibv_poll_cq can be called from more than one thread).

However, it is up to the user to stop working with a resource after it was destroyed (by the same thread or by any other thread), this may result a segmentation fault.

The following shall be declared as functions and may also be defined as macros.

Function prototypes are provided in %SystemDrive%\OFED_SDK\inc\infiniband\verbs.h.

Link to %SystemDrive%\OFED_SDK\lib\libibverbs.lib

Device functions

struct ibv_device **ibv_get_device_list(int *num_devices);

void ibv_free_device_list(struct ibv_device **list);

const char *ibv_get_device_name(struct ibv_device *device);

uint64_t ibv_get_device_guid(struct ibv_device *device);

Context functions

struct ibv_context *ibv_open_device(struct ibv_device *device);

int ibv_close_device(struct ibv_context *context);

Queries

int ibv_query_device(struct ibv_context *context, struct ibv_device_attr *device_attr);

int ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr);

int ibv_query_pkey(struct ibv_context *context, uint8_t port_num, int index, uint16_t *pkey);

int ibv_query_gid(struct ibv_context *context, uint8_t port_num, int index, union ibv_gid *gid);

Asynchronous events

int ibv_get_async_event(struct ibv_context *context, struct ibv_async_event *event);

void ibv_ack_async_event(struct ibv_async_event *event);

Protection Domains

struct ibv_pd *ibv_alloc_pd(struct ibv_context *context);

int ibv_dealloc_pd(struct ibv_pd *pd);

Memory Regions

struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length, enum ibv_access_flags access);

int ibv_dereg_mr(struct ibv_mr *mr);

Address Handles

struct ibv_ah *ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr);

int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num, struct ibv_wc *wc, struct ibv_grh *grh, struct ibv_ah_attr *ah_attr);

struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd, struct ibv_wc *wc, struct ibv_grh *grh, uint8_t port_num);

int ibv_destroy_ah(struct ibv_ah *ah);

Completion event channels

struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context);

int ibv_destroy_comp_channel(struct ibv_comp_channel *channel);

Completion Queues Control

struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, void *cq_context, struct ibv_comp_channel *channel, int comp_vector);

int ibv_destroy_cq(struct ibv_cq *cq);

int ibv_resize_cq(struct ibv_cq *cq, int cqe);

Reading Completions from CQ

int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc);

Requesting / Managing CQ events

int ibv_req_notify_cq(struct ibv_cq *cq, int solicited_only);

int ibv_get_cq_event(struct ibv_comp_channel *channel, struct ibv_cq **cq, void **cq_context);

void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents);

Shared Receive Queue control

struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr);

int ibv_destroy_srq(struct ibv_srq *srq);

int ibv_modify_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, enum ibv_srq_attr_mask srq_attr_mask);

int ibv_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr);

eXtended Reliable Connection control

struct ibv_xrc_domain *ibv_open_xrc_domain(struct ibv_context *context, int fd, int oflag);

int ibv_close_xrc_domain(struct ibv_xrc_domain *d);

struct ibv_srq *ibv_create_xrc_srq(struct ibv_pd *pd, struct ibv_xrc_domain *xrc_domain, struct ibv_cq *xrc_cq, struct ibv_srq_init_attr *srq_init_attr);

int ibv_create_xrc_rcv_qp(struct ibv_qp_init_attr *init_attr, uint32_t *xrc_rcv_qpn);

int ibv_modify_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num, struct ibv_qp_attr *attr, int attr_mask);

int ibv_query_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num, struct ibv_qp_attr *attr, int attr_mask, struct ibv_qp_init_attr *init_attr);

int ibv_reg_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num);

int ibv_unreg_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num);

Queue Pair control

struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr);

int ibv_destroy_qp(struct ibv_qp *qp);

int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask);

int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask, struct ibv_qp_init_attr *init_attr);

Posting Work Requests to QPs/SRQs

int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr);

int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr);

int ibv_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *recv_wr, struct ibv_recv_wr **bad_recv_wr);

Multicast group

int ibv_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);

int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid);

General functions

int ibv_rate_to_mult(enum ibv_rate rate);

enum ibv_rate mult_to_ibv_rate(int mult);
 

SEE ALSO

ibv_get_device_list, ibv_free_device_list,
ibv_get_device_name, ibv_get_device_guid, ibv_open_device,
ibv_close_device, ibv_query_device, ibv_query_port,
ibv_query_pkey, ibv_query_gid, ibv_get_async_event,
ibv_ack_async_event, ibv_alloc_pd, ibv_dealloc_pd, ibv_reg_mr,
ibv_dereg_mr, ibv_create_ah, ibv_init_ah_from_wc, ibv_create_ah_from_wc,
ibv_destroy_ah, ibv_create_comp_channel,
ibv_destroy_comp_channel, ibv_create_cq, ibv_destroy_cq,
ibv_resize_cq, ibv_poll_cq, ibv_req_notify_cq,
ibv_get_cq_event, ibv_ack_cq_events, ibv_create_srq,
ibv_destroy_srq, ibv_modify_srq, ibv_query_srq,
ibv_open_xrc_domain, ibv_close_xrc_domain, ibv_create_xrc_srq,
ibv_create_xrc_rcv_qp, ibv_modify_xrc_rcv_qp,
ibv_query_xrc_rcv_qp, ibv_reg_xrc_rcv_qp, ibv_unreg_xrc_rcv_qp,
ibv_post_srq_recv, ibv_create_qp, ibv_destroy_qp, ibv_modify_qp,
ibv_query_qp, ibv_post_send, ibv_post_recv,
ibv_attach_mcast, ibv_detach_mcast, ibv_rate_to_mult, mult_to_ibv_rate


AUTHORS

Dotan Barak <dotanb@mellanox.co.il>
Or Gerlitz <ogerlitz@voltaire.com>
Stan Smith <stan.smith@intel.com>

<return-to-top>

 

IBV_GET_DEVICE_LIST

IBV_FREE_DEVICE_LIST


NAME

ibv_get_device_list, ibv_free_device_list - get and release list of available RDMA devices

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_device **ibv_get_device_list(int *num_devices);

void ibv_free_device_list(struct ibv_device **list);

DESCRIPTION

ibv_get_device_list() returns a NULL-terminated array of RDMA devices currently available. The argument num_devices is optional; if not NULL, it is set to the number of devices returned in the array.

ibv_free_device_list() frees the array of devices list returned by ibv_get_device_list().

RETURN VALUE

ibv_get_device_list() returns the array of available RDMA devices, or sets errno and returns NULL if the request fails. If no devices are found then num_devices is set to 0, and non-NULL is returned.

ibv_free_device_list() returns no value.

ERRORS

EPERM
Permission denied.
ENOSYS
No kernel support for RDMA.
ENOMEM
Insufficient memory to complete the operation.

NOTES

Client code should open all the devices it intends to use with ibv_open_device() before calling ibv_free_device_list(). Once it frees the array with ibv_free_device_list(), it will be able to use only the open devices; pointers to unopened devices will no longer be valid.  

SEE ALSO

ibv_get_device_name, ibv_get_device_guid, ibv_open_device

 

IBV_GET_DEVICE_GUID


NAME

ibv_get_device_guid - get an RDMA device's GUID

SYNOPSIS

#include <infiniband/verbs.h>

uint64_t ibv_get_device_guid(struct ibv_device *device); 

DESCRIPTION

ibv_get_device_name() returns the Global Unique IDentifier (GUID) of the RDMA device device.

RETURN VALUE

ibv_get_device_guid() returns the GUID of the device in network byte order.  

SEE ALSO

ibv_get_device_list, ibv_get_device_name, ibv_open_device

 


IBV_GET_DEVICE_NAME


NAME

ibv_get_device_name - get an RDMA device's name

SYNOPSIS

#include <infiniband/verbs.h>

const char *ibv_get_device_name(struct ibv_device *device);

DESCRIPTION

ibv_get_device_name() returns a human-readable name associated with the RDMA device device.

RETURN VALUE

ibv_get_device_name() returns a pointer to the device name, or NULL if the request fails.

SEE ALSO

ibv_get_device_list, ibv_get_device_guid, ibv_open_device



IBV_OPEN_DEVICE

IBV_CLOSE_DEVICE


NAME

ibv_open_device, ibv_close_device - open and close an RDMA device context

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_context *ibv_open_device(struct ibv_device *device);

int ibv_close_device(struct ibv_context *context);

DESCRIPTION

ibv_open_device() opens the device device and creates a context for further use.

ibv_close_device() closes the device context context.

RETURN VALUE

ibv_open_device() returns a pointer to the allocated device context, or NULL if the request fails.

ibv_close_device() returns 0 on success, -1 on failure.

NOTES

ibv_close_device() does not release all the resources allocated using context context. To avoid resource leaks, the user should release all associated resources before closing a context.

SEE ALSO

ibv_get_device_list, ibv_query_device, ibv_query_port, ibv_query_gid, ibv_query_pkey

 

 


IBV_GET_ASYNC_EVENT



IBV_ACK_ASYNC_EVENT


NAME

ibv_get_async_event, ibv_ack_async_event - get or acknowledge asynchronous events  

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_get_async_event(struct ibv_context *context,
                        struct ibv_async_event *event);

void ibv_ack_async_event(struct ibv_async_event *event);

DESCRIPTION

ibv_get_async_event() waits for the next async event of the RDMA device context context and returns it through the pointer event, which is an ibv_async_event struct, as defined in <infiniband/verbs.h>.

struct ibv_async_event {
union {
struct ibv_cq  *cq;             /* CQ that got the event */
struct ibv_qp  *qp;             /* QP that got the event */
struct ibv_srq *srq;            /* SRQ that got the event */
int             port_num;       /* port number that got the event */
} element;
enum ibv_event_type     event_type;     /* type of the event */
};

One member of the element union will be valid, depending on the event_type member of the structure. event_type will be one of the following events:

QP events:

IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state
IBV_EVENT_QP_REQ_ERR Invalid Request Local Work Queue Error
IBV_EVENT_QP_ACCESS_ERR Local access violation error
IBV_EVENT_COMM_EST Communication was established on a QP
IBV_EVENT_SQ_DRAINED Send Queue was drained of outstanding messages in progress
IBV_EVENT_PATH_MIG A connection has migrated to the alternate path
IBV_EVENT_PATH_MIG_ERR A connection failed to migrate to the alternate path
IBV_EVENT_QP_LAST_WQE_REACHED Last WQE Reached on a QP associated with an SRQ

CQ events:

IBV_EVENT_CQ_ERR CQ is in error (CQ overrun)

SRQ events:

IBV_EVENT_SRQ_ERR Error occurred on an SRQ
IBV_EVENT_SRQ_LIMIT_REACHED SRQ limit was reached

Port events:

IBV_EVENT_PORT_ACTIVE Link became active on a port
IBV_EVENT_PORT_ERR Link became unavailable on a port
IBV_EVENT_LID_CHANGE LID was changed on a port
IBV_EVENT_PKEY_CHANGE P_Key table was changed on a port
IBV_EVENT_SM_CHANGE SM was changed on a port
IBV_EVENT_CLIENT_REREGISTER SM sent a CLIENT_REREGISTER request to a port

CA events:

IBV_EVENT_DEVICE_FATAL CA is in FATAL state

ibv_ack_async_event() acknowledge the async event event.

RETURN VALUE

ibv_get_async_event() returns 0 on success, and -1 on error.

ibv_ack_async_event() returns no value.

NOTES

All async events that ibv_get_async_event() returns must be acknowledged using ibv_ack_async_event(). To avoid races, destroying an object (CQ, SRQ or QP) will wait for all affiliated events for the object to be acknowledged; this avoids an application retrieving an affiliated event after the corresponding object has already been destroyed.

ibv_get_async_event() is a blocking function. If multiple threads call this function simultaneously, then when an async event occurs, only one thread will receive it, and it is not possible to predict which thread will receive it.

EXAMPLES

The following code example demonstrates one possible way to work with async events in non-blocking mode. It performs the following steps:

1. Set the async events queue work mode to be non-blocked
2. Poll the queue until it has an async event
3. Get the async event and ack it

/* change the blocking mode of the async event queue */
flags = fcntl(ctx->async_fd, F_GETFL);
rc = fcntl(ctx->async_fd, F_SETFL, flags | O_NONBLOCK);
if (rc < 0) {
        fprintf(stderr, "Failed to change file descriptor of async event queue\n");
        return 1;
}

/*
 * poll the queue until it has an event and sleep ms_timeout
 * milliseconds between any iteration
 */
my_pollfd.fd      = ctx->async_fd;
my_pollfd.events  = POLLIN;
my_pollfd.revents = 0;

do {
        rc = poll(&my_pollfd, 1, ms_timeout);
} while (rc == 0);
if (rc < 0) {
        fprintf(stderr, "poll failed\n");
        return 1;
}

/* Get the async event */
if (ibv_get_async_event(ctx, &async_event)) {
        fprintf(stderr, "Failed to get async_event\n");
        return 1;
}

/* Ack the event */
ibv_ack_async_event(&async_event);

SEE ALSO

ibv_open_device

 


IBV_QUERY_DEVICE


NAME

ibv_query_device - query an RDMA device's attributes  

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_query_device(struct ibv_context *context,
                     struct ibv_device_attr *device_attr);

DESCRIPTION

ibv_query_device() returns the attributes of the device with context context. The argument device_attr is a pointer to an ibv_device_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_device_attr {
char                    fw_ver[64];             /* FW version */
uint64_t                node_guid;              /* Node GUID (in network byte order) */
uint64_t                sys_image_guid;         /* System image GUID (in network byte order) */
uint64_t                max_mr_size;            /* Largest contiguous block that can be registered */
uint64_t                page_size_cap;          /* Supported memory shift sizes */
uint32_t                vendor_id;              /* Vendor ID, per IEEE */
uint32_t                vendor_part_id;         /* Vendor supplied part ID */
uint32_t                hw_ver;                 /* Hardware version */
int                     max_qp;                 /* Maximum number of supported QPs */
int                     max_qp_wr;              /* Maximum number of outstanding WR on any work queue */
int                     device_cap_flags;       /* HCA capabilities mask */
int                     max_sge;                /* Maximum number of s/g per WR for non-RD QPs */
int                     max_sge_rd;             /* Maximum number of s/g per WR for RD QPs */
int                     max_cq;                 /* Maximum number of supported CQs */
int                     max_cqe;                /* Maximum number of CQE capacity per CQ */
int                     max_mr;                 /* Maximum number of supported MRs */
int                     max_pd;                 /* Maximum number of supported PDs */
int                     max_qp_rd_atom;         /* Maximum number of RDMA Read & Atomic operations that can be outstanding per QP */
int                     max_ee_rd_atom;         /* Maximum number of RDMA Read & Atomic operations that can be outstanding per EEC */
int                     max_res_rd_atom;        /* Maximum number of resources used for RDMA Read & Atomic operations by this HCA as the Target */
int                     max_qp_init_rd_atom;    /* Maximum depth per QP for initiation of RDMA Read & Atomic operations */ 
int                     max_ee_init_rd_atom;    /* Maximum depth per EEC for initiation of RDMA Read & Atomic operations */
enum ibv_atomic_cap     atomic_cap;             /* Atomic operations support level */
int                     max_ee;                 /* Maximum number of supported EE contexts */
int                     max_rdd;                /* Maximum number of supported RD domains */
int                     max_mw;                 /* Maximum number of supported MWs */
int                     max_raw_ipv6_qp;        /* Maximum number of supported raw IPv6 datagram QPs */
int                     max_raw_ethy_qp;        /* Maximum number of supported Ethertype datagram QPs */
int                     max_mcast_grp;          /* Maximum number of supported multicast groups */
int                     max_mcast_qp_attach;    /* Maximum number of QPs per multicast group which can be attached */
int                     max_total_mcast_qp_attach;/* Maximum number of QPs which can be attached to multicast groups */
int                     max_ah;                 /* Maximum number of supported address handles */
int                     max_fmr;                /* Maximum number of supported FMRs */
int                     max_map_per_fmr;        /* Maximum number of (re)maps per FMR before an unmap operation in required */
int                     max_srq;                /* Maximum number of supported SRQs */
int                     max_srq_wr;             /* Maximum number of WRs per SRQ */
int                     max_srq_sge;            /* Maximum number of s/g per SRQ */
uint16_t                max_pkeys;              /* Maximum number of partitions */
uint8_t                 local_ca_ack_delay;     /* Local CA ack delay */
uint8_t                 phys_port_cnt;          /* Number of physical ports */
};

RETURN VALUE

ibv_query_device() returns 0 on success, or the value of errno on failure (which indicates the failure reason).  

NOTES

The maximum values returned by this function are the upper limits of supported resources by the device. However, it may not be possible to use these maximum values, since the actual number of any resource that can be created may be limited by the machine configuration, the amount of host memory, user permissions, and the amount of resources already in use by other users/processes.

SEE ALSO

ibv_open_device, ibv_query_port, ibv_query_pkey, ibv_query_gid

 


IBV_QUERY_GID


NAME

ibv_query_gid - query an InfiniBand port's GID table

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_query_gid(struct ibv_context *context, uint8_t port_num,
                  int index, union ibv_gid *gid);

DESCRIPTION

ibv_query_gid() returns the GID value in entry index of port port_num for device context context through the pointer gid.

RETURN VALUE

ibv_query_gid() returns 0 on success, and -1 on error.

SEE ALSO

ibv_open_device, ibv_query_device, ibv_query_port, ibv_query_pkey

 


IBV_QUERY_PKEY


NAME

ibv_query_pkey - query an InfiniBand port's P_Key table

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_query_pkey(struct ibv_context *context, uint8_t port_num,
                   int index, uint16_t *pkey);

DESCRIPTION

ibv_query_pkey() returns the P_Key value (in network byte order) in entry index of port port_num for device context context through the pointer pkey.

RETURN VALUE

ibv_query_pkey() returns 0 on success, and -1 on error.

SEE ALSO

ibv_open_device, ibv_query_device, ibv_query_port, ibv_query_gid

 


IBV_QUERY_PORT


NAME

ibv_query_port - query an RDMA port's attributes

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_query_port(struct ibv_context *context, uint8_t port_num,
                   struct ibv_port_attr *port_attr); 

DESCRIPTION

ibv_query_port() returns the attributes of port port_num for device context context through the pointer port_attr. The argument port_attr is an ibv_port_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_port_attr {
enum ibv_port_state     state;          /* Logical port state */
enum ibv_mtu            max_mtu;        /* Max MTU supported by port */
enum ibv_mtu            active_mtu;     /* Actual MTU */
int                     gid_tbl_len;    /* Length of source GID table */
uint32_t                port_cap_flags; /* Port capabilities */
uint32_t                max_msg_sz;     /* Maximum message size */
uint32_t                bad_pkey_cntr;  /* Bad P_Key counter */
uint32_t                qkey_viol_cntr; /* Q_Key violation counter */
uint16_t                pkey_tbl_len;   /* Length of partition table */
uint16_t                lid;            /* Base port LID */
uint16_t                sm_lid;         /* SM LID */
uint8_t                 lmc;            /* LMC of LID */
uint8_t                 max_vl_num;     /* Maximum number of VLs */
uint8_t                 sm_sl;          /* SM service level */
uint8_t                 subnet_timeout; /* Subnet propagation delay */
uint8_t                 init_type_reply;/* Type of initialization performed by SM */
uint8_t                 active_width;   /* Currently active link width */
uint8_t                 active_speed;   /* Currently active link speed */
uint8_t                 phys_state;     /* Physical port state */
};

RETURN VALUE

ibv_query_port() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

SEE ALSO

ibv_create_qp, ibv_destroy_qp, ibv_query_qp, ibv_create_ah

 

 

IBV_ALLOC_PD

IBV_DEALLOC_PD


NAME

ibv_alloc_pd, ibv_dealloc_pd - allocate or deallocate a protection domain (PDs)

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_pd *ibv_alloc_pd(struct ibv_context *context);

int ibv_dealloc_pd(struct ibv_pd *pd);

DESCRIPTION

ibv_alloc_pd() allocates a PD for the RDMA device context context.

ibv_dealloc_pd() deallocates the PD pd.

RETURN VALUE

ibv_alloc_pd() returns a pointer to the allocated PD, or NULL if the request fails.

ibv_dealloc_pd() returns 0 on success, or the value of errno on failure (which indicates the failure reason).  

NOTES

ibv_dealloc_pd() may fail if any other resource is still associated with the PD being freed.  

SEE ALSO

ibv_reg_mr, ibv_create_srq, ibv_create_qp, ibv_create_ah, ibv_create_ah_from_wc

 

 

IBV_REG_MR

IBV_DEREG_MR


NAME

ibv_reg_mr, ibv_dereg_mr - register or deregister a memory region (MR)  

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
                          size_t length, int access);

int ibv_dereg_mr(struct ibv_mr *mr);

DESCRIPTION

ibv_reg_mr() registers a memory region (MR) associated with the protection domain pd. The MR's starting address is addr and its size is length. The argument access describes the desired memory protection attributes; it is either 0 or the bitwise OR of one or more of the following flags:

IBV_ACCESS_LOCAL_WRITE Enable Local Write Access
IBV_ACCESS_REMOTE_WRITE Enable Remote Write Access
IBV_ACCESS_REMOTE_READ Enable Remote Read Access
IBV_ACCESS_REMOTE_ATOMIC Enable Remote Atomic Operation Access (if supported)
IBV_ACCESS_MW_BIND Enable Memory Window Binding

If IBV_ACCESS_REMOTE_WRITE or IBV_ACCESS_REMOTE_ATOMIC is set, then IBV_ACCESS_LOCAL_WRITE must be set too.

Local read access is always enabled for the MR.

ibv_dereg_mr() deregisters the MR mr.

RETURN VALUE

ibv_reg_mr() returns a pointer to the registered MR, or NULL if the request fails. The local key (L_Key) field lkey is used as the lkey field of struct ibv_sge when posting buffers with ibv_post_* verbs, and the the remote key (R_Key) field rkey is used by remote processes to perform Atomic and RDMA operations. The remote process places this rkey as the rkey field of struct ibv_send_wr passed to the ibv_post_send function.

ibv_dereg_mr() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

ibv_dereg_mr() fails if any memory window is still bound to this MR.

SEE ALSO

ibv_alloc_pd, ibv_post_send, ibv_post_recv, ibv_post_srq_recv

 


IBV_CREATE_AH


IBV_DESTROY_AH


NAME

ibv_create_ah, ibv_destroy_ah - create or destroy an address handle (AH)

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_ah *ibv_create_ah(struct ibv_pd *pd,
                             struct ibv_ah_attr *attr);

int ibv_destroy_ah(struct ibv_ah *ah); 

DESCRIPTION

ibv_create_ah() creates an address handle (AH) associated with the protection domain pd. The argument attr is an ibv_ah_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_ah_attr {
struct ibv_global_route grh;            /* Global Routing Header (GRH) attributes */
uint16_t                dlid;           /* Destination LID */
uint8_t                 sl;             /* Service Level */
uint8_t                 src_path_bits;  /* Source path bits */
uint8_t                 static_rate;    /* Maximum static rate */
uint8_t                 is_global;      /* GRH attributes are valid */
uint8_t                 port_num;       /* Physical port number */
};

struct ibv_global_route {
union ibv_gid           dgid;           /* Destination GID or MGID */
uint32_t                flow_label;     /* Flow label */
uint8_t                 sgid_index;     /* Source GID index */
uint8_t                 hop_limit;      /* Hop limit */
uint8_t                 traffic_class;  /* Traffic class */
};

ibv_destroy_ah() destroys the AH ah.

RETURN VALUE

ibv_create_ah() returns a pointer to the created AH, or NULL if the request fails.

ibv_destroy_ah() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

SEE ALSO

ibv_alloc_pd, ibv_init_ah_from_wc, ibv_create_ah_from_wc

 


IBV_CREATE_AH_FROM_WC


IBV_INIT_AH_FROM_WC


NAME

ibv_init_ah_from_wc, ibv_create_ah_from_wc - initialize or create an address handle (AH) from a work completion  

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num,
                        struct ibv_wc *wc, struct ibv_grh *grh,
                        struct ibv_ah_attr *ah_attr);

struct ibv_ah *ibv_create_ah_from_wc(struct ibv_pd *pd,
                                     struct ibv_wc *wc,
                                     struct ibv_grh *grh,
                                     uint8_t port_num);

DESCRIPTION

ibv_init_ah_from_wc() initializes the address handle (AH) attribute structure ah_attr for the RDMA device context context using the port number port_num, using attributes from the work completion wc and the Global Routing Header (GRH) structure grh.

ibv_create_ah_from_wc() creates an AH associated with the protection domain pd using the port number port_num, using attributes from the work completion wc and the Global Routing Header (GRH) structure grh.

RETURN VALUE

ibv_init_ah_from_wc() returns 0 on success, and -1 on error.

ibv_create_ah_from_wc() returns a pointer to the created AH, or NULL if the request fails.  

NOTES

The filled structure ah_attr returned from ibv_init_ah_from_wc() can be used to create a new AH using ibv_create_ah().

SEE ALSO

ibv_open_device, ibv_alloc_pd, ibv_create_ah, ibv_destroy_ah, ibv_poll_cq

 

IBV_CREATE_COMP_CHANNEL

IBV_DESTROY_COMP_CHANNEL


NAME

ibv_create_comp_channel, ibv_destroy_comp_channel - create or destroy a completion event channel

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context
                                                 *context);

int ibv_destroy_comp_channel(struct ibv_comp_channel *channel);

DESCRIPTION

ibv_create_comp_channel() creates a completion event channel for the RDMA device context context.

ibv_destroy_comp_channel() destroys the completion event channel channel.

RETURN VALUE

ibv_create_comp_channel() returns a pointer to the created completion event channel, or NULL if the request fails.

ibv_destroy_comp_channel() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

A "completion channel" is an abstraction introduced by libibverbs that does not exist in the InfiniBand Architecture verbs specification or RDMA Protocol Verbs Specification. A completion channel is essentially file descriptor that is used to deliver completion notifications to a userspace process. When a completion event is generated for a completion queue (CQ), the event is delivered via the completion channel attached to that CQ. This may be useful to steer completion events to different threads by using multiple completion channels.

ibv_destroy_comp_channel() fails if any CQs are still associated with the completion event channel being destroyed.

SEE ALSO

ibv_open_device, ibv_create_cq, ibv_get_cq_event

 

IBV_CREATE_CQ

IBV_DESTROY_CQ


NAME

ibv_create_cq, ibv_destroy_cq - create or destroy a completion queue (CQ)  

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe,
                             void *cq_context,
                             struct ibv_comp_channel *channel,
                             int comp_vector);

int ibv_destroy_cq(struct ibv_cq *cq);

DESCRIPTION

ibv_create_cq() creates a completion queue (CQ) with at least cqe entries for the RDMA device context context. The pointer cq_context will be used to set user context pointer of the CQ structure. The argument channel is optional; if not NULL, the completion channel channel will be used to return completion events. The CQ will use the completion vector comp_vector for signaling completion events; it must be at least zero and less than context->num_comp_vectors.

ibv_destroy_cq() destroys the CQ cq.

RETURN VALUE

ibv_create_cq() returns a pointer to the CQ, or NULL if the request fails.

ibv_destroy_cq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

ibv_create_cq() may create a CQ with size greater than or equal to the requested size. Check the cqe attribute in the returned CQ for the actual size.

ibv_destroy_cq() fails if any queue pair is still associated with this CQ.

SEE ALSO

ibv_resize_cq, ibv_req_notify_cq, ibv_ack_cq_events, ibv_create_qp

 

IBV_POLL_CQ


NAME

ibv_poll_cq - poll a completion queue (CQ)  

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_poll_cq(struct ibv_cq *cq, int num_entries,
                struct ibv_wc *wc);

DESCRIPTION

ibv_poll_cq() polls the CQ cq for work completions and returns the first num_entries (or all available completions if the CQ contains fewer than this number) in the array wc. The argument wc is a pointer to an array of ibv_wc structs, as defined in <infiniband/verbs.h>.

struct ibv_wc {
uint64_t                wr_id;          /* ID of the completed Work Request (WR) */
enum ibv_wc_status      status;         /* Status of the operation */
enum ibv_wc_opcode      opcode;         /* Operation type specified in the completed WR */
uint32_t                vendor_err;     /* Vendor error syndrome */
uint32_t                byte_len;       /* Number of bytes transferred */
uint32_t                imm_data;       /* Immediate data (in network byte order) */
uint32_t                qp_num;         /* Local QP number of completed WR */
uint32_t                src_qp;         /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */
int                     wc_flags;       /* Flags of the completed WR */
uint16_t                pkey_index;     /* P_Key index (valid only for GSI QPs) */
uint16_t                slid;           /* Source LID */
uint8_t                 sl;             /* Service Level */
uint8_t                 dlid_path_bits; /* DLID path bits (not applicable for multicast messages) */
};

The attribute wc_flags describes the properties of the work completion. It is either 0 or the bitwise OR of one or more of the following flags:

IBV_WC_GRH GRH is present (valid only for UD QPs)
IBV_WC_WITH_IMM Immediate data value is valid

Not all wc attributes are always valid. If the completion status is other than IBV_WC_SUCCESS, only the following attributes are valid: wr_id, status, qp_num, and vendor_err.

RETURN VALUE

On success, ibv_poll_cq() returns a non-negative value equal to the number of completions found. On failure, a negative value is returned.

NOTES

Each polled completion is removed from the CQ and cannot be returned to it.

The user should consume work completions at a rate that prevents CQ overrun from occurrence. In case of a CQ overrun, the async event IBV_EVENT_CQ_ERR will be triggered, and the CQ cannot be used.

SEE ALSO

ibv_post_send, ibv_post_recv

 

IBV_RESIZE_CQ


NAME

ibv_resize_cq - resize a completion queue (CQ)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_resize_cq(struct ibv_cq *cq, int cqe);

DESCRIPTION

ibv_resize_cq() resizes the completion queue (CQ) cq to have at least cqe entries. cqe must be at least the number of unpolled entries in the CQ cq. If cqe is a valid value less than the current CQ size, ibv_resize_cq() may not do anything, since this function is only guaranteed to resize the CQ to a size at least as big as the requested size.

RETURN VALUE

ibv_resize_cq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

ibv_resize_cq() may assign a CQ size greater than or equal to the requested size. The cqe member of cq will be updated to the actual size.

SEE ALSO

ibv_create_cq ibv_destroy_cq

 


IBV_GET_CQ_EVENT


IBV_ACK_CQ_EVENTS


NAME

ibv_get_cq_event, ibv_ack_cq_events - get and acknowledge completion queue (CQ) events

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_get_cq_event(struct ibv_comp_channel *channel,
                     struct ibv_cq **cq, void **cq_context);

void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents);

DESCRIPTION

ibv_get_cq_event() waits for the next completion event in the completion event channel channel. Fills the arguments cq with the CQ that got the event and cq_context with the CQ's context.

ibv_ack_cq_events() acknowledges nevents events on the CQ cq.

RETURN VALUE

ibv_get_cq_event() returns 0 on success, and -1 on error.

ibv_ack_cq_events() returns no value.  

NOTES

All completion events that ibv_get_cq_event() returns must be acknowledged using ibv_ack_cq_events(). To avoid races, destroying a CQ will wait for all completion events to be acknowledged; this guarantees a one-to-one correspondence between acks and successful gets.

Calling ibv_ack_cq_events() may be relatively expensive in the datapath, since it must take a mutex. Therefore it may be better to amortize this cost by keeping a count of the number of events needing acknowledgement and acking several completion events in one call to ibv_ack_cq_events().

EXAMPLES

The following code example demonstrates one possible way to work with completion events. It performs the following steps:

Stage I: Preparation
1. Creates a CQ
2. Requests for notification upon a new (first) completion event

Stage II: Completion Handling Routine
3. Wait for the completion event and ack it
4. Request for notification upon the next completion event
5. Empty the CQ

Note that an extra event may be triggered without having a corresponding completion entry in the CQ. This occurs if a completion entry is added to the CQ between Step 4 and Step 5, and the CQ is then emptied (polled) in Step 5.

cq = ibv_create_cq(ctx, 1, ev_ctx, channel, 0);
if (!cq) {
        fprintf(stderr, "Failed to create CQ\n");
        return 1;
}

/* Request notification before any completion can be created */
if (ibv_req_notify_cq(cq, 0)) {
        fprintf(stderr, "Couldn't request CQ notification\n");
        return 1;
}

.
.
.

/* Wait for the completion event */
if (ibv_get_cq_event(channel, &ev_cq, &ev_ctx)) {
        fprintf(stderr, "Failed to get cq_event\n");
        return 1;
}

/* Ack the event */
ibv_ack_cq_events(ev_cq, 1);

/* Request notification upon the next completion event */
if (ibv_req_notify_cq(ev_cq, 0)) {
        fprintf(stderr, "Couldn't request CQ notification\n");
        return 1;
}

/* Empty the CQ: poll all of the completions from the CQ (if any exist) */
do {
        ne = ibv_poll_cq(cq, 1, &wc);
        if (ne < 0) {
                fprintf(stderr, "Failed to poll completions from the CQ\n");
                return 1;
        }

        /* there may be an extra event with no completion in the CQ */
        if (ne == 0)
                continue;

        if (wc.status != IBV_WC_SUCCESS) {
                fprintf(stderr, "Completion with status 0x%x was found\n", wc.status);
                return 1;
        }
} while (ne);

The following code example demonstrates one possible way to work with completion events in non-blocking mode. It performs the following steps:

1. Set the completion event channel to be non-blocked
2. Poll the channel until there it has a completion event
3. Get the completion event and ack it

/* change the blocking mode of the completion channel */
flags = fcntl(channel->fd, F_GETFL);
rc = fcntl(channel->fd, F_SETFL, flags | O_NONBLOCK);
if (rc < 0) {
        fprintf(stderr, "Failed to change file descriptor of completion event channel\n");
        return 1;
}


/*
 * poll the channel until it has an event and sleep ms_timeout
 * milliseconds between any iteration
 */
my_pollfd.fd      = channel->fd;
my_pollfd.events  = POLLIN;
my_pollfd.revents = 0;

do {
        rc = poll(&my_pollfd, 1, ms_timeout);
} while (rc == 0);
if (rc < 0) {
        fprintf(stderr, "poll failed\n");
        return 1;
}
ev_cq = cq;

/* Wait for the completion event */
if (ibv_get_cq_event(channel, &ev_cq, &ev_ctx)) {
        fprintf(stderr, "Failed to get cq_event\n");
        return 1;
}

/* Ack the event */
ibv_ack_cq_events(ev_cq, 1);

SEE ALSO

ibv_create_comp_channel, ibv_create_cq, ibv_req_notify_cq, ibv_poll_cq

 


IBV_REQ_NOTIFY_CQ


NAME

ibv_req_notify_cq - request completion notification on a completion queue (CQ)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_req_notify_cq(struct ibv_cq *cq, int solicited_only);

DESCRIPTION

ibv_req_notify_cq() requests a completion notification on the completion queue (CQ) cq.

Upon the addition of a new CQ entry (CQE) to cq, a completion event will be added to the completion channel associated with the CQ. If the argument solicited_only is zero, a completion event is generated for any new CQE. If solicited_only is non-zero, an event is only generated for a new CQE with that is considered "solicited." A CQE is solicited if it is a receive completion for a message with the Solicited Event header bit set, or if the status is not successful. All other successful receive completions, or any successful send completion is unsolicited.

RETURN VALUE

ibv_req_notify_cq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

The request for notification is "one shot." Only one completion event will be generated for each call to ibv_req_notify_cq().  

SEE ALSO

ibv_create_comp_channel, ibv_create_cq, ibv_get_cq_event

 

 


IBV_CREATE_SRQ


IBV_CREATE_XRC_SRQ


IBV_DESTROY_SRQ


NAME

ibv_create_srq, ibv_destroy_srq - create or destroy a shared receive queue (SRQ)  

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, struct 
                               ibv_srq_init_attr *srq_init_attr);

struct ibv_srq *ibv_create_xrc_srq(struct ibv_pd *pd,
                                   struct ibv_xrc_domain *xrc_domain,
                                   struct ibv_cq *xrc_cq,
                                   struct ibv_srq_init_attr *srq_init_attr);

int ibv_destroy_srq(struct ibv_srq *srq);

DESCRIPTION

ibv_create_srq() creates a shared receive queue (SRQ) associated with the protection domain pd.

ibv_create_xrc_srq() creates an XRC shared receive queue (SRQ) associated with the protection domain pd, the XRC domain xrc_domain and the CQ which will hold the XRC completion xrc_cq.

The argument srq_init_attr is an ibv_srq_init_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_srq_init_attr {
void                   *srq_context;    /* Associated context of the SRQ */
struct ibv_srq_attr     attr;           /* SRQ attributes */
};

struct ibv_srq_attr {
uint32_t                max_wr;         /* Requested max number of outstanding work requests (WRs) in the SRQ */
uint32_t                max_sge;        /* Requested max number of scatter elements per WR */
uint32_t                srq_limit;      /* The limit value of the SRQ (irrelevant for ibv_create_srq) */
};

The function ibv_create_srq() will update the srq_init_attr struct with the original values of the SRQ that was created; the values of max_wr and max_sge will be greater than or equal to the values requested.

ibv_destroy_srq() destroys the SRQ srq.

RETURN VALUE

ibv_create_srq() returns a pointer to the created SRQ, or NULL if the request fails.

ibv_destroy_srq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

ibv_destroy_srq() fails if any queue pair is still associated with this SRQ.

SEE ALSO

ibv_alloc_pd, ibv_modify_srq, ibv_query_srq

 


IBV_MODIFY_SRQ


NAME

ibv_modify_srq - modify attributes of a shared receive queue (SRQ)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_modify_srq(struct ibv_srq *srq,
                   struct ibv_srq_attr *srq_attr,
                   int srq_attr_mask);

DESCRIPTION

ibv_modify_srq() modifies the attributes of SRQ srq with the attributes in srq_attr according to the mask srq_attr_mask. The argument srq_attr is an ibv_srq_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_srq_attr {
uint32_t                max_wr;      /* maximum number of outstanding work requests (WRs) in the SRQ */
uint32_t                max_sge;     /* number of scatter elements per WR (irrelevant for ibv_modify_srq) */
uint32_t                srq_limit;   /* the limit value of the SRQ */
};

The argument srq_attr_mask specifies the SRQ attributes to be modified. The argument is either 0 or the bitwise OR of one or more of the following flags:

IBV_SRQ_MAX_WR Resize the SRQ
IBV_SRQ_LIMIT Set the SRQ limit

RETURN VALUE

ibv_modify_srq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

If any of the modify attributes is invalid, none of the attributes will be modified.

Not all devices support resizing SRQs. To check if a device supports it, check if the IBV_DEVICE_SRQ_RESIZE bit is set in the device capabilities flags.

Modifying the srq_limit arms the SRQ to produce an IBV_EVENT_SRQ_LIMIT_REACHED "low watermark" asynchronous event once the number of WRs in the SRQ drops below srq_limit.

SEE ALSO

ibv_query_device, ibv_create_srq, ibv_destroy_srq, ibv_query_srq

 


IBV_QUERY_SRQ


NAME

ibv_query_srq - get the attributes of a shared receive queue (SRQ)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr);

DESCRIPTION

ibv_query_srq() gets the attributes of the SRQ srq and returns them through the pointer srq_attr. The argument srq_attr is an ibv_srq_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_srq_attr {
uint32_t                max_wr;         /* maximum number of outstanding work requests (WRs) in the SRQ */
uint32_t                max_sge;        /* maximum number of scatter elements per WR */
uint32_t                srq_limit;      /* the limit value of the SRQ */
}; 

RETURN VALUE

ibv_query_srq() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

If the value returned for srq_limit is 0, then the SRQ limit reached ("low watermark") event is not (or no longer) armed, and no asynchronous events will be generated until the event is rearmed.  

SEE ALSO

ibv_create_srq, ibv_destroy_srq, ibv_modify_srq

 

 

IBV_CREATE_XRC_RCV_QP


NAME

ibv_create_xrc_rcv_qp - create an XRC queue pair (QP) for serving as a receive-side only QP

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_create_xrc_rcv_qp(struct ibv_qp_init_attr *init_attr,
                          uint32_t *xrc_rcv_qpn); 

DESCRIPTION

ibv_create_xrc_rcv_qp() creates an XRC queue pair (QP) for serving as a receive-side only QP and returns its number through the pointer xrc_rcv_qpn. This QP number should be passed to the remote node (sender). The remote node will use xrc_rcv_qpn in ibv_post_send() when sending to an XRC SRQ on this host in the same xrc domain as the XRC receive QP. This QP is created in kernel space, and persists until the last process registered for the QP calls ibv_unreg_xrc_rcv_qp() (at which time the QP is destroyed).

The process which creates this QP is automatically registered for it, and should also call ibv_unreg_xrc_rcv_qp() at some point, to unregister.

Processes which wish to receive on an XRC SRQ via this QP should call ibv_reg_xrc_rcv_qp() for this QP, to guarantee that the QP will not be destroyed while they are still using it for receiving on the XRC SRQ.

The argument qp_init_attr is an ibv_qp_init_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_qp_init_attr {
void                   *qp_context;     /* value is being ignored */
struct ibv_cq          *send_cq;        /* value is being ignored */ 
struct ibv_cq          *recv_cq;        /* value is being ignored */
struct ibv_srq         *srq;            /* value is being ignored */
struct ibv_qp_cap       cap;            /* value is being ignored */
enum ibv_qp_type        qp_type;        /* value is being ignored */
int                     sq_sig_all;     /* value is being ignored */
struct ibv_xrc_domain  *xrc_domain;     /* XRC domain the QP will be associated with */
};

Most of the attributes in qp_init_attr are being ignored because this QP is a receive only QP and all RR are being posted to an SRQ.

RETURN VALUE

ibv_create_xrc_rcv_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

SEE ALSO

ibv_open_xrc_domain, ibv_modify_xrc_rcv_qp, ibv_query_xrc_rcv_qp, ibv_reg_xrc_rcv_qp, ibv_unreg_xrc_rcv_qp, ibv_post_send

 

IBV_MODIFY_XRC_RCV_QP


NAME

ibv_modify_xrc_rcv_qp - modify the attributes of an XRC receive queue pair (QP)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_modify_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num,
                          struct ibv_qp_attr *attr, int attr_mask);

DESCRIPTION

ibv_modify_qp() modifies the attributes of an XRC receive QP with the number xrc_qp_num which is associated with the XRC domain xrc_domain with the attributes in attr according to the mask attr_mask and move the QP state through the following transitions: Reset -> Init -> RTR. attr_mask should indicate all of the attributes which will be used in this QP transition and the following masks (at least) should be set:

Next state     Required attributes
----------     ----------------------------------------
Init           IBV_QP_STATE, IBV_QP_PKEY_INDEX, IBV_QP_PORT, 
               IBV_QP_ACCESS_FLAGS 
RTR            IBV_QP_STATE, IBV_QP_AV, IBV_QP_PATH_MTU, 
               IBV_QP_DEST_QPN, IBV_QP_RQ_PSN, 
               IBV_QP_MAX_DEST_RD_ATOMIC, IBV_QP_MIN_RNR_TIMER 

The user can add optional attributes as well.

The argument attr is an ibv_qp_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_qp_attr {
enum ibv_qp_state       qp_state;               /* Move the QP to this state */
enum ibv_qp_state       cur_qp_state;           /* Assume this is the current QP state */
enum ibv_mtu            path_mtu;               /* Path MTU (valid only for RC/UC QPs) */
enum ibv_mig_state      path_mig_state;         /* Path migration state (valid if HCA supports APM) */
uint32_t                qkey;                   /* Q_Key for the QP (valid only for UD QPs) */
uint32_t                rq_psn;                 /* PSN for receive queue (valid only for RC/UC QPs) */
uint32_t                sq_psn;                 /* PSN for send queue (valid only for RC/UC QPs) */
uint32_t                dest_qp_num;            /* Destination QP number (valid only for RC/UC QPs) */
int                     qp_access_flags;        /* Mask of enabled remote access operations (valid only for RC/UC QPs) */
struct ibv_qp_cap       cap;                    /* QP capabilities (valid if HCA supports QP resizing) */
struct ibv_ah_attr      ah_attr;                /* Primary path address vector (valid only for RC/UC QPs) */
struct ibv_ah_attr      alt_ah_attr;            /* Alternate path address vector (valid only for RC/UC QPs) */
uint16_t                pkey_index;             /* Primary P_Key index */
uint16_t                alt_pkey_index;         /* Alternate P_Key index */
uint8_t                 en_sqd_async_notify;    /* Enable SQD.drained async notification (Valid only if qp_state is SQD) */
uint8_t                 sq_draining;            /* Is the QP draining? Irrelevant for ibv_modify_qp() */
uint8_t                 max_rd_atomic;          /* Number of outstanding RDMA reads & atomic operations on the destination QP (valid only for RC QPs) */
uint8_t                 max_dest_rd_atomic;     /* Number of responder resources for handling incoming RDMA reads & atomic operations (valid only for RC QPs) */
uint8_t                 min_rnr_timer;          /* Minimum RNR NAK timer (valid only for RC QPs) */
uint8_t                 port_num;               /* Primary port number */
uint8_t                 timeout;                /* Local ack timeout for primary path (valid only for RC QPs) */
uint8_t                 retry_cnt;              /* Retry count (valid only for RC QPs) */
uint8_t                 rnr_retry;              /* RNR retry (valid only for RC QPs) */
uint8_t                 alt_port_num;           /* Alternate port number */
uint8_t                 alt_timeout;            /* Local ack timeout for alternate path (valid only for RC QPs) */
};

For details on struct ibv_qp_cap see the description of ibv_create_qp(). For details on struct ibv_ah_attr see the description of ibv_create_ah().

The argument attr_mask specifies the QP attributes to be modified. The argument is either 0 or the bitwise OR of one or more of the following flags:

IBV_QP_STATE Modify qp_state
IBV_QP_CUR_STATE Set cur_qp_state
IBV_QP_EN_SQD_ASYNC_NOTIFY Set en_sqd_async_notify
IBV_QP_ACCESS_FLAGS Set qp_access_flags
IBV_QP_PKEY_INDEX Set pkey_index
IBV_QP_PORT Set port_num
IBV_QP_QKEY Set qkey
IBV_QP_AV Set ah_attr
IBV_QP_PATH_MTU Set path_mtu
IBV_QP_TIMEOUT Set timeout
IBV_QP_RETRY_CNT Set retry_cnt
IBV_QP_RNR_RETRY Set rnr_retry
IBV_QP_RQ_PSN Set rq_psn
IBV_QP_MAX_QP_RD_ATOMIC Set max_rd_atomic
IBV_QP_ALT_PATH Set the alternative path via: alt_ah_attr, alt_pkey_index, alt_port_num, alt_timeout
IBV_QP_MIN_RNR_TIMER Set min_rnr_timer
IBV_QP_SQ_PSN Set sq_psn
IBV_QP_MAX_DEST_RD_ATOMIC Set max_dest_rd_atomic
IBV_QP_PATH_MIG_STATE Set path_mig_state
IBV_QP_CAP Set cap
IBV_QP_DEST_QPN Set dest_qp_num

RETURN VALUE

ibv_modify_xrc_rcv_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

If any of the modify attributes or the modify mask are invalid, none of the attributes will be modified (including the QP state).

Not all devices support alternate paths. To check if a device supports it, check if the IBV_DEVICE_AUTO_PATH_MIG bit is set in the device capabilities flags.

SEE ALSO

ibv_open_xrc_domain, ibv_create_xrc_rcv_qp, ibv_query_xrc_rcv_qp

 


IBV_OPEN_XRC_DOMAIN


IBV_CLOSE_XRC_DOMAIN


NAME

ibv_open_xrc_domain, ibv_close_xrc_domain - open or close an eXtended Reliable Connection (XRC) domain

SYNOPSIS

#include <fcntl.h>
#include <infiniband/verbs.h>

struct ibv_xrc_domain *ibv_open_xrc_domain(struct ibv_context *context,
                                           int fd, int oflag);
int ibv_close_xrc_domain(struct ibv_xrc_domain *d);

DESCRIPTION

ibv_open_xrc_domain() open an XRC domain for the InfiniBand device context context or return a reference to an opened one. fd is the file descriptor to be associated with the XRC domain. The argument oflag describes the desired file creation attributes; it is either 0 or the bitwise OR of one or more of the following flags:

O_CREAT
If a domain belonging to device named by context is already associated with the inode, this flag has no effect, except as noted under O_EXCL below. Otherwise, a new XRC domain is created and is associated with inode specified by fd.
O_EXCL
If O_EXCL and O_CREAT are set, open will fail if a domain associated with the inode exists. The check for the existence of the domain and creation of the domain if it does not exist is atomic with respect to other processes executing open with fd naming the same inode.

If fd equals -1, no inode is is associated with the domain, and the only valid value for oflag is O_CREAT.

ibv_close_xrc_domain() closes the XRC domain d. If this is the last reference, the XRC domain will be destroyed.

RETURN VALUE

ibv_open_xrc_domain() returns a pointer to an opened XRC, or NULL if the request fails.

ibv_close_xrc_domain() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

Not all devices support XRC. To check if a device supports it, check if the IBV_DEVICE_XRC bit is set in the device capabilities flags.

ibv_close_xrc_domain() may fail if any QP or SRQ are still associated with the XRC domain being closed.

SEE ALSO

ibv_create_xrc_srq, ibv_create_qp, ibv_create_xrc_rcv_qp, ibv_modify_xrc_rcv_qp, ibv_query_xrc_rcv_qp, ibv_reg_xrc_rcv_qp

 


IBV_QUERY_XRC_RCV_QP


NAME

ibv_query_xrc_rcv_qp - get the attributes of an XRC receive queue pair (QP)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_query_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num,
                         struct ibv_qp_attr *attr, int attr_mask,
                         struct ibv_qp_init_attr *init_attr);

DESCRIPTION

ibv_query_xrc_rcv_qp() gets the attributes specified in attr_mask for the XRC receive QP with the number xrc_qp_num which is associated with the XRC domain xrc_domain and returns them through the pointers attr and init_attr. The argument attr is an ibv_qp_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_qp_attr {
enum ibv_qp_state       qp_state;            /* Current QP state */
enum ibv_qp_state       cur_qp_state;        /* Current QP state - irrelevant for ibv_query_qp */
enum ibv_mtu            path_mtu;            /* Path MTU (valid only for RC/UC QPs) */
enum ibv_mig_state      path_mig_state;      /* Path migration state (valid if HCA supports APM) */
uint32_t                qkey;                /* Q_Key of the QP (valid only for UD QPs) */
uint32_t                rq_psn;              /* PSN for receive queue (valid only for RC/UC QPs) */
uint32_t                sq_psn;              /* PSN for send queue (valid only for RC/UC QPs) */
uint32_t                dest_qp_num;         /* Destination QP number (valid only for RC/UC QPs) */
int                     qp_access_flags;     /* Mask of enabled remote access operations (valid only for RC/UC QPs) */
struct ibv_qp_cap       cap;                 /* QP capabilities */
struct ibv_ah_attr      ah_attr;             /* Primary path address vector (valid only for RC/UC QPs) */
struct ibv_ah_attr      alt_ah_attr;         /* Alternate path address vector (valid only for RC/UC QPs) */
uint16_t                pkey_index;          /* Primary P_Key index */
uint16_t                alt_pkey_index;      /* Alternate P_Key index */
uint8_t                 en_sqd_async_notify; /* Enable SQD.drained async notification - irrelevant for ibv_query_qp */
uint8_t                 sq_draining;         /* Is the QP draining? (Valid only if qp_state is SQD) */
uint8_t                 max_rd_atomic;       /* Number of outstanding RDMA reads & atomic operations on the destination QP (valid only for RC QPs) */
uint8_t                 max_dest_rd_atomic;  /* Number of responder resources for handling incoming RDMA reads & atomic operations (valid only for RC QPs) */
uint8_t                 min_rnr_timer;       /* Minimum RNR NAK timer (valid only for RC QPs) */
uint8_t                 port_num;            /* Primary port number */
uint8_t                 timeout;             /* Local ack timeout for primary path (valid only for RC QPs) */
uint8_t                 retry_cnt;           /* Retry count (valid only for RC QPs) */
uint8_t                 rnr_retry;           /* RNR retry (valid only for RC QPs) */
uint8_t                 alt_port_num;        /* Alternate port number */
uint8_t                 alt_timeout;         /* Local ack timeout for alternate path (valid only for RC QPs) */
};

For details on struct ibv_qp_cap see the description of ibv_create_qp(). For details on struct ibv_ah_attr see the description of ibv_create_ah().

RETURN VALUE

ibv_query_xrc_rcv_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

The argument attr_mask is a hint that specifies the minimum list of attributes to retrieve. Some InfiniBand devices may return extra attributes not requested, for example if the value can be returned cheaply.

Attribute values are valid if they have been set using ibv_modify_xrc_rcv_qp(). The exact list of valid attributes depends on the QP state.

Multiple calls to ibv_query_xrc_rcv_qp() may yield some differences in the values returned for the following attributes: qp_state, path_mig_state, sq_draining, ah_attr (if APM is enabled).

SEE ALSO

ibv_open_xrc_domain, ibv_create_xrc_rcv_qp, ibv_modify_xrc_rcv_qp

 


IBV_REG_XRC_RCV_QP


IBV_UNREG_XRC_RCV_QP


NAME

ibv_reg_xrc_rcv_qp, ibv_unreg_xrc_rcv_qp - register and unregister a user process with an XRC receive queue pair (QP)  

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_reg_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num);
int ibv_unreg_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num); 

DESCRIPTION

ibv_reg_xrc_rcv_qp() registers a user process with the XRC receive QP (created via ibv_create_xrc_rcv_qp() ) whose number is xrc_qp_num, and which is associated with the XRC domain xrc_domain.

ibv_unreg_xrc_rcv_qp() unregisters a user process from the XRC receive QP number xrc_qp_num, which is associated with the XRC domain xrc_domain. When the number of user processes registered with this XRC receive QP drops to zero, the QP is destroyed.

RETURN VALUE

ibv_reg_xrc_rcv_qp() and ibv_unreg_xrc_rcv_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

ibv_reg_xrc_rcv_qp() and ibv_unreg_xrc_rcv_qp() may fail if the number xrc_qp_num is not a number of a valid XRC receive QP (the QP is not allocated or it is the number of a non-XRC QP), or the XRC receive QP was created with an XRC domain other than xrc_domain.

If a process is still registered with any XRC RCV QPs belonging to some domain, ibv_close_xrc_domain() will return failure if called for that domain in that process.

ibv_create_xrc_rcv_qp() performs an implicit registration for the creating process; when that process is finished with the XRC RCV QP, it should call ibv_unreg_xrc_rcv_qp() for that QP. Note that if no other processes are registered with the QP at this time, its registration count will drop to zero and it will be destroyed.  

SEE ALSO

ibv_open_xrc_domain, ibv_create_xrc_rcv_qp

 

 


IBV_CREATE_QP


IBV_DESTROY_QP


NAME

ibv_create_qp, ibv_destroy_qp - create or destroy a queue pair (QP)

SYNOPSIS

#include <infiniband/verbs.h>

struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
                             struct ibv_qp_init_attr *qp_init_attr);

int ibv_destroy_qp(struct ibv_qp *qp);

DESCRIPTION

ibv_create_qp() creates a queue pair (QP) associated with the protection domain pd. The argument qp_init_attr is an ibv_qp_init_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_qp_init_attr {
void                   *qp_context;     /* Associated context of the QP */
struct ibv_cq          *send_cq;        /* CQ to be associated with the Send Queue (SQ) */ 
struct ibv_cq          *recv_cq;        /* CQ to be associated with the Receive Queue (RQ) */
struct ibv_srq         *srq;            /* SRQ handle if QP is to be associated with an SRQ, otherwise NULL */
struct ibv_qp_cap       cap;            /* QP capabilities */
enum ibv_qp_type        qp_type;        /* QP Transport Service Type: IBV_QPT_RC, IBV_QPT_UC, IBV_QPT_UD or IBV_QPT_XRC */
int                     sq_sig_all;     /* If set, each Work Request (WR) submitted to the SQ generates a completion entry */
struct ibv_xrc_domain  *xrc_domain;     /* XRC domain the QP will be associated with (valid only for IBV_QPT_XRC QP), otherwise NULL */
};

struct ibv_qp_cap {
uint32_t                max_send_wr;    /* Requested max number of outstanding WRs in the SQ */
uint32_t                max_recv_wr;    /* Requested max number of outstanding WRs in the RQ */
uint32_t                max_send_sge;   /* Requested max number of scatter/gather (s/g) elements in a WR in the SQ */
uint32_t                max_recv_sge;   /* Requested max number of s/g elements in a WR in the SQ */
uint32_t                max_inline_data;/* Requested max number of data (bytes) that can be posted inline to the SQ, otherwise 0 */
};

The function ibv_create_qp() will update the qp_init_attr->cap struct with the actual QP values of the QP that was created; the values will be greater than or equal to the values requested.

ibv_destroy_qp() destroys the QP qp.

RETURN VALUE

ibv_create_qp() returns a pointer to the created QP, or NULL if the request fails. Check the QP number (qp_num) in the returned QP.

ibv_destroy_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

ibv_create_qp() will fail if a it is asked to create QP of a type other than IBV_QPT_RC or IBV_QPT_UD associated with an SRQ.

The attributes max_recv_wr and max_recv_sge are ignored by ibv_create_qp() if the QP is to be associated with an SRQ.

ibv_destroy_qp() fails if the QP is attached to a multicast group.

SEE ALSO

ibv_alloc_pd, ibv_modify_qp, ibv_query_qp

 


IBV_MODIFY_QP


NAME

ibv_modify_qp - modify the attributes of a queue pair (QP)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
                  int attr_mask); 

DESCRIPTION

ibv_modify_qp() modifies the attributes of QP qp with the attributes in attr according to the mask attr_mask. The argument attr is an ibv_qp_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_qp_attr {
enum ibv_qp_state       qp_state;               /* Move the QP to this state */
enum ibv_qp_state       cur_qp_state;           /* Assume this is the current QP state */
enum ibv_mtu            path_mtu;               /* Path MTU (valid only for RC/UC QPs) */
enum ibv_mig_state      path_mig_state;         /* Path migration state (valid if HCA supports APM) */
uint32_t                qkey;                   /* Q_Key for the QP (valid only for UD QPs) */
uint32_t                rq_psn;                 /* PSN for receive queue (valid only for RC/UC QPs) */
uint32_t                sq_psn;                 /* PSN for send queue (valid only for RC/UC QPs) */
uint32_t                dest_qp_num;            /* Destination QP number (valid only for RC/UC QPs) */
int                     qp_access_flags;        /* Mask of enabled remote access operations (valid only for RC/UC QPs) */
struct ibv_qp_cap       cap;                    /* QP capabilities (valid if HCA supports QP resizing) */
struct ibv_ah_attr      ah_attr;                /* Primary path address vector (valid only for RC/UC QPs) */
struct ibv_ah_attr      alt_ah_attr;            /* Alternate path address vector (valid only for RC/UC QPs) */
uint16_t                pkey_index;             /* Primary P_Key index */
uint16_t                alt_pkey_index;         /* Alternate P_Key index */
uint8_t                 en_sqd_async_notify;    /* Enable SQD.drained async notification (Valid only if qp_state is SQD) */
uint8_t                 sq_draining;            /* Is the QP draining? Irrelevant for ibv_modify_qp() */
uint8_t                 max_rd_atomic;          /* Number of outstanding RDMA reads & atomic operations on the destination QP (valid only for RC QPs) */
uint8_t                 max_dest_rd_atomic;     /* Number of responder resources for handling incoming RDMA reads & atomic operations (valid only for RC QPs) */
uint8_t                 min_rnr_timer;          /* Minimum RNR NAK timer (valid only for RC QPs) */
uint8_t                 port_num;               /* Primary port number */
uint8_t                 timeout;                /* Local ack timeout for primary path (valid only for RC QPs) */
uint8_t                 retry_cnt;              /* Retry count (valid only for RC QPs) */
uint8_t                 rnr_retry;              /* RNR retry (valid only for RC QPs) */
uint8_t                 alt_port_num;           /* Alternate port number */
uint8_t                 alt_timeout;            /* Local ack timeout for alternate path (valid only for RC QPs) */
};

For details on struct ibv_qp_cap see the description of ibv_create_qp(). For details on struct ibv_ah_attr see the description of ibv_create_ah().

The argument attr_mask specifies the QP attributes to be modified. The argument is either 0 or the bitwise OR of one or more of the following flags:

IBV_QP_STATE Modify qp_state
IBV_QP_CUR_STATE Set cur_qp_state
IBV_QP_EN_SQD_ASYNC_NOTIFY Set en_sqd_async_notify
IBV_QP_ACCESS_FLAGS Set qp_access_flags
IBV_QP_PKEY_INDEX Set pkey_index
IBV_QP_PORT Set port_num
IBV_QP_QKEY Set qkey
IBV_QP_AV Set ah_attr
IBV_QP_PATH_MTU Set path_mtu
IBV_QP_TIMEOUT Set timeout
IBV_QP_RETRY_CNT Set retry_cnt
IBV_QP_RNR_RETRY Set rnr_retry
IBV_QP_RQ_PSN Set rq_psn
IBV_QP_MAX_QP_RD_ATOMIC Set max_rd_atomic
IBV_QP_ALT_PATH Set the alternative path via: alt_ah_attr, alt_pkey_index, alt_port_num, alt_timeout
IBV_QP_MIN_RNR_TIMER Set min_rnr_timer
IBV_QP_SQ_PSN Set sq_psn
IBV_QP_MAX_DEST_RD_ATOMIC Set max_dest_rd_atomic
IBV_QP_PATH_MIG_STATE Set path_mig_state
IBV_QP_CAP Set cap
IBV_QP_DEST_QPN Set dest_qp_num

RETURN VALUE

ibv_modify_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

If any of the modify attributes or the modify mask are invalid, none of the attributes will be modified (including the QP state).

Not all devices support resizing QPs. To check if a device supports it, check if the IBV_DEVICE_RESIZE_MAX_WR bit is set in the device capabilities flags.

Not all devices support alternate paths. To check if a device supports it, check if the IBV_DEVICE_AUTO_PATH_MIG bit is set in the device capabilities flags.

The following tables indicate for each QP Transport Service Type, the minimum list of attributes that must be changed upon transitioning QP state from: Reset --> Init --> RTR --> RTS.

For QP Transport Service Type  IBV_QPT_UD:

Next state     Required attributes
----------     ----------------------------------------
Init           IBV_QP_STATE, IBV_QP_PKEY_INDEX, IBV_QP_PORT, 
               IBV_QP_QKEY 
RTR            IBV_QP_STATE 
RTS            IBV_QP_STATE, IBV_QP_SQ_PSN 

For QP Transport Service Type  IBV_QPT_UC:

Next state     Required attributes
----------     ----------------------------------------
Init           IBV_QP_STATE, IBV_QP_PKEY_INDEX, IBV_QP_PORT, 
               IBV_QP_ACCESS_FLAGS 
RTR            IBV_QP_STATE, IBV_QP_AV, IBV_QP_PATH_MTU, 
               IBV_QP_DEST_QPN, IBV_QP_RQ_PSN 
RTS            IBV_QP_STATE, IBV_QP_SQ_PSN 

For QP Transport Service Type  IBV_QPT_RC:

Next state     Required attributes
----------     ----------------------------------------
Init           IBV_QP_STATE, IBV_QP_PKEY_INDEX, IBV_QP_PORT, 
               IBV_QP_ACCESS_FLAGS 
RTR            IBV_QP_STATE, IBV_QP_AV, IBV_QP_PATH_MTU, 
               IBV_QP_DEST_QPN, IBV_QP_RQ_PSN, 
               IBV_QP_MAX_DEST_RD_ATOMIC, IBV_QP_MIN_RNR_TIMER 
RTS            IBV_QP_STATE, IBV_QP_SQ_PSN, IBV_QP_MAX_QP_RD_ATOMIC, 
               IBV_QP_RETRY_CNT, IBV_QP_RNR_RETRY, IBV_QP_TIMEOUT

SEE ALSO

ibv_create_qp, ibv_destroy_qp, ibv_query_qp, ibv_create_ah

 

 


IBV_POST_RECV


NAME

ibv_post_recv - post a list of work requests (WRs) to a receive queue

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr,
                  struct ibv_recv_wr **bad_wr);

DESCRIPTION

ibv_post_recv() posts the linked list of work requests (WRs) starting with wr to the receive queue of the queue pair qp. It stops processing WRs from this list at the first failure (that can be detected immediately while requests are being posted), and returns this failing WR through bad_wr.

The argument wr is an ibv_recv_wr struct, as defined in <infiniband/verbs.h>.

struct ibv_recv_wr {
uint64_t                wr_id;     /* User defined WR ID */
struct ibv_recv_wr     *next;      /* Pointer to next WR in list, NULL if last WR */
struct ibv_sge         *sg_list;   /* Pointer to the s/g array */
int                     num_sge;   /* Size of the s/g array */
};

struct ibv_sge {
uint64_t                addr;      /* Start address of the local memory buffer */
uint32_t                length;    /* Length of the buffer */
uint32_t                lkey;      /* Key of the local Memory Region */
};

RETURN VALUE

ibv_post_recv() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding completion queue (CQ).

If the QP qp is associated with a shared receive queue, you must use the function ibv_post_srq_recv(), and not ibv_post_recv(), since the QP's own receive queue will not be used.

If a WR is being posted to a UD QP, the Global Routing Header (GRH) of the incoming message will be placed in the first 40 bytes of the buffer(s) in the scatter list. If no GRH is present in the incoming message, then the first bytes will be undefined. This means that in all cases, the actual data of the incoming message will start at an offset of 40 bytes into the buffer(s) in the scatter list.

SEE ALSO

ibv_create_qp, ibv_post_send, ibv_post_srq_recv, ibv_poll_cq

 

 


IBV_POST_SEND


NAME

ibv_post_send - post a list of work requests (WRs) to a send queue

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,
                  struct ibv_send_wr **bad_wr); 

DESCRIPTION

ibv_post_send() posts the linked list of work requests (WRs) starting with wr to the send queue of the queue pair qp. It stops processing WRs from this list at the first failure (that can be detected immediately while requests are being posted), and returns this failing WR through bad_wr.

The argument wr is an ibv_send_wr struct, as defined in <infiniband/verbs.h>.

struct ibv_send_wr {
uint64_t                wr_id;                  /* User defined WR ID */
struct ibv_send_wr     *next;                   /* Pointer to next WR in list, NULL if last WR */
struct ibv_sge         *sg_list;                /* Pointer to the s/g array */
int                     num_sge;                /* Size of the s/g array */
enum ibv_wr_opcode      opcode;                 /* Operation type */
int                     send_flags;             /* Flags of the WR properties */
uint32_t                imm_data;               /* Immediate data (in network byte order) */
union {
struct {
uint64_t        remote_addr;    /* Start address of remote memory buffer */
uint32_t        rkey;           /* Key of the remote Memory Region */
} rdma;
struct {
uint64_t        remote_addr;    /* Start address of remote memory buffer */ 
uint64_t        compare_add;    /* Compare operand */
uint64_t        swap;           /* Swap operand */
uint32_t        rkey;           /* Key of the remote Memory Region */
} atomic;
struct {
struct ibv_ah  *ah;             /* Address handle (AH) for the remote node address */
uint32_t        remote_qpn;     /* QP number of the destination QP */
uint32_t        remote_qkey;    /* Q_Key number of the destination QP */
} ud;
} wr;
uint32_t                xrc_remote_srq_num;     /* SRQ number of the destination XRC */
};

struct ibv_sge {
uint64_t                addr;                   /* Start address of the local memory buffer */
uint32_t                length;                 /* Length of the buffer */
uint32_t                lkey;                   /* Key of the local Memory Region */
};

Each QP Transport Service Type supports a specific set of opcodes, as shown in the following table:

OPCODE                      | IBV_QPT_UD | IBV_QPT_UC | IBV_QPT_RC | IBV_QPT_XRC
----------------------------+------------+------------+------------+------------
IBV_WR_SEND                 |     X      |     X      |     X      |     X
IBV_WR_SEND_WITH_IMM        |     X      |     X      |     X      |     X
IBV_WR_RDMA_WRITE           |            |     X      |     X      |     X
IBV_WR_RDMA_WRITE_WITH_IMM  |            |     X      |     X      |     X
IBV_WR_RDMA_READ            |            |            |     X      |     X
IBV_WR_ATOMIC_CMP_AND_SWP   |            |            |     X      |     X
IBV_WR_ATOMIC_FETCH_AND_ADD |            |            |     X      |     X

The attribute send_flags describes the properties of the WR. It is either 0 or the bitwise OR of one or more of the following flags:

IBV_SEND_FENCE Set the fence indicator. Valid only for QPs with Transport Service Type IBV_QPT_RC
IBV_SEND_SIGNALED Set the completion notification indicator. Relevant only if QP was created with sq_sig_all=0
IBV_SEND_SOLICITED Set the solicited event indicator. Valid only for Send and RDMA Write with immediate
IBV_SEND_INLINE Send data in given gather list as inline data
in a send WQE. Valid only for Send and RDMA Write. The L_Key will not be checked.

RETURN VALUE

ibv_post_send() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

The user should not alter or destroy AHs associated with WRs until request is fully executed and a work completion has been retrieved from the corresponding completion queue (CQ) to avoid unexpected behavior.

The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding completion queue (CQ). However, if the IBV_SEND_INLINE flag was set, the buffer can be reused immediately after the call returns.

SEE ALSO

ibv_create_qp, ibv_create_xrc_rcv_qp, ibv_create_ah, ibv_post_recv, ibv_post_srq_recv, ibv_poll_cq

 

 


IBV_POST_SRQ_RECV


NAME

ibv_post_srq_recv - post a list of work requests (WRs) to a shared receive queue (SRQ)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr,
                      struct ibv_recv_wr **bad_wr);

DESCRIPTION

ibv_post_srq_recv() posts the linked list of work requests (WRs) starting with wr to the shared receive queue (SRQ) srq. It stops processing WRs from this list at the first failure (that can be detected immediately while requests are being posted), and returns this failing WR through bad_wr.

The argument wr is an ibv_recv_wr struct, as defined in <infiniband/verbs.h>.

struct ibv_recv_wr {
uint64_t                wr_id;     /* User defined WR ID */
struct ibv_recv_wr     *next;      /* Pointer to next WR in list, NULL if last WR */
struct ibv_sge         *sg_list;   /* Pointer to the s/g array */
int                     num_sge;   /* Size of the s/g array */
};

struct ibv_sge {
uint64_t                addr;      /* Start address of the local memory buffer */
uint32_t                length;    /* Length of the buffer */
uint32_t                lkey;      /* Key of the local Memory Region */
};

RETURN VALUE

ibv_post_srq_recv() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding completion queue (CQ).

If a WR is being posted to a UD QP, the Global Routing Header (GRH) of the incoming message will be placed in the first 40 bytes of the buffer(s) in the scatter list. If no GRH is present in the incoming message, then the first bytes will be undefined. This means that in all cases, the actual data of the incoming message will start at an offset of 40 bytes into the buffer(s) in the scatter list.

SEE ALSO

ibv_create_qp, ibv_post_send, ibv_post_recv, ibv_poll_cq

 

 


IBV_QUERY_QP


NAME

ibv_query_qp - get the attributes of a queue pair (QP)

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
                 int attr_mask,
                 struct ibv_qp_init_attr *init_attr);

DESCRIPTION

ibv_query_qp() gets the attributes specified in attr_mask for the QP qp and returns them through the pointers attr and init_attr. The argument attr is an ibv_qp_attr struct, as defined in <infiniband/verbs.h>.

struct ibv_qp_attr {
enum ibv_qp_state       qp_state;            /* Current QP state */
enum ibv_qp_state       cur_qp_state;        /* Current QP state - irrelevant for ibv_query_qp */
enum ibv_mtu            path_mtu;            /* Path MTU (valid only for RC/UC QPs) */
enum ibv_mig_state      path_mig_state;      /* Path migration state (valid if HCA supports APM) */
uint32_t                qkey;                /* Q_Key of the QP (valid only for UD QPs) */
uint32_t                rq_psn;              /* PSN for receive queue (valid only for RC/UC QPs) */
uint32_t                sq_psn;              /* PSN for send queue (valid only for RC/UC QPs) */
uint32_t                dest_qp_num;         /* Destination QP number (valid only for RC/UC QPs) */
int                     qp_access_flags;     /* Mask of enabled remote access operations (valid only for RC/UC QPs) */
struct ibv_qp_cap       cap;                 /* QP capabilities */
struct ibv_ah_attr      ah_attr;             /* Primary path address vector (valid only for RC/UC QPs) */
struct ibv_ah_attr      alt_ah_attr;         /* Alternate path address vector (valid only for RC/UC QPs) */
uint16_t                pkey_index;          /* Primary P_Key index */
uint16_t                alt_pkey_index;      /* Alternate P_Key index */
uint8_t                 en_sqd_async_notify; /* Enable SQD.drained async notification - irrelevant for ibv_query_qp */
uint8_t                 sq_draining;         /* Is the QP draining? (Valid only if qp_state is SQD) */
uint8_t                 max_rd_atomic;       /* Number of outstanding RDMA reads & atomic operations on the destination QP (valid only for RC QPs) */
uint8_t                 max_dest_rd_atomic;  /* Number of responder resources for handling incoming RDMA reads & atomic operations (valid only for RC QPs) */
uint8_t                 min_rnr_timer;       /* Minimum RNR NAK timer (valid only for RC QPs) */
uint8_t                 port_num;            /* Primary port number */
uint8_t                 timeout;             /* Local ack timeout for primary path (valid only for RC QPs) */
uint8_t                 retry_cnt;           /* Retry count (valid only for RC QPs) */
uint8_t                 rnr_retry;           /* RNR retry (valid only for RC QPs) */
uint8_t                 alt_port_num;        /* Alternate port number */
uint8_t                 alt_timeout;         /* Local ack timeout for alternate path (valid only for RC QPs) */
};

For details on struct ibv_qp_cap see the description of ibv_create_qp(). For details on struct ibv_ah_attr see the description of ibv_create_ah().

RETURN VALUE

ibv_query_qp() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

The argument attr_mask is a hint that specifies the minimum list of attributes to retrieve. Some RDMA devices may return extra attributes not requested, for example if the value can be returned cheaply. This has the same form as in ibv_modify_qp().

Attribute values are valid if they have been set using ibv_modify_qp(). The exact list of valid attributes depends on the QP state.

Multiple calls to ibv_query_qp() may yield some differences in the values returned for the following attributes: qp_state, path_mig_state, sq_draining, ah_attr (if APM is enabled).

SEE ALSO

ibv_create_qp, ibv_destroy_qp, ibv_modify_qp, ibv_create_ah

 

 


IBV_ATTACH_MCAST


IBV_DETACH_MCAST


NAME

ibv_attach_mcast, ibv_detach_mcast - attach and detach a queue pair (QPs) to/from a multicast group

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_attach_mcast(struct ibv_qp *qp, const union ibv_gid *gid, uint16_t lid);

int ibv_detach_mcast(struct ibv_qp *qp, const union ibv_gid *gid, uint16_t lid);

DESCRIPTION

ibv_attach_mcast() attaches the QP qp to the multicast group having MGID gid and MLID lid.

ibv_detach_mcast() detaches the QP qp to the multicast group having MGID gid and MLID lid.

RETURN VALUE

ibv_attach_mcast() and ibv_detach_mcast() returns 0 on success, or the value of errno on failure (which indicates the failure reason).

NOTES

Only QPs of Transport Service Type IBV_QPT_UD may be attached to multicast groups.

If a QP is attached to the same multicast group multiple times, the QP will still receive a single copy of a multicast message.

In order to receive multicast messages, a join request for the multicast group must be sent to the subnet administrator (SA), so that the fabric's multicast routing is configured to deliver messages to the local port.

SEE ALSO

ibv_create_qp

 


IBV_RATE_TO_MULT


IBV_MULT_TO_RATE


NAME

ibv_rate_to_mult - convert IB rate enumeration to multiplier of 2.5 Gbit/sec

mult_to_ibv_rate - convert multiplier of 2.5 Gbit/sec to an IB rate enumeration

SYNOPSIS

#include <infiniband/verbs.h>

int ibv_rate_to_mult(enum ibv_rate rate);

enum ibv_rate mult_to_ibv_rate(int mult);

DESCRIPTION

ibv_rate_to_mult() converts the IB transmission rate enumeration rate to a multiple of 2.5 Gbit/sec (the base rate). For example, if rate is IBV_RATE_5_GBPS, the value 2 will be returned (5 Gbit/sec = 2 * 2.5 Gbit/sec).

mult_to_ibv_rate() converts the multiplier value (of 2.5 Gbit/sec) mult to an IB transmission rate enumeration. For example, if mult is 2, the rate enumeration IBV_RATE_5_GBPS will be returned.

RETURN VALUE

ibv_rate_to_mult() returns the multiplier of the base rate 2.5 Gbit/sec.

mult_to_ibv_rate() returns the enumeration representing the IB transmission rate.

SEE ALSO

ibv_query_port

<return-to-top>

 

RDMA CM - Communications Manager


NAME

librdmacm.lib - RDMA communication manager.

SYNOPSIS

#include <rdma/rdma_cma.h>

DESCRIPTION

Used to establish communication endpoints over RDMA transports.

NOTES

The  RDMA CM is a communication manager used to setup reliable, connected and unreliable datagram data transfers. It  provides  an  RDMA transport  neutral  interface for establishing connections.  The API is based on sockets, but adapted for queue pair (QP) based semantics: communication must be over a specific RDMA device, and data transfers are message based.

The RDMA CM only provides the communication management  (connection setup / teardown) portion of an RDMA API.  It works in conjunction with the verbs API  defined by the libibverbs  library.   The  libibverbs library provides the interfaces needed to send and receive data.

CLIENT OPERATION

       This section provides a general overview of the basic operation for the active, or client, side of communication.  A  general  connection  flow would be:

       rdma_create_event_channel

            create channel to receive events

       rdma_create_id

            allocate an rdma_cm_id, this is conceptually similar to a socket

       rdma_resolve_addr

            obtain a local RDMA device to reach the remote address

       rdma_get_cm_event

            wait for RDMA_CM_EVENT_ADDR_RESOLVED event

       rdma_ack_cm_event

            ack event

       rdma_create_qp

            allocate a QP for the communication

       rdma_resolve_route

            determine the route to the remote address

       rdma_get_cm_event

            wait for RDMA_CM_EVENT_ROUTE_RESOLVED event

       rdma_ack_cm_event

            ack event

       rdma_connect

            connect to the remote server

       rdma_get_cm_event

            wait for RDMA_CM_EVENT_ESTABLISHED event

       rdma_ack_cm_event

            ack event

       Perform data transfers over connection

       rdma_disconnect

            tear-down connection

       rdma_get_cm_event

            wait for RDMA_CM_EVENT_DISCONNECTED event

       rdma_ack_cm_event

            ack event

       rdma_destroy_qp

            destroy the QP

       rdma_destroy_id

            release the rdma_cm_id

       rdma_destroy_event_channel

            release the event channel

An almost identical process is used to setup unreliable datagram  (UD) communication  between  nodes. 
No actual connection is formed between QPs however, so disconnection is not needed.
Although this example shows the client initiating the disconnect, either side of a connection may initiate the disconnect.

SERVER OPERATION

       This section provides a general overview of the basic operation for the passive, or server, side of communication.  A general  connection  flow would be:

       rdma_create_event_channel

            create channel to receive events

       rdma_create_id

            allocate an rdma_cm_id, this is conceptually similar to a socket

       rdma_bind_addr

            set the local port number to listen on

       rdma_listen

            begin listening for connection requests

       rdma_get_cm_event

            wait  for   RDMA_CM_EVENT_CONNECT_REQUEST  event with a new  rdma_cm_id.

       rdma_create_qp

            allocate a QP for the communication on the new rdma_cm_id

       rdma_accept

            accept the connection request

       rdma_ack_cm_event

            ack event

       rdma_get_cm_event

            wait for RDMA_CM_EVENT_ESTABLISHED event

       rdma_ack_cm_event

            ack event

       Perform data transfers over connection

       rdma_get_cm_event

            wait for RDMA_CM_EVENT_DISCONNECTED even

       rdma_ack_cm_event

            ack event

       rdma_disconnect

            tear-down connection

       rdma_destroy_qp

            destroy the QP

       rdma_destroy_id

            release the connected rdma_cm_id

       rdma_destroy_id

            release the listening rdma_cm_id

       rdma_destroy_event_channel

            release the event channel

RETURN CODES

       =  0   success

       = -1   error - see errno for more details

Librdmacm functions return 0 to indicate success, and a -1 return value to indicate failure.

If a function operates asynchronously,  a  return value  of  0  means  that  the operation was successfully started. 
The operation could still complete in error; users should check the  status of  the related event.

If the return value is -1, then errno will contain additional information regarding the reason for the failure.
Prior versions of the library would return -errno and not set errno for some cases related to ENOMEM, ENODEV, ENODATA, EINVAL, and EADDRNOTAVAIL codes.
Applications that want to check these codes and have compatibility  with prior library versions must manually set errno to the negative of the return code if it is < -1.

SEE ALSO

 rdma_create_event_channel, rdma_get_cm_eventrdma_create_id,
rdma_resolve_addrrdma_bind_addr, rdma_create_qp,
rdma_resolve_route, rdma_connect, rdma_listen, rdma_accept,
rdma_reject, rdma_join_multicastrdma_leave_multicast,
rdma_notify,  rdma_ack_cm_eventrdma_disconnect,
rdma_destroy_qprdma_destroy_id, rdma_destroy_event_channel,
rdma_get_devicesrdma_free_devicesrdma_get_peer_addr,
rdma_get_local_addr,  rdma_get_dst_portrdma_get_src_port,
rdma_set_option

<return-to-top>

 


RDMA_CREATE_ID


NAME

RDMA_CREATE_ID - Allocate a communication identifier.  

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_create_id (struct rdma_event_channel *channel, struct rdma_cm_id **id, void *context, enum rdma_port_space ps);

ARGUMENTS

channel
The communication channel that events associated with the allocated rdma_cm_id will be reported on.
id
A reference where the allocated communication identifier will be returned.
context
User specified context associated with the rdma_cm_id.
ps
RDMA port space.

DESCRIPTION

Creates an identifier that is used to track communication information.  

NOTES

Rdma_cm_id's are conceptually equivalent to a socket for RDMA communication. The difference is that RDMA communication requires explicitly binding to a specified RDMA device before communication can occur, and most operations are asynchronous in nature. Communication events on an rdma_cm_id are reported through the associated event channel. Users must release the rdma_cm_id by calling rdma_destroy_id.  

PORT SPACE

Details of the services provided by the different port spaces are outlined below.
RDMA_PS_TCP
Provides reliable, connection-oriented QP communication. Unlike TCP, the RDMA port space provides message, not stream, based communication.
RDMA_PS_UDP
Provides unreliable, connectionless QP communication. Supports both datagram and multicast communication.

SEE ALSO

rdma_cm, rdma_create_event_channel, rdma_destroy_id, rdma_get_devices, rdma_bind_addr, rdma_resolve_addr, rdma_connect, rdma_listen, rdma_set_option

 


RDMA_DESTROY_ID


NAME

RDMA_DESTROY_ID - Release a communication identifier.

SYNOPSIS

#include <rdma/rdma_cma.h> int rdma_destroy_id (struct rdma_cm_id *id);

ARGUMENTS

id
The communication identifier to destroy.

DESCRIPTION

Destroys the specified rdma_cm_id and cancels any outstanding asynchronous operation.

NOTES

Users must free any associated QP with the rdma_cm_id before calling this routine and ack an related events.

SEE ALSO

rdma_create_id, rdma_destroy_qp, rdma_ack_cm_event

 


RDMA_CREATE_EVENT_CHANNEL


NAME

rdma_create_event_channel - Open a channel used to report communication events.

SYNOPSIS

#include <rdma/rdma_cma.h>

 struct rdma_event_channel * rdma_create_event_channel (void);

ARGUMENTS

void
no arguments

DESCRIPTION

Asynchronous events are reported to users through event channels.  

NOTES

Event channels are used to direct all events on an rdma_cm_id. For many clients, a single event channel may be sufficient, however, when managing a large number of connections or cm_id's, users may find it useful to direct events for different cm_id's to different channels for processing. All created event channels must be destroyed by calling rdma_destroy_event_channel. Users should call rdma_get_cm_event to retrieve events on an event channel. Each event channel is mapped to a file descriptor. The associated file descriptor can be used and manipulated like any other fd to change its behavior. Users may make the fd non-blocking, poll or select the fd, etc.  

SEE ALSO

rdma_cm, rdma_get_cm_event, rdma_destroy_event_channel

 


RDMA_DESTROY_EVENT_CHANNEL


NAME

rdma_destroy_event_channel - Close an event communication channel.  

SYNOPSIS

#include <rdma/rdma_cma.h>

 void rdma_destroy_event_channel (struct rdma_event_channel *channel);  

ARGUMENTS

channel
The communication channel to destroy.

DESCRIPTION

Release all resources associated with an event channel and closes the associated file descriptor.  

NOTES

All rdma_cm_id's associated with the event channel must be destroyed, and all returned events must be acked before calling this function.  

SEE ALSO

rdma_create_event_channel, rdma_get_cm_event, rdma_ack_cm_event

 


RDMA_RESOLVE_ADDR


NAME

rdma_resolve_addr - Resolve destination and optional source addresses.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_resolve_addr (struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms);

ARGUMENTS

id
RDMA identifier.
src_addr
Source address information. This parameter may be NULL.
dst_addr
Destination address information.
timeout_ms
Time to wait for resolution to complete.

DESCRIPTION

Resolve destination and optional source addresses from IP addresses to an RDMA address. If successful, the specified rdma_cm_id will be bound to a local device.  

NOTES

This call is used to map a given destination IP address to a usable RDMA address. The IP to RDMA address mapping is done using the local routing tables, or via ARP. If a source address is given, the rdma_cm_id is bound to that address, the same as if rdma_bind_addr were called. If no source address is given, and the rdma_cm_id has not yet been bound to a device, then the rdma_cm_id will be bound to a source address based on the local routing tables. After this call, the rdma_cm_id will be bound to an RDMA device. This call is typically made from the active side of a connection before calling rdma_resolve_route and rdma_connect.  

INFINIBAND SPECIFIC

This call maps the destination and, if given, source IP addresses to GIDs. In order to perform the mapping, IPoIB must be running on both the local and remote nodes.  

SEE ALSO

rdma_create_id, rdma_resolve_route, rdma_connect, rdma_create_qp, rdma_get_cm_event, rdma_bind_addr, rdma_get_src_port, rdma_get_dst_port, rdma_get_local_addr, rdma_get_peer_addr

 


RDMA_GET_CM_EVENT


NAME

rdma_get_cm_event - Retrieves the next pending communication event.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_get_cm_event (struct rdma_event_channel *channel, struct rdma_cm_event **event);

ARGUMENTS

channel
Event channel to check for events.
event
Allocated information about the next communication event.

DESCRIPTION

Retrieves a communication event. If no events are pending, by default, the call will block until an event is received.

NOTES

The default synchronous behavior of this routine can be changed by modifying the file descriptor associated with the given channel. All events that are reported must be acknowledged by calling rdma_ack_cm_event. Destruction of an rdma_cm_id will block until related events have been acknowledged.

EVENT DATA

Communication event details are returned in the rdma_cm_event structure. This structure is allocated by the rdma_cm and released by the rdma_ack_cm_event routine. Details of the rdma_cm_event structure are given below.
id
The rdma_cm identifier associated with the event. If the event type is RDMA_CM_EVENT_CONNECT_REQUEST, then this references a new id for that communication.
listen_id
For RDMA_CM_EVENT_CONNECT_REQUEST event types, this references the corresponding listening request identifier.
event
Specifies the type of communication event which occurred. See EVENT TYPES below.
status
Returns any asynchronous error information associated with an event. The status is zero unless the corresponding operation failed.
param
Provides additional details based on the type of event. Users should select the conn or ud subfields based on the rdma_port_space of the rdma_cm_id associated with the event. See UD EVENT DATA and CONN EVENT DATA below.

UD EVENT DATA

Event parameters related to unreliable datagram (UD) services: RDMA_PS_UDP and RDMA_PS_IPOIB. The UD event data is valid for RDMA_CM_EVENT_ESTABLISHED and RDMA_CM_EVENT_MULTICAST_JOIN events, unless stated otherwise.
private_data
References any user-specified data associated with RDMA_CM_EVENT_CONNECT_REQUEST or RDMA_CM_EVENT_ESTABLISHED events. The data referenced by this field matches that specified by the remote side when calling rdma_connect or rdma_accept. This field is NULL if the event does not include private data. The buffer referenced by this pointer is deallocated when calling rdma_ack_cm_event.
private_data_len
The size of the private data buffer. Users should note that the size of the private data buffer may be larger than the amount of private data sent by the remote side. Any additional space in the buffer will be zeroed out.
ah_attr
Address information needed to send data to the remote endpoint(s). Users should use this structure when allocating their address handle.
qp_num
QP number of the remote endpoint or multicast group.
qkey
QKey needed to send data to the remote endpoint(s).
 

CONN EVENT DATA

Event parameters related to connected QP services: RDMA_PS_TCP. The connection related event data is valid for RDMA_CM_EVENT_CONNECT_REQUEST and RDMA_CM_EVENT_ESTABLISHED events, unless stated otherwise.
private_data
References any user-specified data associated with the event. The data referenced by this field matches that specified by the remote side when calling rdma_connect or rdma_accept. This field is NULL if the event does not include private data. The buffer referenced by this pointer is deallocated when calling rdma_ack_cm_event.
private_data_len
The size of the private data buffer. Users should note that the size of the private data buffer may be larger than the amount of private data sent by the remote side. Any additional space in the buffer will be zeroed out.
responder_resources
The number of responder resources requested of the recipient. This field matches the initiator depth specified by the remote node when calling rdma_connect and rdma_accept.
initiator_depth
The maximum number of outstanding RDMA read/atomic operations that the recipient may have outstanding. This field matches the responder resources specified by the remote node when calling rdma_connect and rdma_accept.
flow_control
Indicates if hardware level flow control is provided by the sender.
retry_count
For RDMA_CM_EVENT_CONNECT_REQUEST events only, indicates the number of times that the recipient should retry send operations.
rnr_retry_count
The number of times that the recipient should retry receiver not ready (RNR) NACK errors.
srq
Specifies if the sender is using a shared-receive queue.
qp_num
Indicates the remote QP number for the connection.

EVENT TYPES

The following types of communication events may be reported.
RDMA_CM_EVENT_ADDR_RESOLVED
Address resolution (rdma_resolve_addr) completed successfully.
RDMA_CM_EVENT_ADDR_ERROR
Address resolution (rdma_resolve_addr) failed.
RDMA_CM_EVENT_ROUTE_RESOLVED
Route resolution (rdma_resolve_route) completed successfully.
RDMA_CM_EVENT_ROUTE_ERROR
Route resolution (rdma_resolve_route) failed.
RDMA_CM_EVENT_CONNECT_REQUEST
Generated on the passive side to notify the user of a new connection request.
RDMA_CM_EVENT_CONNECT_RESPONSE
Generated on the active side to notify the user of a successful response to a connection request. It is only generated on rdma_cm_id's that do not have a QP associated with them.
RDMA_CM_EVENT_CONNECT_ERROR
Indicates that an error has occurred trying to establish or a connection. May be generated on the active or passive side of a connection.
RDMA_CM_EVENT_UNREACHABLE
Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request.
RDMA_CM_EVENT_REJECTED
Indicates that a connection request or response was rejected by the remote end point.
RDMA_CM_EVENT_ESTABLISHED
Indicates that a connection has been established with the remote end point.
RDMA_CM_EVENT_DISCONNECTED
The connection has been disconnected.
RDMA_CM_EVENT_DEVICE_REMOVAL
The local RDMA device associated with the rdma_cm_id has been removed. Upon receiving this event, the user must destroy the related rdma_cm_id.
RDMA_CM_EVENT_MULTICAST_JOIN
The multicast join operation (rdma_join_multicast) completed successfully.
RDMA_CM_EVENT_MULTICAST_ERROR
An error either occurred joining a multicast group, or, if the group had already been joined, on an existing group. The specified multicast group is no longer accessible and should be rejoined, if desired.
RDMA_CM_EVENT_ADDR_CHANGE
The network device associated with this ID through address resolution changed its HW address, eg following of bonding failover. This event can serve as a hint for applications who want the links used for their RDMA sessions to align with the network stack.
RDMA_CM_EVENT_TIMEWAIT_EXIT
The QP associated with a connection has exited its timewait state and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets to exit the network. After the timewait state has completed, the rdma_cm will report this event.

SEE ALSO

rdma_ack_cm_event, rdma_create_event_channel, rdma_resolve_addr, rdma_resolve_route, rdma_connect, rdma_listen, rdma_join_multicast, rdma_destroy_id, rdma_event_str

 


RDMA_ACK_CM_EVENT


NAME

rdma_ack_cm_event - Free a communication event.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_ack_cm_event (struct rdma_cm_event *event);

ARGUMENTS

event
Event to be released.

DESCRIPTION

All events which are allocated by rdma_get_cm_event must be released, there should be a one-to-one correspondence between successful gets and acks. This call frees the event structure and any memory that it references.

SEE ALSO

rdma_get_cm_event, rdma_destroy_id

 


RDMA_CREATE_QP


NAME

rdma_create_qp - Allocate a QP.  

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_create_qp (struct rdma_cm_id *id, struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr);

ARGUMENTS

id
RDMA identifier.
pd
protection domain for the QP.
qp_init_attr
initial QP attributes.

DESCRIPTION

Allocate a QP associated with the specified rdma_cm_id and transition it for sending and receiving.

NOTES

The rdma_cm_id must be bound to a local RDMA device before calling this function, and the protection domain must be for that same device. QPs allocated to an rdma_cm_id are automatically transitioned by the librdmacm through their states. After being allocated, the QP will be ready to handle posting of receives. If the QP is unconnected, it will be ready to post sends.

SEE ALSO

rdma_bind_addr, rdma_resolve_addr, rdma_destroy_qp, ibv_create_qp, ibv_modify_qp

 


RDMA_DESTROY_QP


NAME

rdma_destroy_qp - Deallocate a QP.

SYNOPSIS

#include <rdma/rdma_cma.h>

 void rdma_destroy_qp (struct rdma_cm_id *id);

ARGUMENTS

id
RDMA identifier.

DESCRIPTION

Destroy a QP allocated on the rdma_cm_id.

NOTES

Users must destroy any QP associated with an rdma_cm_id before destroying the ID.

SEE ALSO

rdma_create_qp, rdma_destroy_id, ibv_destroy_qp

 


RDMA_ACCEPT


NAME

rdma_accept - Called to accept a connection request.  

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_accept (struct rdma_cm_id *id, struct rdma_conn_param *conn_param);  

ARGUMENTS

id
Connection identifier associated with the request.
conn_param
Information needed to establish the connection. See CONNECTION PROPERTIES below for details.

DESCRIPTION

Called from the listening side to accept a connection or datagram service lookup request.

NOTES

Unlike the socket accept routine, rdma_accept is not called on a listening rdma_cm_id. Instead, after calling rdma_listen, the user waits for an RDMA_CM_EVENT_CONNECT_REQUEST event to occur. Connection request events give the user a newly created rdma_cm_id, similar to a new socket, but the rdma_cm_id is bound to a specific RDMA device. rdma_accept is called on the new rdma_cm_id.

CONNECTION PROPERTIES

The following properties are used to configure the communication and specified by the conn_param parameter when accepting a connection or datagram communication request. Users should use the rdma_conn_param values reported in the connection request event to determine appropriate values for these fields when accepting. Users may reference the rdma_conn_param structure in the connection event directly, or can reference their own structure. If the rdma_conn_param structure from an event is referenced, the event must not be acked until after this call returns.
private_data
References a user-controlled data buffer. The contents of the buffer are copied and transparently passed to the remote side as part of the communication request. May be NULL if private_data is not required.
private_data_len
Specifies the size of the user-controlled data buffer. Note that the actual amount of data transferred to the remote side is transport dependent and may be larger than that requested.
responder_resources
The maximum number of outstanding RDMA read and atomic operations that the local side will accept from the remote side. Applies only to RDMA_PS_TCP. This value must be less than or equal to the local RDMA device attribute max_qp_rd_atom and the responder_resources value reported in the connect request event.
initiator_depth
The maximum number of outstanding RDMA read and atomic operations that the local side will have to the remote side. Applies only to RDMA_PS_TCP. This value must be less than or equal to the local RDMA device attribute max_qp_init_rd_atom and the initiator_depth value reported in the connect request event.
flow_control
Specifies if hardware flow control is available. This value is exchanged with the remote peer and is not used to configure the QP. Applies only to RDMA_PS_TCP.
retry_count
This value is ignored.
rnr_retry_count
The maximum number of times that a send operation from the remote peer should be retried on a connection after receiving a receiver not ready (RNR) error. RNR errors are generated when a send request arrives before a buffer has been posted to receive the incoming data. Applies only to RDMA_PS_TCP.
srq
Specifies if the QP associated with the connection is using a shared receive queue. This field is ignored by the library if a QP has been created on the rdma_cm_id. Applies only to RDMA_PS_TCP.
qp_num
Specifies the QP number associated with the connection. This field is ignored by the library if a QP has been created on the rdma_cm_id.

INFINIBAND SPECIFIC

In addition to the connection properties defined above, InfiniBand QPs are configured with minimum RNR NAK timer and local ACK timeout values. The minimum RNR NAK timer value is set to 0, for a delay of 655 ms. The local ACK timeout is calculated based on the packet lifetime and local HCA ACK delay. The packet lifetime is determined by the InfiniBand Subnet Administrator and is part of the route (path record) information obtained by the active side of the connection. The HCA ACK delay is a property of the locally used HCA. The RNR retry count is a 3-bit value.

SEE ALSO

rdma_listen, rdma_reject, rdma_get_cm_event

 


RDMA_CONNECT


NAME

rdma_connect - Initiate an active connection request.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_connect (struct rdma_cm_id *id, struct rdma_conn_param *conn_param);

ARGUMENTS

id
RDMA identifier.
conn_param
connection parameters. See CONNECTION PROPERTIES below for details.

DESCRIPTION

For an rdma_cm_id of type RDMA_PS_TCP, this call initiates a connection request to a remote destination. For an rdma_cm_id of type RDMA_PS_UDP, it initiates a lookup of the remote QP providing the datagram service.

NOTES

Users must have resolved a route to the destination address by having called rdma_resolve_route before calling this routine.

CONNECTION PROPERTIES

The following properties are used to configure the communication and specified by the conn_param parameter when connecting or establishing datagram communication.
private_data
References a user-controlled data buffer. The contents of the buffer are copied and transparently passed to the remote side as part of the communication request. May be NULL if private_data is not required.
private_data_len
Specifies the size of the user-controlled data buffer. Note that the actual amount of data transferred to the remote side is transport dependent and may be larger than that requested.
responder_resources
The maximum number of outstanding RDMA read and atomic operations that the local side will accept from the remote side. Applies only to RDMA_PS_TCP. This value must be less than or equal to the local RDMA device attribute max_qp_rd_atom and remote RDMA device attribute max_qp_init_rd_atom. The remote endpoint can adjust this value when accepting the connection.
initiator_depth
The maximum number of outstanding RDMA read and atomic operations that the local side will have to the remote side. Applies only to RDMA_PS_TCP. This value must be less than or equal to the local RDMA device attribute max_qp_init_rd_atom and remote RDMA device attribute max_qp_rd_atom. The remote endpoint can adjust this value when accepting the connection.
flow_control
Specifies if hardware flow control is available. This value is exchanged with the remote peer and is not used to configure the QP. Applies only to RDMA_PS_TCP.
retry_count
The maximum number of times that a data transfer operation should be retried on the connection when an error occurs. This setting controls the number of times to retry send, RDMA, and atomic operations when timeouts occur. Applies only to RDMA_PS_TCP.
rnr_retry_count
The maximum number of times that a send operation from the remote peer should be retried on a connection after receiving a receiver not ready (RNR) error. RNR errors are generated when a send request arrives before a buffer has been posted to receive the incoming data. Applies only to RDMA_PS_TCP.
srq
Specifies if the QP associated with the connection is using a shared receive queue. This field is ignored by the library if a QP has been created on the rdma_cm_id. Applies only to RDMA_PS_TCP.
qp_num
Specifies the QP number associated with the connection. This field is ignored by the library if a QP has been created on the rdma_cm_id. Applies only to RDMA_PS_TCP.

INFINIBAND SPECIFIC

In addition to the connection properties defined above, InfiniBand QPs are configured with minimum RNR NAK timer and local ACK timeout values. The minimum RNR NAK timer value is set to 0, for a delay of 655 ms. The local ACK timeout is calculated based on the packet lifetime and local HCA ACK delay. The packet lifetime is determined by the InfiniBand Subnet Administrator and is part of the resolved route (path record) information. The HCA ACK delay is a property of the locally used HCA. Retry count and RNR retry count values are 3-bit values.

IWARP SPECIFIC

Connections established over iWarp RDMA devices currently require that the active side of the connection send the first message.

SEE ALSO

rdma_cm, rdma_create_id, rdma_resolve_route, rdma_disconnect, rdma_listen, rdma_get_cm_event

 


RDMA_DISCONNECT


NAME

rdma_disconnect - This function disconnects a connection.  

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_disconnect (struct rdma_cm_id *id);

ARGUMENTS

id
RDMA identifier.

DESCRIPTION

Disconnects a connection and transitions any associated QP to the error state, which will flush any posted work requests to the completion queue. This routine may be called by both the client and server side of a connection. After successfully disconnecting, an RDMA_CM_EVENT_DISCONNECTED event will be generated on both sides of the connection.

SEE ALSO

rdma_connect, rdma_listen, rdma_accept, rdma_get_cm_event

 


RDMA_RESOLVE_ROUTE


NAME

rdma_resolve_route - Resolve the route information needed to establish a connection.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_resolve_route (struct rdma_cm_id *id, int timeout_ms);

ARGUMENTS

id
RDMA identifier.
timeout_ms
Time to wait for resolution to complete.

DESCRIPTION

Resolves an RDMA route to the destination address in order to establish a connection. The destination address must have already been resolved by calling rdma_resolve_addr.

NOTES

This is called on the client side of a connection after calling rdma_resolve_addr, but before calling rdma_connect.

INFINIBAND SPECIFIC

This call obtains a path record that is used by the connection.

SEE ALSO

rdma_resolve_addr, rdma_connect, rdma_get_cm_event

 


RDMA_BIND_ADDR


NAME

rdma_bind_addr - Bind an RDMA identifier to a source address.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_bind_addr (struct rdma_cm_id *id, struct sockaddr *addr);

ARGUMENTS

id
RDMA identifier.
addr
Local address information. Wildcard values are permitted.

DESCRIPTION

Associates a source address with an rdma_cm_id. The address may be wildcarded. If binding to a specific local address, the rdma_cm_id will also be bound to a local RDMA device.

NOTES

Typically, this routine is called before calling rdma_listen to bind to a specific port number, but it may also be called on the active side of a connection before calling rdma_resolve_addr to bind to a specific address. If used to bind to port 0, the rdma_cm will select an available port, which can be retrieved with rdma_get_src_port.

SEE ALSO

rdma_create_id, rdma_listen, rdma_resolve_addr, rdma_create_qp, rdma_get_local_addr, rdma_get_src_port

 


RDMA_LISTEN


NAME

rdma_listen - Listen for incoming connection requests.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_listen (struct rdma_cm_id *id, int backlog);

ARGUMENTS

id
RDMA identifier.
backlog
backlog of incoming connection requests.

DESCRIPTION

Initiates a listen for incoming connection requests or datagram service lookup. The listen will be restricted to the locally bound source address.

NOTES

Users must have bound the rdma_cm_id to a local address by calling rdma_bind_addr before calling this routine. If the rdma_cm_id is bound to a specific IP address, the listen will be restricted to that address and the associated RDMA device. If the rdma_cm_id is bound to an RDMA port number only, the listen will occur across all RDMA devices.

SEE ALSO

rdma_cm, rdma_bind_addr, rdma_connect, rdma_accept, rdma_reject, rdma_get_cm_event

 


RDMA_REJECT


NAME

rdma_reject - Called to reject a connection request.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_reject (struct rdma_cm_id *id, const void *private_data, uint8_t private_data_len);

ARGUMENTS

id
Connection identifier associated with the request.
private_data
Optional private data to send with the reject message.
private_data_len
Specifies the size of the user-controlled data buffer. Note that the actual amount of data transferred to the remote side is transport dependent and may be larger than that requested.

DESCRIPTION

Called from the listening side to reject a connection or datagram service lookup request.

NOTES

After receiving a connection request event, a user may call rdma_reject to reject the request. If the underlying RDMA transport supports private data in the reject message, the specified data will be passed to the remote side.

SEE ALSO

rdma_listen, rdma_accept, rdma_get_cm_event

 


RDMA_GET_SRC_PORT


NAME

rdma_get_src_port - Returns the local port number of a bound rdma_cm_id.

SYNOPSIS

#include <rdma/rdma_cma.h>

 uint16_t rdma_get_src_port (struct rdma_cm_id *id);  

ARGUMENTS

id
RDMA identifier.

DESCRIPTION

Returns the local port number for an rdma_cm_id that has been bound to a local address.

SEE ALSO

rdma_bind_addr, rdma_resolve_addr, rdma_get_dst_port, rdma_get_local_addr, rdma_get_peer_addr

 


RDMA_GET_DST_PORT


NAME

rdma_get_dst_port - Returns the remote port number of a bound rdma_cm_id.

SYNOPSIS

#include <rdma/rdma_cma.h>

 uint16_t rdma_get_dst_port (struct rdma_cm_id *id);

ARGUMENTS

id
RDMA identifier.

DESCRIPTION

Returns the remote port number for an rdma_cm_id that has been bound to a remote address.

SEE ALSO

rdma_connect, rdma_accept, rdma_get_cm_event, rdma_get_src_port, rdma_get_local_addr, rdma_get_peer_addr

 


RDMA_GET_LOCAL_ADDR


NAME

rdma_get_local_addr - Returns the local IP address of a bound rdma_cm_id.

SYNOPSIS

#include <rdma/rdma_cma.h>

 struct sockaddr * rdma_get_local_addr (struct rdma_cm_id *id);

ARGUMENTS

id
RDMA identifier.

DESCRIPTION

Returns the local IP address for an rdma_cm_id that has been bound to a local device.

SEE ALSO

rdma_bind_addr, rdma_resolve_addr, rdma_get_src_port, rdma_get_dst_port, rdma_get_peer_addr

 


RDMA_GET_PEER_ADDR


NAME

rdma_get_peer_addr - Returns the remote IP address of a bound rdma_cm_id.

SYNOPSIS

#include <rdma/rdma_cma.h>

 struct sockaddr * rdma_get_peer_addr (struct rdma_cm_id *id);

ARGUMENTS

id
RDMA identifier.

DESCRIPTION

Returns the remote IP address associated with an rdma_cm_id.

SEE ALSO

rdma_resolve_addr, rdma_get_src_port, rdma_get_dst_port, rdma_get_local_addr

 


RDMA_EVENT_STR


NAME

rdma_event_str - Returns a string representation of an rdma cm event.

SYNOPSIS

#include <rdma/rdma_cma.h>

 char * rdma_event_str (enumrdma_cm_event_type event );

ARGUMENTS

event
Asynchronous event.

DESCRIPTION

Returns a string representation of an asynchronous event.

SEE ALSO

rdma_get_cm_event

 


RDMA_JOIN_MULTICAST


NAME

rdma_join_multicast - Joins a multicast group.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_join_multicast (struct rdma_cm_id *id, struct sockaddr *addr, void *context);

ARGUMENTS

id
Communication identifier associated with the request.
addr
Multicast address identifying the group to join.
context
User-defined context associated with the join request.

DESCRIPTION

Joins a multicast group and attaches an associated QP to the group.

NOTES

Before joining a multicast group, the rdma_cm_id must be bound to an RDMA device by calling rdma_bind_addr or rdma_resolve_addr. Use of rdma_resolve_addr requires the local routing tables to resolve the multicast address to an RDMA device, unless a specific source address is provided. The user must call rdma_leave_multicast to leave the multicast group and release any multicast resources. After the join operation completes, any associated QP is automatically attached to the multicast group, and the join context is returned to the user through the private_data field in the rdma_cm_event.

SEE ALSO

rdma_leave_multicast, rdma_bind_addr, rdma_resolve_addr, rdma_create_qp, rdma_get_cm_event

 


RDMA_LEAVE_MULTICAST


NAME

rdma_leave_multicast - Leaves a multicast group.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_leave_multicast (struct rdma_cm_id *id, struct sockaddr *addr);

ARGUMENTS

id
Communication identifier associated with the request.
addr
Multicast address identifying the group to leave.

DESCRIPTION

Leaves a multicast group and detaches an associated QP from the group.

NOTES

Calling this function before a group has been fully joined results in canceling the join operation. Users should be aware that messages received from the multicast group may stilled be queued for completion processing immediately after leaving a multicast group. Destroying an rdma_cm_id will automatically leave all multicast groups.

SEE ALSO

rdma_join_multicast, rdma_destroy_qp

 


RDMA_SET_OPTION


NAME

rdma_set_option - Set communication options for an rdma_cm_id.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_set_option (struct rdma_cm_id *id, int level, int optname, void *optval, size_t optlen);

ARGUMENTS

id
RDMA identifier.
level
Protocol level of the option to set.
optname
Name of the option, relative to the level, to set.
optval
Reference to the option data. The data is dependent on the level and optname.
optlen
The size of the %optval buffer.

DESCRIPTION

Sets communication options for an rdma_cm_id. This call is used to override the default system settings.

NOTES

Option details may be found in the relevent header files.

SEE ALSO

rdma_create_id

 


RDMA_GET_DEVICES


NAME

rdma_get_devices - Get a list of RDMA devices currently available.

SYNOPSIS

#include <rdma/rdma_cma.h>

 struct ibv_context ** rdma_get_devices (int *num_devices);

ARGUMENTS

num_devices
If non-NULL, set to the number of devices returned.

DESCRIPTION

Return a NULL-terminated array of opened RDMA devices. Callers can use this routine to allocate resources on specific RDMA devices that will be shared across multiple rdma_cm_id's.

NOTES

The returned array must be released by calling rdma_free_devices. Devices remain opened while the librdmacm is loaded

SEE ALSO

rdma_free_devices

 


RDMA_FREE_DEVICES


NAME

rdma_free_devices - Frees the list of devices returned by rdma_get_devices.

SYNOPSIS

#include <rdma/rdma_cma.h>

 void rdma_free_devices (struct ibv_context **list);

ARGUMENTS

list
List of devices returned from rdma_get_devices.

DESCRIPTION

Frees the device array returned by rdma_get_devices.

SEE ALSO

rdma_get_devices

 


RDMA_NOTIFY


NAME

rdma_notify - Notifies the librdmacm of an asynchronous event.

SYNOPSIS

#include <rdma/rdma_cma.h>

 int rdma_notify (struct rdma_cm_id *id, enum ibv_event_type event);

ARGUMENTS

id
RDMA identifier.
event
Asynchronous event.

DESCRIPTION

Used to notify the librdmacm of asynchronous events that have occurred on a QP associated with the rdma_cm_id.

NOTES

Asynchronous events that occur on a QP are reported through the user's device event handler. This routine is used to notify the librdmacm of communication events. In most cases, use of this routine is not necessary, however if connection establishment is done out of band (such as done through Infiniband), it's possible to receive data on a QP that is not yet considered connected. This routine forces the connection into an established state in this case in order to handle the rare situation where the connection never forms on its own. Events that should be reported to the CM are: IB_EVENT_COMM_EST.

SEE ALSO

rdma_connect, rdma_accept, rdma_listen

<return-to-top>

 

 

WinVerbs


WinVerbs is a userspace verbs and communication management interface optimized
for the Windows operating system. Its lower interface is designed to support
any RDMA based device, including Infiniband and future RDMA devices. Its upper interface is
capable of providing a low latency verbs interface, plus supports Microsoft's
NetworkDirect Interface, DAPL and OFED components: libibverbs, libibmad, rdma_cm interfaces and numerous OFED IB diagnostic tools.

The WinVerbs driver loads as an upper filter driver for Infiniband HCA devices.
(Open source iWarp drivers for Windows are not yet available.) A corresponding
userspace library installs as part of the Winverbs driver installation package.
Additionally, a Windows port of the OFED libibverbs library and several test
programs are also included.

As of the WinOF 2.1 release, Winverbs and Winmad are are fully integrated into the HCA driver stack load.
That's to say, Winverbs and Winmad are now integral components of the OFED stack.

Available libibverbs test programs and their usage are listed
below. Note that not all listed options apply to all applications

ibv_rc_pingpong, ibv_uc_pingpong, ibv_ud_pingpong
no args start a server and wait for connection
-h <host>     connect to server at <host>
-p <port>     listen on/connect to port <port> (default 18515)
-d <dev>     use IB device <dev> (default first device found)
-i <port>      use port <port> of IB device (default 1)
-s <size>      size of message to exchange (default 4096)
-m <size>     path MTU (default 1024)
-r <dep>      number of receives to post at a time (default 500)
-n <iters>     number of exchanges (default 1000)
-l <sl>          service level value
-e                 sleep on CQ events (default poll)

ibv_send_bw, ibv_send_lat
ibv_read_bw, ibv_read_lat
ibv_write_bw, ibv_write_lat
no args start a server and wait for connection
-h <host>              connect to server at <host>
-p <port>              listen on/connect to port <port> (default 18515)
-d <dev>               use IB device <dev> (default first device found)
-i <port>               use port <port> of IB device (default 1)
-c <RC/UC/UD>  connection type RC/UC/UD (default RC)
-m <mtu>              mtu size (256 - 4096. default for hermon is 2048)
-s <size>               size of message to exchange (default 65536)
-a                          Run sizes from 2 till 2^23
-t <dep>                size of tx queue (default 300)
-g                          send messages to multicast group (UD only)
-r <dep>                make rx queue bigger than tx (default 600)
-n <iters>               number of exchanges (at least 2, default 1000)
-I <size>                max size of message to be sent in inline mode (default 400)
-b                          measure bidirectional bandwidth (default unidirectional)
-V                         display version number
-e                          sleep on CQ events (default poll)
-N                         cancel peak-bw calculation (default with peak-bw)

To verify correct WinVerbs and libibverbs installation, run ibstat or ibv_devinfo. It
should report all RDMA devices in the system, along with limited port
attributes. Because of limitations in the OFED for Windows stack in comparision to the Linux OFED stack, it is normal for the programs to
list several values as unknown.

<return-to-top>