



### **Open Fabrics Interfaces Architecture Introduction**

Sean Hefty Intel Corporation

### **Current State of Affairs**

#### OFED software

- Widely adopted low-level RDMA API
- Ships with upstream Linux
- OFED SW was not designed around HPC
- Hardware and fabric features are changing
   Divergence is driving competing APIs
- Interfaces are being extended, and new APIs introduced
   Long delay in adoption
- Size of clusters and single node core counts greatly increasing
- More applications are wanting to take advantage of highperformance fabrics

but...





#### **Evolve OpenFabrics**



Design software interfaces that are aligned with application requirements

Target needs of HPC

Support multiple interface semantics

Fabric and vendor agnostic

Supportable in upstream Linux





- Leveraging existing open source community
- Broad ecosystem
  - Application developers and vendors
  - Active community engagement
- Drive high-performance software APIs
  - Take advantage of available hardware features
  - Support multiple product generations

Open Fabrics Interfaces Working Group

### **OFIWG Charter**



- Develop an *extensible*, open source framework and *interfaces aligned with ULP and application* needs for *high-performance* fabric services
- Software leading hardware
  - Enable future hardware features
  - Minimal impact to applications
- Minimize impedance match between ULPs and network APIs
- Craft optimal APIs
  - Detailed analysis on MPI, SHMEM, and other PGAS languages
  - Focus on other applications storage, databases, ...



- OFI WG is open participation
  - Contact the ofiwg mail list for meeting details
  - ofiwg@lists.openfabrics.org
- Source code available through github
  - github.com/ofiwg
- Presentations / meeting minutes available from OFA download directory

Help OFI WG understand workload requirements and drive software design

#### Enable..



#### **Scalability**

Reduced cache and memory footprintScalable address resolution and storage

•Tight data structures

#### High performance

Optimized software path to hardware

•Independent of hardware interface, version, features

#### Extensible

#### More agile development

- •Time-boxed, iterative development
- •Application focused APIs
- Adaptable

#### **App-centric**

#### Analyze application needs

•Implement them in a coherent, concise, high-performance manner

### Verbs Semantic Mismatch



| Current RDMA APIs                   |                                     | <b>Evolved Fabric Interfaces</b>                                  |  |
|-------------------------------------|-------------------------------------|-------------------------------------------------------------------|--|
| 50-60 lines of C-code               |                                     | 25-30 lines of C-code                                             |  |
| Allocate WR<br>Allocate SGE         | Reduce setup cost<br>- Tighter data | Direct call - 3 writes                                            |  |
| Format SGE - 3 write                | es                                  | optimized send call                                               |  |
| Format WR - 6 writes                |                                     | Checks - 2 branches                                               |  |
| generic send call                   |                                     |                                                                   |  |
| Loop 1<br>Checks - 9 brand          | - Remaining bra                     | Eliminate loops and branches - Remaining branches predictable     |  |
| Loop 2                              |                                     |                                                                   |  |
| Check<br>Loop 3<br>Checks - 3 brand | Man                                 | Selective optimization paths to HW<br>- Manual function expansion |  |
| Checks - 3 branches                 |                                     |                                                                   |  |

## **Application-Centric Interfaces**



Reducing instruction count *requires* a better application impedance match

- Collect application requirements
- Identify common, fast path usage models
  - Too many use cases to optimize them all
- Build primitives around *fabric services* 
  - Not device specific interface

### **OFA Software Evolution**









Focus on longer-lived interfaces – software leading hardware

- Take growth into consideration
- Reduce effort to incorporate new application features
  - Addition of new interfaces, structures, or fields
  - Modification of existing functions
- Allow time to design new interfaces correctly
  - Support prototyping interfaces prior to integration



### Fabric Interfaces



- Defines philosophy for interfaces and extensions
  - Focus interfaces on the semantics and services offered by the hardware and not the hardware implementation
- Exports a minimal API
  - Control interface
- Defines fabric interfaces
  - API sets for specific functionality
- Defines core object model
  - Object-oriented design, but C-interfaces

### Fabric Interfaces Architecture





- Based on object-oriented programming concepts
- Derived objects define interfaces
  - New interfaces exposed
  - Define behavior of inherited interfaces
  - Optimize implementation

### **Control Interfaces**





#### fi\_getinfo

- Application specifies desired functionality
- Discover fabric providers and services
- Identify resources and addressing

#### fi\_fabric

Open a set of fabric interfaces and resources

#### fi\_register

• Dynamic providers publish control interfaces

### **Application Semantics**



Get / set using control interfaces

- Progress
  - Application or hardware driven
  - Data versus control interfaces
- Ordering
  - Message ordering
  - Data delivery order
- Multi-threading and locking model
  - Compile and run-time options

#### Fabric Object Model





### **Endpoint Interfaces**





# Application Configured Interfaces



#### **Event Queues**





#### **Event Queues**





www.openfabrics.org

#### **Address Vectors**

# Fabric specific addressing requirements



Store addresses/host names Share between processes - Insert range of addresses with single call Example only Reference entries by handle or index Start End **Base LID** SL - Handle may be encoded Range Range fabric address host10 host1000 50 1 Reference vector for group communication host1001 host4999 2000 2

> Enable provider optimization techniques - Greatly reduce storage requirements

#### Summary





- These concepts are necessary, not revolutionary
  - Communication addressing, optimized data transfers, appcentric interfaces, future looking
  - Want a solution where the pieces fit tightly together





- Co-chair (<u>sean.hefty@intel.com</u>)
  - Meets Tuesdays from 9-10 PST / 12-1 EST
- Links
  - Mailing list subscription
    - http://lists.openfabrics.org/mailman/listinfo/ofiwg
  - Document downloads
    - <u>https://www.openfabrics.org/downloads/OFIWG/</u>
  - libfabric source tree
    - www.github.com/ofiwg/libfabric
  - labfabric sample programs
    - www.github.com/ofiwg/fabtests





### Verbs API Mismatch



#### Verbs Provider Mismatch



Most often 1 For each work request (overlap operations) Check for available queue space Often 1 or 2 Check SGL size (fixed in source) Check valid opcode Check flags x 2 Artifact of API Check specific opcode Switch on QP type QP type usually fixed in Switch on opcode source Check flags Flags may be fixed or app For each SGE may have taken branches Check size Loop over length 19+ branches including loops Check flags Check 100+ lines of C code Check for last request 50-60 lines of code to HW Other checks x 3

#### Verbs Completions Mismatch



#### Application accessed fields

struct ibv wc { uint64 t wr id; enum ibv wc status status enum ibv wc opcode opcode; uint32 t vendor err; uint32 t byte len; uint32 t imm data; uint32 t qp num; uint32 t src qp; int wc flags; uint16 t pkey index; uint16 t slid; uint8 t sl; uint8 t dlid path bits; };

App must check both return code and status to determine if a request completed successfully

Provider must fill out all fields, even those ignored by the app

Provider must handle all types of completions from any QP

Developer must determine if fields apply to their QP

Single structure is 48 bytes likely to cross cacheline boundary







- Ability of the underlying implementation to complete processing of an asynchronous request
- Need to consider ALL asynchronous requests
  - Connections, address resolution, data transfers, event processing, completions, etc.
- HW/SW mix

All(?) current solutions require significant software components





- Support two progress models

   Automatic and implicit
- Separate operations as belonging to one of two progress domains
  - Data or control
  - Report progress model for each domain

| SAMPLE  | Implicit | Automatic        |
|---------|----------|------------------|
| Data    | Software | Hardware offload |
| Control | Software | Kernel services  |

### **Automatic Progress**



- Implies hardware offload model
  - Or standard kernel services / threads for control operations
- Once an operation is initiated, it will complete without further user intervention or calls into the API
- Automatic progress meets implicit model by definition

### **Implicit Progress**



- Implies significant software component
- Occurs when reading or waiting on EQ(s)
- Application can use separate EQs for control and data
- Progress limited to objects associated with selected EQ(s)
- App can request automatic progress
  - E.g. app wants to wait on native wait object
  - Implies provider allocated threading





- Applies to a single initiator endpoint performing data transfers to one target endpoint over the same data flow
  - Data flow may be a conceptual QoS level or path through the network
- Separate ordering domains
  - Completions, message, data
- Fenced ordering may be obtained using fi\_sync operation

### **Completion Ordering**



- Order in which operation completions are reported relative to their submission
- Unordered or ordered
  - No defined requirement for ordered completions
- Default: unordered

### Message Ordering



- Order in which message (transport) headers are processed
  - I.e. whether transport message are received in or out of order
- Determined by selection of ordering bits
  - [Read | Write | Send] After [Read | Write | Send]
  - RAR, RAW, RAS, WAR, WAW, WAS, SAR, SAW, SAS
- Example:
  - fi\_order = 0 // unordered
  - fi\_order = RAR | RAW | RAS | WAW | WAS | SAW | SAS // IB/iWarp like ordering

### Data Ordering



- Delivery order of transport data into target memory
  - Ordering per byte-addressable location
  - I.e. access to the same byte in memory
- Ordering constrained by message ordering rules
  - Must at least have message ordering first

### Data Ordering



- Ordering limited to message order size
  - E.g. MTU
  - In order data delivery if transfer <= message order size</p>
  - WAW, RAW, WAR sizes?
- Message order size = 0
  - No data ordering
- Message order size = -1
  - All data ordered



- Ordering to different target endpoints not defined
- Per message ordering semantics implemented using different data flows
  - Data flows may be less flexible, but easier to optimize for
  - Endpoint aliases may be configured to use different data flows



- Support both thread safe and lockless models
  - Compile time and run time support
  - Run-time limited to compiled support
- Lockless (based on MPI model)
  - Single single-threaded app
  - Funneled only 1 thread calls into interfaces
  - Serialized only 1 thread at a time calls into interfaces
- Thread safe
  - Multiple multi-threaded app, with no restrictions





- Support both application and network buffering
  - Zero-copy for high-performance
  - Network buffering for ease of use
    - Buffering in local memory or NIC
  - In some case, buffered transfers may be higherperforming (e.g. "inline")
- Registration option for local NIC access
  - Migration to fabric managed registration
- Required registration for remote access
  - Specify permissions





- Application optimized code paths based on usage model
- Optimize call(s) for single work request
  - Single data buffer
  - Still support more complex WR lists/SGL
- Per endpoint send/receive operations
  - Separate RMA function calls
- Pre-configure data transfer flags
  - Known before post request
  - Select software path through provider