

15th ANNUAL WORKSHOP 2019

# NVME OVER FABRICS OFFLOAD

Tzahi Oved

**Mellanox Technologies** 

[ March, 2019 ]







### **NVME INTRODUCTION**

#### Standard PCle host controller interface for solid-state storage

- Driven by industry consortium of 80+ members
- Standardize feature, command, and register sets
- Enhance PCIe capabilities: low latency, scalable BW, power efficiency etc.

#### Focus on efficiency, scalability and performance

- All parameters for 4KB command in single 64B DMA fetch
- Simple command set (13 required commands)
- Supports MSI-X and interrupt steering





### THE NEED FOR NVME OVER FABRICS

- Regular NVMe Devices "Captive" In the Server
  - Only supports drives in or near server/storage box
  - Limits number of PCle-attached devices
- PCIe, SAS and SATA Don't Support Scale-Out
  - Distance limitations; Difficult to share
  - Limited robustness and error handling

#### **NVM Express, Inc. solution**

- NVMe Over Fabrics Announced September 2014
  - Preserves NVMe command set
  - Simplifies storage virtualization, migration, & failover
  - Allows scale-out without SCSI protocol translation
  - Goal: <10µs added latency compared to local NVMe</li>
- Standard V1.0 done, published June 2016

### **EXAMPLE USE CASE**



## **NVME MAPPING TO FABRICS**

| NVMe                                | NVMe-oF                                  |
|-------------------------------------|------------------------------------------|
| Submit Queue (SQ)                   | QP SQ                                    |
| Completion Queue (CQ)               | QP RQ (+CQ)                              |
| Host write SQE, ring doorbell       | Host SEND SQE capsule, ring doorbell     |
| Device write CQE, interrupt or poll | Target SEND CQE capsule, RX CQE int/poll |
| PCIe data exchange                  | RDMA RD/WR, immediate up to 8K           |

### COMMUNITY NVMF TARGET KERNEL



## **NVME-OF TARGET OFFLOAD FLOW**



## **NVME-OF TARGET OFFLOAD WITH CMB**







## STATUS

### Submitted RFC

https://www.spinics.net/lists/linux-rdma/msg58512.html

### Added/Updated files

| <ul><li>Documentation/nvmf_offload.md</li></ul> |   | 172 |
|-------------------------------------------------|---|-----|
| • libibverbs/man/ibv_create_srq_ex.3            |   | 48  |
| • libibverbs/man/ibv_get_async_event.3          |   | 15  |
| • libibverbs/man/ibv_map_nvmf_nsid.3            |   | 89  |
| • libibverbs/man/ibv_qp_set_nvmf.3              |   | 53  |
| • libibverbs/man/ibv_query_device_ex.3          |   | 26  |
| • libibverbs/man/ibv_srq_create_nvme_ctrl.3     | 1 | 89  |
| • libibverbs/verbs.h                            |   | 107 |

## **VERBS FLOW**



## IDENTIFY - IBV\_QUERY\_DEVICE\_EX()

- New NVMf caps
- Offload types
- Supported min/max values

```
struct ibv_device_attr_ex {
    ...
    struct ibv_nvmf_caps nvmf_caps;
};

struct ibv_nvmf_caps {
    enum nvmf_offload_type offload_type;
    uint32_t max_backend_ctrls_total;
    uint32_t max_namespaces;
    uint32_t max_staging_buf_pages;
    uint32_t max_io_sz;
    uint32_t max_io_sz;
    uint16_t max_nvme_queue_sz;
    uint16_t min_nvme_queue_sz;
    uint32_t min_ioccsz;
    uint32_t min_ioccsz;
    uint32_t min_ioccsz;
    uint32_t min_ioccsz;
    uint32_t max_icdoff;
};
```

## CREATE SRQ - IBV\_CREATE\_SRQ\_EX()

- Represents a fabric-facing NVMf target
- Set params according to caps
- Add staging buffer
  - Use MR!

```
struct ibv_srq_init_attr_ex {
        struct ibv_nvmf_attrs nvmf_attr;
};
struct ibv_nvmf_attrs {
        enum nvmf offload type offload type;
       uint32_t
uint8_t
uint32_t
uint16_t
                                  max namespaces;
                                  nvme log page sz;
                                  ioccsz;
        uint16 t
                                 icdoff;
       uint32_t
uint16_t
struct ibv_mr
uint64_t
                                 max io sz;
                                 nvme queue sz;
                                 *staging buf mr;
                                 staging buf addr;
        uint64 t
                                  staging buf len;
};
```

## CREATE NVME BACKEND - IBV\_SRQ\_CREATE\_NVME\_CTRL ()

- ibv\_nvme\_ctrl belongs to specific SRQ
- Attributes are
  - NVMe SQ
  - NVMe CQ
  - NVMe SQ-DB
  - NVMe CQ-DB
  - NVMe SQ-DB init val
  - NVMe CQ-DB init val
- SQ, CQ, and DBs are described by
  - {MR, VA, LEN}
  - Need to ibv\_reg\_mr()

```
struct ibv nvme ctrl *ibv_srq_create_nvme_ctrl(
          struct ibv srq *srq,
          struct nvme ctrl attrs *nvme attrs);
int
                      ibv_srq_remove_nvme_ctrl(
          struct ibv srq *srq,
          struct ibv nvme_ctrl *nvme_ctrl);
struct ibv mr sg {
        struct ibv mr *mr;
        union {
               void
                        *addr;
               uint64 t offset;
        uint64 t
                    len;
struct nvme_ctrl_attrs {
        struct ibv mr sq
                              sq buf;
        struct ibv mr sq
                              cq buf;
        struct ibv mr sq
                              sqdb;
        struct ibv mr sg
                              cqdb;
        uint16 t
                              sqdb ini;
        uint16 t
                              cqdb ini;
        uint16 t
                              timeout ms;
        uint32 t
                              comp mask;
} ;
```

## MAP NAMESPACES - IBV\_MAP\_NVMF\_NSID()

### New map within the subsystem between

```
• { fe_nsid } -> { nvme_ctrl, nvme_nsid, params }
```

### SRQ is available from the nvme\_ctrl

It was created for a specific SRQ

### To map the same NVMe to different SRQ

- Meaning to different NVMe-oF subsystems
- Create different nvme\_ctrl with each SRQ
- Each nvme\_ctrl represents exclusive NVMe queues on the same NVMe device

## ENABLE QP - IBV\_QP\_SET\_NVMF()

- Create QP as normal
  - RDMA-CM in case of NVMf
- Enable QP for NVMf offload
  - New verb specifically for NVMf attrs
- First message is CONNECT
  - No offload should be done before it
  - Software will enable offload after seeing CONNECT

### **EXCEPTIONS**

#### Non-offloaded operations

- Go as normal to SRQ and CQ
- Software can post responses on QP

#### QP goes to error

- Async event IBV\_EVENT\_QP\_FATAL
- Software may not see CQE with errors...

#### NVMe errors

- New async event, nvme\_ctrl scope
- PCI error (when reading CQ)
- Command timeout

#### • Must listen to async events!





### **NVME EMULATION - MANAGEMENT**

### Implemented Out of band

- Connect to Cloud provider management/ orchestration
- Implement any proprietary protocol
- Direct from network, bypassing host
- Utilizing NVMe Emulation SDK

### Vendor-specific Admin commands

- Using the NVMe driver
- Limited / controlled capabilities
- Example: allow user to choose QoS policy



### NVME EMULATION DATAPATH

#### SDK

- Handle NVMe registers and admin queue
- Efficient memory management based on SPDK
- Zero-copy all the way
- Full polling
- Multi queues, multi threads, lockless
- Well defined APIs: vBdev, Bdev drivers...





15th ANNUAL WORKSHOP 2019

# THANK YOU

Tzahi Oved

**Mellanox Technologies** 

[ LOGO HERE ]