

**2023 OFA Virtual Workshop** 

UNIVERSAL CHIPLET INTERCONNECT EXPRESS™ (UCIE™): BUILDING AN OPEN ECOSYSTEM OF CHIPLETS FOR ON-PACKAGE INNOVATIONS

> Keynote by: Dr. Debendra Das Sharma Chair, UCle Consortium

Intel Senior Fellow and co-GM Memory and I/O Technologies, Intel Corporation

### AGENDA

- Mega-Trends in compute landscape
- Interconnects and Fabrics an important pillar of compute
- On-Package Interconnects: Opportunities and Challenges
- Universal Chiplet Interconnect Express (UCIe): An Open Standard for Chiplets
- Future Directions

### **MEGA-TRENDS IN THE COMPUTE LANDSCAPE**





- Insatiable demand for compute, storage, and data movement
- Innovative applications leading to more demand which in turn leading to more innovations
- Interconnect is an important pillar of compute
  - Compute, storage/ memory, interconnect, software, process technology, security

### **EXPLOSION OF DATA ENABLING DATA-CENTRIC REVOLUTION**



© OpenFabrics Alliance







### Mega-Trends in compute landscape

- Interconnects and Fabrics an important pillar of compute
- On-Package Interconnects: Opportunities and Challenges
- Universal Chiplet Interconnect Express (UCIe): An Open Standard for Chiplets
- Future Directions

# **INTERCONNECTS: TAXONOMY, CHARACTERISTICS AND TRENDS**



Load-Store I/O: from die/ package/node to rack/pod

# LOAD-STORE INTERCONNECTS : PCIE AND CXL

#### With PCIe: (900+ member companies)

- Memory Connected to CPU Cacheable
- Memory Connected to PCIe device is Uncacheable
- Different Ordering rules across I/O vs coherency domains
- Ubiquitous I/O for compute continuum

### With CXL: (~200 member companies)

- Caching and memory protocols on top of PCIe
- Device can cache memory
- Memory attached to device is cacheable
- Leverages PCIe infrastructure

### PCIe and CXL very successful industry standards:

- Multi-generational, backward compatible, IP/ tools
- Compliance program with plug-and-play

On-Package Interconnects should leverage PCIe/CXL infrastructure for standardization and Load-Store Usages.. Need to seamlessly move functionality from node to package to die level



### DESIGN CHOICE: SEAMLESS INTEGRATION FROM NODE $\rightarrow$ PACKAGE $\rightarrow$ ON-DIE ENABLES REUSE, BETTER USER EXPERIENCE



### AGENDA

### Mega-Trends in compute landscape

- Interconnects and Fabrics an important pillar of compute
- On-Package Interconnects: Opportunities and Challenges
- Universal Chiplet Interconnect Express (UCIe): An Open Standard for Chiplets
- Future Directions

### **MOORE PREDICTED "DAY OF RECKONING"**



 "It may prove to be more economical to build large systems out of smaller
 functions, which are separately packaged and interconnected<sup>1</sup>."

### Gordon E. Moore

<sup>1</sup>: "Cramming more components onto integrated circuits", Electronics, Volume 38, Number 8, April 19, 1965

### **DRIVERS FOR ON-PACKAGE CHIPLETS**

- Reticle Limit, yield optimization, scalable performance
- Same dies on package (Scale-up)
- Increasing design costs at leading edge process nodes
  - Die-disaggregated dies across different nodes
- Use new process node for advanced functionality
- Time to Market (Late binding)
- Custom silicon for different customers leveraging a base product
- Different acceleration functions with common compute
- Different process nodes optimized for different functions
- Memory, logic, analog, co-packaged optics
- High power-efficient bandwidth with low-latency access (e.g., HBM memory)



Source: IBS (as cited in IEEE Heterogeneous Integration Roadmap)

# **COMPONENTS OF CHIPLET INTEROPERABILITY**



Source: UCle<sup>™</sup> Consortium

### AGENDA

Mega-Trends in compute landscape

- Interconnects and Fabrics an important pillar of compute
  On Package Interconnectes Opportunities and Challenges
- Universal Chiplet Interconnect Express (UCIe): An Open Standard for Chiplets

Future Directions

# **MOTIVATION FOR UCIE**

#### **OPEN CHIPLET: PLATFORM ON A PACKAGE High-Speed Standardized** Chip-to-Chip Interface (UCIe) 20X I/O Performance at 1/20th Power vs off-package SerDes at Launch Gap more prominent with Customer IP & better on-package **Customized Chiplets** technologies in future Sea of Cores (heterogeneous) Memory Advanced 2D/2.5D/3D Packaging

Heterogeneous Integration Fueled by an Open Chiplet Ecosystem (Mix-and-match chiplets from different process nodes / fabs / companies / assembly) Source: UCle<sup>M</sup> Consortium

- Enables SoC construction overcoming reticle limits
  - Package is new System-on-a-Chip (SoC): Scale Up
- Reduces time-to-solution (e.g., enables die reuse)
- Lowers portfolio cost (product & project)
  - Enables optimal process technologies
  - Smaller (better yield)
  - Reduces IP porting costs
  - Lowers product SKU cost
- Enables a customizable, standard-based product for specific use cases (bespoke solutions)
- Scales innovation (Mfg process locked IPs)

UCIe Goal: Align Industry around an open platform to enable chiplet based solutions

# **UCIE: KEY METRICS AND ADOPTION CRITERIA**

#### **Key Performance Indicators**

- Bandwidth density (linear & area)
  - Data Rate & Bump Pitch
- Energy Efficiency (pJ/b)
  - Scalable energy consumption
  - Low idle power (entry/exit time)
- Latency (end-to-end: Tx+Rx)
- Channel Reach
  - Technology, frequency, & BER
- Reliability & Availability
- Cost: Standard vs advanced packaging

#### **Factors Affecting Wide Adoption**

- Interoperability
  - Full-stack, plug-and-play with existing s/w
  - Different usages/segments ubiquity
- Technology
  - Across process nodes & packaging options
  - Power delivery & cooling
  - Repair strategy (failure/yield improvement)
  - Debug controllability & observability
- Broad industry support / Open ecosystem
  - Learnings from other standards efforts

UCIe is architected and specified from the ground-up to deliver the best KPIs while meeting wide adoption criteria

### **UCIE 1.0 SPECIFICATION**

#### Layered Approach with industry-leading KPIs

Physical Layer: Die-to-Die I/O

#### Die to Die Adapter: Reliable delivery

- Support for multiple protocols: bypassed in raw mode
- Protocol: CXL/PCIe and Streaming
  - CXL<sup>™</sup>/PCIe<sup>®</sup> for volume attach and plug-and-play
    - SoC construction issues are addressed w/ CXL/PCIe
    - CXL/PCIe addresses common use cases: I/O attach, Memory, Accelerator
  - Streaming for other protocols
    - Scale-up (e.g., CPU/ GP-GPU/Switch from smaller dies)
    - Protocol can be anything (e.g., AXI/CHI/SFI/CPI/ etc)

# Well defined Spec: interoperability and future evolution

- Configuration register for discovery and run-time
  - control and status reporting in each layer
  - transparent to existing drivers
- Form-factor and Management
- <u>Compliance</u> for interoperability
- Plug-and-play IPs with RDI/ FDI interface



# UCIE 1.0: SUPPORTS STANDARD AND ADVANCED PACKAGES



(Standard Package)

Standard Package: 2D – cost effective, longer distance

Advanced Package: 2.5D – power-efficient, high bandwidth density

Dies can be manufactured anywhere and assembled anywhere – can mix 2D and 2.5D in same package – Flexibility for SoC designer







#### (Multiple Advanced Package Options)

Source: UCle<sup>™</sup> Consortium

One UCIe 1.0 Spec covers both type of packaging options

# **UCIE PHY: BUMP-OUT FOR INTEROPERABILITY**

### UCIe architected with process portability in mind

-Circuit components can be built with common digital/ analog structures

 Bump-out specified in the specification for interoperability even with future bump-pitcl reductions

-Die rotation and mirroring supported



(UCIe-S Unstacked Bump-out)





# **PHYSICAL LAYER**

#### Unit is One Module: uni-directional: 1, 2, or 4 modules form a Link

- -1 differential pair of forwarded clock
- -Rest are single-ended
  - –Data (16/64), Valid, Track
- -Valid for effective power management
- -Lane reversal on Transmit side
- Reliability: Spare Lanes in Adv; degradation in Std
- -Data Rates: 4, 8, 12, 16, 24, 32 GHz
- Sideband: always on, 800 MHz
  - -1 data and 1 clock each direction
  - -Used for training, debug, mgmt, etc
  - -Depopulated GND bumps to ensure no extra shore-line



Source: UCle<sup>™</sup> Consortium



# **D2D ADAPTER AND FLIT MAPPING THROUGH FDI**

Responsible for packetization

-Adds Flit Header (2B) and CRC (2B)

### Supported Flit Sizes: 68B and two flavors of 256B

-Decided at negotiation

Flit Hdr (2B): Protocol ID (3b), Credit (1b), Flit Ack/Nak management (2b command + 8b sequence number), Rsvd (2b)

### CRC: Covers 128B payload (smaller payloads are 0extended)

- -Triple bit flip detection guarantee with 16 bits
- -Replay if CRC fails
- -Sample RTL code for CRC provided in the spec

| Byte | Flit Hdr        |               |                  |                                                           |  |  |
|------|-----------------|---------------|------------------|-----------------------------------------------------------|--|--|
| 0    | (2B)            | 62B of Flit 1 |                  |                                                           |  |  |
| 64   | 2B of<br>Flit 1 | CRC<br>(2B)   | Flit Hdr<br>(2B) | 58B of Flit 2 or<br>all 0s if no Flit from Protocol Layer |  |  |

(a. 68-Byte Flit – usage CXL 2.0/ PCIe Non-Flit Mode/ Streaming)



(c. 256-Byte Latency-Optimized Flit – usage CXL 3.0/ Streaming) Source: UCle™ Consortium © OpenFabrics Alliance

## UCIE USAGE MODEL: SOC AT PACKAGE LEVEL

- SoC as a Package level construct
  - Standard and/ or Advanced package
  - Homogeneous and/or heterogeneous chiplets
  - Mix and match chiplets from multiple suppliers
- Across segments: Hand-held, Client, Server, Workstation, Comms, HPC, etc
  - Similar to PCIe/ CXL at board level
- Chiplet Types:
  - PCIe/CXL Based: Inference, video, networking (crypto, compression, NIC), memory expansion
  - Streaming: Scale-up, accelerators
  - SERDES: High-speed PHY for PCIe/ Networking (64/ 128G, 112/ 224G)



### **EXAMPLE SCALE-UP SOC FROM HOMOGENEOUS DIES: LARGE SWITCH WITH ON-DIE PROTOCOL AS STREAMING OVER UCIE**

- Need large radix CXL switches
  - challenges: reticle limit, cost, etc.
- UCIe based Chiplets for scale-up
  - 64G Gen6 x16b CXL links
  - UCIe as d2d interconnect a switch vendor may prefer to have their on-die interconnect protocol be transported over UCIe rather than create a hierarchy of switches which will not work for CXL 2.0 treebased topology
- Similar approach for other scale-up SoCs (CPU, GP-GPU, N/W Switches)







Large CXL switch (512 lanes)

Ack: Nathan Kalyanasundaram Source: UCle™ Consortium

### EXAMPLE SCALE-UP PACKAGE USING STREAMING AND OPEN-PLUG-IN USING PCIE/ CXL



 Any device type in this open plug-in slot with CXL (or CHI if both support it)

Ack: Marvin Denman, Bruce Mathewson, Francisco Socal, Durgesh Srivastava, Dong Wei

Source: UCle<sup>™</sup> Consortium

defined data-link CRC and retry

### UCIE BASED SYSTEM TOPOLOGY: SOC AS WELL AS OFF-PACKAGE



Source: UCle<sup>™</sup> Consortium CO

© OpenFabrics Alliance

# SLIDE FROM MY 2021 OFA KEYNOTE ON FUTURE DIRECTIONS

#### Composable Disaggregated Infrastructure at Rack level

- Heterogenous compute/ memory, storage, networking fabric resources
- connected through high bandwidth, low-latency Load-Store Interconnect
- delivering almost-identical performance per watt as independent servers
- w/ multiple domains w/ shared memory, message passing, atomics

#### Synergy between Networking and Load-store

- Expect boundaries to be fungible
- Fabric Manager, Multi-head, multi-domain, Atomics support, Persistence flows, Smart NIC with optimized flows to access system memory without involving host, VM migration

#### Challenges:

- Latency: NUMA optimization, low-latency switch
- Bandwidth demand: higher rate helps
- Power Efficiency
- Blast Radius containment and QoS
- Scaling: Moore's law and Dennard-scaling
- Copper-Optical transition point
- Software!

Key Takeaway (2023): CXL Protocol for Rack/Pod level dis-aggregation/ scale-up UCIe PHY for on-package and co-packged optics for Rack/Pod PCIe/ CXL PHY for board/ Rack level



# **UCIE USAGE: OFF-PACKAGE CONNECTIVITY WITH UCIE RETIMERS**



(Vision: Load-Store I/O (CXL) as the fabric across the Pod providing low-latency and high bandwidth resource pooling/ sharing as well as message passing)



CXL/ PCle

(Interconnects at drawer level)

Source: UCle<sup>™</sup> Consortium

© OpenFabrics Alliance

# **UCIE 1.0: CHARACTERISTICS AND KEY METRICS**

| CHARACTERISTICS      | STANDARD<br>PACKAGE  | ADVANCED<br>PACKAGE | COMMENTS                                                                |
|----------------------|----------------------|---------------------|-------------------------------------------------------------------------|
| Data Rate (GT/s)     | 4, 8, 12, 16, 24, 32 |                     | Lower speeds must be supported -interop (e.g., 4, 8, 12 for 12G device) |
| Width (each cluster) | 16                   | 64                  | Width degradation in Standard, spare lanes in Advanced                  |
| Bump Pitch (um)      | 100 - 130            | 25 - 55             | Interoperate across bump pitches in each package type across nodes      |
| Channel Reach (mm)   | <= 25                | <=2                 |                                                                         |

| KPIs / TARGET FOR<br>KEY METRICS                        | STANDARD<br>PACKAGE            | ADVANCED<br>PACKAGE | COMMENTS                                                                                 |
|---------------------------------------------------------|--------------------------------|---------------------|------------------------------------------------------------------------------------------|
| B/W Shoreline (GB/s/mm)                                 | 28 – 224                       | 165 - 1317          | Conservatively estimated: AP: 45u; Standard: 110u; Proportionate to data rate (4G – 32G) |
| B/W Density (GB/s/mm <sup>2</sup> )                     | 22-125                         | 188-1350            |                                                                                          |
| Power Efficiency target<br>(pJ/b)                       | 0.5                            | 0.25                |                                                                                          |
| Low-power entry/exit latency 0.5ns <=16G, 0.5-1ns >=24G |                                | 5-1ns >=24G         | Power savings estimated at $>= 85\%$                                                     |
| Latency (Tx + Rx)                                       | < 2ns                          |                     | Includes D2D Adapter and PHY (FDI to bump and back)                                      |
| Reliability (FIT)                                       | 0 < FIT (Failure In Time) << 1 |                     | FIT: #failures in a billion hours (expecting $\sim$ 1E-10) w/ UCIe Flit Mode             |

UCIe 1.0 delivers the best KPIs while meeting the projected needs for the next 5-6 years. Wide industry leader adoption spanning semiconductor, manufacturing, assembly, & cloud segments.

### **INGREDIENTS OF BROAD INTER-OPERABLE CHIPLET ECOSYSTEM**







# **FUTURE DIRECTIONS: COMPLIANCE TESTING**

- Golden Die is self tested with another golden die
  - calibrate the channels on the reference package
- UCle Devices tested against golden die for the following:
  - Physical layer: training, channel spec compliance, eye height/ width, BER, Tx swing, Rx voltage
  - Adapter layer: CRC, error detection/ replay, etc.
  - Protocol layer: PCIe and CXL leverage the respective protocol compliance suite
  - Assumption: DUT passes the sort and HVM prior to compliance
- Sideband plays a critical role to gather information on training and subsequent progression
- Formal compliance program needed (similar to PCI-SIG, CXL, USB)



# **FUTURE DIRECTIONS**

#### Challenges to solve:

- Set of foot-prints for chiplets
- Power delivery, Cooling
- Debug and test
  - Not all chiplets are accessible and UCIe can not be probed on a package
  - Need a mechanism for controllability and observability from external package pins

#### Protocol Enhancements: Native on-package memory

#### Reduced bump pitches with advanced packaging:

- B/W density will increase as a square of the pitch reduction (e.g., pitch goes down from 45u -> 25u => B/W density will go up by 3.24X)
- Higher B/W => Reduced frequency (say ½) so that we still get higher b/w but can get power savings with simpler circuit (reduces capacitance) and lower voltage levels (CV<sup>2</sup>f) – get benefit of C and V reduction

#### 3D chiplets

- 3D power-efficiency and latency will approach on-die levels
- Challenges: Reliability, debug, power delivery, cooling
- Combination of 3D, 2.5D, and 2D integration on package
- Interconnect b/w and power will be a challenge skyway interconnects across 3D towers on package?



### CONCLUSIONS

- Chiplets and D2D interface are essential to the compute continuum
- Power-efficient performance, yield optimization, different functions, custom solutions, cost-effective
- UCIe standardization will propel the development an open ecosystem
- Open plug-and-play "slot" at package level will unleash innovations
- Evolution needs to track the underlying packaging technology to deliver compelling metrics
- Form-factor, New Protocols, and manageability are some other areas for innovation
- The open chiplet journey with UCIe just started! Join us in what will be an exciting journey for decades!



2023 OFA Virtual Workshop

**THANK YOU** 

Dr. Debendra Das Sharma Chair, UCle Consortium

Intel Senior Fellow and co-GM Memory and I/O Technologies, Intel Corporation