Remote Persistent Memory: It Takes a Village, (or Perhaps a City)

/ / Consortium News, From the Chairs, OpenFabrics News

By Paul Grun, Chair, OpenFabrics Alliance and Senior Technologist, Cray, Inc.

Remote Persistent Memory, (RPM), is rapidly emerging as an important new technology. But understanding a new technology, and grasping its significance, requires engagement across a wide range of industry organizations, companies, and individuals. It takes a village, as they say.

Technologies that are capable of bending the arc of server architecture come along only rarely. It’s sometimes hard to see one coming because it can be tough to discern between a shiny new thing, an insignificant evolution in a minor technology, and a serious contender for the Technical Disrupter of the Year award. Remote Persistent Memory is one such technology, the ultimate impact of which is only now coming into view. Two relatively recent technologies serve to illustrate the point: The emergence of dedicated, high performance networks beginning in the early 2000s and more recently the arrival of non-volatile memory technologies, both of which are leaving a significant mark on the evolution of computer systems. But what happens when those two technologies are combined to deliver access to persistent memory over a fabric? It seems likely that such a development will positively impact the well-understood memory hierarchies that are the basis of all computer systems today. And that, in turn, could cause system architects and application programmers to re-think the way that information is accessed, shared, and stored. To help us bring the subject of RPM into sharp focus, there is currently a concerted effort underway to put some clear definition around what is shaping up to be a significant disrupter.

For those who aren’t familiar, Remote Persistent Memory refers to a persistent memory service that is accessed over a fabric or network. It may be a service shared among multiple users, or dedicated to one user or application. It’s distinguished from local Persistent Memory, which refers to a memory device attached locally to the processor via a memory or I/O bus, in that RPM is accessed via a high performance switched fabric. For our purposes, we’ll further refine our discussion to local fabrics, neglecting any discussion of accessing memory over the wide area.

Most important of all, Persistent Memory, including RPM, is definitely distinct from storage, whether that is file, object or block storage. That’s why we label this as a ‘memory’ service – to distinguish it from storage.  The key distinction is that the consumer of the service recognizes and uses it as it would any other level in the memory hierarchy. Even though the service could be implemented using block or file-oriented non-volatile memory devices, the key is in the way that an application accesses and uses the service. This isn’t faster or better storage, it’s a whole different kettle of fish.

So how do we go about discovering the ultimate value of a new technology like RPM? So far, a lively discussion has been taking place across multiple venues and industry events. These aren’t ad hoc discussions nor are they tightly scripted events; they are taking place in a loosely organized fashion designed to encourage lots of participation and keep the ball moving forward. Key discussions on the topic have hopscotched from the SNIA’s Storage Developers Conference, to SNIA/SSSI’s Persistent Memory Summit, to the OpenFabrics Alliance (OFA) Workshopand others. Each of these industry events has given us an opportunity for the community at large to discuss and develop the essential ideas surrounding RPM. The next installment will occur at the upcoming Flash Memory Summit in August where there will be four sessions all devoted to discussing Remote Persistent Memory.

Having frequent industry gatherings is a good thing, naturally, but that by itself doesn’t answer the question of how we go about progressing a discussion of Remote Persistent Memory in an orderly way.  A pretty clear consensus has emerged that RPM represents a new layer in the memory hierarchy and therefore the best way to approach it is to take a top-down perspective. That means starting with an examination of the various ways that an application could leverage this new player in the memory hierarchy. The idea is to identify and explore several key use cases. Of course, the technology is in its early infancy, so we’re relying on the best instincts of the industry at large to guide the discussion.

Once there is a clear idea of the ways that RPM could be applied to improve application performance, efficiency or resiliency, it’ll be time to describe how the features of an RPM service are exposed to an application. That means taking a hard look at network APIs to be sure they export the functions and features that applications will need to access the service. The API is key, because it defines the ways that an application actually accesses a new network service. Keep in mind that such a service may or may not be a natural fit to existing applications; in some cases, it will fit naturally meaning that an existing application can easily begin to utilize the service to improve performance or efficiency. For other applications, more work will be needed to fully exploit the new service.

Notice that the development of the API is being driven from the top down by application requirements. This is a clear break from traditional network design, where the underlying network and its associated API are defined roughly in tandem. Contrast that to the approach being taken with RPM, where the set of desired network characteristics is described in terms of how an application will actually use the network. Interesting!

Armed with a clear sense of how an application might use Remote Persistent Memory and the APIs needed to access it, now’s the time for network architects and protocol designers to deliver enhanced network protocols and semantics that are best able to deliver the features defined by the new network APIs. And it’s time for hardware and software designers to get to work implementing the service and integrating it into server systems.

With all that in mind, here’s the current state of affairs for those who may be interested in participating. SNIA, through its NVM Programming Technical Working Group, has published a public document describing one very important use case for RPM – High Availability. The document describes the requirements that the SNIA NVM Programming Model – first released in December 2013 — might place on a high-speed network.  That document is available online (https://www.snia.org/sites/default/files/technical_work/final/NVM_PM_Remote_Access_for_High_Availability_v1.0.pdf). In keeping with the ‘top-down’ theme, SNIA’s work begins with an examination of the programming models that might leverage a Remote Persistent Memory service, and then explores the resulting impacts on network design. It is being used today to describe enhancements to existing APIs including both the Verbs API and the libfabric API.

In addition, SNIA and the OFA have established a collaboration to explore other use cases, with the idea that those use cases will drive additional API enhancements. That collaboration is just now getting underway and is taking place during open, bi-weekly meetings of the OFA’s OpenFabrics Interfaces Working Group (OFIWG). There is also a mailing list dedicated to the topic to which you can subscribe by going to www.lists.openfabrics.org and subscribing to the Ofa_remotepm mailing list.

And finally, we’ll be discussing the topic at the upcoming Flash Memory Summit, August 7-9, 2018.  Just go to the program section (https://flashmemorysummit.com/English/Conference/Program_at_a_Glance.html), and click on the Persistent Memory major topic, and you’ll find a link to PMEM-202-1: Remote Persistent Memory.

See you in Santa Clara in August.