Storage Developer Conference - #165: Enabling Heterogeneous Memory in Python

Episode Date: March 30, 2022

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 165. Welcome to the talk on enabling heterogeneous memory in Python. The talk is centered around three main learning objectives. First, we will look at what is happening in the memory technology space,
Starting point is 00:00:53 and specifically we will review the emerging Coherent Express Link standard, also known as CXL, which provides a coherent serial memory channel to off-chip memory. Second, we will discuss adoption challenges of these new technologies from a software perspective, and particularly with respect to legacy code and libraries. Finally, we will examine PyMM, which is a new approach to software integration of current persistent memory and future CXL-attached memories that allows adoption
Starting point is 00:01:26 with minimal code change. A word of warning, this is not a product talk, but more of a philosophical talk and a presentation of some early work that IBM is doing in the area of preparing for the adoption of CXL. CXL defines a standard for high performance cache coherent interconnect that allows CPUs and accelerators to share memory in a cache coherent manner. It also allows expansion beyond the locally attached DIMM slots and potentially more memory bandwidth to the CPU. There are two key advantages to this technology brings.
Starting point is 00:02:15 First, it allows cost reduction and performance improvement by sharing memory resources across devices and thus reducing data copies. For example, a CXL attached GPU or accelerator card could either share its local memory, which could be in the form of high bandwidth memory, with the host CPU or use the host CPU memory directly. Accessors in CXL are cache line granularity, that is 64 bytes, and all caches are kept coherent. The second generation of the standard, CXL2, allows connections outside of the box to a switch. This revision, among other things, supports shared memory pooling and CPU memory disaggregation that allow flexible, software-controlled association of memory to processing. This type of capability could really impact the way we build and deploy systems in data centers and the cloud.
Starting point is 00:03:10 While the CXL specification is close to finalization, we do not expect to see CXL-capable platforms until 2022. We are beginning to see design and CXL controller logic IP become available today and we expect to see CXL 1.1 support in the Intel Sapphire Rapids platform which is expected to be generally available in 2022. We're not going to go into any great detail about CXL but I think it it's worth sharing a high-level picture of what CXL looks like. It's based on three protocols, CXL.io, CXL.cache, and CXL.memory. These are all layered on a PCI Express 5-based transport. CXL.io basically provides PCI Express like services for large DMA based transfers. It is actually non-coherent with the caches and largely follows the PCI Express definition. It is used for block I.O as well as discovery and configuration of CXL devices in the system. CXL.cache is used to keep CPU and devices cache coherent. Its primary use case is a type 1 device such as an accelerator network interface card that
Starting point is 00:04:34 wishes to perform atomic operations on host memory. Finally, the CXL.memory protocol is a transactional interface between the host CPU and memory. The coherency interface of the CPU uses CXL.Mem to interface with the memory provided by either an accelerator with its own memory, this is known as a Type 2 device, or a device that provides memory itself, sometimes termed a memory expander or type 3 device. These three protocols together define the transaction layer for CXL. Okay, let's look more closely at the memory expander or the type 3 device support. The expectation is that a host processor will continue to have locally attached DDR because CXL attached memory will undoubtedly be slow, probably adding around 100 nanoseconds to memory access. However, what is neat about the CXL attached memory is that it can support different types of memory
Starting point is 00:05:39 that have different performance and functional characteristics. As we move to CXL 2.0, we can externalize the memory devices from a node via a single layer switch. Hierarchical switching is a focus of CXL 3. Anyway, the CXL switch includes a fabric manager that can dynamically control the visibility of memory devices to host processors. From a cloud or data centre perspective, this means that both memory scale-up and relocation of compute without data copying is possible. For those astute viewers, you may have also noticed that memory devices can provide
Starting point is 00:06:19 additional functionality. These intelligent memory functions are not included as part of the standard, although reliability and serviceability concerns and security encryption are considered. The point is that the specification allows a provider to create memory devices that include additional intelligence or compute. This could be horizontal services such as support for persistent memory transactions or tiering to support storage, snapshots, copy and write. They can also be vertical or application specific services such as near data compute or processing
Starting point is 00:07:04 in memory. Again, this is a visionary slide of what is to come rather than what is likely to be available next year. The takeaway is that we should expect CXL to enable support for heterogeneous memory types in a single system or cluster, as well as expansion of the traditional memory controller function to include advanced data management capabilities. OK, now we understand that with CXL the future looks like we'll have systems with lots of types of memory. We're going to centre our discussion around general purpose programming, such as building a data science application in Python.
Starting point is 00:07:49 Local attached DRAM, CXL attached DRAM, PMEM or storage class memory could all be present in the same single node. And as CXL evolves we should expect memory sharing across nodes through CXL switching. The problem is that for the last 50 years or so, locally attached DRAM has been the only player in the memory arena, and software programming languages, compilers and frameworks have all been defined around this premise. Today, programming language abstractions and compilers see and expect volatile behaviour and data access. While today's compilers
Starting point is 00:08:27 are very aware of the cache hierarchy, they typically expect the main memory to behave uniformly and just store bits. Today's software typically uses storage to make data persistent. The system layers services such as file systems and databases to provide durability. Here, the programming domain and the persistent domains are separate. Before copying to the persistent storage domain, data or variables in the programming domain must be first converted so as to remove dependencies on pointers or machine architecture specifics. The data is then explicitly copied to the persistent domain, whether that be through
Starting point is 00:09:09 a file system, database or key-value store. For emerging persistent memories, such as Intel Optane persistent memory modules, we must effectively rewrite the application. Toolkits such as Pers memory development kit or PMDK require that applications use a well-understood interface to the persistent memory such as a file of its key value store or alternatively that applications be rewritten to incorporate transactional semantics and awareness of persistence. For legacy applications to get the full potential from the persistent memory, there is no easy path.
Starting point is 00:09:54 With CXL, the software is suddenly faced with a very different concept of what memory is. It can be volatile or non-volatile, local or remote, fast or cheap, or integrated with any possible value-add services. The software ecosystem is not ready for this shift. Even shifting to persistent memory is a hurdle in itself. My legacy Python program does not include abstractions or an understanding of what persistence and transactions means. It doesn't even have a notion that variables can exist in different types of memory. Of course we expect low-hanging fruit for adoption whereby the operating system or hypervisor can do something useful with knowledge of different memory
Starting point is 00:10:38 behaviours. For example a hypervisor might deploy virtual machines onto different types of memory, depending on cost-performance ratios required. So the general problem is, how do I reuse my invaluable Python ecosystem with this new memory technology? In a perfect world, new technology can be quickly and easily adopted. At a minimum, this means that the legacy applications and libraries can still be used. I don't need to go off and rewrite all my software. That's simply not feasible. However, we believe that small changes are more palatable, such as tweaking API calls or adding additional parameters. In the long run, the introduction of new hardware technology is typically followed by a gradual
Starting point is 00:11:30 alignment of the software ecosystem. Of course, this assumes that hardware brings new value to the table that cannot be got from elsewhere. Unfortunately, we don't live in a perfect world. Software catch-up for CXL is not going to happen overnight, but we don't live in a perfect world. Software catch up for CXL is not going to happen overnight, but we can make steps in the right direction. We're now going to shift to talking about PyMM.
Starting point is 00:11:58 PyMM is an early prototype from IBM Research that takes a different approach to integrating external and persistent memory with the Python programming language. The target for this technology is primarily Python-based data analytics and machine learning. The approach of PyMM is very different from PMDK. PyMM, which stands for Python Memory Management, is a librarian extension for Python 3. The principal objective of PyMM is to make the integration of different memory types very, very easy.
Starting point is 00:12:34 Because CXL is not readily available today, PyMM is prototyped with DRAM and locally attached Intel Optane 3D cross-point DIMMs. We are also keen to address the problem of persistent memory adoption, so this is part of this work. The current prototype is focused on supporting data science applications, of which the respective developer community does not really want to worry about the details of the underlying physical memory. With respect to persistent memory, the data science developer recognises the potential for holding more data in memory at the same time, ultimately improving performance. They also recognise the advantage of avoiding the need to continuously reload amounts of data from storage whenever an application crashes
Starting point is 00:13:24 or is terminated. PyMM is available as part of the broader MCAS open source project that provides capabilities around network attached memory centric storage with near data compute. PyMM provides an abstraction known as a shelf for a logical collection of variables that are stored in a specific set of memory resources. These variables that reside on a shelf are readily available to be used in Python. There is no need to load from storage and convert into memory form. The shelf permits certain types to be placed
Starting point is 00:14:06 on the shelf. This is because for persistent memory, the type must retain metadata that describes the type and format of the data. Shelf types look and feel exactly like a conventional Python counterpart type. In theory, a shelf can be created for any existing data type. The key to doing so is Python's polymorphism and subclassing capabilities. For example, the current PyMM prototype supports a shelves underscore ndarray, which is polymorphic with a volatile numpy nd array. So let's see PyMM in action. For the moment we won't worry about creating a shelf, we'll just open an existing one. Once we have the open shelf, the variables are immediately available.
Starting point is 00:15:07 To access a variable, we simply use the dot accessor on the shelf instance itself. Variables can be added and removed from the shelf, and they can also be modified in place with ease. No pickling or loading from storage. Variables that belong to a given shelf are stored in memory resources that are associated with the shelf. Under the HUD the shelf implementation requires a key value namespace that maps the variable name, that is the key, to the memory used to store the variable. The memory used to store the variable consists of one or more regions that are allocated through the underlying heap allocator. For persistent memory, the shelf must store both the metadata and the
Starting point is 00:16:02 value data so that type information can be recovered when the shelf is reopened. The current prototype supports regions of Intel Optane AppDirect memory, but we have designed PyMM to allow other memory types, such as future CXL attached, to be easily integrated. Shelf types are those that can be instantiated in the shelf memory. They are implemented to look like their volatile counterpart but the memory for the type instance is captured and specifically allocated from shelf memory. Shelf variables are instantiated in one of two ways. First, they can be instantiated directly on the shelf using what is known as a shadow type. The shadow type is simply an expression that can be evaluated that retains constructor parameters
Starting point is 00:16:58 which can be later used in the instantiation in persistent memory. Alternatively, the shelf variable can be instantiated using a copy constructor that takes an instance, for example expression evaluation, and uses that to copy construct on the shelf. This latter approach is useful when you want to use an existing library to construct your shelf data. For example, you could use numpy ones to constructndArray shadow type and then fill the matrix with ones using the fill operation after the fact. So here we're creating the array using the shadow type and then we can use the in-place fill operation to populate the matrix.
Starting point is 00:18:17 Shadow type instantiation only requires shelf memory. However, copy constructor instantiation requires that there is sufficient main memory to evaluate the right-hand side expression. We'll come back to addressing this problem later. Now let's quickly talk about references, which is important to Python, since everything is done at pass-by reference. Faultile references can be made to variables on the shelf. However, in the current prototype, references themselves cannot be explicitly put on the shelf. However, in the current prototype, references themselves cannot be explicitly put on the shelf. If you delete something from the shelf with an outstanding
Starting point is 00:18:52 volatile reference, PyMem will give a warning. Okay, so let's see references in action. So we've created a shelf and then on the shelf we're going to instantiate a variable. In this case a 3x3 matrix. Now we have the variable on the shelf. We can actually look at the address of the variable using the special addr attribute. And then we can actually create a reference to that shelved variable. Here the reference is called ref and it's pointing to shelf.m.
Starting point is 00:19:50 And we can verify that the reference is pointing to the same object by again looking at the address. However, if we do an assignment between shelf variables, actually what happens is that we get a copy so here shelf.n equals shelf.m makes a copy of the matrix shelf.n and here you can see we're doing an operation on shelf.n and then we can check that a there are different addresses and b that the actual contents have changed. We can also delete variables from the shelf.
Starting point is 00:20:34 Here we can see we're deleting the variable m from the shelf with a shelf.arrays m and then we're doing the same for n. Shelf types used as expressions are coerced to their volatile counterpart type. So for instance, if I have a PyMM shelved ndarray, this will be coerced to the MP ndarray type. So let's take a look at an example. So I've opened a shelf here, and we have a variable.m, which is a shelved nd array. We'll create another variable using the basic numpy nd array type which the variable here is x and then if we look at the type of the shelf and the type of x they're different. If we look at the type of the expression shelf.m plus x we can see that it's coerced to numpy.ndarray. Likewise if we take two shelves objects shelf.m plus shelf.m
Starting point is 00:21:36 we can see that is also coerced to an ndarray. However when you apply a assignment operator between shelf objects actually what happens is a copy is made so here we can say shelf.p equals shelf.m plus shelf.m and that will create a shelf type shelf.p transient memory mode is used by PyMM to provide sufficient memory for right hand side evaluation. The user allocates a portion of persistent memory or alternatively a backing file path. The persistent memory needs to be an FSDex DAX file and the backing file needs to be mappable with mmap. The transient memory which is normally only used temporarily is applied when large allocations over 64 kilobytes are made. This threshold can be easily modified. The way it works is
Starting point is 00:22:37 by using the Python pymm set allocator function to override the default Python raw memory allocator. For numPy we've had to make some small modifications so that the Python allocator is used instead of the system malloc function. Okay, let's take a look at an example. So in this system I have 6.8 gigabytes of DRAM or main memory. And I've also got two persistent memory devices, one configured as devdacx or dax 1.0 and one configured as fsdax slash mountpmem0. Firstly I'll import the pyMM module and the NumPy modules. Then I'm going to create a shelf so we can perform the test.
Starting point is 00:23:31 The shelf you can see is configured to be 64 gigabytes in capacity using the DAX 1.0 device. Now we're going to try and create a variable x on the shelf that is 16 gigabytes in size. We're going to use the copy constructor to do this using the mp.ones function that's provided by numpy. So obviously this is too big to fit in memory and you can see what happens is it fails to actually instantiate the shelf variable X because of exhaustion of memory evaluating the right hand side. So PyMM has a feature called transient memory which allows you to use persistent memory
Starting point is 00:24:18 and or backing store to increase the available memory for right hand side evaluation. So in this example we're going to use both persistent memory and the backing file. Here you can see I'm using slash mount pmem0 slash swap. Remember this is fsdax so that's the name of a file on the persistent memory. We're going to give that a size of 32 gigabytes and then we're also going to give a backing directory of slash temp. Now you can see I can execute exactly the same instruction. It takes a little while but you can see now that it's able to get the memory to evaluate the right hand side instruction expression. And so now you can see shelf.x is successfully instantiated on the shelf
Starting point is 00:25:08 and the transient memory has in fact been released. One of the main attractions of PyMM is its ability to support persistent memory such as Intel Optane Persistent Memory modules in AppDirect mode. The idea is that the programmer does not have to worry about crash consistency. It will all be dealt with under the hood. PyMM's default implementation is to use software-based undo logging by implicitly making a copy or out-of-place instance. However, achieving high-performance crash-consistent transactions in general
Starting point is 00:25:47 is a difficult problem. PyMM does not profess to solve this problem, but instead provides hooks so that the developer can put in place their own approach to transactions and crash consistency. So whether you want to do undo or redo logging, in software or in hardware, PyMM has been designed to allow those mechanisms to be easily plugged in. Okay, let's take a look at some example code.
Starting point is 00:26:17 We're going to first look at the explicit persistence or persist calls. So here I've imported PyMM and opened an existing shelf. And on the shelf I have a matrix that's currently populated with 9s. I'm going to perform an addition to each element of the matrix of the plus equals 1. And then I'm explicitly calling shelf.matrix.persist to make sure that the changes are flushed out of cache to persistent memory. So here there's no crash consistency. The matrix could in fact crash in some inconsistent state. I can also start PyMM with this special environment variable PyMM underscore use underscore SW underscore TX.
Starting point is 00:27:08 This will turn on undo logging for the system. So if I repeat the exercise, I can open the shelf and then perform some operation on the matrix. And you can see the operation fill 8 executed implicitly a begin transaction and a commit transaction point. This is what gave the hooks to basically do the undo login. So I can effectively implement whatever type of crash consistency I want using these hooks. Okay, that's the end of the talk, so let's summarize. We believe that CXL, amongst other things, will fundamentally change the way memory is integrated into the system. It will allow both new types of persistent memory and higher capacity volatile memory to be integrated by the system integrator,
Starting point is 00:27:58 not just the processor vendor. Furthermore, CXL will ultimately allow a shift to compute memory disaggregation, creating additional resource flexibility in cloud and data center deployments. However, these new heterogeneous memory types will present a challenge to the software. Working out how to integrate this emerging technology with existing software, programming languages, compilers, libraries and frameworks will be fundamental to driving adoption. In this talk we have presented PyMM, which is a Python 3 framework designed to ease the adoption of persistent memory and ultimately CXL attached heterogeneous memories. Remember, CXL attached memory of the future may do more than just store bits. PyMM is a proof-of-concept prototype. It's not a product, but it is there in the open-source community,
Starting point is 00:28:56 so you can pick it up and have a play. We welcome contributors, and we are already trialing the idea with data science students at Boston University and the Hasso Plattner Institute. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further
Starting point is 00:29:30 with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.