@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20260511

Starting point is 00:00:04 Welcome to HPC Newsbytes, a weekly show about important news in the world of supercomputing, AI, quantum computing, and other advanced technologies. Hi, everyone. Welcome to HPC Newsbytes. I'm Doug Black and with me as Shaheen Khan. Light-based optical computing technology has made big news again in the form of Vennvideo, investing more than $3.2 billion in Glassmaker and Optical Networking Company Corning as part of a massive optical fiber deal that includes three new factories in North Carolina and Texas, everything focused on AI. This is about optical interconnects and their incorporation within Nvidia's systems and upcoming generations of AI chips.

Starting point is 00:00:48 Intel and AMD are, as listeners of this podcast, have heard many times, also on this technology bandwagon, which promises to deliver next generation chip performance and power efficiency gains. But there's a broader story here, which is the emerging impact of optical on the future of computing in general. In addition to interconnects, optical computing, that is optical computation, may play a major role in some aspects of HPC handling workloads beyond the capacity of today's classical supercomputers. There is a cadre of startups conducting R&D in this space, and while some analysts doubt optical computation will come to fruition, it's a sector attracting attention and investment.

Starting point is 00:01:34 Optical interconnects started in long-distance telecommunications use cases and have been making their way towards the chip ever since, kind of a reverse Big Bang, if you will. As you mentioned, copackaged optics is now expected to arrive in upcoming chips from the major chip makers. Of course, the large and growing demand for AI is what's driving a lot of the developments out there, and since photons run cooler than electrons, they are being pulled into the picture anywhere they can play.

Starting point is 00:02:04 Obviously in interconnects, where they have been increasingly proven now at any scale, but also in computation, which is the next challenge and where the jury is out. But the area is intriguing and could work for specific algorithms, and that may be enough to get it going. Professor Karen Bergman of Columbia University, a leading authority in optical technologies and a guest of this podcast, Highly encourage you to go look up those episodes, express some doubt about optical computation the last time we spoke with her.

Starting point is 00:02:36 The topic also came up with Olivier Ezrati, our recent guest focused on quantum computing, as part of what he calls unconventional computing, which may be able to solve specific big problems and thus establish a valuable role for themselves. Meanwhile, serious R&D and optical computation is underway, and it's the usual scene, promising progress, but not quite there yet. Last week, OpenAI, together with Nvidia and other industry partners,

Starting point is 00:03:06 including AMD, Broadcom, Intel, and Microsoft, announced a new protocol for a more efficient distributed RDMA that runs on Ethernet fabrics. RDMA, or Remote Direct Memory Access, allows one processor to directly access the memory of another system across a network at low latency and minimal CPU overhead. The companies have submitted the 1.0 specification of what they call multipath-reliable connection MRC as an open standard to the OpenC compute project, OCP. OpenAI said the technology is already validated in production on its largest NVIDIAGB200 supercomputers, including on the Oracle and Microsoft clouds.

Starting point is 00:03:56 Yes, MRC is a pretty significant announcement and replaces the current protocol RDMA over-converged Ethernet version 2 or Rocky V2. You could say its essence is along three dimensions. First is technical shift. There have been RDMA protocols that run on Ethernet fabrics and are packet switched. Those parts are not new. What is new is that MRC can handle out-of-order delivery. Until now, if a packet failed or if it arrived out of order, it was a bit of a showstopper.

Starting point is 00:04:31 Traditional rocky fabrics degrade badly on their packet loss and out-of-order conditions. Those problems are solved now. Still under the technical shift, because it can handle out-of-order traffic, the protocol enables what they called packet spray, meaning the server packetizes the data and sends packets through every available network path simultaneously rather than using a single lane, so to say. What is new here is doing it efficiently with RDMA semantics at massive scale. The network is described as a programmable host-managed fabric because the routing and recovery intelligence is pushed to the host, the servers,

Starting point is 00:05:13 instead of relying on super lossless switches. The servers have the full map of the interconnect and specify the data path for each packet as it is being sent. If a packet fails, the recipient has to send a message to the sender asking for a retry. So the flow is not fully dynamic. However, the switches in the middle can detect the packet failure and send on just a header without the associated data to more quickly let the recipient know that the packet has failed. That way, the recipient can ask for a retry without having to wait for a timeout.

Starting point is 00:05:49 And all of this would be a selective retransmit, as they call it, just for the failing packet, not for the whole network transaction. Under Rocky V2, the whole network transaction would have to be retried, a process that they call go back n, go dash, back dash, n capital. The protocol dramatically reduces so-called tail latency too, where one slow packet can cause the whole cluster to wait. It can deliver higher performance across 100,000, or more GPUs, so it's really designed for massive scale.

Starting point is 00:06:24 And the reported benchmarks that I saw made MRC look really very good compared to Rocky V2. They reported zero measurable weights for MRC compared to something like 30% slowdown for Rocky V2, 92% efficiency and bandwidth for MRC compared to 85% for Rocky V2, and over 40% improvement in storage reads and rights for MRC, and then recovery times were reported to be close to hardware-level rerouting in microseconds because it does the selective retransmit. The second thing is a competitive shift.

Starting point is 00:06:59 The protocol is an open standard through OCP, and there is some associated open-source software, but each vendor has to do their own hardware implementation and optimization where a bunch of the magic is. Nevertheless, the protocol provides a multi-vender possibility for customers. We should also mention that the Ultra Ethernet Consortium, UEC, is already pursuing a lot of optimizations like this, but it is doing it with a clean slate. MRC runs on top of existing fabrics and protocols, and you could consider it as an evolution.

Starting point is 00:07:35 Finally, it's a strategic shift. It narrows the gap between Ethernet and Infiniband, if not close it. Ethernet and Infiniband have been using the same hardware for years but with different protocols, By bringing InfiniBand-like efficiency to Ethernet fabrics, MRC makes Ethernet a primary choice for big AI clusters. That's a pretty big deal, too. Speaking of Nvidia and the tech supply chain, we've recently talked a lot about Intel's turnaround, Microns, Rapid Ascent, Marvell, and others, and, of course, AMD. Wall Street looks like it has discovered the importance of all these technologies

Starting point is 00:08:12 and the structurally higher demand that they have to meet. And this exuberance is reflected in stock prices. Since AMD is viewed as the closest GPU rival to NVIDIA, the question remains, can AMD make a serious dent in NVIDIA's AI market dominance, or is the market growing so fast that the best they can do is grow faster? AMD's strengths in CPUs and GPUs is quite an advantage too. In fact, AMD CEO Lisa Sue has pointed to that,

Starting point is 00:08:44 saying that the surge in demand for AI inference has generated tremendous demand for their CPUs. Which raises another question, is Nvidia with its own arm CPUs vulnerable on the CPU front, now that AI inference is gaining? Or is it actually a comparative strength, since they can nudge their GPUs to connect to their own CPUs? So in terms of relative strength, in GPUs, AMD needs to grow faster, and in CPUs, Nvidia needs to grow faster. Cooling, what some have called glow-in-the-dark GPUs, has surfaced some surprising proposed solutions. Case in point is a company in Oregon that brings new meaning to the term liquid cooling.

Starting point is 00:09:26 A company called Pentalasa has raised $140 million to develop floating data centers at sea that rely on ocean wave-based power generation. The startup said they have successfully tested prototypes and plan to pilot their idea. in the Pacific Ocean by August of this year. We've heard talk about wave-based power for a long time, but this is the first I've heard of it moving beyond the science experiment phase. According to Astoria Geekwire, funding for the startup from Peter Thiel and others will allow them to finish building its pilot manufacturing facility near Portland.

Starting point is 00:10:05 The floating panthalassa orbs look something like massive navigational buoys, and the company says they will transmit data via low. Earth orbit satellites. The idea makes sense she not only in tapping waves for sustainable energy, but also because the data center itself is located in a cool environment. As anyone who was tried to swim in it can tell you, the Pacific Ocean in the northwest coast of the U.S. is cold. Well, it's even cold in San Francisco.

Starting point is 00:10:34 But yeah, it's a compelling story to start with you don't need land and can avoid complicated land use and permit issues. waves are supposed to have five to ten times the energy density of wind or solar, and they are 24-7. So you have a lot of energy and no interruptions. The energy is right there, so transmission issues are simplified, as are cooling challenges, since cold water is right there too. And if you let the data center float on a platform instead of being submerged, you can avoid underwater cables and a lot of the corrosion issues. And maintenance is doable too. Data can be sent by a satellite, or if you have it close enough to the shore through line-of-sight microwave.

Starting point is 00:11:17 If you remember that Microsoft experiment with a submerged capsule with servers in it a couple of years ago, Microsoft Project Natick, and we covered it on this podcast, they filled the capsule with nitrogen, submerged it and reported significant increase in reliability, actually. It failed one-eighths as often as those on land. So it can probably extend the hardware lifecycle, too. So whether or not it ends up looking at. looking like one, it's like an offshore oil rake, high initial costs, logistic challenges, salty water and air that cause corrosion, occasional storms, but hey, land, electricity, cooling, and maintenance could all be a little or a lot cheaper. By the way, the name Pantilasa

Starting point is 00:12:01 comes from the vast super ocean that encompassed planet Earth and surrounded the subcontinent pangia before the Earth's original landmass broke up into the continents and islands scattered around the Earth's surface that we have today. Now you know. All right, that's it for this episode. Thank you all for being with us. HPC Newsbytes is a production of OrionX. Shaheen Khan and Doug Black host the show.

Starting point is 00:12:29 Every episode is posted on OrionX.net. If you like the show, please rate and review it. Thank you for listening.

@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20260511

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.