@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20260330

Starting point is 00:00:04 Welcome to HPC Newsbytes, a weekly show about important news in the world of supercomputing, AI, quantum computing, and other advanced technologies. Hello, everyone. Welcome to HBC Newsbytes. I'm Doug Black. And with me is Shaheen Khan. In a move that has been anticipated for some time, Arm, the chipless chip company that has thrived by licensing its technology, launched its first real chip, a CPU that it said, has been designed for agentic AI. To leave no doubt about that, they call it the Arm AGI CPU. It has been developed with meta as a key partner and also featured YouTube, OpenAI, SAP, Cerebrus, and Cloudflare as collaborating partners. The chip has 136 cores designed for air-cooled or liquid-cooled racks with a 1-U

Starting point is 00:00:58 server chassis and promises more than 2x the performance per rack compared to 2.2.2.com. traditional X-86 alternatives, something we're sure AMD and Intel would not agree with. Full production is planned for the second half of 2026 in TSM's 3-Nanameter Fab in Taiwan, and original design manufacturers' ODMs such as ASRock, Lenovo, Quanta, and Super Micro were mentioned as building complete systems using the chip. But no mention of expected volumes, which is a big question, in light of tight capacity at TSM and parts shortages across the industry. They said limited volumes in 2026, ramping up in 2027, and it moved to 2 nanometer in 2028. Meta is building its own accelerator, as are Google, Microsoft, and Amazon.

Starting point is 00:01:54 Microsoft, Amazon, and Nvidia also built their own arm-based CPUs. So naturally, Arm-fielded questions about competition with its own. customers, but said it's all about providing more choices for all customers. I see these chips as having more or less similar micro-architecture and their performance, a function of fab technology, and architectural choices to optimize for one workload or another. For a while, there was a lot of debate about Arm and Risk 5 being cleaner instruction sets than X86 and with little historical baggage for backwards compatibility. implying that x86 was inherently slower than Arm.

Starting point is 00:02:37 That all faded after AMD started leading in performance with the same X86 instruction set using TSM's advanced fabs. What we see as Arm today started out in 1978 as Cambridge processor unit limited, which spells CPU so they can say they've been about CPUs from the beginning. A year later, they started Acorn Computer Limited to separate, the computer business from consulting that they had started doing, and 11 years after that, in 1990, ACORN spun out its captive design team as advanced risk machines limited or ARM, as a joint venture with Apple and VLSI technology to build a CPU for Apple Newton, which coined the

Starting point is 00:03:25 term personal digital assistant. In 1991, Arm pivoted to IP licensing business and thrived with that model. As the architecture proliferated, the customers could use more services, but participating in more of what customers do by moving up the stack is financially complicated, since profit margins as a percentage would drop, even though profits as a magnitude would be higher. So Arm added a new business model and started offering more design services with their integrated design blocks, i.e. bigger technology chunks ready to tape out, which they call compute subsystem, or CSS. And now they have added another business model by selling actual chips. For IP licensing, analysts see gross margins of about 97% for what arm would provide.

Starting point is 00:04:19 That is just what Wall Street wants to hear, those kinds of high margins, but it translates to only a $50 share of a $1,000 chip, because they're not participating in the remaining $948 of that chip. With the compute subsystem business, margin percent drops slightly to 95 percent, but profit magnitude doubles to $100 out of a $1,000 chip. And now with the direct silicon business, where they sell the whole chip,

Starting point is 00:04:47 analysts estimate a margin percent of about 50 percent initially, which translates to $500 of profit out of a $1,000 chip. Sticking with CPUs, is a new blog from Intel that talks about the rising status of CPUs in AI workloads as the industry moves from AI training to AI inference. The blog is called the rising CPU GPU ratio in AI infrastructure and asserts that agentic AI and complex inference pipelines require significant serial processing, data orchestration, and housekeeping tasks that GPUs cannot perform as efficiently. As you've heard me say here, I consider AI inference to have similar deployment dynamics as the old mainframe era online transaction processing, OLTP.

Starting point is 00:05:38 Inference brings HPC into OLTP. The article's central focus is on that architectural and economic shift. The article does not focus on vector extensions within CPUs, which have been around for over 30 years, 32 years, in fact, and have been standard on X-86 chips for about as-dust. long, and matrix extensions that were introduced three to six years ago depending on the chip. These capabilities are already there in systems and can accelerate code within the chip, albeit not anywhere near what a real GPU can do. The article mentions that certain AI tasks like smaller inference based on smaller models can be moved back to the CPU to improve latency, energy efficiency, and total costs,

Starting point is 00:06:24 but does not push that point, which is probably smart. Instead, it stays in the more defensible territory of CPU-friendly tasks like logic-heavy orchestration, baseline security, networking, and pre- and post-processing. Those tasks, and not matrix math on the GPU, are the cause of some 90% of the latency, it says, and more CPUs will be the solution. Nature hates a vacuum, and in the IT industry, innovators hate an unmet need. And right now, there is enormous need for greater power efficiency and AI workloads. Actually, innovators love the business opportunity of an unmet need, and in desperate times like

Starting point is 00:07:07 these, innovations can come from unexpected sources. We're going to look at three efforts to address power needs, one of which is from an Israel-based company called NIV AI that has announced $12 million in seed funding. The company says it analyzes boosts in general. GPU-powered usage, allowing NIV AI to, quote, actively synchronize energy and compute and safely unlock stranded capacity by notifying data center developers who can make more cost-effective choices about energy use. NIV AI plans to have systems in American data centers this year, according to TechCrunch. And a 10-person startup called Cache Energy in Champaign, Illinois,

Starting point is 00:07:54 has developed a cement battery, yes, a cement battery, in the form of tiny cement pellets. They've added an unnamed but readily available binding agent that enables the pellets to hold and then discharge heat repeatedly, generating temperatures of up to 1,000 degrees Fahrenheit. So assuming these batteries are produced at scale, they could be used to ease the overall burden on electric grids caused by data center power demand. Well, it's all high-tech, but let's now get back for a moment, Doug, to actual high technology. There was an interesting blog from Google last week announcing what they called turboquant, a compression algorithm that takes on the challenge of memory in vector quantization. So let's unpack that for a bit.

Starting point is 00:08:43 Key value is a data structure in AI inference that holds the context of an interaction with AI. In AI, unlike traditional databases, the key is not just a label used to look up a value. It is a string of hundreds of numbers representing the mathematical relationship between a token to other potential tokens. And the value that corresponds to it is itself a vector, a series of numbers. They are used in matrix multiplications between a new query, which itself has been vectorized into sequences of numbers, and all the stored key vectors and their values. It's all math. The key value structure is referenced and accessed a lot,

Starting point is 00:09:28 so it makes sense to keep it in the fastest memory you can. Either HBM, high bandwidth memory, or Sram, static random access memory. SSDs and traditional memory even would be too slow. So that is what turns it into a cache. Now, KV structures get bigger as the AI model gets bigger, or as you ask AI to remember more of your exchange with it, or as you throw a lot of different requests at it in a multi-user environment. And if you want to keep all of that in fast memory,

Starting point is 00:09:59 you'll need more memory and faster memory. Today, it has become the main bottleneck in AI inference. So you could say KV Cash is a big reason why there is a memory shortage in the industry, and the big reason why memory prices have been going up. Large models, long contexts, or high concurrency, all cause KV Cash to get bigger. In those cases, KV Cash dominates. You might need several gigabytes per request or tens of gigabytes in aggregate, and data movement, not computation, becomes the bottleneck.

Starting point is 00:10:35 Anything you can do to reduce the size of KV Cash without loss of accuracy will make AI inference faster, allows for bigger models and more simultaneous users. This has been done by going to lower precision arithmetic, from 32 bits to 16 to 8, and with new data formats that match the requirements of the specific math. They call this reduction in size quantization, which we can say is a synonym for compression. But in recent years, reducing precision has hit a floor,

Starting point is 00:11:07 with the overhead of managing lower precision, causing more work than it saves. While 4 bits have been tried and generally showed too much degradation, and research projects have looked at 2 bits or even 1 bit, the industry standard today is 8 bits. The Google TurboQuant team has found a way to reduce precision to 3.5 bits without loss of accuracy or much overhead. It's 3 bits for all the data and one correction bit for half of the data,

Starting point is 00:11:38 which is where 3.5 bits comes from. TurboQuant cuts KV cache size by 4 to 6x, reducing bandwidth pressure and increasing tokens per second and users per GPU. The added compute for quantization, dequantization, projections, etc., is pretty small, often single digit to low double digit percent, and the overhead is usually amortized via batching. So effectiveness scales with total KV4, footprint across all users and models. It is not a per-request thing. However, if you're compute-bound,

Starting point is 00:12:16 i.e. short contacts, small models, low concurrency, then KV cache is already small and fast. Then the extra compute provides no relief and may slightly increase latency. But people who buy a lot of GPUs are usually dealing with large KV caches, and this algorithm promises to be standard in short order. This is why its announcement impacted stock prices. But whether this means lower demand for memory is not so clear. Demand seems insatiable, so reduction in costs can actually unlock additional demand versus reduce it. India is among the list of countries that wants to redouble its chip manufacturing capacity across the board, not just at the high end, a list of countries that includes all major economies and even Taiwan and the United States,

Starting point is 00:13:07 States, which already have leading positions. In India is coming from behind and as a way to go, but it's important to track its progress and expect it to move fast. The Data Quest publication in India has a story about India's semiconductor ecosystem entering a period of transformation, quote-unquote, with the India Electronics and Semiconductor Association reporting that the country's chip production capacity could reach 75 to 80 million units per day by late 2026 or early 27 as new facilities come online. This is part of a long-term strategy to build the ecosystem and supply chains in India and to go up the technology curve. Yes, the numbers are very small, but that's not quite the point yet, as you mentioned. For now, India's semiconductor effort targets chip assembly and testing at the low end.

Starting point is 00:14:02 For example, Micron recently launched a facility for ATMP, which stands for assembly, testing, marking, and packaging. Other projects on the way are led by Tata Electronics, Cain's technology, and CG power and industrial solutions. India right now does not produce its own wafers. The industry metrics is, of course, wafers versus chips because a wafer produces a lot more small chips than big ones. So, to put it in perspective, 80 million chips per day, assuming smallish chips at the low end, could be something like 2,000 to 4,000 wafers per day, and maybe even less. By contrast, rough ballpark estimates have China processing 300,000 wafers per day, Taiwan and Korea 200,000 each, Japan, 150,000, and U.S. and Europe 100,000 wafers per day each.

Starting point is 00:14:57 Again, very broad brush estimates there to show order of magnitude. Taiwan, of course, leads in the high-end, high-value end, and South Korea has the advantage in memory production. There's a report from the Hong Kong-based Long Bridge securities analyst firm that TSM is facing a demand crunch for its advanced 2-nometer chip production, prompting NVIDIA to reconsider its Feynman AI platform design. Feynman is the company's next generation GPU micro architecture and AI platform announced for 2008 and successor to the Rubin platform. And speaking of shortages, we took note of a story in the

Starting point is 00:15:38 Wall Street Journal that Elon Musk has unveiled a $20 billion, quote, terra fab chip manufacturing project, in which he says Tesla will produce processors, memory, and package processors under one roof. One thing that struck us was a quote for Musk who said that relying on current fab companies would provide Tesla with a mere 2% of what Musk said his companies need for Tesla's robot cars and humanoid robots and Space X's AI data centers to fuel XAI plans. On the TSMC story, their two nanometer shortage affects NVIDIA, but also competitors like meta and their usual big customers, Apple, Qualcomm, AMD, and Broadcom, who's been ascending. TSM has also raised prices in the past and is presumably signaling additional price hikes

Starting point is 00:16:30 to 2009. Nvidia has secured significant A16 allocations, 1.6 nanometer class, in other words, but faces two nanometer competition from Apple and others. As for new chip factories at the high end, there were news stories about Sam Altman, aiming to raise $7 trillion for new fabs a couple of years ago. That effort was abandoned in favor of in-house design and a partnership with Broadcom. Tesla mentioned its own GPU in 2019, announced it in 2021 as the D1 GPU and the Dojo supercomputer. It said the production of the Dojo system had started in 2023, only to shut it all down in 2025.

Starting point is 00:17:15 Building a fap from scratch is a monumentally more difficult task, so we shall see. All right, that's it for this episode. Thank you all for being with us. HPC Newsbytes is a production of OrionX. Shaheen Khan and Doug Black host the show. Every episode is posted on Orionx.net. If you like the show, please rate and review it. Thank you for listening.

@HPC Podcast Archives - OrionX.net - HPC News Bytes – 20260330

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.