In The Arena by TechArena - How Hedgehog Brings Hyperscaler Networking to Enterprise AI

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name's Allison Klein, and today is the Data Insights episode, which means I'm here with Janice Naurowski of Solidime. Hey, Janice, how's it going? Hi, Allison. It's going great. How are you? I'm doing great. I think this is our last episode for 2025, so it's been. been a year of telling stories about data insights, and we're really leaving the year with a fantastic guest. Tell me who's with us today. Yes, we are leaving the year with a fantastic guest

Starting point is 00:00:45 in all things, the year of AI, right? But not a year of AI without talking about how all of this hangs together. So today, we're actually sitting down with an organization called Hedgehog. So if you haven't heard of Hedgehog, you should really check them out. And we are joined by Mark Austin, who is the CEO and co-founder of Hedgehogs. Mark, welcome to the show. Thanks for having me, Janice and Allison, thank you. I'm so excited to talk to you. Why don't we just start, since Hedgehog has never been on the program before,

Starting point is 00:01:14 why don't you just start with a quick intro to yourself and the company and how you came to form Hedgehog? Yeah, so I'm Mark Austin. I'm co-founder and CEO of Hedgehog, and we founded the company to enable enterprises. We didn't really even know about NeoPlows. Neoplauds wasn't a term when we found out of the company, but enterprises, government agencies to network like hyperscalers. And what that means is enabling everybody to run their own infrastructure with the same kind of convenience that you have with public cloud,

Starting point is 00:01:52 but hopefully more cost effectively and in a fully automated fashion. So that's really our mission. And, you know, I came up with that concept after I was at Cisco for six years working on mass scale network automation, just observing how Cisco's largest customers operate really big networks at scale and just kind of took those lessons learned. And so let's go do this for everybody else, but let's do it in a hyperscaler way, which means doing it the same way that the big three would Amazon, Microsoft, Google, and design AI networks the same way that they do. Got it. And so with that, Mark, I mean, Hedgehog is really tackling an interesting challenge where you guys are really helping organizations build AI capabilities, where there's greater efficiency and scale of the cloud. How do you describe the company's mission and how does this set your approach apart from others? Yeah, so it's that network like a hyperscaler approach I was talking about.

Starting point is 00:02:52 And what that means is that rather than taking a reference architecture from one vendor, that typically is proprietary, it's vertically integrated, it locks you into that vendor for however long when this infrastructure requires you to hire specially trained resources, people from that vendor. And it's all very expensive. Rather than doing all that, do what Amazon, Microsoft, and Google do. So start with open source software that enables you to diversify your hardware supply chain, which is super important in the global economy today. right so if you have a supply shock say a global pandemic or say a trade war that can limit your ability to scale because supply becomes constrained and you can't add capacity to your network when you need to it also means that prices go up so if you want to mitigate that risk you diversify

Starting point is 00:03:47 and you're able to acquire equipment from the vendor that you want to acquire it from when you need it at a competitive price so the open source software approach makes that happen. And then the other thing you need to do is just fully automate the installation, configuration, and operation of your network. And you need to make use of that AI network self-service. So most people are familiar with AWS. If you go to AWS to rent a GPU and you set up, you get a thing called a virtual private cloud by default. And that gives you privacy or it gives you a segment in this data center that Amazon operates. So we provide the same kind of a virtual private cloud service that you can give to your different development teams if you're an enterprise or to your

Starting point is 00:04:32 different tenants if you're a neocloud. And what it means is that your users operate on your network the same way they would on AWS or Azure or Google Cloud Cloud. We've been talking about repatriation of workloads or delivery of private cloud infrastructure for a long time. And it seems like the AI moment is really driving that home. Do you think that that's because people want to control their data. Does it mean that this workload is becoming much more critical? What do you think of the drivers that your customers are moving towards the operation of their own on-prem infrastructure in this manner? Yeah, people have been talking about cloud repatriation for a long time. And I'd say for like legacy workloads, yeah, some of it is repatriated, but like a lot of times don't fix

Starting point is 00:05:20 what they broke. Just let it be. But when you start thinking about AI and you understand the infrastructure cost of AI, which is significantly higher than the cost of running an internet website or a SaaS product or mobile application, like way more expensive. And you consider that if you're building an AI business, maybe it's a software company, or you're automating a process as a large enterprise, you'll come to realize that the gold in your gold mine is your data, not source code and software. It's the data. So you've really got to take care of that data. And you've really got to take care of that data. You got to defend it because that data is your business moat. And if you give that data away to public model or maybe a cloud service provider who wants to compete with you in a segment,

Starting point is 00:06:07 they're bridging your moat. Good example of this is our customer zipline. They're a very well capitalized startup now. I think they've raised over $1.3 billion, but they do automated drone delivery. That's cool. Yeah. So if you live in Dallas, Fort Worth, or Arkansas, you can order from Walmart or Buffalo Wild Wings or Chipotle, a number of different restaurants, the same way you would through Dorjash or Uber Eats. But the difference is your delivery will reach your home in probably about 10 minutes because a drone flies to that store, picks up your order, flies back to your house, and then drops it off on your front porch. So all that requires AI to navigate the drone and to do the precision drop and also to manage the airspace so that

Starting point is 00:06:53 drones aren't flying into aircraft, crashing, and that sort of. They were running those workloads on AWS, and they kind of came to that same realization. They said, hey, look, we want to have control of all of our infrastructure, all of our data. It's our goal. And people aren't going to pay for drone delivery if it's expensive, right? It's got to be the same price or less than word asteroid regrets, right? So cost matters to scale. So they decided to build their own private cloud.

Starting point is 00:07:19 And the cool thing about SipLine is, you know, they're able to do that with a very small team. The same people, same process, same tools, same skills they already had for running workloads on AWS, those same people now run a private cloud with hedgehogsaw. They didn't need to go hire a bunch of network engineers. They were able to do it with the same DevOps team they already got. And when they did that, they cut their infrastructure cost by 70%. Wow. That's really cool. I want to keep talking about these use cases because that's awesome. It's just a vivid visual in my mind. But from a technology standpoint, what problem does Hedgehog solve in terms of having an open source network fabric? Yeah.

Starting point is 00:08:02 So we're open source software. In a nutshell, it's all the software you need to run an AI network for private cloud or even for multi-tinent public cloud. It's all open source. We delivered as an appliance. You just go to Hedgehog.com cloud, click the download button, download it, and run it. And it will automatically configure the network for your use case. So the typical use cases in AI are training, is what a lot of people have been doing. So that's training a new model, fine-tuning an existing model, maybe with your own private enterprise data, rag or retrieval augmented generation, and inference.

Starting point is 00:08:38 And all of those use cases have different network topologies because they have different compute and storage requirements. If you choose the alternative, you're going to hire a network architect who's going to design all that stuff for you. then get on a command line interface, a terminal, and do a bunch of commands with a network operating system and configure this whole. Then it will take several weeks. With Hedgehog, you don't need that. You just run this appliance and it automatically installs all the software you need. It configures all the network devices that are in your data center. And then it gives you continuous observability of all the network devices in data center. and it gives you a cloud native API

Starting point is 00:09:24 that you can either put in your own equivalent of AWS console or just give to your customers and then they're able to start provisioning resources on your cloud, same way they would on public cloud. That's amazing. I looked at your site and I learned a little bit about performance and you said that you can deliver speeds up to 200 gigabytes

Starting point is 00:09:46 per second of data rates. You're delivering a very efficient network that is not bottlenecky, which is a huge concern when you're talking about AI clusters. How do you see this operating within an AI training and then an inference workload scenario for a client? Yeah. I think the 200 gigabits per second that you saw is in reference to our gateway. So I'll tell another customer story to illustrate the point. So one of our tenants is FarmGPU, or one of our customers, rather, is FarmGPU. They operate the Solidime AI Central app.

Starting point is 00:10:19 So that is an 800 gig fabric. It's a fabric for training. So what that means is that there is, and there's actually two network fabrics in there, there's a back-end network that connects B-200 and H-200 GPU servers together and enables those GPUs to share memory with each other and do it in a high-performance way where you actually are utilizing that 800-gigibbit Ethernet fabric

Starting point is 00:10:46 to its full capacity. We do that by managing conjection. What happens in these training workloads is the network gets congested, which means that GPUs get underutilized, which means that your training workload takes longer to run. And in some cases, you may even run out of memory, and in which case the whole training run crashes and you've got to go back to your last checkpoint. And it's super costly because you're paying for all that time on the GPUs. So we solved that by delivering congestion management that maximizes the bandwidth or throughput

Starting point is 00:11:16 of that network and gives you better GPUs. utilization in shorter training time. And in that network in particular, semi-analysis tested it for their Cluster Max rating system, and it got really good results. So these are nickel tests, the NVIDIA Collective Communication Library tests. In that independent testing scenario, that Hedgehog network outperformed the Invidia Spectrum X reference cloud at Israel 1, as well as an Infiniband cloud at Crusoe Island. That's the high performance in the back end network. Then there's a separate front end network, which gets data in and out of the cluster,

Starting point is 00:11:55 and that data typically then gets stored in storage array. In this case, it's a solidime storage cluster, and there's a data platform typically that people use to then load that data into their training run. So that also is an 800 gigabit second network, but it has a connection to the Internet. And for people to run the kind of training workloads that they want to do, A lot of those workloads involve a lot of images or video, multimedia content. And FarmGP was getting users, tenants, trying to ingest terabytes of data per day.

Starting point is 00:12:29 And they had an internet connection with a firewall that topped out. And I think it was 10 gigabits per second. Oh, crazy. Yeah. So that's a huge bottleneck. You're trying to get terabytes of data in and out of everything. And that's a big problem. You know, you're trying to sell more storage services and more storage product.

Starting point is 00:12:47 The legacy alternative to that is you go by a high-performance router from somebody like Juniper or Nokia, and then you go by a firewall from somebody like Palau to Firewall, so that internet connection is secure, and you spend a lot of money out. The alternative to that is you take that same open-source AI network fabric that I've described for Hedgehog, and you extend it to a gateway. And that gateway, it's just an X-86 server. It has a NVIDIA-ConnectX-C-7 network interface card in it. And we turn that commodity server into a high-performance router that gets packets in and out of that cluster at up to 200 gigabits per second.

Starting point is 00:13:26 And we do that with minimum viable security features so that you don't have to go buy an expensive firewall on top of that. Because most people only use maybe 10, 20% of the features that come with an enterprise firewall for this particular use case. So what that ends up looking like, if you're a neocloud, is the equivalent of offering AWS Transit Gateway to your tenants, which is how people peer VPCs and multiple availability zones on public cloud, and AWS Network Firewall, which is a cloud-based, software-based set of security service. That's awesome. Yeah.

Starting point is 00:14:02 So you're giving us all kinds of good customer examples. So we're going to keep on this train asking you about a harder ecosystem now. You know, you're working with companies also like Celestica, Dell, Super Micro. How do these partners come together to simplify AI data center deployments with you guys? Yeah, great question. So I talked earlier about networking like a hyperscaler and how hyperscalers build out their infrastructure. Celestka, Dell, Edgecor is another partner of ours, Broke Micro. Some people will refer to them as white box vendors.

Starting point is 00:14:35 But what they typically do is they build hardware for hyperscalers. Amazon or Microsoft or Google reaches out to a contract manufacturer and says, here's our spec for the next switch that we want to have. And then those vendors then collaborate with the hyperscale or on the design. And typically it will also be in collaboration with the Open Compute Project, which is a community that was started by Microsoft and meta, to create an ecosystem of hardware vendors to feed their hyperscale growth. and you can think of it as the equivalent of the Linux foundation, but sure, hardware.

Starting point is 00:15:14 So they're open spec devices, which means that there's some standardization. And so that's sort of the first step in networking like a hyperscaler is choosing OCP, open compute project hardware. Those are typically the companies that we partner with. So our open source software runs on their open compute spec hardware. And combined, we deliver that high performance networking at half the cost of vertically integrated. proprietary alternatives. Now, when I think about this, one of the things that I'm thinking about is that you're requiring not just networking what's in the cluster, but you may be fetching data for remote sources.

Starting point is 00:15:50 When you look at the evolution of this, and it's happening really quickly, why are storage architectures in particular becoming critical when considering delivery of data as models and applications require it? Well, I mean, if you look at the extreme case, the foundational model like OpenAI, right? I mean, the data set, the training set is all of the data that's publicly available on the internet. That's a lot of data. So you've got to put that somewhere, which means a really big storage array, which means a data platform that can scale out horizontally to a large amount of drives. So the same thing goes in an enterprise use case. If you're fine-tuning one of those

Starting point is 00:16:30 foundational models, hopefully you're choosing an open source model and then you're taking your very sensitive, unique, private enterprise data, say, flight pass of all on your drones, for example, in the Zipline case, and you're putting all that data into that storage cluster. And then you're using that data as you're running training or fine-tuning or rag jobs. So yeah, there's a lot of data. You need a lot of storage. And just like the network can be a bottleneck on your training duration, so can storage performance. So everything needs to be high performance in this environment.

Starting point is 00:17:05 So Hedgehog is really passionate about data privacy and control, right? That's like one of your guys's core philosophies. How do you see that shaping the next generation of AI infrastructure? Yeah, so far we've been talking a lot about training and fine-tuning and RAG, which are techniques that people are going to use to ultimately end up with models that people use. And when people are using a model, that's inference, right? So that's where the model is doing its job, it's intelligence that operates our artificially on your data as the user. So that data privacy for the user is super important.

Starting point is 00:17:41 But that user data also sort of builds this data flywheel for whoever's operating this AI business. Data privacy and control of that is again super important. But that inference to deliver the user experience you want, the inference is going to happen at what we call the data edge, which is really just close to the source of data. And the reason for that is you want really low latency inference. So as we're talking to each other, you know, we respond to each other pretty quickly unless maybe we have a connection issue and Janice couldn't hear the end of the last question and we're waiting for it to ask the next one. That'd be a good example of network latency that created a gap in the conversation. So the way that you eliminate those gaps and you make it

Starting point is 00:18:24 more human-like and its experience, it's the same way you get people together. Or you put it in the same room, then you're able to talk. Well, if the application, the inference bottle is close to the source of data, then it means it's a short network hop, which means that there's low network latency, which means that it's lightning fast and it works well. And for something like autonomous driving, which a lot of us have in our cars now, or autonomous flying like a drone that may be operating near the flight pass at your local airport, you want low latency inference. So the network just extends outward, I guess, to a whole bunch of like small data centers. So in the case of a drone, it's It's a very small data center.

Starting point is 00:19:01 It's like one GPU that's in the drone. Same thing. Your car, it's a mobile data center now. There's going to be lots of AI compute all over the place, and it's all collecting data, about you, about us, about the world around it. And so, yeah, you want that data to remain private and secure. Makes a lot of sense. Mark, I really enjoyed this conversation.

Starting point is 00:19:25 You and the Hedshawk team have really taken the AI world by storm. Congratulations on a success. Thank you. And one thing that I think about is that you're doing it the right way. You're investing in an open source and an open infrastructure. I love that. As we look forward to 2026, I mean, I think that we're just on the cusp of the broad proliferation of this in enterprise. So your timing for solution delivery is awesome.

Starting point is 00:19:50 What's next for Hedgehog? And are there new technologies or new partnerships on the Verizon that you want to give us a sneak peek about? Yeah, well, we just did our first release of our Gateway a few weeks ago. So there's a lot more features that we're going to bring to that. Over time, there will be more security features. I don't want to call it a firewall, because we're not going to ever try to compete in firewall market, but there will be more security features that eliminate the need for you to buy a separate firewall. There will be load balancing, which is sort of a whole other networking segment that F5 is traditionally dominated,

Starting point is 00:20:26 that you'll need to have for AI clusters. And then we were just talking about storage and data platforms. So we're starting to work with a lot more data companies now. And we were just talking to one this week where they deliver their product with an NVME fabric. So NVME is a particular protocol that's important in the storage domain. And they ship their product with a pair of data center switches. And they're running into supply constraints from the one vendor that they've been working with. So they need to diversify their supply chain.

Starting point is 00:20:57 So for their product, they need open networking in their product. And then that NVME fabric that they own has to integrate with our front-end networked software. So that's one player in the data platform space. There are others. They're all in play in some shape or fashion in these different AI architectures. And collaborating with them is a super big piece. The other big piece is on compute. There are a number of players who are building.

Starting point is 00:21:24 I'll just generally categorize it as AI development platforms, but they need really the same thing that you need in a network, right? In the network, you need switches and routers provisioned with a network operating system and configure for their use in the overall data center. So the same thing goes with GPU servers. For almost every neocloud, they're going to need to offer managed Kubernetes and slurm.

Starting point is 00:21:49 And for any of these AI development platforms that then run on that Neoploud infrastructure or private infrastructure for that matter, they need basics on the server, the operating system, and a Kubernetes distribution. So it turns out we're already in the business of installing, configuring, and operating systems in Kubernetes. I didn't really mention this, but the way that we configure the network is we model the network as a Kubernetes cluster. So when you run Hedgehog, we install Kubernetes on a controller server, and that's the thing that enforces state in the network configuration. So we've got really good domain experience there.

Starting point is 00:22:27 People need help getting their GPU servers provisioned, but with Linux and with Kubernetes and maybe even slur them on Kubernetes, we're game. Awesome. Wow. This is probably the best way to end the Data Insights podcast this year with this topic. Mark, this has been really awesome. Your examples have blown me away.

Starting point is 00:22:47 I know our audience is going to be eager to learn more. where can you send folks to get in contact with you and get more information on Hedgehog? Sure. So our URL is Hedgehog.cloud. And if you go there, you should see a button on the homepage. It says download. Nice. And when you click that button, you can download our software. And we have like a learning portal that we set up. And we've got a whole team of engineers who are happy to help you work through those exercises. You can run the software. First is a virtual lab. It's a virtual data center. You can run it on public cloud if you want, and you can play with a variety of different fabrics.

Starting point is 00:23:28 So you can experiment with a training fabric or an infant fabric. And what we do is we run virtual switch images. So they're virtual switches. They're all connected to each other and virtual servers. And then you can exercise the API to partition that virtual data center into virtual private clouds, and then experiment with peering those clouds and updating software. and observing network performance, all the other things that the product does. So that's the fastest way to get going.

Starting point is 00:23:57 And then if you're ready to actually hold the trigger on building an AI data center, we have a number of partners who can help you with all of the hardware components that you need, the installation services that you need, and getting our software running as well. But the software, honestly, is the easy part. We've made it super easy. The hard part is figuring out all the stuff you've got to buy because there are a lot of moving parts when you're starting going to be really high performance networks. I think what we know is that they want to do it.

Starting point is 00:24:25 And I can't wait to hear more customer stories from you next year. Mark, we'd love to have you back on the show sometime. Thank you so much for the time today. It was so informative. I learned a ton. And Janice, thanks so much for the collaboration on Data Insights. It was another great episode. Thanks. Thanks, guys.

Starting point is 00:24:43 Take care. Thanks for joining Tech Arena. Subscribe and engage at our website, Techorena.com. All content is copyright by tuckering.

In The Arena by TechArena - How Hedgehog Brings Hyperscaler Networking to Enterprise AI

Hedgehog CEO Marc Austin joins Data Insights to break down open-source, automated networking for AI clusters—cutting cost, avoiding lock-in, and keeping GPUs fed from training to inference....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.