In The Arena by TechArena - How Hedgehog Brings Hyperscaler Networking to Enterprise AI
Episode Date: January 28, 2026Hedgehog CEO Marc Austin joins Data Insights to break down open-source, automated networking for AI clusters—cutting cost, avoiding lock-in, and keeping GPUs fed from training to inference....
Transcript
Discussion (0)
Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome in the arena. My name's Allison Klein, and today is the Data Insights episode, which means I'm here with Janice Naurowski of Solidime.
Hey, Janice, how's it going?
Hi, Allison. It's going great. How are you?
I'm doing great. I think this is our last episode for 2025, so it's been.
been a year of telling stories about data insights, and we're really leaving the year with a
fantastic guest. Tell me who's with us today. Yes, we are leaving the year with a fantastic guest
in all things, the year of AI, right? But not a year of AI without talking about how all of this
hangs together. So today, we're actually sitting down with an organization called Hedgehog.
So if you haven't heard of Hedgehog, you should really check them out. And we are joined by Mark Austin,
who is the CEO and co-founder of Hedgehogs.
Mark, welcome to the show.
Thanks for having me, Janice and Allison, thank you.
I'm so excited to talk to you.
Why don't we just start, since Hedgehog has never been on the program before,
why don't you just start with a quick intro to yourself and the company
and how you came to form Hedgehog?
Yeah, so I'm Mark Austin.
I'm co-founder and CEO of Hedgehog,
and we founded the company to enable enterprises.
We didn't really even know about NeoPlows. Neoplauds wasn't a term when we found out of the company,
but enterprises, government agencies to network like hyperscalers.
And what that means is enabling everybody to run their own infrastructure with the same kind of convenience that you have with public cloud,
but hopefully more cost effectively and in a fully automated fashion.
So that's really our mission.
And, you know, I came up with that concept after I was at Cisco for six years working on mass scale network automation, just observing how Cisco's largest customers operate really big networks at scale and just kind of took those lessons learned.
And so let's go do this for everybody else, but let's do it in a hyperscaler way, which means doing it the same way that the big three would Amazon, Microsoft, Google, and design AI networks the same way that they do.
Got it.
And so with that, Mark, I mean, Hedgehog is really tackling an interesting challenge where you guys are really helping organizations build AI capabilities, where there's greater efficiency and scale of the cloud.
How do you describe the company's mission and how does this set your approach apart from others?
Yeah, so it's that network like a hyperscaler approach I was talking about.
And what that means is that rather than taking a reference architecture from one vendor,
that typically is proprietary, it's vertically integrated, it locks you into that vendor for however long when this infrastructure requires you to hire specially trained resources, people from that vendor.
And it's all very expensive.
Rather than doing all that, do what Amazon, Microsoft, and Google do.
So start with open source software that enables you to diversify your hardware supply chain, which is super important in the global economy today.
right so if you have a supply shock say a global pandemic or say a trade war that can limit your
ability to scale because supply becomes constrained and you can't add capacity to your network
when you need to it also means that prices go up so if you want to mitigate that risk you diversify
and you're able to acquire equipment from the vendor that you want to acquire it from when you
need it at a competitive price so the open source software approach makes that
happen. And then the other thing you need to do is just fully automate the installation,
configuration, and operation of your network. And you need to make use of that AI network self-service.
So most people are familiar with AWS. If you go to AWS to rent a GPU and you set up,
you get a thing called a virtual private cloud by default. And that gives you privacy or it gives you
a segment in this data center that Amazon operates. So we provide the same kind of a virtual private cloud
service that you can give to your different development teams if you're an enterprise or to your
different tenants if you're a neocloud. And what it means is that your users operate on your network
the same way they would on AWS or Azure or Google Cloud Cloud. We've been talking about
repatriation of workloads or delivery of private cloud infrastructure for a long time. And it seems
like the AI moment is really driving that home. Do you think that that's because people want to
control their data. Does it mean that this workload is becoming much more critical? What do you think
of the drivers that your customers are moving towards the operation of their own on-prem infrastructure
in this manner? Yeah, people have been talking about cloud repatriation for a long time. And I'd say
for like legacy workloads, yeah, some of it is repatriated, but like a lot of times don't fix
what they broke. Just let it be. But when you start thinking about AI and you understand the infrastructure
cost of AI, which is significantly higher than the cost of running an internet website or a SaaS
product or mobile application, like way more expensive. And you consider that if you're building
an AI business, maybe it's a software company, or you're automating a process as a large
enterprise, you'll come to realize that the gold in your gold mine is your data, not source
code and software. It's the data. So you've really got to take care of that data. And you've really got to take
care of that data. You got to defend it because that data is your business moat. And if you give that
data away to public model or maybe a cloud service provider who wants to compete with you in a segment,
they're bridging your moat. Good example of this is our customer zipline. They're a very well
capitalized startup now. I think they've raised over $1.3 billion, but they do automated drone delivery.
That's cool. Yeah. So if you live in Dallas, Fort Worth, or Arkansas, you can order from
Walmart or Buffalo Wild Wings or Chipotle, a number of different restaurants, the same way you
would through Dorjash or Uber Eats. But the difference is your delivery will reach your home in
probably about 10 minutes because a drone flies to that store, picks up your order,
flies back to your house, and then drops it off on your front porch. So all that requires
AI to navigate the drone and to do the precision drop and also to manage the airspace so that
drones aren't flying into aircraft, crashing, and that sort of.
They were running those workloads on AWS, and they kind of came to that same realization.
They said, hey, look, we want to have control of all of our infrastructure, all of our data.
It's our goal.
And people aren't going to pay for drone delivery if it's expensive, right?
It's got to be the same price or less than word asteroid regrets, right?
So cost matters to scale.
So they decided to build their own private cloud.
And the cool thing about SipLine is, you know, they're able to do that with a very
small team. The same people, same process, same tools, same skills they already had for running
workloads on AWS, those same people now run a private cloud with hedgehogsaw. They didn't need
to go hire a bunch of network engineers. They were able to do it with the same DevOps team they already
got. And when they did that, they cut their infrastructure cost by 70%. Wow. That's really cool.
I want to keep talking about these use cases because that's awesome. It's just a vivid visual in my mind.
But from a technology standpoint, what problem does Hedgehog solve in terms of having an open source network fabric?
Yeah.
So we're open source software.
In a nutshell, it's all the software you need to run an AI network for private cloud or even for multi-tinent public cloud.
It's all open source.
We delivered as an appliance.
You just go to Hedgehog.com cloud, click the download button, download it, and run it.
And it will automatically configure the network for your use case.
So the typical use cases in AI are training, is what a lot of people have been doing.
So that's training a new model, fine-tuning an existing model, maybe with your own private enterprise data, rag or retrieval augmented generation, and inference.
And all of those use cases have different network topologies because they have different compute and storage requirements.
If you choose the alternative, you're going to hire a network architect who's going to design all that stuff for you.
then get on a command line interface, a terminal, and do a bunch of commands with a network
operating system and configure this whole. Then it will take several weeks. With Hedgehog, you don't
need that. You just run this appliance and it automatically installs all the software you need. It
configures all the network devices that are in your data center. And then it gives you continuous
observability of all the network devices in data center.
and it gives you a cloud native API
that you can either put in your own
equivalent of AWS console
or just give to your customers
and then they're able to start provisioning resources
on your cloud, same way they would on public cloud.
That's amazing.
I looked at your site and I learned a little bit about performance
and you said that you can deliver speeds up to 200 gigabytes
per second of data rates.
You're delivering a very efficient network
that is not bottlenecky, which is a huge concern when you're talking about AI clusters.
How do you see this operating within an AI training and then an inference workload scenario for a client?
Yeah. I think the 200 gigabits per second that you saw is in reference to our gateway.
So I'll tell another customer story to illustrate the point.
So one of our tenants is FarmGPU, or one of our customers, rather, is FarmGPU.
They operate the Solidime AI Central app.
So that is an 800 gig fabric.
It's a fabric for training.
So what that means is that there is,
and there's actually two network fabrics in there,
there's a back-end network that connects B-200 and H-200 GPU servers together
and enables those GPUs to share memory with each other
and do it in a high-performance way
where you actually are utilizing that 800-gigibbit Ethernet fabric
to its full capacity.
We do that by managing conjection.
What happens in these training workloads is the network gets congested, which means that
GPUs get underutilized, which means that your training workload takes longer to run.
And in some cases, you may even run out of memory, and in which case the whole training
run crashes and you've got to go back to your last checkpoint.
And it's super costly because you're paying for all that time on the GPUs.
So we solved that by delivering congestion management that maximizes the bandwidth or throughput
of that network and gives you better GPUs.
utilization in shorter training time. And in that network in particular, semi-analysis tested it for
their Cluster Max rating system, and it got really good results. So these are nickel tests,
the NVIDIA Collective Communication Library tests. In that independent testing scenario, that Hedgehog
network outperformed the Invidia Spectrum X reference cloud at Israel 1, as well as an Infiniband
cloud at Crusoe Island.
That's the high performance in the back end network.
Then there's a separate front end network, which gets data in and out of the cluster,
and that data typically then gets stored in storage array.
In this case, it's a solidime storage cluster,
and there's a data platform typically that people use to then load that data into their training run.
So that also is an 800 gigabit second network,
but it has a connection to the Internet.
And for people to run the kind of training workloads that they want to do,
A lot of those workloads involve a lot of images or video, multimedia content.
And FarmGP was getting users, tenants, trying to ingest terabytes of data per day.
And they had an internet connection with a firewall that topped out.
And I think it was 10 gigabits per second.
Oh, crazy.
Yeah.
So that's a huge bottleneck.
You're trying to get terabytes of data in and out of everything.
And that's a big problem.
You know, you're trying to sell more storage services and more storage product.
The legacy alternative to that is you go by a high-performance router from somebody like Juniper or Nokia,
and then you go by a firewall from somebody like Palau to Firewall,
so that internet connection is secure, and you spend a lot of money out.
The alternative to that is you take that same open-source AI network fabric that I've described for Hedgehog,
and you extend it to a gateway.
And that gateway, it's just an X-86 server.
It has a NVIDIA-ConnectX-C-7 network interface card in it.
And we turn that commodity server into a high-performance router that gets packets in and out of that cluster at up to 200 gigabits per second.
And we do that with minimum viable security features so that you don't have to go buy an expensive firewall on top of that.
Because most people only use maybe 10, 20% of the features that come with an enterprise firewall for this particular use case.
So what that ends up looking like, if you're a neocloud, is the equivalent of offering
AWS Transit Gateway to your tenants, which is how people peer VPCs and multiple
availability zones on public cloud, and AWS Network Firewall, which is a cloud-based,
software-based set of security service.
That's awesome.
Yeah.
So you're giving us all kinds of good customer examples.
So we're going to keep on this train asking you about a harder ecosystem now.
You know, you're working with companies also like Celestica, Dell, Super Micro.
How do these partners come together to simplify AI data center deployments with you guys?
Yeah, great question.
So I talked earlier about networking like a hyperscaler and how hyperscalers build out their infrastructure.
Celestka, Dell, Edgecor is another partner of ours, Broke Micro.
Some people will refer to them as white box vendors.
But what they typically do is they build hardware for hyperscalers.
Amazon or Microsoft or Google reaches out to a contract manufacturer and says,
here's our spec for the next switch that we want to have.
And then those vendors then collaborate with the hyperscale or on the design.
And typically it will also be in collaboration with the Open Compute Project,
which is a community that was started by Microsoft and meta,
to create an ecosystem of hardware vendors to feed their hyperscale growth.
and you can think of it as the equivalent of the Linux foundation, but sure, hardware.
So they're open spec devices, which means that there's some standardization.
And so that's sort of the first step in networking like a hyperscaler is choosing OCP, open compute project hardware.
Those are typically the companies that we partner with.
So our open source software runs on their open compute spec hardware.
And combined, we deliver that high performance networking at half the cost of vertically integrated.
proprietary alternatives.
Now, when I think about this, one of the things that I'm thinking about is that you're requiring
not just networking what's in the cluster, but you may be fetching data for remote sources.
When you look at the evolution of this, and it's happening really quickly, why are storage
architectures in particular becoming critical when considering delivery of data as models and
applications require it?
Well, I mean, if you look at the extreme case, the foundational
model like OpenAI, right? I mean, the data set, the training set is all of the data that's publicly
available on the internet. That's a lot of data. So you've got to put that somewhere, which means a
really big storage array, which means a data platform that can scale out horizontally to a large
amount of drives. So the same thing goes in an enterprise use case. If you're fine-tuning one of those
foundational models, hopefully you're choosing an open source model and then you're taking your very
sensitive, unique, private enterprise data, say, flight pass of all on your drones, for example,
in the Zipline case, and you're putting all that data into that storage cluster.
And then you're using that data as you're running training or fine-tuning or rag jobs.
So yeah, there's a lot of data.
You need a lot of storage.
And just like the network can be a bottleneck on your training duration, so can storage performance.
So everything needs to be high performance in this environment.
So Hedgehog is really passionate about data privacy and control, right?
That's like one of your guys's core philosophies.
How do you see that shaping the next generation of AI infrastructure?
Yeah, so far we've been talking a lot about training and fine-tuning and RAG,
which are techniques that people are going to use to ultimately end up with models that people use.
And when people are using a model, that's inference, right?
So that's where the model is doing its job, it's intelligence that operates our
artificially on your data as the user. So that data privacy for the user is super important.
But that user data also sort of builds this data flywheel for whoever's operating this AI
business. Data privacy and control of that is again super important. But that inference to deliver
the user experience you want, the inference is going to happen at what we call the data edge,
which is really just close to the source of data. And the reason for that is you want really low
latency inference. So as we're talking to each other, you know, we respond to each other pretty
quickly unless maybe we have a connection issue and Janice couldn't hear the end of the last question
and we're waiting for it to ask the next one. That'd be a good example of network latency
that created a gap in the conversation. So the way that you eliminate those gaps and you make it
more human-like and its experience, it's the same way you get people together. Or you put it in the same
room, then you're able to talk. Well, if the application, the inference bottle is close to the source
of data, then it means it's a short network hop, which means that there's low network latency,
which means that it's lightning fast and it works well. And for something like autonomous driving,
which a lot of us have in our cars now, or autonomous flying like a drone that may be operating
near the flight pass at your local airport, you want low latency inference. So the network just extends
outward, I guess, to a whole bunch of like small data centers. So in the case of a drone, it's
It's a very small data center.
It's like one GPU that's in the drone.
Same thing.
Your car, it's a mobile data center now.
There's going to be lots of AI compute all over the place, and it's all collecting data,
about you, about us, about the world around it.
And so, yeah, you want that data to remain private and secure.
Makes a lot of sense.
Mark, I really enjoyed this conversation.
You and the Hedshawk team have really taken the AI world by storm.
Congratulations on a success.
Thank you.
And one thing that I think about is that you're doing it the right way.
You're investing in an open source and an open infrastructure.
I love that.
As we look forward to 2026, I mean, I think that we're just on the cusp of the broad proliferation of this in enterprise.
So your timing for solution delivery is awesome.
What's next for Hedgehog?
And are there new technologies or new partnerships on the Verizon that you want to give us a sneak peek about?
Yeah, well, we just did our first release of our Gateway a few weeks ago.
So there's a lot more features that we're going to bring to that.
Over time, there will be more security features.
I don't want to call it a firewall, because we're not going to ever try to compete in firewall market,
but there will be more security features that eliminate the need for you to buy a separate firewall.
There will be load balancing, which is sort of a whole other networking segment that F5 is traditionally dominated,
that you'll need to have for AI clusters.
And then we were just talking about storage and data platforms.
So we're starting to work with a lot more data companies now.
And we were just talking to one this week where they deliver their product with an NVME fabric.
So NVME is a particular protocol that's important in the storage domain.
And they ship their product with a pair of data center switches.
And they're running into supply constraints from the one vendor that they've been working with.
So they need to diversify their supply chain.
So for their product, they need open networking in their product.
And then that NVME fabric that they own has to integrate with our front-end networked software.
So that's one player in the data platform space.
There are others.
They're all in play in some shape or fashion in these different AI architectures.
And collaborating with them is a super big piece.
The other big piece is on compute.
There are a number of players who are building.
I'll just generally categorize it as AI development platforms,
but they need really the same thing that you need in a network,
right? In the network, you need switches and routers
provisioned with a network operating system and configure for their use
in the overall data center.
So the same thing goes with GPU servers.
For almost every neocloud, they're going to need to offer managed Kubernetes
and slurm.
And for any of these AI development platforms that then run on that Neoploud
infrastructure or private infrastructure for that matter, they need basics on the server,
the operating system, and a Kubernetes distribution. So it turns out we're already in the business
of installing, configuring, and operating systems in Kubernetes. I didn't really mention this,
but the way that we configure the network is we model the network as a Kubernetes cluster. So
when you run Hedgehog, we install Kubernetes on a controller server, and that's the thing that
enforces state in the network configuration.
So we've got really good domain experience there.
People need help getting their GPU servers provisioned,
but with Linux and with Kubernetes and maybe even slur them on Kubernetes,
we're game.
Awesome.
Wow.
This is probably the best way to end the Data Insights podcast this year with this topic.
Mark, this has been really awesome.
Your examples have blown me away.
I know our audience is going to be eager to learn more.
where can you send folks to get in contact with you and get more information on Hedgehog?
Sure. So our URL is Hedgehog.cloud. And if you go there, you should see a button on the
homepage. It says download. Nice. And when you click that button, you can download our software.
And we have like a learning portal that we set up. And we've got a whole team of engineers who are
happy to help you work through those exercises. You can run the software. First is a virtual lab.
It's a virtual data center.
You can run it on public cloud if you want, and you can play with a variety of different fabrics.
So you can experiment with a training fabric or an infant fabric.
And what we do is we run virtual switch images.
So they're virtual switches.
They're all connected to each other and virtual servers.
And then you can exercise the API to partition that virtual data center into virtual private clouds,
and then experiment with peering those clouds and updating software.
and observing network performance, all the other things that the product does.
So that's the fastest way to get going.
And then if you're ready to actually hold the trigger on building an AI data center,
we have a number of partners who can help you with all of the hardware components that you need,
the installation services that you need, and getting our software running as well.
But the software, honestly, is the easy part.
We've made it super easy.
The hard part is figuring out all the stuff you've got to buy because there are a lot of moving parts
when you're starting going to be really high performance networks.
I think what we know is that they want to do it.
And I can't wait to hear more customer stories from you next year.
Mark, we'd love to have you back on the show sometime.
Thank you so much for the time today.
It was so informative.
I learned a ton.
And Janice, thanks so much for the collaboration on Data Insights.
It was another great episode.
Thanks. Thanks, guys.
Take care.
Thanks for joining Tech Arena.
Subscribe and engage at our website, Techorena.com.
All content is copyright by tuckering.
