The Data Stack Show - 264: Infrastructure as Code Meets AI: Simplifying Complexity in the Cloud with Alexander Patrushev of Nebius
Episode Date: October 1, 2025This week on The Data Stack Show, Alexander Patrushev joins John to share his journey from working on mainframes at IBM to leading AI infrastructure innovation at Nebius, with stops at VMware and AWS ...along the way. The discussion explores the evolution of AI and cloud infrastructure, the five pillars of successful machine learning projects, and the unique challenges of building and operating modern AI data centers—including energy consumption, cooling, and networking. Alexander also delves into the practicalities of infrastructure as code, the importance of data quality, and offers actionable advice for those looking to break into the AI field. Key takeaways include the need for strong data foundations, thoughtful project selection, and the value of leveraging existing skills and tools to succeed in the rapidly evolving AI landscape. Don’t miss this great conversation.Highlights from this week’s conversation include:Alexander’s Background and Early Career at IBM (1:06)Moving From Mainframes to Virtualization at VMware (4:09)Transitioning to AWS and Machine Learning Projects (8:22)What Was Missed From Mainframes and the Rise of Public Cloud (9:03)Security, Performance, and Economics in Cloud Infrastructure (12:40)The Five Pillars of Successful Machine Learning Projects (15:02)Choosing the Right ML Project: Data, Impact, and Existing Solutions (18:01)Real-World AI and ML Use Cases Across Industries (19:42)Building Specialized AI Clouds Versus Hyperscalers (22:08)Performance, Scalability, and Reliability in AI Infrastructure (25:18)Data Center Energy Consumption and Power Challenges (28:41)Cooling, Networking, and Supporting Systems in AI Data Centers (30:06)Infrastructure as Code and Tooling in AI (31:50)Lowering Complexity for AI Developers and the Role of Abstraction (34:08)Startup Opportunities in the AI Stack (38:53)When to Fine-Tune or Post-Train Foundation Models (43:41)Comparing and Testing Models With Tool Use (47:49)Skills and Advice for Entering the AI Field (49:18)Final Thoughts and Encouragement for AI Newcomers (52:31)The Data Stack Show is a weekly podcast powered by RudderStack, customer data infrastructure that enables you to deliver real-time customer event data everywhere it’s needed to power smarter decisions and better customer experiences. Each week, we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Hi, I'm Eric Dots.
And I'm John Wessel.
Welcome to The Datastack Show.
The Datastack Show is a podcast where we talk about the technical, business, and human challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies.
Before we dig into today's data.
episode, we want to give a huge thanks to our presenting sponsor, RudderSack. They give us the
equipment and time to do this show week in, week out, and provide you the valuable
content. RudderSack provides customer data infrastructure and is used by the world's most
innovative companies to collect, transform, and deliver their event data wherever it's needed
all in real time. You can learn more at RudderSack.com. Welcome back to the Datasack show.
We're here with our guest Alex from Nebius.
Alex, really excited to have you on the show today.
Tell us a little bit about your background, and then we'll jump in.
Hi, John.
Thank you.
I think that having me here today.
Today I'm leading the part of the product team at Nebius
and mostly focusing on searching for the way how to make AI infrastructure and AI tuning
is accessible for a wide range of the people from the startups,
deal with huge other prizes and Frontier Labs.
But how I get there, so it's quite interesting for me because I'm always looking for
new challenges for myself.
And that's why I'm changing completely the role, direction, and the technology stack.
I started in IBM when I was working with the mainframes, big supercomputters, which could
actually go to the space without any errors and keep running in the space, even in the
right edition. That was quite an interesting. And then I switched to VMware, quite opposite to
that, you know, public visualization on X86 after mainframes. And after that, I switched to
AWS from the private data center virtualization, private cloud, the public cloud. And it was amazing
that I really liked a time in AWS. And I spent almost six years working machine learning on
different customers, different size of the customers, the different type of the projects.
And once I met the Nibius, they started to build a new AI specialized cloud.
And it was super interesting to join them as a product manager
and help them to actually redefine how the infrastructure looks like
and create so many interesting products that we have today
and what we are working on for the future.
Awesome. Yeah. So excited to learn more about that.
And then the show today, we've got a number of topics to dive into,
but what's one that you're excited about chatting?
I would say that there are a couple of topics that I want to touch there
and we'll be happy to tell you a lot of insights and answer your questions about them.
So I think that we can talk about how AI changed in the last couple of years.
We can also change how infrastructure and the software related to the AI changed.
I think because we're on the data show, it's not supposed to be about,
It's supposed to be about the data.
Yeah.
So I want to really also talk about the data because, you know, there is a golden rule, garbage in garbage out.
Without the data, there is no much learning at all.
So it's all super important.
Awesome, Alex.
All right.
Well, I'm excited about this.
Alex, so excited to dig into your background.
It's really interesting to me and unique for us as data practitioners to get to talk to somebody that's been a lot of time at a way lower level.
I would say that a lot of us have as far as actually your time at, you know,
VMware at IBM and some of these other companies.
So I'm curious, like, how did you get started, you know, in your first job?
And then, like, take us through the progression a little bit.
Okay.
I've got you thinking way back.
Yeah, I'm like, where to say.
Yeah.
So, honestly, I, you know, after university, I joined a small company who was distributing the IBM hardware.
I just find out that there is a position in that.
IBM in a really interesting area in the mainframes, power systems.
And I just applied.
And that way, I get past all the interviews and I was born in IBM.
The mainframes is actually, you know, it's like a big supercomputer who were, in that
time, they were the most performance, the most reliable.
It's when, you know, why they call the mainframes while they, why they actually call
like, you know, super computers, because it's not only about the performances, it's more about
ability on those systems.
You were running
like the business critical databases.
There was a lot of Oracle that time.
You're running business critical applications,
which could not be stopped even for a minute
that you was a lot of money.
Your business process is stopping.
So that's why there are so many features
building on a hardware level
to make sure that you can always correct
like problems mistakes in the memory.
You can retry the instructions
on the processor, because right now, if you have a failed instructional processor
or a failed, you know, uncorrectable error in the memory,
you'll actually just stop.
You'll get a blue screen or something else.
If it's Windows, the blue screen, if it's Linux, your kernel panic.
And those computers, you had a lot of the things that, you know,
controlling how the memory works, correcting everything, rechecking everything.
You can actually retry the instructions in a processor.
And that actually was a place where virtualization, I could say, was created.
The first virtualization was actually created in this.
It was an IBM 360 systems.
It's a really a long time ago.
So, and then one day I get a call from me and where, and we want to have you on the team.
We're going to join us and I was thinking, okay.
So it's, you know, the first thing that I remember that back in a childhood, when I was really young,
maybe like seven years old.
My father bought me a magazine about the computers.
And there was an article about the VMware,
the company who created the virtualization for X-86.
I was like, okay, so that's interesting.
That's a company where probably I want to be to learn about.
They probably know a lot about the virtualization and how to build it.
And I joined them and it was also really fun.
We were working on how to take one server on X86 and make hundreds of them.
I had to put hundreds of the virtual machines on the same server.
And they were almost the only one in the beginning.
Then you have KVM, then you have other hypervisors, in this.
But they were the leader.
They were really good.
There were a lot of features.
And somehow, back in that time, I was living in Russia.
And somehow we had an email from the AWS.
We want to build a team.
We want to have you on the team.
I said, wow, AWS.
That's a space.
and I also passed all interviews
and we moved with a family to Luxembourg
where I'm still living by the way
quite a unique country I would say
in combination
so in AWS I started to work
and honestly in that time
in ABS you can work almost anything
like you can do whatever you want
that was a time when they were growing
extremely you know
there are so many people joined after that
I was starting to work a lot
on the data science projects
on a machine learning, there was a time when, you know, there was a data science.
The people were actually playing with a random cut forest with the egg chipbook.
Then there are a lot of the time series game.
A lot of companies started to do forecasting with different algorithms.
Then the computer vision came.
Then LMs started to grow from the really small models like BIRT,
till the where we are right now.
And then, again.
So all the CIRSA was mostly working on machine learning.
and that really takes me.
I really like it.
I'm really passionate about it.
And then...
Yeah.
Let me stop you because something came to mind.
So you've got this really neat
evolution from the mainframe
to the virtualization
into the AWS public cloud.
I always like to ask people this.
What did you...
So when you first started at VMware,
what did you miss about the mainframe?
Because there's lots of neat new things
but, and say maybe from AWS, like a public club, like, were there things that you missed
that like from your time working on mainframes to your time working, you know, VMware versus
public cloud? As far as like there's this edge case or this one like thing that I actually
liked how it was solved like when we had main main frames, for example. I would say I was
missing the, let's say the first fall, it was a complexity. It became so easy. Like a mainframe,
If you look on the UI, there was literally no normal UI.
There was nothing.
When a time when there was iPhone, they still were living like in the terminals.
It wasn't in terminals.
So we need to run a lot of comments, a lot of attributes.
It was really hard.
And the real way it was so easy, you just literally click on everything in the comments.
And another thing, I was actually missing the performance, Turing.
Because when you buy a mainframe for a huge amount of dollars,
you are going to get maximum from that.
When you have a critical database,
even like a latency of the seconds
actually influencing on your business.
So you're really tuning everything.
Like you're really tuning the storage.
You're tuning the memory, what the page size,
you know, the page size and the memory you use for the database,
how you do the read-write operations from the database.
And when I came to email,
honestly, people were not really doing that.
We're just, okay, we launched it.
virtual machines. And one of the reasons why that was possible, because if you look on
an average, that time, if you look on an average server CPU utilization, it was barely
like 30, 40%. So no one actually, people were not really using them. So that's why you can take
one server and put like 20 virtual machines. And all of them will be fine because that was
a pattern of the workloads on 886. Yeah. Well, but I think that's part of what gave rise to
public cloud was the fact that like you get into virtualization nobody's actually tuning these
things you typically have 40% capacity maybe 60% capacity at any given time like because that's how
people are already working I mean that's what makes public cloud work is like oh like we can
scale this up and like share the you know share the capacity across you know across all these different
companies whereas like in the mainframe world like that that that concept didn't make sense
because, like, of course, everything was tuned
and, like, this was a big capital investment
and, like, they wanted to maximize.
Not that, like, I mean, there's still
CAPEX with hosts and sands and stuff.
Yeah, it is interesting that, like,
how that virtualization really, like, gave rise
to, you know, people moving to public cloud.
Yeah, you're right with the public cloud.
And no, it's how to,
it depends on how you see the public.
Like, if you look at it from the customer respect,
yeah, it's about you get nice service,
you pay you, you know,
for what you're using.
It's really effective, efficient.
It's nice.
If you look on the other side,
give us to the other perspective.
Like, you have the topics.
Now it's your topics,
you know,
the end user.
And I want to have,
you know,
you need to work on a reliability,
which means you probably need to put more servers.
You have some reservation to fix that.
You're working on the operations
because now it's your way.
You work a lot on security,
picture. Or you can put many customers in the same place and guarantee that no one will get
access to someone else data. So the security was extremely important. And then you want to earn
money. So that means you need to get all the servers and you actually want to maximize the amount
of the positive, you know, work workloads that you can put in it. So we want to use the same
sure it's the same storage to
provide it to many users and make money
from that.
Yeah, I mean, it's a great point because
people, yeah, from the
consumer perspective is like, oh,
this makes a ton of sense. I can essentially like
share these physical
servers with all these other companies, but the
virtualization layer provides the security
and flexibility and all the things.
But it's easy to gloss over from the provider perspective
of like each of those
things that I just, you know,
threw away glossed over.
is amplified, right? Because if I'm a small company and I have a security problem that is like
a three out of a hundred, but if I scale this up, then my security problem becomes a 80 out of 100
and 90 out of 100 as far as effort, time, money, etc. Because the risk, you know, is so magnified.
So yeah, that's really interesting. Yeah, that's how it looks like. And now I would say,
now it's an interesting combination. So it's a public cloud.
you know, the specialized AI clouds.
It's a public cloud, so there might be many users who are using that.
It's always about security.
And now it's not just CPUs in the RAM.
It's actually GPUs, which are much more extensive.
So you need to, and the customers are looking for maximum performance.
If you look on how many different benchmarks exist to verify the performance,
the provider of the GPU, you know, you'll understand how it's,
how important it is.
And if you also think about where we are right now in the market,
you will see that new models are coming continuously.
So the frontier labs, they want to be the first who released a new model.
Because if you release your model and your model is second,
that's almost useless.
There will be no hype.
There will be nothing.
You will be one of them.
You always want to be the first,
which means you want to have the most performance infrastructure,
and you want to have most reliable infrastructure
to be able to train faster than your competitors.
So that's why it's even become more amplified nowadays.
Yeah, yeah, that makes a lot of sense.
Okay, so we got to talk about this, I believe our producer, Brooks found this.
So the five pillars of a successful machine learning project,
this is something that Brooks, our producer came across.
So I'd love for you to share the five pillars.
And then, like, let's dig in a little bit on that.
Yeah, so let's say old projects, all machine learning projects, regardless the size, it could be just small pet project, or it could be the huge project for the, you know, huge enterprise for the millions of dollars.
It actually, if they look on all of them, you'll find that they're absolutely similar pattern, regardless of the size, regardless of the use case.
They're always the same patterns.
And if you could be successful in them, if you could do them right, you probably will have a success.
project. Otherwise, there is a huge chance that that project will won't be successful.
You know, if briefly look on them what they are, the first, and we talk about a little bit
in the beginning, that it's, you know, without the data, there is no machine learning. There is
no way you are without the data. So the first one is always the data, you know, that you need
to have the data, you need to be the owner of your data. You need to know what's inside,
how you process it, how you use it, how you provide an access to the data to your teams.
Because depending on a size, if you're like just one data scientist, that's one story.
If you have 10 teams of 15 data scientists, that's absolutely different story.
It's also about the security and many different topics inside of it.
The second probably also, I know, the second, let's see, I'm not putting them by importance.
They're all important.
The second will be actually what the project you select.
Inside of the company, even not in your pet project,
you always need to select between the complexity
and between the user, how it's important for you.
So sometimes you might find that there are tens of the examples on the GitHub to do that.
You look at it and you're asking yourself,
why do I need it at all?
I don't need it.
So why I want to want it in this?
And the opposite, you might say, I have a super important task for myself or for a company,
but it looks like no in the world, you know, completely that's the opposite.
And you probably don't want to do that.
You probably don't want to take it because you will put a lot of time.
You will probably put a lot of effort and there is no guarantee that you will achieve something.
Yeah, that's such good advice, you know, especially for, you know, I think a lot of our listeners
and I know I'd include myself.
you know, everybody wants to be on high-impact projects, ones that make a difference.
And from a career perspective, especially in larger companies, it drastically impacts your
career as to what projects that you work on.
And even like using, like even those last two, like I think are great criteria.
Sometimes you don't have a choice, but when you do have a choice to think about is this
on one extreme, you know, like complete new frontier, you know, nobody's done before.
And on the other extreme, like, maybe people have done it a lot, but like, we, but seems to be
useless, like, not, you know, so those are good.
Yeah, I would say that in general, I'm, I could suggest to use three dimensions.
Like, first of all, is it, is data available?
Because you might just don't have a data for that project.
So what are you going to do, nothing.
So the availability of the data.
The second will be the impact.
it's probably just told it.
And the last will be the ML existence.
Let's say if other solutions exist or not.
So ideally you probably want to have a project
where you have some data.
You will have the impact,
so it will be impactful somehow.
And there are some solutions already exist.
It's like chat board.
There are so many data exist.
There are so many solutions exist.
And probably for a majority of business,
especially with a front end,
with a customer, you know, with the business who is interacting with the users,
then with the final customer, the chatbot might be useful.
Right.
So probably that's why everyone is starting from the chatbot.
Yeah, right, right.
Yeah, yeah, yeah, that's really good.
Well, speaking of chatbots, I do want to talk use cases with you.
And AI use cases for sure, ML, like I think, you know, more traditional ML gets completely ignored today.
And I think that's an easy one to gloss over.
there's still some good use cases there.
But yeah, we'll start.
We can start with either AI or ML.
But I just want to talk really practically,
specifically ones that are going to be, you know,
maybe relevant for the data space.
Like, what are you seeing and we can get in, you know,
more talking about your, you know, your company that you're at.
Like, what are you seeing your platform being used for
as far as business use cases that, you know, involve data?
Great question.
So I would say we have so many customers.
from absolutely different industries and use cases.
So, for example, let's start from big groups of use cases.
So we have customers who are doing the training on us,
like our training, all those new big models.
We have a customer who is doing inference.
That's another side, like after you train the model,
you probably want to earn money from that model somehow,
and one of the way you actually start to provide that model
to be users through some service.
So that's an inference.
inference. The last one is including playing for the data or do something else,
maybe too bold. So that's a big use case. And if you look on what exactly you're doing.
So for example, we have customers. So they are training the really great model for the
image generation. And they are serving that model for the special application that
help for the designers to work and create new images, new visualizations. Or something
opposite like similar AI, that that's company from the healthcare life science. So they're doing,
they're trying to help people live longer and make life better. So we have different customers from
retail, we have customers from a financial, we have customers from the healthcare life science,
and all of them doing different. So I would say that we, when we don't one and we, what we, and we are not
doing like platform for just one use case. What we are doing, we're doing the AI infrastructure that we
be accessible by any company, any person with different level of experts.
It could be a startup, home startup, that project, or it might be the huge enterprise
from Chirelops.
And all of that, from our idea, supposed to have the best user experience.
Right.
So this is what we're doing, and that's why we're not really focusing on just one use case.
Right.
Yeah, that's super interesting.
So, yeah, and we talked about this before the show.
I'd love to, you know, for our listeners, kind of, since you worked, you know, since you're at
AWS, but this could be true for any of the cloud providers.
Like, as you're seeing this AI space evolve, like, what are things that on your platform
you're able to do uniquely well that some of these public cloud platforms, maybe it's
a little bit more challenging for various reasons?
I would say that, you know, when you, if you look up, if you think about the cloud
the hyper-scarters,
the most biggest one.
What you actually like, love them.
You actually, I think that one of,
at least one, I love the flexibility.
So you can open the console.
You have so many different servers.
You can just select one of them,
then select another one, do the combination.
It's actually, you know,
in time you arms,
you can build whatever you want,
and it's really good.
But there is a trade deal.
Is it effective?
Is it effective?
to use service designed somewhere around your task instead of creating something that really feeds.
I remember I had one one one situation really fun when we were launching the same workload
continuously and we spent tangiously were getting the worst performance every like five
every fifth launch. Who knows what is what's going on? Why I have a gigpad,
And we found out, okay, so in the end we found out, and we found out in the documentation
as well that the hardware is not guarantee me. So the different CPU, the different generation
of the server, and you get not the same performance continuously. So that's the other side
of the coin of the flexibility, because they need to utilize the resource that they have. And if there
is no direct contract, like, you know, sometimes you get specific virtual machine with specific hardware,
But if you go to other services where it is hidden from you, you cannot control that.
You will use what they give you under the...
Right, right.
And if you think about the opposite station, like the bare metal GPU, Neo Clouds,
so they give you the hardware, they give you like a bare metal server,
here's the server, here's an IP address of that sort of connecting to whatever you want,
it's super performance.
You get the best performance because that's a bearing at all.
There is nothing on top of it.
But is it scalable?
Is it reliable?
Can you just get a thousand of them in the next five seconds,
as you can do that in a hypers?
Can you guarantee that they will not fail?
No, you're not.
That's like two opposite things.
And that's probably the reason why we made multiple decisions in Namibius
and the way how we build it.
So what we see that we've done,
differently that we designed a special purpose B or the AI cloud, where we're bringing the
performance of a computer and the flexibility of hyperscaliers together. So that's what we're doing,
and I believe that's what people need nowadays. So it means that you can just open a console
and you get the virtual machines that you need to run your task. And those virtual machines will
guarantee you the same performance as a physical survey. And you control it. You see that. And if you don't
needed you stop it, if you need that you launch it, you're controlling that. And on top of
that, there are different servers that actually could simplify what they're doing and how you can
achieve your target. Yeah, I think that makes a lot of sense, especially with very simply
that the cut, so in, you know, the hyperscalers you mentioned, their, you know, their customers
and the current use cases like dictate the underlying hardware, right? And obviously, like, you know,
there's special contracts and ways that you can specify the hardware or whatever. But in general,
the customers in the U cases dictate the hardware. So like when you, what you guys are doing subset,
like we're really building for this use case, you know, this AI use case. Like you can do a lot
more to refine the underlying infrastructure and hardware to be best suited for that and therefore
get the best performance. So I think that makes a ton of sense. Speaking of which,
I would imagine that's pretty complicated. So,
I want to talk a little bit about some of the complexities that you all are facing.
And then I think this leads us into a really popular topic on the show recently,
which is the like doubling down, I think, of the as code movement.
So we talked last show about BI as code, infrastructure as code.
So I'd love to like dig into some of the complexity and then maybe how how the as code,
you know, plays into that complexity for you guys.
Yeah, that's a great point.
So, you know, when you do infrastructure as a code, there are multiple things.
First of all, there is an infrastructure.
And then infrastructure for AI, in nowadays, it's absolutely different type of infrastructure infrastructure.
It's, you know, there are so many different things, starting from the data center,
like, you know, how you build the data center and how to make it efficient.
You're actually facing a lot of challenges, including, like, you know, you probably heard that, you know,
Nowadays, people are, you know, companies building like a multi, like 100 megawatt data centers
or they're building like a gigawatt data center.
But imagine, you know, what's going on, imagine the simple thing.
One of the challenges to give you the magnitude of the challenges, you get the data center
which is consuming 200 megawatts or, let's say, one gigawatt, even the worst case, like a gigawatt.
I mean, if I think about how usually an electricity network is working.
So you have many consumers, some of them turn on, some of them turn off the, you know, kettle, something else.
So you have this fluctuation of the consumption.
And because all of them are really small.
I mean, the house independent, you know, people of families, flats, they're quite small.
So the fluctuation is quite small.
You have night and day, for true, but it's more, it's really predictable curve.
And then imagine the beta center, which is consuming.
like a small city or even like a big city. In the center of the data center, you launch a training,
all those thousands of GPUs starting to run and consume Gigawatte advantage. And the next second,
next month's, you finish the training. You turn it all. So your consumption is going immediately
zero. Not a zero, but close to that. But your power plan next to you is still producing
gigawatts. You could not stop the power plant the same way. What do you've been?
going to do is that gigawatt of electricity, which is just available in the grid. You need to
move it somewhere if you either want to break everything. Well, yeah, I mean, that's such a,
such an interesting problem because it's way more similar versus like what we think about is
like traffic with like cars and trucks. So this is similar to launching spaceships. Like you need
a tremendous amount of energy to launch a rocket for an X amount of time. And then like you don't
need the energy until the next rocket launches. And if you,
you don't have the rocket launches sequenced like back to back, then there's these big,
you know, gap, you know, lulls of energy. And I think that's really interesting. And I want to
ask, because I've never heard anybody break this down. We always talk about energy consumption and
AI and etc. I'd love to break down some of the details. So like GPUs consume, you know,
a lot of energy. But what are some other like supporting characters that, you know, consume, you know,
a lot of energy and an AI, like, optimized data center.
Like, what are the big, what are the, what's your checklist?
What are the big items that consume a lot of energy?
I would say that the GPU is probably one of the biggest.
And then it's will be, you know, then you have a lot of the subtly systems around you.
So you have a storage, which is also like servers.
They also consume energy.
You have a cooling.
So it depends like if you have air cooling, you have the coolers, you have the coolers, you need to
move.
it's also consumed a lot of
I would imagine that was one on my list
I thought cooling would be up there yeah
yeah it also depends how you do
what kind of technology using
you could put data centers
within the north like for example
we have data centers in the Finland
we have data centers
even further
like for example in the Finland
we almost do not use
you know all of that we actually could take
the air from the outside
see in the air and we can then use it because
it's already cold
it's even fun that
that you need to heat up this air.
So you actually need to take the heat from the data center
and hit the cool air before you can put it in the data.
And then you still have a lot of heat,
which in Finland, we use that hit up to hit up the village,
next to the data center.
Yeah.
Yeah, and then you also have a networking, networking is super big,
and you have not just Internet standards networking.
You have Infinaababab, which is extremely performing, low latency,
because you need to synchronize all those GPUs together
so you need it within a band.
It's a networking, consuming a lot.
And probably in a modern data center,
you're replacing the air cooling with the liquid cooling.
So now you have a pumps that need to push the liquid
through the service through all the system.
So there are a lot of different things that consume,
but I want to go back to your original question.
You ask about infrastructure, the code.
So when you do infrastructure as a code,
honestly, in the AI, it doesn't change from the book side. It's still the same. People still
use TerraForum. In the AWS, people still use Confirmation to control the AWS. In our
case, we use Data Forum to control it. So we have Terra. Yeah, I was going to ask if there's
specialty tooling yet. And, you know, the infrastructure is good for, but it sounds like it, at least for
now. That's quite interesting that when you, if you look on them, you know, I had a couple of months
back. I was sitting with my friends with a beer and we were talking, and they were not in
the AI space. And they asked me, what's going on in the AI space? What tools you are using?
What to join the AI space? And I said, come on, man. Kubernetes is still here.
Yeah. A lot of people use Kubernetes. Slorm. Yeah. Now a lot of people use SLORM. It's even
older than Kubernetes. You know, then you still use Grafana. You still use Lockheed, you still
use promulatives, you still
have CICD, you still use
Terraform, like in the reality,
you even could use Elasticsearch
for the database for the
vectors. So there is
if you want to be like in the
AI space, like if you look on them,
people in the eye, so definitely
there are people who build all those algorithms
who write a code of algorithms,
who do the data engineering,
but also there are many
people who are actually doing infrastructure
and from this, I wouldn't
say that there are special tools. I mean, there are tools which help you to do a specific
tasks, like for example, monitoring. If you do the monitoring, you nowadays, let's say in a
standard wall, you just monitor the application and their infrastructure. Now you also need to monitor
the data set, so you also need to detect data drift, you know, detect semantic drift.
You also need to do slicing of the new data. You also need to do a versioning of the data,
not just version of the code, you're also diversioning.
So it becomes more complex for those, for those new tasks, you have more specialized.
But in general, for the infrastructure, more or the same as it was before.
Yeah, that's good to hear because, you know, for those outside of the space, like, I think it's
easy to assume, oh, they have their own tooling for like every one of these solutions.
And, you know, well, maybe that would be nice.
Like, there's probably some tooling you use, you wish could be optimized for your solution.
there's a practicality of we need real people with existing skill sets to help build the future here.
And we can't overnight, like, have completely reorient with an entirely new set of tooling at every layer of the, you know, of the stack.
It's just, that's just a practical, it's very practical thing that's not possible.
Yeah.
So that's why it's more, obviously, you're controlling the different type of the backend.
So what are doing infrastructure is a code.
You also need to start to control working.
But you're also controlling Infinibem, which is just another type of hub,
another type of the control plane to use.
But still, you do more or less the same as it was before.
Yeah.
From the, you know, terraform from the intersection of the sector.
Right.
But you want to bring one more thing.
You mentioned about it that people could, you know,
people who have those skills who are not in the industry,
could join the industry.
And now I just want to amplify it a little bit more.
actually those people, you know, if you look on AI developers who are writing a code,
they don't really want to learn Terraform.
They don't really want to learn Kubernetes.
I know they're really doing the great stuff.
They're creating that.
They're wrapping the data.
They're creating algorithms.
Why they're supposed to know Kubernetes?
I would say the majority of them want to stay someone at a Docker container level.
So that means those people who have knowledge about the Kubernetes, the Terraform,
and all of these things, what they could do.
They could actually create new tool set, which will hide the complexity of the infrastructure,
buy the complex of Kubernetes.
You know, once I remember that, I was talking to one of the really, one of the really big
love who is doing amazing job, amazing models.
And then I was asking, can we do the load balance?
And people told me, I have no idea what is the load balance.
What it is?
Well, I need it.
I know.
And anyway, it was already, it doesn't make sense to ask them about the sticky session.
They need it.
So that's also like a part of, like, you know, if you have knowledge about the Kubernetes,
you can actually think about how you can remove the complexity of that.
Just give people a really, give to AI people a really simple way to operate the infrastructure
that rely on things.
Yeah, yeah, that's, yeah, that makes a lot of sense.
I mean, we've covered this a lot before too, but that the right levels of abstraction
and being able to contribute to that abstraction.
So like you said, so the people that are working on, you know, post-training or,
work on the foundational models even can completely focus as much of their energy and effort
on the hardest problems and abstract away some of the other like more solved problems.
I think that's a really good point.
We're going to take a quick break from the episode to talk about our sponsor, Rutterstack.
Now, I could say a bunch of nice things as if I found a fancy new tool.
But John has been implementing Rutterstack for over half a decade.
John, you work with customer event data every day and you know how hard it can be to
make sure that data is clean and then to stream it everywhere it needs to go.
Yeah, Eric, as you know, customer data can get messy.
And if you've ever seen a tag manager, you know how messy it can get.
So rudder stack has really been one of my team's secret weapons.
We can collect and standardize data from anywhere, web, mobile, even server side, and then send
it to our downstream tools.
Now, rumor has it that you have implemented the longest running production instance of rudder
stack at six years and going.
Yes, I can confirm that.
And one of the reasons we picked Rutterstack was that it does not store the data and we can
live stream data to our downstream tools.
One of the things about the implementation that has been so common over all the years and
with so many Rutterstack customers is that it wasn't a wholesale replacement of your stack.
It fit right into your existing tool set.
Yeah, and even with technical tools, Eric, things like Kafka or PubSub, but you don't have to have
all that complicated customer data infrastructure.
Well, if you need to stream clean customer data to your entire stack,
including your data infrastructure tools,
head over to rudderstack.com to learn more.
One thing I wanted to bring up,
because this comes up a lot when I've talked to people,
is at what, and there's kind of an obvious answer to this, I realize,
but at what spot, so say, you know, you wanted to do a startup today,
you want to be kind of in the AI space.
You probably want to be in the application layer.
but which is kind of the obvious answer but what are some other spots that maybe are less
obvious from your perspective so you've got foundational models like over here that there's
probably only going to be a few successful really large foundational models just from a
funding and spend perspective that's here then you've got various levels I would almost say
maybe between a foundational model and between like the application layer of like post-training
stuff or various other things you can do I'm curious like what are
what in your mind are some opportunities for people that aren't just the obvious, like,
well, I can go work on a foundational model versus like I'm going to work like somewhere
with the application layer.
That's a great question.
So I would say that when you open the startup, you always, not even the startup who's just
running a new project.
It's always a question of a choices, of a tradeoffs.
Like, do you have it?
Like, you need to look what kind of team you have, what this team,
knows what kind of
tooling it knows. What's
your budget for this? What is your time to
market? Do you need to be
on the market tomorrow or you need to be in
the one year?
And what's kind of, you know,
what competitive advantage you want to bring? And actually
all of that is
brings you a question, what
I want to use? What the
complexity? You mentioned that
they're a frontier model and then
you can build your own model by post-training
I'm one of the open source of Wi-Fi and tuning.
So that actually brings us to the question,
like, I could take frontier model
or I could take one of the open-source model from the provider.
Like, for example, in Nibius, we have an EBIOS AI studio
where you can get models and pay per token
and then the best models from the open-source world.
And you can find you in there as well.
It's two paper token.
But, you know, that's the first option.
If you can go to the model per token service, opposite to that will be to go to the cloud
and get the GPUs and build your own model or find-tube the model.
In the middle of them, I would say that we could see that there is a market of the more abstracted
GPUs. So we can, you know, back in the, we back in days, we're calling it serverless,
and then when you remove the complexity, so it's a serverless GPUs.
You get GPUs, but you don't control the infrastructure behind it.
You even still use them, but one of the thing that you also need, when you're making a decision what to use, which of them you want to use, I would say there are multiple, there are four directions which you need to explore.
It's about the economic, like, what do you want to use?
Like, if you do a small startup, if you do MVP in the company, you probably don't want to rent GPUs and pay for them to $207.
Yeah, right.
It's much easier to take the Lama model or an trophic model and just make an NPP.
Verify the idea, paper token, and then you can redefine.
You can probably find, do you need to switch to something else?
It's also about operational, you know, dimension.
Like, what's your team?
Actually, is your team capable to use Kubernetes?
Or they know long chain, and that's the level.
They want to operate API.
Or maybe, like, another thing is that it's a technical requirement.
like what's the performance you need?
Do you want to customize?
If you take, for example, the one of the model from the provider,
and then you decide, oh, I need to fine-tune the model
because I have my own data, with my data, here is a really good data set.
Now I want my model to be fine-tune a dataset.
Can you do that or not?
Is it technically available?
Or if it's not, it doesn't mean that you need to know.
Throw everything away and started to rebuild the infrastructure.
And the last dimension is actually the legal
dimension or, you can say, strategic dimension.
Because not a lot of people think about, you know, what's the time to market, you know,
if you take a GPUs, will you be faster than you take that a model from the open?
But if you take the model from the provider, and do you have a special data?
Like I've seen once, I was talking to one of the startup, and they told me that they do
medicine, you know, agent for the doctor, like supporting agent for the doctor.
And I asked them, like, okay, where are your patients?
And he told me, like, all the market in the U.S.,
because it's the biggest market.
We want to make money.
And then where is your model?
And we use the provider from the European Union.
Okay, you take the data of the United States people,
especially medical data, which is like there are so many compliance around it,
and you put it across the ocean.
We really think that you could do that.
So you need to think about what's your, you know, the legal,
what's your data privacy and all that stuff.
And that will help you to select the tooling.
So there are so many decisions in trade-off
that you need to me when you select between the model
on the wider and the GPU from the wider.
Right. Yeah, I think this would be interesting.
So I think a lot of people, you know,
just from a skill set-wise,
are going to typically start with a foundational model,
especially if they want to be in the application layer.
What are leading indicators
when working with a foundational model
that maybe you should explore some post-training or tuning
other than just using the foundational model?
What are some indicators people should look for to consider that?
I would say that, first of all, you need to verify the quality.
I mean, is this model actually answering your question?
It's answering correctly your question.
You know, it also probably has checked with the harness.
But let's say it's all about that the model was trained it on some big dataset.
But that big data set doesn't mean actually have a good portion of knowledge about your specific use case
and have a huge portion of the, not a, I don't want to say, it's not a garbage, but it's like useless for your use case.
Right.
So if you need to do a kind of testing, ideally not a manual testing, it's not about you sit and ask five questions in the chat and you say, oh yeah, that's a good model.
Yeah.
So you probably need to, you know, to cross the data set with a question.
and answers, then it's also the, it's kind of another problem, like, do you have open answers or
you have a close?
Oh, sorry, do you have open questions or close questions?
Like, is it like yes and all, and you can just go with a, you know, precision and then have
one function to verify the quality or it's open questions.
So they actually need another model as a judge to verify the answers and say how close they are.
So you need to grab the data set of really curated, really verified questions, answers.
And then you need to take a couple of open source models or not open source, it's fine to have commercial models, a couple of those foundation models, and do automatic testing, run all those questions through them and get the Marx metrics for all of them.
And that will help you to understand.
Okay, so there is a model really good in my questions, so I can use it.
Or you can see, okay, so all those models are bad.
what I could do if it's not good enough,
then you need to take a bigger data set.
So now it's not a dataset for their evaluation.
Now it's a dataset for the training.
You take more data, more examples of your, you know,
tax corpus images.
It depends on which model you have, which modality you use.
And then you need to fine-tune.
So you also need to say which of this models I could find out.
If I will take this commercial model, can I find you in whatever the cost?
if you find you on this open-source model,
what will be the cost of fine-tuning and the usage.
So you can do that kind of another round of, you know,
let's say you could do the quickly fine-tuning,
do another round of evaluation, check the new metrics.
Okay, it looks like this model commercial now becoming excellent.
You can fine-tune X further.
Or it looks like this open-source now is really good.
Maybe I could take open-source because with the commercial models,
you know, you can get it just from one place.
but for the open source model
there are so many providers
and they have a filing for the pricing
so you can even fire really cheap.
So I would say this is a reason
why a majority of people might start
to use, you know, might start to go
from foundation models.
It's mostly because the quality, not
fitting to the use case, not enough
knowledge about the specific domain.
And then when you
want to find you.
Yeah, that makes sense. Yeah.
And then I guess
like one of the one of the one of the things that that i and i don't have a lot of experience with us
but one of the things that i'm always thinking about is you've got a foundational model then you've
got tools available to the model right mcp or whatever however you're doing that and then
say i'm going to switch out the model it's it ends up being a really complicated thing where you're
like okay here's a model here's all the tools available to it i'm going to switch out the model
give that same model the same tools but maybe it's like worse at tool calling or like there's just
all these like complexities so any like helpful frameworks that I mean like you mentioned there's a million
of them out there for trying to you know test models and compare models but like any maybe just maybe just
like more of a puristical framework that that that that is helpful for when people think about like
I want to compare this to this and especially like in a more in a complex scenario where you've got
tool calls and other things going on and it's not just to you know kind of raw models
So I would say that if you want to be really excellent, if you want to have the best at all,
you need to do the verification, you need to the evaluation as well.
You still need to do, okay, so you want to replace that model with something else.
You need to verify it, do programmatic testing of, you know, several models, see how many of them will hold the tool correctly or not.
Another approach, you can actually separate it.
So you know that to actually use a tool, you don't need it.
big model. Actually, to select a tool could be really small. And that means you can actually
use one model to select the tool to work with the tools that you have and pass the information
to the bigger model, which will make the thinking and, you know, chain of thought and actually
use the data. So you can use one model as a tool for another model. Yeah. Okay. Yeah, that's right.
agent yeah okay yeah that's the marketing term right yeah no that's a great point okay we're we're
coming up on on the end of the show here you've mentioned this previously in the show but i want to ask
for people that do want to get more into AI like what what would you say like we talked about like
actually some of the tooling is the same but what is that what is that skill or what is that thing
that other than you know maybe you're really good at some of this you know tooling that we've
talked about, like, what's important? Like, what would you tell somebody? I would say that,
first of all, you need to understand where you want to be in AI. So, nowadays, AI is not
anymore, like, you know, for the people who think about algebra. So nowadays, AI is actually
the big business where there are a lot of different people neither. Right. So if you really,
if you want to, if you love the infrastructure, you know, do not try to build a new, I mean,
You can try to build a new algorithm.
You can try to go into Pythorch, sensor flow.
But if you really like the infrastructure,
if you have knowledge in that,
and you want to be in it, start working, start to learn where, do that.
Yes, be the person who will actually help data science be effective
and stop wasting their time on understanding what is the Kubernetes
and what is the virtual science and all the things.
If you really like to write algorithms, then just, you know,
join the team who is doing that.
there are so many good ideas in steel and space.
Like, for example, recently there was a release of the Torch Ft, which is Torch Ftolerant.
That's a, let's say, algorithmic approach, how you can continue training even if you have a failed GPU.
So that was just, that's a pure algorithm.
So it's not about even not an algorithm of machine learning.
It's algorithm of the code is running or you synchronize the data.
But if you want to write the code of the AI model,
you can do that as well.
I mean, it's up to you.
So, first of all, decide what you want to do.
You cannot do all of them.
Right.
It's already, each of this piece is a huge component already.
So you can be successful in one,
and that will be already more than enough.
For sure, you can explore something else.
It's like, you know, start where you're really good,
and if you want, explore the new options.
And the second one, I mean,
the second probably did advise,
be afraid. I remember then when you previously asked, and still it exists, you ask people,
what is, like, what do you think about the AI? And they tell you, oh, that's a, you know, super
smart people knows about math, law, probably learn physics. No, I mean, for sure there are such
people who, you know, create, breakthrough, who create amazing things. But to start working on
E.I. Nowadays, you can just use WebUI, or you can just, you know, with that wide coding
coming widely, you can do something with EI without literally understanding how you write the code
and what's the idea of what is doing. So do not be afraid. AI is not that hard as you might think.
There are things like AI Studio and EBOCII Studio where you can just get a model. You can play in
a UI. There are open source project like FlowWise, which help you to just, you know, take
blocks and connect them in a visual way. You don't even write, but just drug and draw. And there is
there are clouds like Mibut's AI cloud where you can get GPUs where you can launch containers,
we can launch huge Kubernetes. So it depends on your skills, but to not be afraid. You can start
with small and grow your skills as soon as you need. As soon as you feel you need it.
yeah yeah Alex such great advice thanks for being on the show today really enjoyed this love to
have you back and yeah thanks for being here yeah thank you on this was a great traffic meeting
the data stock show is brought to you by rudder stack learn more at rudderstack.
