The Data Stack Show - 264: Infrastructure as Code Meets AI: Simplifying Complexity in the Cloud with Alexander Patrushev of Nebius

Starting point is 00:00:00 Hi, I'm Eric Dots. And I'm John Wessel. Welcome to The Datastack Show. The Datastack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Before we dig into today's data. episode, we want to give a huge thanks to our presenting sponsor, RudderSack. They give us the equipment and time to do this show week in, week out, and provide you the valuable

Starting point is 00:00:38 content. RudderSack provides customer data infrastructure and is used by the world's most innovative companies to collect, transform, and deliver their event data wherever it's needed all in real time. You can learn more at RudderSack.com. Welcome back to the Datasack show. We're here with our guest Alex from Nebius. Alex, really excited to have you on the show today. Tell us a little bit about your background, and then we'll jump in. Hi, John. Thank you.

Starting point is 00:01:07 I think that having me here today. Today I'm leading the part of the product team at Nebius and mostly focusing on searching for the way how to make AI infrastructure and AI tuning is accessible for a wide range of the people from the startups, deal with huge other prizes and Frontier Labs. But how I get there, so it's quite interesting for me because I'm always looking for new challenges for myself. And that's why I'm changing completely the role, direction, and the technology stack.

Starting point is 00:01:41 I started in IBM when I was working with the mainframes, big supercomputters, which could actually go to the space without any errors and keep running in the space, even in the right edition. That was quite an interesting. And then I switched to VMware, quite opposite to that, you know, public visualization on X86 after mainframes. And after that, I switched to AWS from the private data center virtualization, private cloud, the public cloud. And it was amazing that I really liked a time in AWS. And I spent almost six years working machine learning on different customers, different size of the customers, the different type of the projects. And once I met the Nibius, they started to build a new AI specialized cloud.

Starting point is 00:02:31 And it was super interesting to join them as a product manager and help them to actually redefine how the infrastructure looks like and create so many interesting products that we have today and what we are working on for the future. Awesome. Yeah. So excited to learn more about that. And then the show today, we've got a number of topics to dive into, but what's one that you're excited about chatting? I would say that there are a couple of topics that I want to touch there

Starting point is 00:03:00 and we'll be happy to tell you a lot of insights and answer your questions about them. So I think that we can talk about how AI changed in the last couple of years. We can also change how infrastructure and the software related to the AI changed. I think because we're on the data show, it's not supposed to be about, It's supposed to be about the data. Yeah. So I want to really also talk about the data because, you know, there is a golden rule, garbage in garbage out. Without the data, there is no much learning at all.

Starting point is 00:03:31 So it's all super important. Awesome, Alex. All right. Well, I'm excited about this. Alex, so excited to dig into your background. It's really interesting to me and unique for us as data practitioners to get to talk to somebody that's been a lot of time at a way lower level. I would say that a lot of us have as far as actually your time at, you know, VMware at IBM and some of these other companies.

Starting point is 00:03:57 So I'm curious, like, how did you get started, you know, in your first job? And then, like, take us through the progression a little bit. Okay. I've got you thinking way back. Yeah, I'm like, where to say. Yeah. So, honestly, I, you know, after university, I joined a small company who was distributing the IBM hardware. I just find out that there is a position in that.

Starting point is 00:04:20 IBM in a really interesting area in the mainframes, power systems. And I just applied. And that way, I get past all the interviews and I was born in IBM. The mainframes is actually, you know, it's like a big supercomputer who were, in that time, they were the most performance, the most reliable. It's when, you know, why they call the mainframes while they, why they actually call like, you know, super computers, because it's not only about the performances, it's more about ability on those systems.

Starting point is 00:04:52 You were running like the business critical databases. There was a lot of Oracle that time. You're running business critical applications, which could not be stopped even for a minute that you was a lot of money. Your business process is stopping. So that's why there are so many features

Starting point is 00:05:09 building on a hardware level to make sure that you can always correct like problems mistakes in the memory. You can retry the instructions on the processor, because right now, if you have a failed instructional processor or a failed, you know, uncorrectable error in the memory, you'll actually just stop. You'll get a blue screen or something else.

Starting point is 00:05:33 If it's Windows, the blue screen, if it's Linux, your kernel panic. And those computers, you had a lot of the things that, you know, controlling how the memory works, correcting everything, rechecking everything. You can actually retry the instructions in a processor. And that actually was a place where virtualization, I could say, was created. The first virtualization was actually created in this. It was an IBM 360 systems. It's a really a long time ago.

Starting point is 00:05:59 So, and then one day I get a call from me and where, and we want to have you on the team. We're going to join us and I was thinking, okay. So it's, you know, the first thing that I remember that back in a childhood, when I was really young, maybe like seven years old. My father bought me a magazine about the computers. And there was an article about the VMware, the company who created the virtualization for X-86. I was like, okay, so that's interesting.

Starting point is 00:06:31 That's a company where probably I want to be to learn about. They probably know a lot about the virtualization and how to build it. And I joined them and it was also really fun. We were working on how to take one server on X86 and make hundreds of them. I had to put hundreds of the virtual machines on the same server. And they were almost the only one in the beginning. Then you have KVM, then you have other hypervisors, in this. But they were the leader.

Starting point is 00:06:58 They were really good. There were a lot of features. And somehow, back in that time, I was living in Russia. And somehow we had an email from the AWS. We want to build a team. We want to have you on the team. I said, wow, AWS. That's a space.

Starting point is 00:07:14 and I also passed all interviews and we moved with a family to Luxembourg where I'm still living by the way quite a unique country I would say in combination so in AWS I started to work and honestly in that time in ABS you can work almost anything

Starting point is 00:07:33 like you can do whatever you want that was a time when they were growing extremely you know there are so many people joined after that I was starting to work a lot on the data science projects on a machine learning, there was a time when, you know, there was a data science. The people were actually playing with a random cut forest with the egg chipbook.

Starting point is 00:07:53 Then there are a lot of the time series game. A lot of companies started to do forecasting with different algorithms. Then the computer vision came. Then LMs started to grow from the really small models like BIRT, till the where we are right now. And then, again. So all the CIRSA was mostly working on machine learning. and that really takes me.

Starting point is 00:08:16 I really like it. I'm really passionate about it. And then... Yeah. Let me stop you because something came to mind. So you've got this really neat evolution from the mainframe to the virtualization

Starting point is 00:08:32 into the AWS public cloud. I always like to ask people this. What did you... So when you first started at VMware, what did you miss about the mainframe? Because there's lots of neat new things but, and say maybe from AWS, like a public club, like, were there things that you missed that like from your time working on mainframes to your time working, you know, VMware versus

Starting point is 00:08:54 public cloud? As far as like there's this edge case or this one like thing that I actually liked how it was solved like when we had main main frames, for example. I would say I was missing the, let's say the first fall, it was a complexity. It became so easy. Like a mainframe, If you look on the UI, there was literally no normal UI. There was nothing. When a time when there was iPhone, they still were living like in the terminals. It wasn't in terminals. So we need to run a lot of comments, a lot of attributes.

Starting point is 00:09:29 It was really hard. And the real way it was so easy, you just literally click on everything in the comments. And another thing, I was actually missing the performance, Turing. Because when you buy a mainframe for a huge amount of dollars, you are going to get maximum from that. When you have a critical database, even like a latency of the seconds actually influencing on your business.

Starting point is 00:09:51 So you're really tuning everything. Like you're really tuning the storage. You're tuning the memory, what the page size, you know, the page size and the memory you use for the database, how you do the read-write operations from the database. And when I came to email, honestly, people were not really doing that. We're just, okay, we launched it.

Starting point is 00:10:12 virtual machines. And one of the reasons why that was possible, because if you look on an average, that time, if you look on an average server CPU utilization, it was barely like 30, 40%. So no one actually, people were not really using them. So that's why you can take one server and put like 20 virtual machines. And all of them will be fine because that was a pattern of the workloads on 886. Yeah. Well, but I think that's part of what gave rise to public cloud was the fact that like you get into virtualization nobody's actually tuning these things you typically have 40% capacity maybe 60% capacity at any given time like because that's how people are already working I mean that's what makes public cloud work is like oh like we can

Starting point is 00:11:00 scale this up and like share the you know share the capacity across you know across all these different companies whereas like in the mainframe world like that that that concept didn't make sense because, like, of course, everything was tuned and, like, this was a big capital investment and, like, they wanted to maximize. Not that, like, I mean, there's still CAPEX with hosts and sands and stuff. Yeah, it is interesting that, like,

Starting point is 00:11:22 how that virtualization really, like, gave rise to, you know, people moving to public cloud. Yeah, you're right with the public cloud. And no, it's how to, it depends on how you see the public. Like, if you look at it from the customer respect, yeah, it's about you get nice service, you pay you, you know,

Starting point is 00:11:39 for what you're using. It's really effective, efficient. It's nice. If you look on the other side, give us to the other perspective. Like, you have the topics. Now it's your topics, you know,

Starting point is 00:11:52 the end user. And I want to have, you know, you need to work on a reliability, which means you probably need to put more servers. You have some reservation to fix that. You're working on the operations because now it's your way.

Starting point is 00:12:07 You work a lot on security, picture. Or you can put many customers in the same place and guarantee that no one will get access to someone else data. So the security was extremely important. And then you want to earn money. So that means you need to get all the servers and you actually want to maximize the amount of the positive, you know, work workloads that you can put in it. So we want to use the same sure it's the same storage to provide it to many users and make money from that.

Starting point is 00:12:40 Yeah, I mean, it's a great point because people, yeah, from the consumer perspective is like, oh, this makes a ton of sense. I can essentially like share these physical servers with all these other companies, but the virtualization layer provides the security and flexibility and all the things.

Starting point is 00:12:55 But it's easy to gloss over from the provider perspective of like each of those things that I just, you know, threw away glossed over. is amplified, right? Because if I'm a small company and I have a security problem that is like a three out of a hundred, but if I scale this up, then my security problem becomes a 80 out of 100 and 90 out of 100 as far as effort, time, money, etc. Because the risk, you know, is so magnified. So yeah, that's really interesting. Yeah, that's how it looks like. And now I would say,

Starting point is 00:13:29 now it's an interesting combination. So it's a public cloud. you know, the specialized AI clouds. It's a public cloud, so there might be many users who are using that. It's always about security. And now it's not just CPUs in the RAM. It's actually GPUs, which are much more extensive. So you need to, and the customers are looking for maximum performance. If you look on how many different benchmarks exist to verify the performance,

Starting point is 00:13:59 the provider of the GPU, you know, you'll understand how it's, how important it is. And if you also think about where we are right now in the market, you will see that new models are coming continuously. So the frontier labs, they want to be the first who released a new model. Because if you release your model and your model is second, that's almost useless. There will be no hype.

Starting point is 00:14:24 There will be nothing. You will be one of them. You always want to be the first, which means you want to have the most performance infrastructure, and you want to have most reliable infrastructure to be able to train faster than your competitors. So that's why it's even become more amplified nowadays. Yeah, yeah, that makes a lot of sense.

Starting point is 00:14:45 Okay, so we got to talk about this, I believe our producer, Brooks found this. So the five pillars of a successful machine learning project, this is something that Brooks, our producer came across. So I'd love for you to share the five pillars. And then, like, let's dig in a little bit on that. Yeah, so let's say old projects, all machine learning projects, regardless the size, it could be just small pet project, or it could be the huge project for the, you know, huge enterprise for the millions of dollars. It actually, if they look on all of them, you'll find that they're absolutely similar pattern, regardless of the size, regardless of the use case. They're always the same patterns.

Starting point is 00:15:24 And if you could be successful in them, if you could do them right, you probably will have a success. project. Otherwise, there is a huge chance that that project will won't be successful. You know, if briefly look on them what they are, the first, and we talk about a little bit in the beginning, that it's, you know, without the data, there is no machine learning. There is no way you are without the data. So the first one is always the data, you know, that you need to have the data, you need to be the owner of your data. You need to know what's inside, how you process it, how you use it, how you provide an access to the data to your teams. Because depending on a size, if you're like just one data scientist, that's one story.

Starting point is 00:16:09 If you have 10 teams of 15 data scientists, that's absolutely different story. It's also about the security and many different topics inside of it. The second probably also, I know, the second, let's see, I'm not putting them by importance. They're all important. The second will be actually what the project you select. Inside of the company, even not in your pet project, you always need to select between the complexity and between the user, how it's important for you.

Starting point is 00:16:43 So sometimes you might find that there are tens of the examples on the GitHub to do that. You look at it and you're asking yourself, why do I need it at all? I don't need it. So why I want to want it in this? And the opposite, you might say, I have a super important task for myself or for a company, but it looks like no in the world, you know, completely that's the opposite. And you probably don't want to do that.

Starting point is 00:17:07 You probably don't want to take it because you will put a lot of time. You will probably put a lot of effort and there is no guarantee that you will achieve something. Yeah, that's such good advice, you know, especially for, you know, I think a lot of our listeners and I know I'd include myself. you know, everybody wants to be on high-impact projects, ones that make a difference. And from a career perspective, especially in larger companies, it drastically impacts your career as to what projects that you work on. And even like using, like even those last two, like I think are great criteria.

Starting point is 00:17:42 Sometimes you don't have a choice, but when you do have a choice to think about is this on one extreme, you know, like complete new frontier, you know, nobody's done before. And on the other extreme, like, maybe people have done it a lot, but like, we, but seems to be useless, like, not, you know, so those are good. Yeah, I would say that in general, I'm, I could suggest to use three dimensions. Like, first of all, is it, is data available? Because you might just don't have a data for that project. So what are you going to do, nothing.

Starting point is 00:18:17 So the availability of the data. The second will be the impact. it's probably just told it. And the last will be the ML existence. Let's say if other solutions exist or not. So ideally you probably want to have a project where you have some data. You will have the impact,

Starting point is 00:18:34 so it will be impactful somehow. And there are some solutions already exist. It's like chat board. There are so many data exist. There are so many solutions exist. And probably for a majority of business, especially with a front end, with a customer, you know, with the business who is interacting with the users,

Starting point is 00:18:52 then with the final customer, the chatbot might be useful. Right. So probably that's why everyone is starting from the chatbot. Yeah, right, right. Yeah, yeah, yeah, that's really good. Well, speaking of chatbots, I do want to talk use cases with you. And AI use cases for sure, ML, like I think, you know, more traditional ML gets completely ignored today. And I think that's an easy one to gloss over.

Starting point is 00:19:17 there's still some good use cases there. But yeah, we'll start. We can start with either AI or ML. But I just want to talk really practically, specifically ones that are going to be, you know, maybe relevant for the data space. Like, what are you seeing and we can get in, you know, more talking about your, you know, your company that you're at.

Starting point is 00:19:35 Like, what are you seeing your platform being used for as far as business use cases that, you know, involve data? Great question. So I would say we have so many customers. from absolutely different industries and use cases. So, for example, let's start from big groups of use cases. So we have customers who are doing the training on us, like our training, all those new big models.

Starting point is 00:20:02 We have a customer who is doing inference. That's another side, like after you train the model, you probably want to earn money from that model somehow, and one of the way you actually start to provide that model to be users through some service. So that's an inference. inference. The last one is including playing for the data or do something else, maybe too bold. So that's a big use case. And if you look on what exactly you're doing.

Starting point is 00:20:28 So for example, we have customers. So they are training the really great model for the image generation. And they are serving that model for the special application that help for the designers to work and create new images, new visualizations. Or something opposite like similar AI, that that's company from the healthcare life science. So they're doing, they're trying to help people live longer and make life better. So we have different customers from retail, we have customers from a financial, we have customers from the healthcare life science, and all of them doing different. So I would say that we, when we don't one and we, what we, and we are not doing like platform for just one use case. What we are doing, we're doing the AI infrastructure that we

Starting point is 00:21:15 be accessible by any company, any person with different level of experts. It could be a startup, home startup, that project, or it might be the huge enterprise from Chirelops. And all of that, from our idea, supposed to have the best user experience. Right. So this is what we're doing, and that's why we're not really focusing on just one use case. Right. Yeah, that's super interesting.

Starting point is 00:21:41 So, yeah, and we talked about this before the show. I'd love to, you know, for our listeners, kind of, since you worked, you know, since you're at AWS, but this could be true for any of the cloud providers. Like, as you're seeing this AI space evolve, like, what are things that on your platform you're able to do uniquely well that some of these public cloud platforms, maybe it's a little bit more challenging for various reasons? I would say that, you know, when you, if you look up, if you think about the cloud the hyper-scarters,

Starting point is 00:22:15 the most biggest one. What you actually like, love them. You actually, I think that one of, at least one, I love the flexibility. So you can open the console. You have so many different servers. You can just select one of them, then select another one, do the combination.

Starting point is 00:22:32 It's actually, you know, in time you arms, you can build whatever you want, and it's really good. But there is a trade deal. Is it effective? Is it effective? to use service designed somewhere around your task instead of creating something that really feeds.

Starting point is 00:22:50 I remember I had one one one situation really fun when we were launching the same workload continuously and we spent tangiously were getting the worst performance every like five every fifth launch. Who knows what is what's going on? Why I have a gigpad, And we found out, okay, so in the end we found out, and we found out in the documentation as well that the hardware is not guarantee me. So the different CPU, the different generation of the server, and you get not the same performance continuously. So that's the other side of the coin of the flexibility, because they need to utilize the resource that they have. And if there is no direct contract, like, you know, sometimes you get specific virtual machine with specific hardware,

Starting point is 00:23:41 But if you go to other services where it is hidden from you, you cannot control that. You will use what they give you under the... Right, right. And if you think about the opposite station, like the bare metal GPU, Neo Clouds, so they give you the hardware, they give you like a bare metal server, here's the server, here's an IP address of that sort of connecting to whatever you want, it's super performance. You get the best performance because that's a bearing at all.

Starting point is 00:24:12 There is nothing on top of it. But is it scalable? Is it reliable? Can you just get a thousand of them in the next five seconds, as you can do that in a hypers? Can you guarantee that they will not fail? No, you're not. That's like two opposite things.

Starting point is 00:24:30 And that's probably the reason why we made multiple decisions in Namibius and the way how we build it. So what we see that we've done, differently that we designed a special purpose B or the AI cloud, where we're bringing the performance of a computer and the flexibility of hyperscaliers together. So that's what we're doing, and I believe that's what people need nowadays. So it means that you can just open a console and you get the virtual machines that you need to run your task. And those virtual machines will guarantee you the same performance as a physical survey. And you control it. You see that. And if you don't

Starting point is 00:25:07 needed you stop it, if you need that you launch it, you're controlling that. And on top of that, there are different servers that actually could simplify what they're doing and how you can achieve your target. Yeah, I think that makes a lot of sense, especially with very simply that the cut, so in, you know, the hyperscalers you mentioned, their, you know, their customers and the current use cases like dictate the underlying hardware, right? And obviously, like, you know, there's special contracts and ways that you can specify the hardware or whatever. But in general, the customers in the U cases dictate the hardware. So like when you, what you guys are doing subset, like we're really building for this use case, you know, this AI use case. Like you can do a lot

Starting point is 00:25:54 more to refine the underlying infrastructure and hardware to be best suited for that and therefore get the best performance. So I think that makes a ton of sense. Speaking of which, I would imagine that's pretty complicated. So, I want to talk a little bit about some of the complexities that you all are facing. And then I think this leads us into a really popular topic on the show recently, which is the like doubling down, I think, of the as code movement. So we talked last show about BI as code, infrastructure as code. So I'd love to like dig into some of the complexity and then maybe how how the as code,

Starting point is 00:26:32 you know, plays into that complexity for you guys. Yeah, that's a great point. So, you know, when you do infrastructure as a code, there are multiple things. First of all, there is an infrastructure. And then infrastructure for AI, in nowadays, it's absolutely different type of infrastructure infrastructure. It's, you know, there are so many different things, starting from the data center, like, you know, how you build the data center and how to make it efficient. You're actually facing a lot of challenges, including, like, you know, you probably heard that, you know,

Starting point is 00:27:04 Nowadays, people are, you know, companies building like a multi, like 100 megawatt data centers or they're building like a gigawatt data center. But imagine, you know, what's going on, imagine the simple thing. One of the challenges to give you the magnitude of the challenges, you get the data center which is consuming 200 megawatts or, let's say, one gigawatt, even the worst case, like a gigawatt. I mean, if I think about how usually an electricity network is working. So you have many consumers, some of them turn on, some of them turn off the, you know, kettle, something else. So you have this fluctuation of the consumption.

Starting point is 00:27:46 And because all of them are really small. I mean, the house independent, you know, people of families, flats, they're quite small. So the fluctuation is quite small. You have night and day, for true, but it's more, it's really predictable curve. And then imagine the beta center, which is consuming. like a small city or even like a big city. In the center of the data center, you launch a training, all those thousands of GPUs starting to run and consume Gigawatte advantage. And the next second, next month's, you finish the training. You turn it all. So your consumption is going immediately

Starting point is 00:28:22 zero. Not a zero, but close to that. But your power plan next to you is still producing gigawatts. You could not stop the power plant the same way. What do you've been? going to do is that gigawatt of electricity, which is just available in the grid. You need to move it somewhere if you either want to break everything. Well, yeah, I mean, that's such a, such an interesting problem because it's way more similar versus like what we think about is like traffic with like cars and trucks. So this is similar to launching spaceships. Like you need a tremendous amount of energy to launch a rocket for an X amount of time. And then like you don't need the energy until the next rocket launches. And if you,

Starting point is 00:29:02 you don't have the rocket launches sequenced like back to back, then there's these big, you know, gap, you know, lulls of energy. And I think that's really interesting. And I want to ask, because I've never heard anybody break this down. We always talk about energy consumption and AI and etc. I'd love to break down some of the details. So like GPUs consume, you know, a lot of energy. But what are some other like supporting characters that, you know, consume, you know, a lot of energy and an AI, like, optimized data center. Like, what are the big, what are the, what's your checklist? What are the big items that consume a lot of energy?

Starting point is 00:29:39 I would say that the GPU is probably one of the biggest. And then it's will be, you know, then you have a lot of the subtly systems around you. So you have a storage, which is also like servers. They also consume energy. You have a cooling. So it depends like if you have air cooling, you have the coolers, you have the coolers, you need to move. it's also consumed a lot of

Starting point is 00:30:03 I would imagine that was one on my list I thought cooling would be up there yeah yeah it also depends how you do what kind of technology using you could put data centers within the north like for example we have data centers in the Finland we have data centers

Starting point is 00:30:17 even further like for example in the Finland we almost do not use you know all of that we actually could take the air from the outside see in the air and we can then use it because it's already cold it's even fun that

Starting point is 00:30:31 that you need to heat up this air. So you actually need to take the heat from the data center and hit the cool air before you can put it in the data. And then you still have a lot of heat, which in Finland, we use that hit up to hit up the village, next to the data center. Yeah. Yeah, and then you also have a networking, networking is super big,

Starting point is 00:30:53 and you have not just Internet standards networking. You have Infinaababab, which is extremely performing, low latency, because you need to synchronize all those GPUs together so you need it within a band. It's a networking, consuming a lot. And probably in a modern data center, you're replacing the air cooling with the liquid cooling. So now you have a pumps that need to push the liquid

Starting point is 00:31:15 through the service through all the system. So there are a lot of different things that consume, but I want to go back to your original question. You ask about infrastructure, the code. So when you do infrastructure as a code, honestly, in the AI, it doesn't change from the book side. It's still the same. People still use TerraForum. In the AWS, people still use Confirmation to control the AWS. In our case, we use Data Forum to control it. So we have Terra. Yeah, I was going to ask if there's

Starting point is 00:31:45 specialty tooling yet. And, you know, the infrastructure is good for, but it sounds like it, at least for now. That's quite interesting that when you, if you look on them, you know, I had a couple of months back. I was sitting with my friends with a beer and we were talking, and they were not in the AI space. And they asked me, what's going on in the AI space? What tools you are using? What to join the AI space? And I said, come on, man. Kubernetes is still here. Yeah. A lot of people use Kubernetes. Slorm. Yeah. Now a lot of people use SLORM. It's even older than Kubernetes. You know, then you still use Grafana. You still use Lockheed, you still use promulatives, you still

Starting point is 00:32:22 have CICD, you still use Terraform, like in the reality, you even could use Elasticsearch for the database for the vectors. So there is if you want to be like in the AI space, like if you look on them, people in the eye, so definitely

Starting point is 00:32:38 there are people who build all those algorithms who write a code of algorithms, who do the data engineering, but also there are many people who are actually doing infrastructure and from this, I wouldn't say that there are special tools. I mean, there are tools which help you to do a specific tasks, like for example, monitoring. If you do the monitoring, you nowadays, let's say in a

Starting point is 00:33:02 standard wall, you just monitor the application and their infrastructure. Now you also need to monitor the data set, so you also need to detect data drift, you know, detect semantic drift. You also need to do slicing of the new data. You also need to do a versioning of the data, not just version of the code, you're also diversioning. So it becomes more complex for those, for those new tasks, you have more specialized. But in general, for the infrastructure, more or the same as it was before. Yeah, that's good to hear because, you know, for those outside of the space, like, I think it's easy to assume, oh, they have their own tooling for like every one of these solutions.

Starting point is 00:33:43 And, you know, well, maybe that would be nice. Like, there's probably some tooling you use, you wish could be optimized for your solution. there's a practicality of we need real people with existing skill sets to help build the future here. And we can't overnight, like, have completely reorient with an entirely new set of tooling at every layer of the, you know, of the stack. It's just, that's just a practical, it's very practical thing that's not possible. Yeah. So that's why it's more, obviously, you're controlling the different type of the backend. So what are doing infrastructure is a code.

Starting point is 00:34:15 You also need to start to control working. But you're also controlling Infinibem, which is just another type of hub, another type of the control plane to use. But still, you do more or less the same as it was before. Yeah. From the, you know, terraform from the intersection of the sector. Right. But you want to bring one more thing.

Starting point is 00:34:39 You mentioned about it that people could, you know, people who have those skills who are not in the industry, could join the industry. And now I just want to amplify it a little bit more. actually those people, you know, if you look on AI developers who are writing a code, they don't really want to learn Terraform. They don't really want to learn Kubernetes. I know they're really doing the great stuff.

Starting point is 00:35:00 They're creating that. They're wrapping the data. They're creating algorithms. Why they're supposed to know Kubernetes? I would say the majority of them want to stay someone at a Docker container level. So that means those people who have knowledge about the Kubernetes, the Terraform, and all of these things, what they could do. They could actually create new tool set, which will hide the complexity of the infrastructure,

Starting point is 00:35:23 buy the complex of Kubernetes. You know, once I remember that, I was talking to one of the really, one of the really big love who is doing amazing job, amazing models. And then I was asking, can we do the load balance? And people told me, I have no idea what is the load balance. What it is? Well, I need it. I know.

Starting point is 00:35:41 And anyway, it was already, it doesn't make sense to ask them about the sticky session. They need it. So that's also like a part of, like, you know, if you have knowledge about the Kubernetes, you can actually think about how you can remove the complexity of that. Just give people a really, give to AI people a really simple way to operate the infrastructure that rely on things. Yeah, yeah, that's, yeah, that makes a lot of sense. I mean, we've covered this a lot before too, but that the right levels of abstraction

Starting point is 00:36:08 and being able to contribute to that abstraction. So like you said, so the people that are working on, you know, post-training or, work on the foundational models even can completely focus as much of their energy and effort on the hardest problems and abstract away some of the other like more solved problems. I think that's a really good point. We're going to take a quick break from the episode to talk about our sponsor, Rutterstack. Now, I could say a bunch of nice things as if I found a fancy new tool. But John has been implementing Rutterstack for over half a decade.

Starting point is 00:36:40 John, you work with customer event data every day and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go. Yeah, Eric, as you know, customer data can get messy. And if you've ever seen a tag manager, you know how messy it can get. So rudder stack has really been one of my team's secret weapons. We can collect and standardize data from anywhere, web, mobile, even server side, and then send it to our downstream tools. Now, rumor has it that you have implemented the longest running production instance of rudder

Starting point is 00:37:12 stack at six years and going. Yes, I can confirm that. And one of the reasons we picked Rutterstack was that it does not store the data and we can live stream data to our downstream tools. One of the things about the implementation that has been so common over all the years and with so many Rutterstack customers is that it wasn't a wholesale replacement of your stack. It fit right into your existing tool set. Yeah, and even with technical tools, Eric, things like Kafka or PubSub, but you don't have to have

Starting point is 00:37:43 all that complicated customer data infrastructure. Well, if you need to stream clean customer data to your entire stack, including your data infrastructure tools, head over to rudderstack.com to learn more. One thing I wanted to bring up, because this comes up a lot when I've talked to people, is at what, and there's kind of an obvious answer to this, I realize, but at what spot, so say, you know, you wanted to do a startup today,

Starting point is 00:38:08 you want to be kind of in the AI space. You probably want to be in the application layer. but which is kind of the obvious answer but what are some other spots that maybe are less obvious from your perspective so you've got foundational models like over here that there's probably only going to be a few successful really large foundational models just from a funding and spend perspective that's here then you've got various levels I would almost say maybe between a foundational model and between like the application layer of like post-training stuff or various other things you can do I'm curious like what are

Starting point is 00:38:42 what in your mind are some opportunities for people that aren't just the obvious, like, well, I can go work on a foundational model versus like I'm going to work like somewhere with the application layer. That's a great question. So I would say that when you open the startup, you always, not even the startup who's just running a new project. It's always a question of a choices, of a tradeoffs. Like, do you have it?

Starting point is 00:39:09 Like, you need to look what kind of team you have, what this team, knows what kind of tooling it knows. What's your budget for this? What is your time to market? Do you need to be on the market tomorrow or you need to be in the one year? And what's kind of, you know,

Starting point is 00:39:26 what competitive advantage you want to bring? And actually all of that is brings you a question, what I want to use? What the complexity? You mentioned that they're a frontier model and then you can build your own model by post-training I'm one of the open source of Wi-Fi and tuning.

Starting point is 00:39:43 So that actually brings us to the question, like, I could take frontier model or I could take one of the open-source model from the provider. Like, for example, in Nibius, we have an EBIOS AI studio where you can get models and pay per token and then the best models from the open-source world. And you can find you in there as well. It's two paper token.

Starting point is 00:40:07 But, you know, that's the first option. If you can go to the model per token service, opposite to that will be to go to the cloud and get the GPUs and build your own model or find-tube the model. In the middle of them, I would say that we could see that there is a market of the more abstracted GPUs. So we can, you know, back in the, we back in days, we're calling it serverless, and then when you remove the complexity, so it's a serverless GPUs. You get GPUs, but you don't control the infrastructure behind it. You even still use them, but one of the thing that you also need, when you're making a decision what to use, which of them you want to use, I would say there are multiple, there are four directions which you need to explore.

Starting point is 00:40:54 It's about the economic, like, what do you want to use? Like, if you do a small startup, if you do MVP in the company, you probably don't want to rent GPUs and pay for them to $207. Yeah, right. It's much easier to take the Lama model or an trophic model and just make an NPP. Verify the idea, paper token, and then you can redefine. You can probably find, do you need to switch to something else? It's also about operational, you know, dimension. Like, what's your team?

Starting point is 00:41:22 Actually, is your team capable to use Kubernetes? Or they know long chain, and that's the level. They want to operate API. Or maybe, like, another thing is that it's a technical requirement. like what's the performance you need? Do you want to customize? If you take, for example, the one of the model from the provider, and then you decide, oh, I need to fine-tune the model

Starting point is 00:41:45 because I have my own data, with my data, here is a really good data set. Now I want my model to be fine-tune a dataset. Can you do that or not? Is it technically available? Or if it's not, it doesn't mean that you need to know. Throw everything away and started to rebuild the infrastructure. And the last dimension is actually the legal dimension or, you can say, strategic dimension.

Starting point is 00:42:08 Because not a lot of people think about, you know, what's the time to market, you know, if you take a GPUs, will you be faster than you take that a model from the open? But if you take the model from the provider, and do you have a special data? Like I've seen once, I was talking to one of the startup, and they told me that they do medicine, you know, agent for the doctor, like supporting agent for the doctor. And I asked them, like, okay, where are your patients? And he told me, like, all the market in the U.S., because it's the biggest market.

Starting point is 00:42:39 We want to make money. And then where is your model? And we use the provider from the European Union. Okay, you take the data of the United States people, especially medical data, which is like there are so many compliance around it, and you put it across the ocean. We really think that you could do that. So you need to think about what's your, you know, the legal,

Starting point is 00:43:02 what's your data privacy and all that stuff. And that will help you to select the tooling. So there are so many decisions in trade-off that you need to me when you select between the model on the wider and the GPU from the wider. Right. Yeah, I think this would be interesting. So I think a lot of people, you know, just from a skill set-wise,

Starting point is 00:43:22 are going to typically start with a foundational model, especially if they want to be in the application layer. What are leading indicators when working with a foundational model that maybe you should explore some post-training or tuning other than just using the foundational model? What are some indicators people should look for to consider that? I would say that, first of all, you need to verify the quality.

Starting point is 00:43:47 I mean, is this model actually answering your question? It's answering correctly your question. You know, it also probably has checked with the harness. But let's say it's all about that the model was trained it on some big dataset. But that big data set doesn't mean actually have a good portion of knowledge about your specific use case and have a huge portion of the, not a, I don't want to say, it's not a garbage, but it's like useless for your use case. Right. So if you need to do a kind of testing, ideally not a manual testing, it's not about you sit and ask five questions in the chat and you say, oh yeah, that's a good model.

Starting point is 00:44:27 Yeah. So you probably need to, you know, to cross the data set with a question. and answers, then it's also the, it's kind of another problem, like, do you have open answers or you have a close? Oh, sorry, do you have open questions or close questions? Like, is it like yes and all, and you can just go with a, you know, precision and then have one function to verify the quality or it's open questions. So they actually need another model as a judge to verify the answers and say how close they are.

Starting point is 00:44:54 So you need to grab the data set of really curated, really verified questions, answers. And then you need to take a couple of open source models or not open source, it's fine to have commercial models, a couple of those foundation models, and do automatic testing, run all those questions through them and get the Marx metrics for all of them. And that will help you to understand. Okay, so there is a model really good in my questions, so I can use it. Or you can see, okay, so all those models are bad. what I could do if it's not good enough, then you need to take a bigger data set. So now it's not a dataset for their evaluation.

Starting point is 00:45:36 Now it's a dataset for the training. You take more data, more examples of your, you know, tax corpus images. It depends on which model you have, which modality you use. And then you need to fine-tune. So you also need to say which of this models I could find out. If I will take this commercial model, can I find you in whatever the cost? if you find you on this open-source model,

Starting point is 00:45:58 what will be the cost of fine-tuning and the usage. So you can do that kind of another round of, you know, let's say you could do the quickly fine-tuning, do another round of evaluation, check the new metrics. Okay, it looks like this model commercial now becoming excellent. You can fine-tune X further. Or it looks like this open-source now is really good. Maybe I could take open-source because with the commercial models,

Starting point is 00:46:22 you know, you can get it just from one place. but for the open source model there are so many providers and they have a filing for the pricing so you can even fire really cheap. So I would say this is a reason why a majority of people might start to use, you know, might start to go

Starting point is 00:46:37 from foundation models. It's mostly because the quality, not fitting to the use case, not enough knowledge about the specific domain. And then when you want to find you. Yeah, that makes sense. Yeah. And then I guess

Starting point is 00:46:51 like one of the one of the one of the things that that i and i don't have a lot of experience with us but one of the things that i'm always thinking about is you've got a foundational model then you've got tools available to the model right mcp or whatever however you're doing that and then say i'm going to switch out the model it's it ends up being a really complicated thing where you're like okay here's a model here's all the tools available to it i'm going to switch out the model give that same model the same tools but maybe it's like worse at tool calling or like there's just all these like complexities so any like helpful frameworks that I mean like you mentioned there's a million of them out there for trying to you know test models and compare models but like any maybe just maybe just

Starting point is 00:47:35 like more of a puristical framework that that that that is helpful for when people think about like I want to compare this to this and especially like in a more in a complex scenario where you've got tool calls and other things going on and it's not just to you know kind of raw models So I would say that if you want to be really excellent, if you want to have the best at all, you need to do the verification, you need to the evaluation as well. You still need to do, okay, so you want to replace that model with something else. You need to verify it, do programmatic testing of, you know, several models, see how many of them will hold the tool correctly or not. Another approach, you can actually separate it.

Starting point is 00:48:15 So you know that to actually use a tool, you don't need it. big model. Actually, to select a tool could be really small. And that means you can actually use one model to select the tool to work with the tools that you have and pass the information to the bigger model, which will make the thinking and, you know, chain of thought and actually use the data. So you can use one model as a tool for another model. Yeah. Okay. Yeah, that's right. agent yeah okay yeah that's the marketing term right yeah no that's a great point okay we're we're coming up on on the end of the show here you've mentioned this previously in the show but i want to ask for people that do want to get more into AI like what what would you say like we talked about like

Starting point is 00:49:05 actually some of the tooling is the same but what is that what is that skill or what is that thing that other than you know maybe you're really good at some of this you know tooling that we've talked about, like, what's important? Like, what would you tell somebody? I would say that, first of all, you need to understand where you want to be in AI. So, nowadays, AI is not anymore, like, you know, for the people who think about algebra. So nowadays, AI is actually the big business where there are a lot of different people neither. Right. So if you really, if you want to, if you love the infrastructure, you know, do not try to build a new, I mean, You can try to build a new algorithm.

Starting point is 00:49:45 You can try to go into Pythorch, sensor flow. But if you really like the infrastructure, if you have knowledge in that, and you want to be in it, start working, start to learn where, do that. Yes, be the person who will actually help data science be effective and stop wasting their time on understanding what is the Kubernetes and what is the virtual science and all the things. If you really like to write algorithms, then just, you know,

Starting point is 00:50:09 join the team who is doing that. there are so many good ideas in steel and space. Like, for example, recently there was a release of the Torch Ft, which is Torch Ftolerant. That's a, let's say, algorithmic approach, how you can continue training even if you have a failed GPU. So that was just, that's a pure algorithm. So it's not about even not an algorithm of machine learning. It's algorithm of the code is running or you synchronize the data. But if you want to write the code of the AI model,

Starting point is 00:50:42 you can do that as well. I mean, it's up to you. So, first of all, decide what you want to do. You cannot do all of them. Right. It's already, each of this piece is a huge component already. So you can be successful in one, and that will be already more than enough.

Starting point is 00:50:58 For sure, you can explore something else. It's like, you know, start where you're really good, and if you want, explore the new options. And the second one, I mean, the second probably did advise, be afraid. I remember then when you previously asked, and still it exists, you ask people, what is, like, what do you think about the AI? And they tell you, oh, that's a, you know, super smart people knows about math, law, probably learn physics. No, I mean, for sure there are such

Starting point is 00:51:26 people who, you know, create, breakthrough, who create amazing things. But to start working on E.I. Nowadays, you can just use WebUI, or you can just, you know, with that wide coding coming widely, you can do something with EI without literally understanding how you write the code and what's the idea of what is doing. So do not be afraid. AI is not that hard as you might think. There are things like AI Studio and EBOCII Studio where you can just get a model. You can play in a UI. There are open source project like FlowWise, which help you to just, you know, take blocks and connect them in a visual way. You don't even write, but just drug and draw. And there is there are clouds like Mibut's AI cloud where you can get GPUs where you can launch containers,

Starting point is 00:52:20 we can launch huge Kubernetes. So it depends on your skills, but to not be afraid. You can start with small and grow your skills as soon as you need. As soon as you feel you need it. yeah yeah Alex such great advice thanks for being on the show today really enjoyed this love to have you back and yeah thanks for being here yeah thank you on this was a great traffic meeting the data stock show is brought to you by rudder stack learn more at rudderstack.

The Data Stack Show - 264: Infrastructure as Code Meets AI: Simplifying Complexity in the Cloud with Alexander Patrushev of Nebius

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.