The AI Daily Brief: Artificial Intelligence News and Analysis - Why Local AI Matters and How to Use It

Starting point is 00:00:00 Today on the AI Daily Brief, how and why to use local AI. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, robots and pencils, section, mission cloud, and out systems. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on Apple Podcasts. To learn more about sponsoring the show, visit AIDailydlybreath.A.S. or you can email us at sponsors at AIDilybrief.aI.

Starting point is 00:00:36 To learn more about the new executive agent training program that Newfar mentioned at the end of this episode, go to training.bysuper.a.i. And yes, we are back with another Newfar operators cut. Specifically, this week, the conversation has been so much about the changing composition of enterprise AI strategy or just business AI strategy in general, as people deal with, one, rising costs from agentic workloads, and two, the new reality that our AI can be turned off. off on a whim at any moment. And yet there is a chasm between the idea of using alternatives to the major models and actually being able to do so. And so what Newfar is presented today is a primer that's going to give you a background understanding of a lot of the key concepts, terms,

Starting point is 00:01:18 and steps you would need to take to even explore thinking in this new way. All right, Newfar, welcome back to the Daily Brief. How's it going? Good. How are you? Good. So we are at the transition point moment. I hope fingers crossed by the time this airs, although I'm not super optimistic, but fingers crossed we might be playing with Fable 5 again. But I think that this week, as

Starting point is 00:01:45 I've been discussing all week, has shown why investing only in the biggest model or the best model is not necessarily the best strategy. On Thursday's episode from last week, I talked about what the alternative models and model approaches that companies are

Starting point is 00:02:00 starting to think through. But there is a huge gap between just shifting thinking from Fable 5 to some other type of model to understanding what that actually takes. And that's the gap that you are going to fill in, at least on a basic or high level for us today. I'll do my best. All right. So I do think that there is a big gap between saying open source and fully understanding the implications and deciding whether you should go and buy a hardware for your company. There is a lot of understanding that needs to be done of what it all means. So today I'll try to give a very practical overview of why you should care about open source and why you should care about running models locally. In practice, it will also

Starting point is 00:02:40 include how can you do it and for whom it might be relevant. So just a quick recap of the perfect storm that makes open source so important nowadays in my own words. So the first force that I see is the cost axis. Everybody is talking about tokens, the cost and how to maximize value while minimizing the cost or optimization of the cost. Anyway, a lot of conversation around tokens. That's for a good reason, but for the most part, it's becoming more and more expensive.

Starting point is 00:03:09 Just a few examples that I'm sure many of you have encountered, the price growth of GPT, the release of Opus 4.7 that changed the tokenizer, and all of a sudden, companies who hasn't made any change to their

Starting point is 00:03:24 prompts, has a bill that increased in sometimes 35%. and we're seeing more and more companies leaning towards agentic workflows, and as such, these harnesses are also a huge cost multipliers. So the thing is that many individuals and companies are more than willing to pay the price if the return is justified, and we are all very excited and pleased to know that there is a new player in the block, namely Fable 5, and then, as we all know, it was shut down, and we are still kind of waiting to understand the events, and maybe by the time,

Starting point is 00:03:59 it else your optimism is going to be in motion and we're going to have Fable 5 back. But I think that the eye is becoming increasingly more volatile because of the geopolitical forces. So theoretically, we said it before, but I think that seeing that in action and realizing that all of a sudden you might have a high dependency on a single vendor that can be shut down by a government, that create a new category of dependency risk that we all need to start thinking about how to alleviate that. So beyond these two, there's also a third force, and we should definitely keep paying attention to the fact that there is a capacity issue. And data centers are being built at a very fierce space. However, the usage is going even faster, and we may be heading

Starting point is 00:04:44 toward the world where it's not just about the cost. It's about whether you can even get access to sufficient compute when you need it. So many companies and individuals have hardware that is just sitting idle, that could serve their AI tasks. So that's an untapped resource, while the resource that they do tap into might become even more skills. For reference here, every estimate that I've seen, if you watch just sort of leaders from TSMC or from Nvidia or anyone,

Starting point is 00:05:13 they're all predicting capacity shortages at least through the end of the decade. No one is looking at earlier than 2030. And that might be optimistic just based on the difference in the speed that demand is growing versus capacity is growing. So this is why cost is not just a current issue. It is a leading indicator of a much bigger cost issue in my estimation. I agree. And one more twist to add on that is that even if you are contemplating buying hardware for home or for your company, the hardware itself is increasingly becoming more and more expensive. A lot of it is because of memory shortage. That means that there is a supply chain issue that

Starting point is 00:05:50 will not become even better anytime soon. So even if you are contemplating buying, there is something to say for buying sooner rather than later because the costs keep going up and up even for the purchase option. So if I put together all of these sources, I think that what we should all start thinking about is that local AI deployment of open source models

Starting point is 00:06:13 on a hardware that you own is very much like building shelter for your AI capability or the equivalent of the AI bomb shelter that you should consider. Obviously, on the one hand, it keeps you safe from all of the forces that we just name. We also are the owner of your data. You have availability during outages if you have a fully local deployment. On the other hand, it comes with an overhead and you might save on tokens, but you will spend on maintenance, updates, hardware, and the people who keep it running. So we'll talk more about it towards the end. But if you are paying a cloud vendor, often all of these costs and implications are hidden.

Starting point is 00:06:50 across a very well operationalized company. So if you are contemplating, bring it home. It's very important that you understand all of the implications. And that's what we're trying to do today. And namely, I want to meet you where you are, because I think that everybody should care, whether you are an executive that is steering your company's AI strategy and vendor decisions,

Starting point is 00:07:08 whether you are a practitioner that will drive the actual productization and deployment of local AI, or just an enthusiast that wants to experiment and then consider running at least some of your workloads on local models to save costs or just to be more self-sufficient. So bottom line, it's everybody, but I wonder what you think. Yeah, so I think that one of the biggest ways that AI differs from previous technology that I've seen is it's always a priority when there's a new technology movement for companies to come in

Starting point is 00:07:40 and reduce complexity as fast as possible. And what's been interesting is that with AI, the market of people who want to actually understand the guts of these systems and really get in there and figure them out, I think is much bigger. It's not just the sort of traditional addressable market of people who are any of these category or, you know, the practitioners or executives or IT people. And I think OpenClaw is a great example of this. OpenClaw became a phenomenon, not because there were so many people already in the IT or so many developers who were using it. It's because there was 8,000 people who ended up doing claw camp within the first month. And the vast majority of them,

Starting point is 00:08:18 weren't even technical to start. So this is kind of the same spirit where I don't anticipate, I think 99% of people who listen to this episode will not race out to go build something, but it's a blueprint. It'll help you understand the systems you're working with. And I guarantee it'll help you understand even the systems where all of these parts of things are obviated and behind the scenes. So it's why I wanted to put it on the show, especially right now, as everyone's paying attention, is that I think the market of people for whom it's applicable is much wider than it might seem. And what it used to be. I agree. All right, so let's dive right in.

Starting point is 00:08:49 But before, a very quick and important distinction just to make sure that we are all on the same page. With AI, we have two phases that require very different hardware. In the training phase, we're building the model from scratch. This is what the labs do. It's why they need billion-dollar data center and tens of thousands of specialized chips. It's not what we're talking about today.

Starting point is 00:09:10 You shouldn't care as an AI enthusiast about what Open AI or Anthropical others are doing with their massive data centers. You should care about inference. This is where you use the model that was already built by the various AI labs, asking it questions, getting answers, and empowering the brains of your agents. So everything in this episode is all around the inference, running a pre-built model on your own hardware.

Starting point is 00:09:34 And the hardware requirements for inference are dramatically lower than for training. That's why we are all able to now consider doing that on our own laptop or the hardware that we have lying around. A quick note, I'm going to simplify in places throughout the episode. There are many technical nuances that matter for engineers, but would just be noise for many of the other parts of the audience. So if you are an infrastructure professional, I'm sure that you will identify all the areas where I'm kind of cutting corners. And you'll also know why it's okay that I'm doing that and will forgive me. That's my disclaimer.

Starting point is 00:10:07 All right. So if I'm going back to the bomb shelter analogy, you don't have to go and build a full bunker on day one. because there are four levels from takes 10 minutes still cloud all the way to fully on your hardware, no internet needed. And I want to walk you to each one and maybe you will find what's the right place for you to be in. So at level one, that's the simplest first step. You can use a routing service like open router that sits between you and all the major AI provider. You have one account, you have one interface and it connects you to 400 or more models across more than 60. providers. And this gives you, first of all, a mix and match by task. You can route complex

Starting point is 00:10:49 reasoning to one provider. You can kind of very quickly do another routing to a simpler model. You can optimize cost versus quality per workflow. So you don't have to have a contract per vendor. And you also don't have a vendor lock-in and you can switch models or providers whenever there is something new or maybe something happened that caused you to, I want to consider moving between vendors. You also have a very good cost transparency so you can compare side by side and then select the model that works best for your own workflows. And of course, if there is some kind of an outage or a problem with one vendor, you can enable an automatic failover to another one, which makes it more robust. And lastly, of course, you can experiment with models to decide

Starting point is 00:11:34 whether there is a new kid on the block that is catching your attention and you want to swap to that. The trade-off for working with something like that is that the data still leaves your network, you are still cloud-dependent, and you're still paying quite a lot to a third party, but you're not dependent on a single vendor. And obviously, open router is not the only alternative. It's just the most popular one. There are other alternatives like a light LLM, if you want more of a router that is self-hosted on your own machine, you have port key for enterprise governance and others, just to name a few. Same concept, one interface to many providers with an automatic failover very quickly to set up. The level two is if your organization is already on some kind

Starting point is 00:12:18 of a cloud, whether it's AWS, Google, Azure, and so on, this level uses what you have. So we all heard or maybe are using services like AWS Bedrock, Google Vertex, Azure AI Foundry, and so on. They all let you run several vendors on your own cloud, and they all all let you run also open source models in a way that is secure, compliant, and in a place that you're already most likely operate anyway. That means that your data stays within your own virtual private cloud, and for the most part, it's going to be easier to approve that with your own security. You have two ways. You can use the commercial models, or you can use them for open source, as noted. And I think that this is the path where most large enterprises

Starting point is 00:13:03 are already taking or will be taking first whether they're starting to contemplate experimenting with more open source models just to see the option. And then we have an option that is not for the faint of heart, which is to self-host a cloud. It takes everything that we just discuss one step further. So instead of using a managed server

Starting point is 00:13:24 like the ones that we mentioned, you rent a GPU and you install your own model, your own serving. We'll explain what it means in a minute. That means that you don't have any platform. no restrictions and you get to do everything, good, bad, ugly. So for most organizations, this is not very practical because it requires a lot of infrastructure engineering, ones that know how to work with GPU drivers, work well with containers, and many other engineering

Starting point is 00:13:51 words. But for teams that have that capability, it gives maximum flexibility. And often, it's probably the lowest per query cost at high volume. Again, given that you know how to manage your own bare cloud without any help from the cloud providers. And lastly, this is where you go fully local. That means that everything is on a hardware that you physically control. No internet is needed after the initial model download. No model in the loop at all. And this is where we'll spend most of the rest of the episode

Starting point is 00:14:23 to walk you through what it means to deploy AI fully locally because I think that's where most of the learning lives. And that's the level that will truly survive any internet. internet outage, export control or vendor going dark, or I don't know what's going to be the future, but you have full control with that level. By the way, that's not where I think that everybody should start here. I think that if you are an enterprise, you should probably start at level one immediately, evaluate level two for a sensitive workload, and you can build toward level four if you have capabilities that must survive all of these disruptions. Of course, that's also

Starting point is 00:14:59 the level where many of ours, the individual practitioners, can live and build for ourselves and many people are already doing that and we'll focus on that level from here on after. So it's a stack of five layers to go fully local and they all matter. At the bottom, we have the hardware, where physically do we run our AI? Then we have the model. What is the intelligence that is being loaded? Then we have the serving layer, what software make it available. Then we have the agent harness or the user interface, what orchestrate the action. And at the very top, we have the fully user facing what you actually see and what you actually interact with. So I wanted to go from bottom to the top to make sure that you understand how to do it for

Starting point is 00:15:41 yourself, or at least as mentioned talk to talk. So layer one is the hardware. And the question is, where does it physically run? Just going very quickly to the basics, because this matters, your computer has two types of brains. You have the CPU and you have the GPU. The CPU is the general purpose chip that runs your operating system, the browser, the email, every computer has one. It can run AI models, but typically more slowly because it wasn't designed for this kind of mathematical operations. Then we have the GPU, which stands for graphic processing unit, originally built for gaming and video, but it turns out that the same architecture is perfect for AI. So GPUs do thousands of simple calculations simultaneously, which is exactly what's

Starting point is 00:16:29 running an AI model requires. So GPU is typically what you need. And the key number that truly makes a difference is the memory, specifically how much memory or GPU has called VRA. And the entire model needs to fit in this memory if you want to have a usable speed. If it doesn't fit, the system typically falls back to using regular memory through the CPU, and everything slows down dramatically. Just very quick hardware simplification.

Starting point is 00:16:57 All right. So what does it mean for different machines? If I have a regular laptop like the PC that I have, I don't have any gaming graphic cards, I don't have so much memory, I can run on my own laptop, small models, through the CPU. It's going to be quite slow, but still functional for simple things,

Starting point is 00:17:15 primarily to learn an experiment. So that's going to be like the small stuff. If, however, you have a Mac with an Apple Silicon, then you have a CPU and GPU that share the same memory pool. so your Mac can probably run even large models, and that's why Macs have become so popular for local AI, and as a result, most of them are very hard to come across nowadays. Another great option is if you have a desktop with gaming GPU,

Starting point is 00:17:45 that's going to have a dedicated graphic card with sufficient memory. It's not going to come cheap, it's going to come around $2,000. That's probably the sweet spot, because that can run between medium to large, models at a very good speed. We also have some interesting offering from Nvidia around this category, but you also have the option to run stuff on a phone or a tablet. So very small models can run even on your old Android machine. So don't be very haste to throw away old hardware. Lastly, a server with enterprise GPUs can run any

Starting point is 00:18:21 model well, but the cost structure is very simple. I'm going to explain what when you see these numbers of parameters and so on in a minute. But for now, think about the t-shirt sizes, meaning that your hardware determines the largest size that you can wear or the largest model that you can run. And typically, how smart or how sophisticated the use cases that you have in place. Okay, so prices are quite diverse. They spend $700 if you want to buy at the low end, some kind of a used high memory graphic card for an existing desktop. That gets you to medium-sized model, and that's going to cost you less than $1,000. At the mid-range, you will have $3,000 to $5,000 that will buy purposefully build AI

Starting point is 00:19:10 appliance from Nvidia or AMD, and the cost keeps going up and up. If you're contemplating, that's a category that becomes expensive as we go along. And at the high end, you have these numbers, and if we're talking about purchasing a server for a company where like it's a completely different degree of orders. A few things to know before you go and pull a credit card. First of all, as I mentioned, the Apple products have a massive wait times right now because of the memory shortage, so it can be even month. Second, you may not need to buy anything. You can just start with the hardware that you already have lying around, answer the ROI question. Like, do you have a

Starting point is 00:19:51 justification to go and buy a hardware? Do you have a use case that you are able to run locally to satisfaction and you will not default back to paying the cloud vendors sooner rather than later, only to have this very expensive or fairly expensive hardware lying at home or at your office not being used? And of course, if you are working in a regulated industry where compliance prohibits sending data to a third-party API, local may be a requirement and not a choice. But if this is the case, you have to be honest, because a machine, on your own network is not necessarily more secure

Starting point is 00:20:24 than a well-configured cloud API. So the security argument is strongest if you truly are not connected to the internet and no one can infiltrate your network. But if you are connected to the internet, it's not necessarily the stuff that you have within your walls of your company are more secure than what those cloud providers

Starting point is 00:20:43 are doing for you in order to secure you from cyber attacks. How different the enterprise costs, if you want to buy, a server for your data center, it starts with a quarter of a million dollars. So completely different bolge. I cover the capability gap between AI potential and AI reality every day on the show. Most companies are still figuring out how to start. Robots and pencils is already launching and scaling. Agentic and generative AI in production, at large enterprises in weeks. AWS advanced tier pattern partner more than doubled in a year. And they're hiring. 50 open roles. If you're someone who

Starting point is 00:21:24 knows this moment is different. Who wants to be inside it, not watching it, this is worth a look. At Robots and Penciles, the best ideas win, and the team is purposefully kept super high quality. This is the kind of place you look back on as the best decision you ever made. Take a look at robots and pencils.com slash careers. Here's a harsh truth. Your company is probably spending thousands or millions of dollars on AI tools that are being massively underutilized. Half of companies have AI tools, but only 12% use them for business value. Most employees are still using AI to summarize meeting notes. If you're the one responsible for AI adoption at your company, you need Section. Section is a platform that helps you

Starting point is 00:22:01 manage AI transformation across your entire organization. It coaches employees on real use cases, tracks who's using AI for business impact, and shows you exactly where AI is and isn't creating value. The result, you go from rolling out tools to driving measurable AI value. Your employees move from meeting summaries to solving actual business problems, and you can prove the ROI. Stop guessing if your AI investment is working. Check out section at sectionaI.com. That's SECT-I-O-N-A-I.com. The average enterprise is spending $11.5 million on AI this year,

Starting point is 00:22:35 and most of them can't prove a single dollar came back. What does AI actually look like when it produces ROI? Ask the healthcare company that just made their payment processing 320 times faster, or the law firm whose document research went from three months to 10 minutes, or the contact center who reduced wait times by 99. These are real Mission Cloud customers with real results. Mission Cloud is a CDW company and an AWS premiere to your partner.

Starting point is 00:22:59 They're the AI-first outcomes-obsessed AWS experts who build AI solutions that drive your business forward. Whether you're flooded with AI ambitions but no idea where to start or six months into a deployment that's going sideways, they've seen it and they've fixed it. Stop burning your budgets on AI that doesn't produce results. Start at missioncloud.com. This episode of the AI Daily Brief is brought to you by OutSystems. a leading Agendic Systems platform built for the enterprise. Organizations all over the world are building, orchestrating, and governing agentic systems on the OutSystems platform and with good reason.

Starting point is 00:23:32 OutSystems open and unified platform allows teams to architect, deliver, and scale governed agentic systems with agility. Teams of any size and technical depth can use OutSystems to build, deploy, and manage AI apps and agents quickly and cost-effectively without compromising reliability and security. Without systems, you can rapidly launch ideas from concept to completion. It's the leading agendic systems platform that is unified, agile, and enterprise proven, allowing you to accelerate growth, reduce operational friction, and deliver real enterprise impact with AI. OutSystems. Build your agentic future.

Starting point is 00:24:08 Let's talk about layer two of the model, and the question that we're trying to answer here is, what's the intelligence that we want our hardware to run? And I think that most of us never needed to think about what models we're running, or more importantly what their size, because if you use the... HGPT, Klaude, Gemina, and so on, you were using a model and all you had to decide is between fast and thinking, basically, because someone else chose it, hosted it, and did everything to maintain it. However, if you are contemplating, deploying your own model, you need to understand that model comes in different sizes, and a model size is measured in parameters, billions of

Starting point is 00:24:48 learned values that encode the patterns from the training data. You can think of parameters like vocabulary and experience all combined into one. Typically, more parameters means that the model can hold more nuance and can handle more complex reasoning and produce even more sophisticated output, again, at a high level. The question is, if that's the case, why don't companies just make every model enormous or more and more big over time? And that's because bigger models need more compute power to train to the point of billions of dollars at the frontier and much more memory to run. So the size spectrum is what you should understand very quickly. At a high level, we have the tiny models. Those will be one to four billion parameters. They are very fast. They

Starting point is 00:25:34 can run on anything literally, including even your Android machine, can typically hold basic chat, simple summarization, or a very like a pointed task. We then have the seven to 14 billion parameters. Those are the small, quite capable for everyday tasks. They can do writing. They can do some boilerplate code, they can do Q&A, they can run very well on a laptop or a basic GPU. And I believe that most of you, if you are contemplating doing a local deployment, will first deploy models from this family. And the medium size, we have near frontier. And as time goes by, we see more and more models at this size that are providing results

Starting point is 00:26:11 that are almost as good as the huge ones. They need quite a good GPU or a high-end Mac in order to run. And that's kind of the sweet spot. if you want to be serious about local deployment for yourself or for an immediate team. The large ones and of course the major ones, those are good for powerful reasoning. They will typically need more expensive hardware or even a setup that involves multiple GPUs. And I think that one pattern that is worth watching for is that especially for well-defined tasks around coding and math for the most part, we're starting to see tiny specialized models.

Starting point is 00:26:49 that match frontier performance. Just this week, we've seen a 3 billion parameter model called ViveTinker that match the cloud opus and Gemini Pro on coding benchmarks. So 3 billion is extremely small, such that you can run it as noted even on your phone. But the catch is that it only works this well on very structured, very verifiable tasks, and not necessarily on the knowledge work tasks that many of us are doing. So still, if we need general purpose, general knowledge, for things that we do as part of the knowledge work, size still matters,

Starting point is 00:27:21 but seeming like a future where you might run Frontier Class Specialized Mosul on very modest hardware. Bottom line, we don't need the Frontier-level intelligence for every task. A huge amount of what we do with AI can be done on the smaller ones or staying at the like 7 to 14 billion or 7 to 27 billion range. Those are open, free to downloads. They can run on hardware that all of us has. and bigger will be when you need either a more able or a more general purpose type of things. What you should care beyond the size,

Starting point is 00:27:55 because as we say, there are other parameters that make the models different. What many people are being caught off guard with how the models behave is that you can download the model that benchmarks beautifully. You try to use it for agentic tasks, calling the tools, following multi-step instructions, and they fail spectacularly

Starting point is 00:28:13 because it was trained for chat, not for tool use. So when you are evaluating a model, also check, does it support tool calling? How large is the context window? Will it hold the amount of input and output that I plan to run on a single session? Does it handle images?

Starting point is 00:28:29 Is the license commercial friendly? These are on the model card that I will explain shortly, but you need to read it like a product spec and don't just look at the size as the deciding parameter. And if I need to call out some of the most prominent models

Starting point is 00:28:45 in the open source ecosystem and obviously there are a ton more, but just to name a few, five names that keep coming up. Gemma from Google, great model to mention, comes in different sizes. Quinn from Alibaba,

Starting point is 00:28:58 and the number here is even not up to date. We have a more updated model. It's a coding champion and it fits well on a one good GPU. We have the deep seek that we all heard about. It has a very strong reasoning and it's quite good and capable. We have the family of

Starting point is 00:29:15 models from META, the Lama, Scout and others, and many models that were based on Lama that are quite good. And another one that I wanted to mention is the Hermes. It's a fine-tuned model from Nus Research, and specifically it was built for agentic work and tool-calling and some of the things that I mentioned are something that you need to look into. So if you are running an agent harness locally, it might be an interesting one to look into.

Starting point is 00:29:40 Just maybe one more point on fine-tuning, because I mentioned that a couple of times. this means that you take a general model and train it further for a specific purpose or with a specific data that you have. Hermes is exactly that. It took another model and improved it further in order to be good at a workflow. This list is going to be changing all the time too. Obviously, on AI Daily Brief, I'm trying to keep track of the ones that sort of transcend from developers are playing around with too. It's maybe more broadly worth knowing. GLM 5.2 is the one that came up this week that more and more people are talking about.

Starting point is 00:30:15 although we're still only a couple days into it. And a lot of the latest Chinese open weight model tends to have this pattern of people get super excited about it in the first few days, and then a few weeks later, no one's talking about it. So who knows if it'll stick around, but there's that. And we're also seeing, even from American companies,

Starting point is 00:30:32 a lot more experimentation with different model approaches. Cursor's composer is one that I bring up a lot on the show. So there's always changes on the model front, again, which is kind of why this is less a conversation about, the exact models and more the principles of running and being able to run and switch in and out these different types of models for different types of goals. Yeah. We have many others, which is exactly why, in general, one more place that I want you to pay attention to occasionally is a hugging face. Hanging face is like the App Store for all the AI models and the open source

Starting point is 00:31:05 models out there. And if you haven't been there, I strongly recommend that you go and check it out because everything is there open source or for free, and every major release will go there. Currently, they have almost more than 500,000 models hosted. And when you go into a specific model page, because you heard about it on the podcast or on X or wherever you're trying to stay up to date, you want to understand what's under the hood. You will first encounter what is called the model card,

Starting point is 00:31:34 which is basically like a spec that tells you what it is good at, what it was trained on, limitations, and just ready to make sure that if you are contemplating using a certain model that it fits what you need to do with that. You will also be able to see the license of the model. Typically, we're looking to get a model that is either Apache 2 or MIT. That means that you can use it even for commercial stuff however you want. Some have other restrictions, so pay attention to that if you're planning to use it for a product. And lastly, you will see a file called GGUF.

Starting point is 00:32:06 That's the compressed ready to run versions that you can download to your own hardware to start deploying and running the model. There are different files for different compression levels on more on that in a minute. And you need to pick the one that fits your own hardware. Another thing that I want you to use Hugging Face 4 is that it's a great place to see the vibes. Okay, because you will see how many downloads, what the community is saying about stuff. And while I know that we're all sometimes falling trapped to the benchmarks, which is maybe a good start, but I know Nathaniel that you repeatedly say that you don't believe in benchmarks, but what you can look into is the wisdom of the crowds, and that's exactly what you get in Hanging Face.

Starting point is 00:32:44 Because if you see that something has been downloaded a ton of times, that means that real people are finding real value, and that's why they're downloading that. They also mentioned trusted publishers, so you should use the ones that are official and approved. Be more wary of third-party unknown publisher before you download anything to avoid any incidents. One more thing to say about HangingFace, it's not just for models. There are applications and data set and spaces like live demos that people upload that you can explore, and there is tons of inspiration to draw from HangingFace. So even that, I'm not affiliated by anyway, but I just think that it's a great source

Starting point is 00:33:22 for anybody who wants to understand the art of possible to go and traverse. And whenever you are considering a specific model, I want you to go beyond the model card and ask your AI tool to do fresh with research on the community signals, it can be X Reddit, other places, developer forms and so on, just to see what actual practitioners are saying, because often what's written in the model card and the vibes from the community are completely different and you need to be aware of them. So that's Hanging Face.

Starting point is 00:33:51 I promise that I will say, what do I mean by quantization? Basically, the concept is how you fit the large model on a more practical hardware. that's something that unlocks basically the entire picture because when a model is published by the creators, it stores typically at maximum quality. That means that it uses a ton of memory in order to preserve the full accuracy. So a 27 billion parameter model

Starting point is 00:34:15 at this original quality needs 54 gigabyte of memory and nobody has that in a consumer-grade machine. So what the companies are doing and the model labs are doing in order to make it more accessible is they do quantization. and that basically compresses the model into lower precision. And if you need the analogy, it's like an image compression. The raw photo has a very high quality,

Starting point is 00:34:39 but a JPEG look nearly identical to the human eye, but it's a fraction of the file size. So that's the simplification of the concept. You can see Q4, Q8, Q5, or Q6, but Q4 means that that's the standard default, and it cuts the model to about 30% of the full size. For most tasks, if you see a model with the letter Q4, it's more than enough and it will run well on your hardware.

Starting point is 00:35:03 If for some reason you need higher quality, you can go to the Q8 or in between stuff like that. And you will see that on the file name. So maybe you will see like a quent, 3.7, 27 billion. That's the number of parameters. Q4, that's mean the quantization and the name of the files. So that's how you read all of these cues and files and so on. Enough about the models, let's talk about the serving layer.

Starting point is 00:35:25 That's the layer that loads the model, because we already cover the hardware and the model file, but you need software that loads the model and makes it available. It's a little bit like a waiter standing between the kitchen and the customers. That's the purpose of these software. It sits in the background. It's ready to serve when it's being asked. Two dominant offerings here, we have Olamma.

Starting point is 00:35:47 That's basically the engine. It's free. It's open source. It's the most popular way for you to serve models on your machine. It's very simple to install just one command, and one command to run the actual model. And what nice about it, it will automatically detects your own hardware

Starting point is 00:36:02 and configure itself. Critically, it exposes a standard interface that other tools can talk to, which that makes it that anything designed for cloud AI can point to your local ALAMA instead. So it makes it even very easy to transfer between tools you're currently running AI on versus the cloud to run all of a sudden locally.

Starting point is 00:36:23 And it has a ton, a ton, a ton of models in the library. so it supports almost anything that matters. The other thing that you might want to consider installing is the LM Studio. It's like the showroom. This is a desktop application with visual interfaces where you can browse models, see the hardware usage in real time. You can test two models side by side, and it's very good for understanding what different models can do before you commit.

Starting point is 00:36:47 So these do work very well together, the LM studio to explore and evaluate an olama to serve in production. And if you need to serve multiple users at scales, There are additional tools for that, but that will typically also require more technical team. So I'm not going down the route of more sophisticated serving as a software. Moving to the layer of the agent harness or what orchestrates the AI, we have a chat interface. That's one thing. An agent is another thing.

Starting point is 00:37:15 And the difference is that a chat interface lets you talk to the model, but an agent harness lets the model take actions. It can read the files or search the web. It can call the various APIs or MCPs, send messages, run schedule tasks, and all the fun stuff that we love about our various agency capabilities. If you want to go down a chat interface for your local AI, one very simple path and very useful way to do that is to use open web UI. Again, very popular, self-hosted web application that look and feels very much like a chat GPT. You can point it at your local alama and then your team has a very private chat GPT that runs. entirely on your hardware. It enables multi-user, it can document the upload, it has a search built-in,

Starting point is 00:38:00 and so it's a very good alternative if you want to create local ideas primarily for chatting. However, if you do want to go all the way to an agentic harness that is hosted locally, obviously there are a ton of options and the list is getting longer and longer and longer over time, but two things to note. One is obviously open claw and the other is Hermis agent. Those are the most dominant in the open source of agentic harnesses. Both of them will run on your own hardware. Both of them support local models to Olamma. They do tool calling, persistent memory, as well as integrating with various messaging platforms.

Starting point is 00:38:36 The difference is the philosophy. OpenClaw gives tighter manual control because you can create the skills and define the rules and create the context. Hermes leans into autonomy. It is writing its own skills from experience. it does a lot of self-evaluation to improve for you, and it has a compound capability over time. Because both of them are fully open source and you can install in minutes,

Starting point is 00:38:59 they are both great things to explore if you haven't already. But I will say on Hermes that it's, at least in my opinion, becoming more and more predominant option that if you haven't looked into is something to look into this June over the summer. And I think that if you do go and install one of them or one of the alternative agentic harnesses, locally and you do all the other layers that we just talked about, all of a sudden you are in

Starting point is 00:39:24 full control and running everything locally without paying anything beyond electricity. One more thing to say with regards to coding specifically is that even if you are used working with a different coding tool, and most of them can be pointed also to local models, and not too many people are doing that, but all of the major players are now integrating very well into OLAMA in order to run stuff locally. There is one caveat that some of the features within these tools stay cloud only regardless. So, for example, auto-complete in some cases,

Starting point is 00:39:56 will not work if you're running on a local model and some other cases, automations are run on the cloud and so on. But even if you are primarily using these tools and you don't want to go to Hermis or OpenCla, you can work also with local models and reduce the costs and the dependency on the cloud model providers. Last layer, what you actually interact with, this is the top of the stack. The one thing that you will touch day to day, it can be an open web UI chat window that you give your team.

Starting point is 00:40:25 It could be the Hermis desktop that was just released a few days ago or a few weeks ago. It can be something that you interact with through Slack or Discord or wherever you're conversing with your agent. And the point is that once the lower layers are all working, this layer is completely flexible. You can build anything on top of a locally served a model that you could build on top of any cloud API. So that's not where you should spend a lot of your energy. So I want to bring us home. I know it was a lot. There is an honest trade-off here.

Starting point is 00:40:59 Let's start with what you gain. If you are going local with AI, you get a lot of data independency. Nothing leaves your network. You have availability. You cannot be shut off by export control, vendor decisions or internet outages. You have cost predictability because after the hardware investment, the marginal cost per query is almost zero except electricity. You have learning because running model locally teaches your organization how AI actually works under the hood. And many people who interact with the models directly all of a sudden have a ton of a ha moments from the process.

Starting point is 00:41:33 However, you do take on a lot of responsibility and effort. Hardware, if you haven't had it lying around, is something that you need to buy. maintenance. When something breaks, it's on you. No one will fix your Olama that is not working or your open cloud is not working or whatever you decided to install locally. These tools have a ton of updates. So every time there is a better model or a better software, it's on you to update and make sure that it's still running smoothly. The security integration, if there are new things that are happening, it's on you to orchestrate or install them. And lastly, you might realize that, you went all in on local AI in order to save a ton of tokens, but you are having a few people that are working around the clock to maintain your local AI. And all in all, the cost of tokens versus the cost of humans are not comparable. So that's something to pay attention to. And also the fact that security is not guaranteed if you don't know what you're doing with local AI,

Starting point is 00:42:35 especially if you are connected to the internet, that's that. I think that if you need to start somewhere, one good machine, one useful workflow, prove the quality, secure it, and then decide whether to scale. So if I'm trying to be even more concrete, I think that I gave you a ton of vocabulary, a mental model, and the landscape. You understand hopefully the five layers and what decisions live in each layer and so on. But what you can do immediately after depends on who you are. So if you are an executive, maybe you have enough food for thought to ask informed questions

Starting point is 00:43:09 of your technical team, what's our position on local models, have we evaluated our vendor dependency? What would we do if our primary I provider becomes unavailable or overly expensive? So these are some of the questions that you should be able to ask. If you are a practitioner, you can definitely install Olama this week if you haven't had it or experiment with yet another latest and greatest open source model to see how well it serves your own workflows, to see how it feels and so on. And also, I believe that the hands-on experience is worth more than any amount of reading

Starting point is 00:43:40 that you can do. And of course, if you are in a regulated instrument, industry, that's something to definitely contemplate more and more with your compliance and infrastructure team to see what's the right stance for you. And the core message is, from my perspective, is not that everyone must run AI locally. It's that the landscape has shifted enough on cost, on control, on access, that every organization making serious AI decisions need an informed position at the very minimum in a very deep conversation on that. And even if the position is not for us right now.

Starting point is 00:44:12 It should be a deliberate choice and not an assumption that you never go back and re-examine. So it's not for everyone, but understanding is for everyone. And one last thing before I go, I wanted to mention that we just launched the executive agent leadership program.

Starting point is 00:44:30 It is the evolution of the beloved enterprise club program. It was rebuilt for everything that's changed in the last few weeks and months. The token economy, the local deployments, the security, the vendor independence, all of that. It's a six-week cohort for leaders who want to build AI agents hands-on and then design how the organization operates in the agent era. The first revised cohort will start June 29. And if this resonated with you or you want to spend some time with others going through the same process and have fun with us, I'll be more than

Starting point is 00:45:03 happy to have you there.

The AI Daily Brief: Artificial Intelligence News and Analysis - Why Local AI Matters and How to Use It

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.