Screaming in the Cloud - Generative AI, Tech Innovations, & Evolving Perspectives with Randall Hunt

Starting point is 00:00:00 hyper-focus on per-token costs is kind of like missing the forest for the trees, because that is only one part. Welcome to Screaming in the Cloud. I'm Corey Quinn. It's been a hot second since I got to catch up with Randall Hunt, now VP of Technology at Kalent. Randall, what have you been doing this episode? I haven't seen you in a month of Sundays. Well, I'm still working at Kalent and we are still building cool stuff. That's my new motto is we build cool stuff. And yeah, a lot of Gen AIs coming out from a lot of different customers. People are getting really interested in applying it. So that's what I'm doing these days. engineers and loved by developers. Prowler lets you start securing your cloud with just a few

Starting point is 00:01:05 clicks with beautiful, customizable dashboards and visualizations. No unnecessary forms, no fuss, because honestly, who has time for that? Visit prowler.com to get your first security scan in minutes. Some of the stuff that you have been saying on Twitter, yes, it's still called Twitter, has raised an eyebrow. Because back when we first met, you were about as critical of AWS as I am. And what made this a little strange was at the time that you worked there, you're one of those people that I could best be described as unflinchingly honest. Sometimes this works to your detriment, but it's one of the things I admire the most about you. And then you started saying nice things about Amazon bedrock in public recently. So my default conclusion is, oh, clearly you've

Starting point is 00:01:56 been bought and paid for and have thus become a complete and total shill, which sure that might fly in the face of everything I thought I believed about you, but simple solutions are probably the best. Before I just start making that my default assessment, is that accurate or is there perhaps something else going on? No. So I think if you look at the way I was talking about Bedrock back in April of 23, you can see I was still as unflinchingly honest as ever. Although I guess I've grown up a little bit over

Starting point is 00:02:27 the years and I try to be a little less, I don't know, inflammatory in my opinion. So I'm like, hey, this isn't real. This is vaporware. This doesn't work. So since then, we've had the opportunity to work with... And me personally, I've had the opportunity, like hands-on keyboard, to work with over 50 customers in deploying real-world, non-experiment, non-proof-of-concepts production solutions built on Bedrock. And I have to say, the service continues to evolve. It continues to get better. There are things that I still think need to be fixed in it, but it is a reliable, good AWS service that I can recommend now.

Starting point is 00:03:05 I see your head exploding. Yeah, I hear you. The problem is, let me back up here with my experience of Bedrock. Well, before we get into Bedrock, let's talk about, in a general sense, I am not a doomer when it comes to AI. I think it is legitimately useful.

Starting point is 00:03:21 I think it can do great things. I think people have lost their minds in some respects when it comes to the unmitigated boosterism. But there's value there. This is not blockchain. We are talking about something that legitimately adds value. Now, in my experience, Bedrock is a framework for a bunch of different models. A few of them great, some of them not. Some pre-announced, like Amazon's Titan, the actual Titan, not embeddings. And I believe that has never been released to see the light of day for good reason.

Starting point is 00:03:54 But it always seems to me that Bedrock starts and stops with being an interface to other models. And I believe you can now host your own models unless I'm misremembering a release or an indoor cloud company that was doing that. There's a lot of different components of Bedrock. You know how you think of SageMaker as like the family for traditional machine learning services and you've got SageMaker jumpstarted. Well, I used to until they wound up robbing me eight months after I used it for $260 in SageMaker canvas session charges.

Starting point is 00:04:20 And now I think of it as basically a service run by thieves. And I look forward to getting back into it just as soon as I'm made whole. Two and a half years later, I'm starting to doubt that that'll happen. But yes, I used to think of SageMaker that way. I agree with you on the Canvas side. I'm pretty equally frustrated.

Starting point is 00:04:36 Like I have to administer it for all of talent. So I am our AWS administrator and I have to manage all of our costs. So I very much empathize with that component of it. I do think the evolution of SageMaker, you have to think that it went all the way back to what was it, 17 that it launched or was it 16? I think I wrote the launch blog post, so I should really remember this, but I've forgotten. Pretty sure it was 17.

Starting point is 00:04:58 It came out and my first question was, what are we going to do with all this Sage? You know, it's got like three different generations of stuff on top of it and figuring out the cost inside of SageMaker, which generation it belongs to. Is it part of studio? Is it part of canvas? Is it, it is not fun. So I totally empathize with that part. However, SageMaker has many good components to it. So that part of things, you think of SageMaker as a family of services. Think of Bedrock as a similar family of services. You have things like guardrails. And these started out as very sort of rudimentary. You would do a regular expression search for certain words. You would, in natural language, define certain questions that you didn't want the models to respond to. It's much better now. So you can tune in things like toxicity. You can tune in things like, oh, you know, you're an HR bot, but don't answer any questions

Starting point is 00:05:46 about finance or tax, things like that. This works now, whereas previously it was more of like a preview feature didn't really work. Then the number of models that are available is going to continue to grow. You can't actually bring your own model yet. So what you can do is you can bring weights. That's what it was. Sorry, It all starts

Starting point is 00:06:05 to run together on some level. And it's, to be frank, it's hard to figure out what's real and what's not, given how much AWS has been running its corporate mouth about AI for the last year and a half. They've been pre-announcing things by sarcastic amounts. And it's always tricky. Is that a pre-announcement or is that a thing customers can use today? I do know that I know nothing about what's up and coming because that team doesn't talk to me because in the early days, I was apparently a little too honest. So I think one of the things that they have improved on,

Starting point is 00:06:36 and I lost my mind over this back in 23, is they've stopped saying it's in preview when it's really coming soon. So that was the biggest thing that drove me crazy is they would say something was in preview. All of my customers would come to me and they'd be like, Randall, I want to get into this preview. And then I would go to AWS and I would say,

Starting point is 00:06:54 AWS, we want to get into this preview so that we can advise our customers and all this. And then it turns out it was really a coming soon. Yeah, it's in private preview for a limited subset of customers whose corporate name must rhyme with schmanthropic. It's like that. That is not the same thing.

Starting point is 00:07:10 But they've gotten better about that. So they say coming soon now. I don't know if you've seen some of the more recent announcements where it doesn't say preview or anything like that. It says coming soon, which is so much more helpful in helping our customers understand what's real, what's on the way, what can you use in your account today, that sort of thing. But getting back to Bedrock, it is a very solid API. I think it's well-designed. The ability to return... If you do invoke model with response stream,

Starting point is 00:07:39 it's going to return at the end of everything, a little JSON, you know, payload that has, you know, bedrock metrics. And you can also get these in CloudWatch, but it's very useful to have them per inference instead of having to go and like average them out and get them from the aggregate results. Like you can get per inference things like time to first token, which is very useful when you're building something that's customer facing where you want streaming responses. You can get inter token latency, which is also important for that. And you can get total inference time. Now total inference time is less useful in a streaming use case. I mean, it's still a valuable metric. But all of that stuff is returned through the API.

Starting point is 00:08:17 There are some models that don't return that. And I think that's just because they're kind of older legacy models. Titan is a real model. I've used it, but to your point, not the best, but that's fine. I think AWS is probably working on some stuff and I hope they are. I'd love to see them release a model, but I also think we're going to get into this area. Here's my prediction, right? And I know we're getting a little bit off the topic of bedrock with this, but my prediction regarding generative AI is that we will have a few foundation models that are very powerful, large models. And then we're going to have many frequently changing distilled models, so smaller models. We did this for a customer recently where the cost, the per token cost in production of using a large language model like Cloud3

Starting point is 00:09:05 Sonnet or Cloud3 Opus was going to be way too high. It just wasn't going to work, given the threshold that they were operating at. What we did is we made Cloud3 Opus the decision maker deciding which tool to use for which job. And then we use something called Distilbert, which is just a version of BERT that you fine-tuned, which we did in SageMaker, on their particular data set for that particular thing. We use Distilbert as the secondary tool. So then we could process this massive amount of data, like I think it was 50,000 requests per minute or something,

Starting point is 00:09:43 with these Distilbert models running on some G4s and G5s. We tried Infraintia as well, and we've had really good success with Infraintia on some of the Lama models and some other customers. But the G4s and G5s, people... I don't know if I want to say this because the spot market for them is really good right now. But maybe we can cut that out.

Starting point is 00:10:04 It's not the latest and greatest NVIDIA GPU, at which point everyone just loses their collective mind. We've been saving customers a lot of money by staying on some of the older GPUs, and they perform really well with these distilled models. Oh, yes. I've been doing some embedded work on Raspberry Pi, doing inference running two models simultaneously

Starting point is 00:10:23 for a project that will be released in the fullness of time. And yeah, there are some challenges in resource-constrained environments. GPUs help. The other trick, I guess, maybe I'll give away all my tricks here, local zones. So you can get very decent prices on GPUs and local zones. So if end-to-end user latency is important to you, check out the prices there. But the initial understanding of Bedrock back when it first came out was that it was a wrapper around other models that you had.

Starting point is 00:10:53 And you say now it is a great API. The problem I recall was that every model required a payload specified a subtly or grossly different way. So it almost felt like, what is this thing? Are they trying to just spackle

Starting point is 00:11:04 over everything into a trench coat? What is the deal here? It's a wrapper. So you still have to customize the payload a little bit. But the good thing about the payloads is that they're basically all trending towards this message API. So Anthropic Cloud 2.1 and 2 had this API where you would say human assistant and you would just kind of go back and forth in turns, and that was the entire prompt. There's this new one, which is the messages payload. And that structure is much more amenable to moving between different models. Now, that brings us to the topic of prompt engineering. And prompt engineering is still quite different depending

Starting point is 00:11:42 on which model you use. So you can't take a prompt that you've designed for Lama 3 and move it into Claude 3, for instance. They're not compatible. That said, there's a ton of tooling that's out there these days. And there's Langchain, there's Griptape. And I think all of those are good things to sort of learn. But if anyone is listening to this and wanting to get into it, the best way to learn is to actually just remove all of the SDKs that make this a lot

Starting point is 00:12:14 easier and just write the prompts raw yourself until you understand it. We did that for our reInvent session navigator that was powered by Clog 2.1. It's just like no SDKs except for Boto3. And then I think I used whatever SDK to talk to Postgres. Those were the only SDKs we used.

Starting point is 00:12:34 And you can see what all of these tools are doing under the hood. The tools like Llama, Index, and Langchain. And once you understand it, you realize a lot of this is just syntactic sugar. It doesn't add a ton of overhead. It's just niceties on top of things. I've been using GPT-4 for a while now as part of my newsletter production workflow. Now, instead of seeing an empty field, which has the word empty in it historically, because

Starting point is 00:12:58 back when I built my newsletter ingest pipeline, you could not have empty fields in DynamoDB. So I just, easy enough, cool, I'll put the word empty in place and then just have a linter that validates the word empty by itself, does not appear in any of the things that means I haven't forgotten anything, and we're good to go. Now it replaces it with auto-generated text and sets a flag so that I still have something for a linter to fire off of. And I very rarely will use anything it says directly, but it gets me thinking. It's better than staring at an empty screen, but it took me two months to get that prompt dialed in. I am curious as to how well

Starting point is 00:13:34 it would work on other models, but that has been terrific just from an unblocking me perspective. Probably it's time for me to look at it a bit more. But in this particular case, I viewed, at least until this conversation, GPT-4 as being best of breed. And I don't care about cost because it costs less than $7 a month for this component. And as far as any latency issues go, well, I'd like it done by Thursday night every week, which means that, okay, for everything except the last day of it, I could theoretically, if cost did become an issue, wind up using the batch API that OpenAI has and pay half price and get the response within 24 hours. And that is, so my use case is strictly best case. It took me months to get the prompt right so it would come out with the right tone of voice rather than what

Starting point is 00:14:20 it thought that I did. So you have a couple options. If cost really doesn't matter, you should really compare GPT-4 to Cloud3 Opus. And if you're looking to use it without any other sort of AWS tooling, you can just access the Anthropic SDK directly. And that is my big question as far as, is there differentiated value in using Bedrock for something like this, as opposed to just paying Anthropic directly? There is. Because I did debate strongly, do I go directly with Open AI or do I go with Azure's API? The pricing is identical. And despite all the grief I give Azure, rightly so, for its lack of attention to security, this is all stuff that's designed to see the public anyway. If they wind up training on its own responses in AWS blog posts, I assume they

Starting point is 00:15:05 already are. So it doesn't really matter to me from that perspective. So the pricing is identical, the per token pricing and everything. The advantage of running within Bedrock is you can get all those metrics that I was talking about. You can log everything in CloudWatch. You can get all the traditional AWS infrastructure you're used to. And that's one benefit. The other benefit is, and this is less useful for your use case, by the way. So this is more useful for industry use cases. You get stable, predictable performance. Have you ever hit the OpenAI API

Starting point is 00:15:35 and it's been like, LOL, you're out of tokens, 429 back off. Like, I'm not going to give you anything more. No, for a lot of stuff I use, it's ad hoc. I use ChatGipity, which just sort of sits there and spins and acts like it's doing something before giving a weird error. And sometimes it's in-flight Wi-Fi not behaving, but other times it's great. There's instability in a number of the consumer-facing APIs that you can get around

Starting point is 00:15:58 in the AWS APIs. So if you want to go and purchase provision throughput, which that was another huge gripe I had with Bedrock, is that the provision throughput was antithetical to the way that AWS did usage-based pricing. So if you would have to commit to one month or six months, you commit to an hour now, which is much more reasonable from a, I want to build something and I know it's going to run in batch. So I'm going to purchase two model units of provision throughput for one hour. That works super well. And we've had customers do that for batch workloads. And it's very dependable. You get precise performance.

Starting point is 00:16:35 It's dialed in. You know exactly what you need. Whereas, you know, if you're using the on-demand APIs, you can get 429 backoffs all the time. And originally when Bedrock first came out, the SDKs, this is funny, the SDKs, particularly the Python SDK did not correctly parse the 429 backoff because throttled exception was a lowercase T and it was trying to do a case sensitive match on throttled exception, but that's fixed now. So it'll properly do the 429 backoffs and everything. But those are the advantages really, is that you can get the predictable performance that you're looking for.

Starting point is 00:17:12 It's much more suitable for kind of production workloads. Something I saw recently reminded me of a polished version of Bedrock. And I wish I could remember what it was. It was some website. I forget if it was a locally hosted thing or some service that you would bring your own API keys for a variety of different AI services. And then it was effectively a drop-in replacement for ChatGipity. And you could swap between models,

Starting point is 00:17:36 race them against each other, and also just have a great user experience. The initial sell was instead of $20, pay for your actual API use case case which except for the very chattiest of you is probably not twenty dollars and okay great i'm not that cheap i haven't i didn't go down that path but if i can swap out models and do side by side that starts to change you can actually do that in the bedrock console now so if you go to the bedrock console you can compare and contrast models you can see what the cost is i can't we built something like that before it existed in the Bedrock console, so we call it the Bedrock Battleground, and you can pull in any of these models. I think the one you're

Starting point is 00:18:09 thinking of is probably the Vercel AI SDK, which is also very, very nice. We actually have submitted some code and pull requests to make Bedrock work better in that SDK, and adding in models like Mistral and streaming support. But yeah, I mean, I'm

Starting point is 00:18:26 totally fine with that approach. But if you need to do it within AWS, it's right in the console now. The reason I would avoid using Bedrock directly for something like this, perfect example of AWS's long-tail challenges catching up with them. Very often, I will use the iOS app for ChatGipity and can pick up where I left off or look at historical things. I'm also not sure if any of these systems, please correct me at all if I'm wrong, but the magic part to me about ChatGipity is I can be asking it about anything that I want and then it's, oh yeah, generate an image

Starting point is 00:18:55 of whatever thing I want. Like one of the recent ones I did for a conference talk was a picture of a data center aisle with a giraffe standing in it. And it was great, because there's never going to be stock photography of a zookeeper doing that. But the fact that it's multimodal,

Starting point is 00:19:09 I don't have to wind up constructing a separate prompt for DALI. I mean, the magic thing that it does for that is it constructs the prompt on the backend for you and you get much better results as a direct result. I don't have to think about

Starting point is 00:19:22 what model I wind up telling it to use. As an added benefit, because it has persistent settings that become part of its system prompt that it should know about you, what I like is that I can say, oh yeah, unless otherwise specified, all images should be in 16 by 9 formats,

Starting point is 00:19:37 or aspect ratio, because then it just becomes a slide if it works out well. I think you're still thinking about it from the consumer perspective, which is valid. You know, GPT, ChatGPT is a very polished product and it's simple. You know, it's a simple interface with an incredible amount of complexity underneath. And I think what Bedrock is providing, among other things, and it does have image generation models, by the way, Titan Image Generator and Stability,

Starting point is 00:20:05 is the same thing that AWS has always been particularly good at, building blocks. So it's letting people build capabilities like ChatGPT into their own products. And even going beyond that, there's a ton of use cases beyond the chat interface that I think we're going to see Bedrock applied for. One of the things that we did for a customer is we built a resumable kind of data science environment. So think about Panda's data frames that exist within

Starting point is 00:20:38 Jupyter notebooks. Now imagine you have a chat GPT or something that can go and talk to this data frame and it can send plots. And those are all kept on Fargate containers there. You know, we save the notebook, we persist it to S3. And then if a user wants to bring that session up again, we restore it. We bring that session back to life

Starting point is 00:20:58 and we go and we resume, you know, the Python execution and we say, hey, this plot that you made, and Cloud3, by the way, supports multimodal, so you can put images in and you can say, hey, look at this plot and then change the x-axis so that it gets rid of these outliers that I don't care about. And it'll redo that.

Starting point is 00:21:19 And it'll actually write the matplotlib code or the plotly code in this case, but whatever. And it'll go and redo it. And that is something that I think is genuinely valuable the Matplotlib code or the Plotly code in this case, but whatever, and it'll go and redo it. And that is something that I think is genuinely valuable and not just a typical chat use case. Tired of big black boxes when it comes to cloud security? I mean, I used to work at a big black rock and decided I was tired of having a traditional job, so now I do this instead. But with Prowler, you're not just using a tool, you're joining a movement. A movement that stands for open, flexible, and transparent cloud security across AWS, Azure, GCP, and Kubernetes. Prowler is your go-to-everything from compliance frameworks like CIS and NIST to real-time incident response and hardening.

Starting point is 00:21:59 It's security that scales with your needs. So if you're tired of opaque, complicated security solutions, it's time to try Prowler. No gatekeepers, just open security. Dive deeper at prowler.com. I want to push back on one of the first things you said in that response, specifically that separating out the consumer from the professional use case. The way that I've been able to dial these things in and has worked super well for me is I start with treating ChatGPT as a spectacular user experience for what I will ultimately, if I need it to be repeatable. Like I don't need infinite varieties of giraffes and data center photographs because neither giraffes nor cloud repatriation are real, which was sort of the point of the slide. But that was a one-off and it's great.

Starting point is 00:22:43 For the one-off approach though, I did iterate on a lot of that using ChatGPT first because is this even possible? Because once I start getting consistent results in the way that I want them with a prompt, then I can start deconstructing how to do it programmatically and systematize it. But for the initial exploration,

Starting point is 00:23:02 the fact that there's a polished interface for it is streets ahead, and that's something AWS has never seemed to quite wrap their head around. You still can't use the bedrock console on an iPhone because the entire AWS console does not work on a phone. The next piece beyond that, then, is if it's that easy and straightforward

Starting point is 00:23:19 to build and play around with something to see if it's possible, then what can change down the road? The closest they come with this so far has been PartyRock, which is too good for this world. And I'm still surprised it came out of AWS because of how straightforward and friendly it is. So I think we are talking about two different use cases, right? I'm talking about the enterprise or even startup, the business application of generative AI, in which case, bedrock is absolutely the way that I would go right now. And you're talking about the individual in consumer

Starting point is 00:23:50 usage of generative AI, which I agree. True. None of the stuff I've done yet has been with an eye towards scaling. You're right. This is I'm not building a business around any of this. It is in service of existing businesses. Listen, AWS builds backends really well. About interfaces and frontends, I mean, there's a lot to be done. I've actually been pretty pleased with some of the changes that have happened in the console. I know people don't like it when the console changes, but there used to be these little bugs

Starting point is 00:24:15 like not having endings on the table borders in the DynamoDB console. That infuriated me. I don't know why. It was such a simple thing to fix. And I worked at AWS at the time. And it took me two years to get a commit in to fix that console.

Starting point is 00:24:31 That was the entire reason you took the job. Once it was done, it was time for you to leave because you'd done what you set out to do. This is actually a fun piece of history. You know, the AWS console started out in a Google Web Toolkit. So it was GWT. Does anyone remember that?

Starting point is 00:24:44 I don't think. You wrote your entire front end in Java and it would be translated into like AJAX Google Web Toolkit. So it's GWT. Does anyone remember that? I don't think. Like this, you wrote your entire front end in Java and it would be translated into like AJAX and HTML on the back end. That's how

Starting point is 00:24:53 all of the original consoles were written. 2009, 2010 was my first exposure to AWS as a whole. Was that replaced by then? No. I think many of the back ends

Starting point is 00:25:03 were still, or sorry, many of the consoles were still GWT back then. Come to find out, surprise, surprise, a few still are today. Kidding, I hope, I hope that's a joke. I don't think any are. I mean, I don't use SimpleDB, so it could still be,

Starting point is 00:25:16 but I think almost all the consoles moved to Angular after that because there was a pretty easy upgrade path between GWT and Angular. And then a lot of people started experimenting with React. And then there was this kind of really polished internal UI toolkit that let you basically pick the right toolkit for your service, the right front end framework for your service. And I think they've continued to iterate on that. And I do think that's the right approach. I wish there was a little more consistency in the consoles approach. I wish there was a little more consistency in the consoles,

Starting point is 00:25:46 and I wish there was a little bit more of an eye towards power user experience. So a lot of times console teams that are new, like new services that launch, they don't think about what it means. Oh, I have 2,000 SAML users here, and that's not going to auto-populate when I do a console-side search of users.

Starting point is 00:26:07 It needs to be a backend search. All these little things. But I think that's because AWS works in this semi-siloed fashion where the service team does a lot of the work on their own console. The only truly centralized team that I'm aware of at AWS is their security team. And then everything else is sort of, okay, I've got to get one front-end developer, I've got to get one product manager, I need four back-end developers, and I'm going to need one streaming person. So I think that's just an artifact of how they work.

Starting point is 00:26:34 Yeah, that is their superpower and also the thing that they struggle with the most. Because individual teams going in different directions will get you to solve an awful lot of problems, but also means that there are certain entire classes of problems that you won't that there are certain entire classes of problem that you won't be able to address in a meaningful way. User experience is very often one of those.

Starting point is 00:26:50 Billing, I would argue, might be another, but that's a personal pet peeve. On the topic of billing, you've also been, the polite way is talking, the other, the impolite way is banging on about, a lot about unit economics when it comes to generative AI. As you might imagine, this is of intense interest for me.

Starting point is 00:27:07 Yes. What's your take? So everyone wants the highest quality tokens as quickly as possible, as cheaply as possible. Like if you are a enterprise user or a large scale user of generative AI, the unit economics of this go beyond tokens.

Starting point is 00:27:24 And I think if people just keep designing to lower the per token cost, there are models, there are architectures that may not require tokenization that we might want to use one day. And this hyper focus on per token cost is kind of like missing the forest for the trees because that is only one part of the scale

Starting point is 00:27:47 and the cost that you have to deal with. You have to think about embedding models. So that's actually one place where I've been pleasantly surprised and impressed is AWS released the Titan V2 embeddings, which support normalization. And they're fairly new, so we don't have hard, hard numbers on these yet. But we've had really good initial experiments. And I have all

Starting point is 00:28:11 the data on it if you want. A dramatic reading of an Excel spreadsheet. Those are the riveting podcast episodes. But if you want me to send you a graph afterwards, I can show you where we saw the good capabilities. And I can show you the trade-off between the 256 vector size, which brings us back to the unit economics. right? Like the original Titan embeddings, I think they had like a 4k vector output. Now, if you put that into PG vector, which is the vector extension within Postgres, and you try to query it, well, guess what? You just blew up your RAM. Like keeping that whole index in memory is very expensive. And these cosine similarity searches are very expensive. Now, back then, PG Vector only supported what was called IVV Flat, which was just inverted index.

Starting point is 00:28:49 Next, what they did is they, Supabase, AWS, and this one individual open source contributor, all worked together to get what's called HNSW, or Highly Navigable Small World Indexes, into Postgres. And all of a sudden, Postgres is beating Pinecone and everyone else on price performance and vectors. Now, the downside is that Postgres doesn't scale to like 100 million vectors. Because as soon as you get into sharding and things like vectors don't shard well, you have to pick a different shard key, all this other good stuff. That is a whole other side of the unit economics. It's like, what is your vector storage medium or your document storage medium? And what is your cost of retrieval? And then what

Starting point is 00:29:29 is your cost of context? Because the Cloud3 models, for example, they have 200k of context and they have darn good recall within that entire context. But that's a lot of tokens that you have to spend in order to put all that context in. So part of the unit economics of this are, hey, how good is my retrieval at giving me the thing that I'm looking for so that I can enrich the context of the inference that I'm trying to make? And measuring that is three levels of abstraction away from tokens. You have to have a human in the loop say, this is what we thought was a quality answer. And the context was quality too. And it was able to correctly infer what I needed it to infer.

Starting point is 00:30:10 I think people have lost sight of just how horrifyingly expensive it is to get these models up and running. There was a James Hamilton talk at the start of the year at CIDR where he mentioned that an internal Amazon LLM training run had recently cost $65 million in real cost. And that, like, honestly, the biggest surprise was that Amazon spent anything like that on anything without having a massive fight over frugality. So that just shows how hustling they are around these things. But it's, I think, why we're seeing every company, even when the product isn't fully baked yet,

Starting point is 00:30:39 they're rushing to monetize up front, which I appreciate. I don't like everything subsidized by VCs until suddenly one day there's a horrifying discovery. I love GitHub Copilot. It normally would cost 20 bucks a month and I'd pay it in a heartbeat, except for the fact as an open source maintainer, I get it for free. It's worth every penny I don't pay

Starting point is 00:30:58 just because of how effective it is at weird random languages with which I'm not familiar or things like it. That is just a, it is a game changer for me in a number of different ways. It's great stuff. And I like the fact that we're monetizing, but they have to because of how expensive this stuff is. The other thing to think about there is there are power users when you price something, right? At like $20 per user per month or $19 per user per month. There are power users who are definitely going to go above what that costs. So that's, I think, part of the Economic Balancing Act there is how do I structure

Starting point is 00:31:34 this offering in my product, whether it's a SaaS product, whether it's a B2B product, or even a consumer-facing product, such that I am going to provide more value and impact than it will cost me to deliver and I will make this margin. And those are the most interesting conversations that I get to have with customers is moving... First, I love the implementation. I love getting hands-on keyboard

Starting point is 00:31:58 and building cool things for people. But then we move one level up from that and we're like, hey, this is a technical deliverable. But did it solve our stated business goal? Did we actually accomplish the thing that we set out to do? And building in the mechanisms for that and making sure we can measure the margin and know that it's genuinely impacting things and moving the needle, that takes time.

Starting point is 00:32:18 That's more than a quarter over quarter view because it takes time for people to learn about the product and to adapt it. And people have to be willing to make some bets in this space. And that's scary for some enterprises that are not used to making bets. But there was one other thing that I wanted to mention there about the cost of training, which is the transformer architecture, the generative pre-trained transformer architecture has quadratic, essentially, or exponential even, training costs. So as you grow the size of the transformer network, as you increase the number of parameters, as you change the depth of these encoders and decoders, you are increasing the cost to train.

Starting point is 00:32:57 Then you have the reward modeling, which is the human in the loop part. You have all of this other stuff that you have to do, which again, increases the cost of training. There are alternative architectures out there. And I think the future is not necessarily purely transformer based. I think the future is going to be some combination of like state space machines and transformers. And, you know, we're going to go back to the RNNs that we used to use. And I, you know, what kind of ticks me off is, I don't know if you remember back in 2017, Sunil and I did this video walking through the transformer architecture on SageMaker.

Starting point is 00:33:34 And even we didn't get it back then that it was going to be this, you know, massive thing unlocking emergent behavior. And I think it was only people like Ilya and Andrej Kaparthi who realized actually, if we just keep making this thing bigger, we get emergent behavior. And it is suddenly not just a stochastic parrot, it is actually able to

Starting point is 00:33:54 the act of predicting the next token has suddenly given us this emergent behavior and this access to this massive latent space of knowledge that has been encoded in the model and can, in real real time in exchange for compute be turned into a valuable output. Very far from AGI still in my opinion,

Starting point is 00:34:12 but I think you could brute force it if you use the transformer architecture and you just threw trillions of parameters at it and trillions and trillions of tokens, you could probably brute force AGI. I think it is much more likely we will have an architectural shift away from transformers or transformers will become one part and we will use something akin to SSMs or another architecture alongside that. And you've already seen promising results there from different models.

Starting point is 00:34:43 These are early days. And I also suspect on some level, the existing pattern has started to hit a point of diminishing returns. When you look at the cost to train these models from generation to generation, at some point, it's like, okay, when Sam Altman was traveling around trying to raise $7 trillion, it's okay. I appreciate that that is the logical next step on this. I am predicting some challenges with raising that kind of scratch. And that is, so there have to be a different, different approaches to it. I think inference has got to be both on device and less expensive for sustainability reasons, for economic reasons. And for example, something I think would be a

Starting point is 00:35:24 terrific evolution of this would be a personalized assistant that sits there and watches what someone does throughout the day, like the conversations they have with their families, the things they look for on the internet as they do their banking, as they do their job, as they have briefings or client meetings, and so on and so forth. And there's zero chance in the universe I'm going to trust that level of always-on recording data to anything that is not on-device, under my control, that I can hit with a hammer if I have to. So to do that, you need an awful lot of advancements. That feels far future, but future's never here until suddenly it is. I don't think it's that far away. We've already gotten Lama 3 running on an iPhone 15, the 8 billion parameter model. And I think we got 24 tokens per second or something.

Starting point is 00:36:07 I mean, and admittedly, that was a quantized version of the model, but I mean, that's what everyone does for this sort of hardware. I think we're not as far from that future as you might think. And I love the idea of agents and that's another good feature of Bedrock.

Starting point is 00:36:21 And I know we've gotten far from the topic of Bedrock at this point, but I know we're coming to an end here and I'd really just want to my goal with this core is to convince you to give bedrock a shot and like go in try it again explore it and like update your opinion because i i agree with you when it first was announced there was a lot of hullabaloo about something that wasn't really there already yet. But we have these real things that we are building on it now. And it is really exciting.

Starting point is 00:36:50 Like, I just love seeing these products come to life. There's one customer we have, Brainbox. They built this HVAC management system that is powered by generative AI. And it can do a lot of the data science stuff that I was talking about before where it's like, oh, resume a Jupyter notebook in a Fargate container and show me this plot. Also, look up the manual for this HVAC thing and tell me what parts I'm going to need before I drive out there. And it's all a natural language interface. And it's really helping

Starting point is 00:37:21 the built environment be better at decarbonization. And those are the sorts of impacts that I'm excited about making. And I think building that on top of OpenAI, it would have been possible. We could have done it on top of OpenAI, but getting it to integrate with Fargate and all of these other services would have been more challenging. It would have introduced more vendors. It would have introduced more vendors. It would have been this overall very weird kind of complex architecture where we're balancing different models against each other.

Starting point is 00:37:54 With Bedrock, it's one API call. We're still able to use multiple models, but it's all within this Boto3 ecosystem or this TypeScript ecosystem. And we're able to kind immediately, when a new model is added to Bedrock, we started out in Cloud 2.1, or maybe it was Cloud 2, I don't remember. Immediately, we were able to switch to Cloud 3 Sonnet when it came out and get better results. So that's the other advantage of Bedrock is because this stuff moves so quickly,

Starting point is 00:38:21 I can go as soon as it's available in Bedrock without having to change or introduce new SDKs or anything. Start using that model. I got way off whatever point I was originally trying to make. I got excited about it. The point you started off with is that you were urging me to give Bedrock another try. The entire AWS apparatus, sales and marketing, has not convinced me to do that. But I strongly suspect you just may have done that. For someone who's not a salesperson, you are disturbingly effective at selling ideas. Listen, there are other SDKs out there. There are other offerings out there and many of them are good, but Bedrock is one that

Starting point is 00:38:56 I'm bullish on. I think if they continue to move at this pace and they continue to improve, it's going to be a very powerful force to be reckoned with. There's a lot more they need to do though. I email the product manager all the time. And I'm very sorry. Sorry if you're listening to this. I'm sorry for blowing up your inbox. There are all these little things that I want fixed in it.

Starting point is 00:39:19 But the fact is they fix them and they fix them within days. So getting that sort of responsiveness and excitement from a team is just really powerful. And you don't always get that with AWS products. Sometimes teams are disengaged, unfortunately. Sometimes teams are surprised to discover that's one of their products, but that's a separate problem. Okay, fair enough. I will give it a try.

Starting point is 00:39:39 And then we will talk again about this on this show about what I have learned and how it has come across. I have no doubt that you're that you're right. I'll be surprised. You've you are many things, but not a liar. What do you think about CDK these days? I haven't done a lot with it lately just because I have I been honestly going down a terraform well. Honestly, the big reason behind that, honestly, sometimes I want to build infrastructure where I'm not the sole person on the planet capable of understanding how the hell it works. And my code is not exactly indexed for reliability and readability. So there's a question around finding people who are conversant with a thing and Terraform's

Starting point is 00:40:11 everywhere. I'm a little worried about the IBM acquisition, to be honest. I don't know how all of that is going to play out. Suddenly someone who's not a direct HashiCorp competitor is going to care about OpenTofu. So that has the potential to be interesting. But I don't know if you remember, you used to not be the biggest fan of CDK. And then you and I had a Twitter DM conversation. And then I think you started liking it. Oh, then I became a cultist and gave a talk about it at an event dressed in a cultist robe. Yes.

Starting point is 00:40:41 Costuming is critically important. I'm hoping I can convince you on the bedrock side too. I don't think it's cult worthy yet, but it could get there. We'll find out. Thank you so much once again for your time. I appreciate your willingness to explain complex concepts to me using simple words. Something I should probably ask,

Starting point is 00:40:57 if people want to learn more, where's the best place for them to find you? Oh, you should go to kalent.com. We post a bunch of blog posts. We've got all kinds of stuff about LLM ops and performance. And we post all our results. Back when Bedrock went GA, I wrote a whole post on everything you need to know about Bedrock. Some of that stuff's out of date now, but we keep a lot of things up to date too. And if you need any help, if you find all of this

Starting point is 00:41:21 daunting, all of this knowledge, all of this kind of content around generative AI really difficult to stay apace of, feel free to just set up a meeting with me through Twitter or with Kalen. And, you know, we, we do this every day. Like this is this is our jam. We want to build more cool stuff. And we will, of course, put a link to that in the show notes for you. Thanks again for taking the time. I appreciate it. Always good to chat with you, buddy. Randall Hunt, VP of Technology and accidental bedrock convincing go-to-market marketer. I'm cloud economist Corey Quinn,

Starting point is 00:41:53 and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment, which is going to be hard because it's in the AWS console and it won't work on a phone.

Screaming in the Cloud - Generative AI, Tech Innovations, & Evolving Perspectives with Randall Hunt

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.