Screaming in the Cloud - Generative AI, Tech Innovations, & Evolving Perspectives with Randall Hunt
Episode Date: May 21, 2024In this episode, we chat with Randall Hunt, the VP of Technology at Caylent, about the world of generative AI and how it's changing industries. Randall talks about his journey from being an A...WS critic to leading tech projects at Caylent. He shares cool insights into the latest tech innovations, the challenges and opportunities in AI, and his vision for the future. Randall also explains how AI is used in healthcare, finance, and more, and gives advice for those interested in tech. Show Highlights: (00:00) - Introduction (00:28) - Randall talks about his job at Caylent and the projects he's working on(01:35) - Randall explains his honest and evolving perspective on Amazon Bedrock after working with it hands-on(03:35) - Randall breaks down the components and improvements of AWS Bedrock(06:08) - Improvements in AWS Bedrock's preview announcements and API functionality(08:05) - Randall's predictions on the future of generative AI models and their cost efficiency(10:00) - Randall shares practical use cases using distilled models and older GPUs(12:12) - Corey shares his experience with GPT-4 and the importance of prompt engineering(17:21) - Bedrock console features for comparing and contrasting AI models(21:02) - enterprise applications of generative AI and building reliable AI infrastructures(28:13) - Randall and Corey delve into the costs of training large AI models(36:37) - Randall talks about real-world applications of Bedrock in industries like HVAC management(39:40) - Closing thoughts and where to connect with RandallAbout Randall Hunt: Randall Hunt is a Software Engineer and Open Source Developer Advocate at Facebook. Previously of AWS, SpaceX, MongoDB, and NASA., Randall Hunt, VP of Cloud Strategy and Solutions at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski., Randall is the coder in the boardroom.Links referenced: Randall Hunt on LinkedIn: https://www.linkedin.com/in/ranman/Caylent: https://caylent.com/Caylent on Linkedin: https://www.linkedin.com/company/caylent/* Sponsor Prowler: https://prowler.com
Transcript
Discussion (0)
hyper-focus on per-token costs is kind of like missing the forest for the trees,
because that is only one part.
Welcome to Screaming in the Cloud. I'm Corey Quinn. It's been a hot second since I got to
catch up with Randall Hunt, now VP of Technology at Kalent. Randall, what have you been doing this episode? I haven't
seen you in a month of Sundays. Well, I'm still working at Kalent and we are still building cool
stuff. That's my new motto is we build cool stuff. And yeah, a lot of Gen AIs coming out from a lot
of different customers. People are getting really interested in applying it. So that's what I'm doing
these days. engineers and loved by developers. Prowler lets you start securing your cloud with just a few
clicks with beautiful, customizable dashboards and visualizations. No unnecessary forms,
no fuss, because honestly, who has time for that? Visit prowler.com to get your first security scan
in minutes. Some of the stuff that you have been saying on Twitter, yes, it's still called Twitter, has raised an
eyebrow. Because back when we first met, you were about as critical of AWS as I am. And what made
this a little strange was at the time that you worked there, you're one of those people that
I could best be described as unflinchingly honest. Sometimes this works to your detriment,
but it's one of the things I admire the most about you. And then you started saying nice
things about Amazon bedrock in public recently. So my default conclusion is, oh, clearly you've
been bought and paid for and have thus become a complete and total shill, which sure that might
fly in the face of everything I thought I believed about you, but simple solutions are probably the best.
Before I just start making that my default assessment,
is that accurate or is there perhaps something else going on?
No.
So I think if you look at the way I was talking about Bedrock back in April of 23,
you can see I was still as unflinchingly honest as ever.
Although I guess I've grown up a little bit over
the years and I try to be a little less, I don't know, inflammatory in my opinion. So I'm like,
hey, this isn't real. This is vaporware. This doesn't work. So since then, we've had the
opportunity to work with... And me personally, I've had the opportunity, like hands-on keyboard, to work with over 50 customers in deploying real-world, non-experiment, non-proof-of-concepts production solutions
built on Bedrock.
And I have to say, the service continues to evolve.
It continues to get better.
There are things that I still think need to be fixed in it, but it is a reliable, good
AWS service that I can recommend now.
I see your head exploding.
Yeah, I hear you.
The problem is, let me back up here
with my experience of Bedrock.
Well, before we get into Bedrock,
let's talk about, in a general sense,
I am not a doomer when it comes to AI.
I think it is legitimately useful.
I think it can do great things.
I think people have lost their minds in
some respects when it comes to the unmitigated boosterism. But there's value there. This is not
blockchain. We are talking about something that legitimately adds value. Now, in my experience,
Bedrock is a framework for a bunch of different models. A few of them great, some of them not.
Some pre-announced, like Amazon's Titan,
the actual Titan, not embeddings.
And I believe that has never been released to see the light of day for good reason.
But it always seems to me that Bedrock starts and stops
with being an interface to other models.
And I believe you can now host your own models
unless I'm misremembering a release
or an indoor cloud company that was doing that.
There's a lot of different components of Bedrock.
You know how you think of SageMaker as like the family for traditional machine learning services and you've got SageMaker jumpstarted.
Well, I used to until they wound up robbing me eight months after I used it for $260 in SageMaker canvas session charges.
And now I think of it as basically a service run by thieves.
And I look forward to getting back into it
just as soon as I'm made whole.
Two and a half years later,
I'm starting to doubt that that'll happen.
But yes, I used to think of SageMaker that way.
I agree with you on the Canvas side.
I'm pretty equally frustrated.
Like I have to administer it for all of talent.
So I am our AWS administrator
and I have to manage all of our costs.
So I very much empathize with that component of it.
I do think the evolution of SageMaker, you have to think that it went all the way back
to what was it, 17 that it launched or was it 16?
I think I wrote the launch blog post, so I should really remember this, but I've forgotten.
Pretty sure it was 17.
It came out and my first question was, what are we going to do with all this Sage?
You know, it's got like three different generations of stuff on top of it and figuring out the cost inside of SageMaker, which generation it belongs to. Is it
part of studio? Is it part of canvas? Is it, it is not fun. So I totally empathize with that part.
However, SageMaker has many good components to it. So that part of things, you think of SageMaker
as a family of services. Think of Bedrock as a similar family of services. You have things like guardrails. And these started out as very sort of rudimentary.
You would do a regular expression search for certain words. You would, in natural language,
define certain questions that you didn't want the models to respond to. It's much better now.
So you can tune in things like toxicity. You can tune in things like, oh, you know, you're an HR bot, but don't answer any questions
about finance or tax, things like that.
This works now, whereas previously it was more of like a preview feature didn't really
work.
Then the number of models that are available is going to continue to grow.
You can't actually bring your own model yet.
So what you can do is you can bring weights.
That's what it was.
Sorry, It all starts
to run together on some level. And it's, to be frank, it's hard to figure out what's real and
what's not, given how much AWS has been running its corporate mouth about AI for the last year
and a half. They've been pre-announcing things by sarcastic amounts. And it's always tricky.
Is that a pre-announcement or is that a thing customers can use today? I do know that I know nothing about what's up and coming
because that team doesn't talk to me
because in the early days,
I was apparently a little too honest.
So I think one of the things that they have improved on,
and I lost my mind over this back in 23,
is they've stopped saying it's in preview
when it's really coming soon.
So that was the biggest thing that drove me crazy
is they would say something was in preview.
All of my customers would come to me
and they'd be like, Randall, I want to get into this preview.
And then I would go to AWS and I would say,
AWS, we want to get into this preview
so that we can advise our customers and all this.
And then it turns out it was really a coming soon.
Yeah, it's in private preview
for a limited subset of customers whose corporate name must rhyme
with schmanthropic.
It's like that.
That is not the same thing.
But they've gotten better about that.
So they say coming soon now.
I don't know if you've seen some of the more recent announcements where it doesn't say
preview or anything like that.
It says coming soon, which is so much more helpful in helping our customers understand
what's real, what's on the way, what can you
use in your account today, that sort of thing. But getting back to Bedrock, it is a very solid API.
I think it's well-designed. The ability to return... If you do invoke model with response stream,
it's going to return at the end of everything, a little JSON, you know, payload that has,
you know, bedrock metrics.
And you can also get these in CloudWatch, but it's very useful to have them per inference
instead of having to go and like average them out and get them from the aggregate results.
Like you can get per inference things like time to first token, which is very useful
when you're building something that's customer facing where you want streaming responses. You can get inter token latency, which is also important for
that. And you can get total inference time. Now total inference time is less useful in a streaming
use case. I mean, it's still a valuable metric. But all of that stuff is returned through the API.
There are some models that don't return that. And I think that's just because they're kind of
older legacy models. Titan is a real model. I've used it, but to your point, not the best, but that's fine. I think
AWS is probably working on some stuff and I hope they are. I'd love to see them release a model,
but I also think we're going to get into this area. Here's my prediction, right? And I know
we're getting a little bit off the topic of bedrock with this, but my prediction regarding generative AI is that we will have a few foundation models that are
very powerful, large models. And then we're going to have many frequently changing distilled models,
so smaller models. We did this for a customer recently where the cost, the per token cost in
production of using a large language model like Cloud3
Sonnet or Cloud3 Opus was going to be way too high. It just wasn't going to work,
given the threshold that they were operating at. What we did is we made Cloud3 Opus the decision
maker deciding which tool to use for which job. And then we use something called Distilbert, which is just a version of BERT that you fine-tuned,
which we did in SageMaker,
on their particular data set for that particular thing.
We use Distilbert as the secondary tool.
So then we could process this massive amount of data,
like I think it was 50,000 requests per minute or something,
with these Distilbert models running on some G4s and G5s.
We tried Infraintia as well,
and we've had really good success with Infraintia
on some of the Lama models and some other customers.
But the G4s and G5s, people...
I don't know if I want to say this
because the spot market for them is really good right now.
But maybe we can cut that out.
It's not the latest and greatest NVIDIA GPU,
at which point everyone just loses their collective mind.
We've been saving customers a lot of money
by staying on some of the older GPUs,
and they perform really well with these distilled models.
Oh, yes.
I've been doing some embedded work on Raspberry Pi,
doing inference running two models simultaneously
for a project that will be released in the fullness of time. And yeah, there are some challenges in resource-constrained
environments. GPUs help. The other trick, I guess, maybe I'll give away all my tricks here,
local zones. So you can get very decent prices on GPUs and local zones. So if end-to-end user
latency is important to you, check out the prices there.
But the initial understanding
of Bedrock back when it first came out
was that it was a wrapper
around other models that you had.
And you say now it is a great API.
The problem I recall
was that every model required
a payload specified a subtly
or grossly different way.
So it almost felt like,
what is this thing?
Are they trying to just spackle
over everything into a trench coat? What is the deal here?
It's a wrapper. So you still have to customize the payload a little bit. But
the good thing about the payloads is that they're basically all trending towards this message API.
So Anthropic Cloud 2.1 and 2 had this API where you would say human assistant and you would just
kind of go back and forth in turns,
and that was the entire prompt. There's this new one, which is the messages payload.
And that structure is much more amenable to moving between different models. Now, that brings us to
the topic of prompt engineering. And prompt engineering is still quite different depending
on which model you use. So you can't take a prompt
that you've designed for Lama 3 and move it into Claude 3, for instance. They're not compatible.
That said, there's a ton of tooling that's out there these days. And there's Langchain,
there's Griptape. And I think all of those are good things to sort of learn. But if anyone
is listening to this and wanting to get into it,
the best way to learn is
to actually just remove
all of the SDKs that make this a lot
easier and just write
the prompts raw yourself
until you understand it.
We did that for our reInvent session
navigator that was powered by Clog 2.1.
It's just like no SDKs except for Boto3.
And then I think I used whatever SDK to talk to Postgres.
Those were the only SDKs we used.
And you can see what all of these tools are doing under the hood.
The tools like Llama, Index, and Langchain.
And once you understand it, you realize
a lot of this is just syntactic sugar.
It doesn't add a ton of overhead.
It's just niceties on top of things.
I've been using GPT-4 for a while now as part of my newsletter production workflow.
Now, instead of seeing an empty field, which has the word empty in it historically, because
back when I built my newsletter ingest pipeline, you could not have empty fields in DynamoDB.
So I just, easy enough, cool, I'll put
the word empty in place and then just have a linter that validates the word empty by itself,
does not appear in any of the things that means I haven't forgotten anything, and we're good to go.
Now it replaces it with auto-generated text and sets a flag so that I still have something for
a linter to fire off of. And I very rarely will
use anything it says directly, but it gets me thinking. It's better than staring at an empty
screen, but it took me two months to get that prompt dialed in. I am curious as to how well
it would work on other models, but that has been terrific just from an unblocking me perspective.
Probably it's time for me to look at it a bit more. But in this particular case,
I viewed, at least until this conversation, GPT-4 as being best of breed. And I don't care
about cost because it costs less than $7 a month for this component. And as far as any latency
issues go, well, I'd like it done by Thursday night every week, which means that, okay,
for everything except the last day of it, I could theoretically, if cost did become an issue, wind up using the batch API that OpenAI has and pay half price and
get the response within 24 hours. And that is, so my use case is strictly best case. It took me
months to get the prompt right so it would come out with the right tone of voice rather than what
it thought that I did. So you have a couple options. If cost really doesn't matter, you should really compare GPT-4 to Cloud3 Opus. And if you're looking to use it
without any other sort of AWS tooling, you can just access the Anthropic SDK directly.
And that is my big question as far as, is there differentiated value in using Bedrock for
something like this, as opposed to
just paying Anthropic directly? There is. Because I did debate strongly, do I go directly with Open
AI or do I go with Azure's API? The pricing is identical. And despite all the grief I give Azure,
rightly so, for its lack of attention to security, this is all stuff that's designed to see the
public anyway. If they wind up training on its own responses in AWS blog posts, I assume they
already are. So it doesn't really matter to me from that perspective. So the pricing is identical,
the per token pricing and everything. The advantage of running within Bedrock is you can get all those
metrics that I was talking about. You can log everything in CloudWatch. You can get all the
traditional AWS infrastructure you're used to. And that's one benefit. The other benefit is,
and this is less useful for your use case, by the way.
So this is more useful for industry use cases.
You get stable, predictable performance.
Have you ever hit the OpenAI API
and it's been like,
LOL, you're out of tokens, 429 back off.
Like, I'm not going to give you anything more.
No, for a lot of stuff I use, it's ad hoc.
I use ChatGipity,
which just sort of sits there and spins and acts like it's doing something before giving a weird
error. And sometimes it's in-flight Wi-Fi not behaving, but other times it's great.
There's instability in a number of the consumer-facing APIs that you can get around
in the AWS APIs. So if you want to go and purchase provision throughput, which that was another huge
gripe I had with Bedrock, is that the provision throughput was antithetical to the way that AWS did usage-based pricing.
So if you would have to commit to one month or six months, you commit to an hour now, which is much more reasonable from a, I want to build something and I know it's going to run in batch.
So I'm going to purchase two model units of provision throughput for one hour.
That works super well.
And we've had customers do that for batch workloads.
And it's very dependable.
You get precise performance.
It's dialed in.
You know exactly what you need.
Whereas, you know, if you're using the on-demand APIs, you can get 429 backoffs all the time.
And originally when Bedrock first came out, the SDKs, this is funny, the SDKs, particularly the Python SDK did not correctly parse the 429
backoff because throttled exception was a lowercase T and it was trying to do a case sensitive match
on throttled exception, but that's fixed now. So it'll properly do the 429 backoffs and everything.
But those are the advantages really,
is that you can get the predictable performance that you're looking for.
It's much more suitable for kind of production workloads.
Something I saw recently reminded me
of a polished version of Bedrock.
And I wish I could remember what it was.
It was some website.
I forget if it was a locally hosted thing or some service
that you would bring your own API keys for a variety of different AI services. And then it
was effectively a drop-in replacement for ChatGipity. And you could swap between models,
race them against each other, and also just have a great user experience. The initial sell was
instead of $20, pay for your actual API use case case which except for the very chattiest of you is probably not twenty dollars and okay great i'm
not that cheap i haven't i didn't go down that path but if i can swap out models and do side
by side that starts to change you can actually do that in the bedrock console now so if you go to
the bedrock console you can compare and contrast models you can see what the cost is i can't we
built something like that before it existed in the Bedrock console, so we call it
the Bedrock Battleground, and you can
pull in any of these models. I think the one you're
thinking of is probably the Vercel AI
SDK, which is also very,
very nice. We actually have submitted
some code and pull requests to make
Bedrock work better in that
SDK, and
adding in models like Mistral and
streaming support. But yeah, I mean, I'm
totally fine with that approach. But if you need to do it within AWS, it's right in the console now.
The reason I would avoid using Bedrock directly for something like this, perfect example of AWS's
long-tail challenges catching up with them. Very often, I will use the iOS app for ChatGipity and
can pick up where I left off or look at historical things. I'm also not sure if any of these systems,
please correct me at all if I'm wrong,
but the magic part to me about ChatGipity
is I can be asking it about anything that I want
and then it's, oh yeah, generate an image
of whatever thing I want.
Like one of the recent ones I did for a conference talk
was a picture of a data center aisle
with a giraffe standing in it.
And it was great,
because there's never going to be
stock photography of a zookeeper doing that.
But the fact that it's multimodal,
I don't have to wind up
constructing a separate prompt for DALI.
I mean, the magic thing that it does for that
is it constructs the prompt
on the backend for you
and you get much better results
as a direct result.
I don't have to think about
what model I wind up telling it to use.
As an added benefit,
because it has persistent settings
that become part of its system prompt
that it should know about you,
what I like is that I can say,
oh yeah, unless otherwise specified,
all images should be in 16 by 9 formats,
or aspect ratio,
because then it just becomes a slide
if it works out well.
I think you're still thinking about it
from the consumer perspective,
which is valid. You know, GPT, ChatGPT is a very polished product and it's simple.
You know, it's a simple interface with an incredible amount of complexity underneath.
And I think what Bedrock is providing, among other things, and it does have image generation models, by the way, Titan Image Generator and Stability,
is the same thing that AWS has always been particularly good at,
building blocks.
So it's letting people build capabilities like ChatGPT
into their own products.
And even going beyond that,
there's a ton of use cases beyond the chat interface that I think we're going to
see Bedrock applied for. One of the things that we did for a customer is we built a resumable
kind of data science environment. So think about Panda's data frames that exist within
Jupyter notebooks. Now imagine you have a chat GPT or something that can go and talk to this data frame
and it can send plots.
And those are all kept on Fargate containers there.
You know, we save the notebook,
we persist it to S3.
And then if a user wants to bring that session up again,
we restore it.
We bring that session back to life
and we go and we resume, you know,
the Python execution and we say,
hey, this plot that you made,
and Cloud3, by the way, supports multimodal,
so you can put images in and you can say,
hey, look at this plot and then change the x-axis
so that it gets rid of these outliers that I don't care about.
And it'll redo that.
And it'll actually write the matplotlib code
or the plotly code in this case, but whatever.
And it'll go and redo it.
And that is something that I think is genuinely valuable the Matplotlib code or the Plotly code in this case, but whatever, and it'll go and redo it.
And that is something that I think is genuinely valuable and not just a typical chat use case.
Tired of big black boxes when it comes to cloud security? I mean, I used to work at a big black rock and decided I was tired of having a traditional job, so now I do this instead.
But with Prowler, you're not just using a tool, you're joining a movement. A movement that stands for open, flexible, and transparent cloud security across AWS, Azure, GCP, and Kubernetes.
Prowler is your go-to-everything from compliance frameworks like CIS and NIST to real-time incident response and hardening.
It's security that scales with your needs.
So if you're tired of opaque, complicated security solutions,
it's time to try Prowler. No gatekeepers, just open security. Dive deeper at prowler.com.
I want to push back on one of the first things you said in that response,
specifically that separating out the consumer from the professional use case.
The way that I've been able to dial these things in and has worked super well for me is I start with treating ChatGPT as a spectacular user experience for what I will ultimately, if I need it to be repeatable.
Like I don't need infinite varieties of giraffes and data center photographs because neither giraffes nor cloud repatriation are real, which was sort of the point of the slide.
But that was a one-off and it's great.
For the one-off approach though,
I did iterate on a lot of that using ChatGPT first
because is this even possible?
Because once I start getting consistent results
in the way that I want them with a prompt,
then I can start deconstructing
how to do it programmatically and systematize it.
But for the initial exploration,
the fact that there's a polished interface for it
is streets ahead,
and that's something AWS has never seemed
to quite wrap their head around.
You still can't use the bedrock console on an iPhone
because the entire AWS console does not work on a phone.
The next piece beyond that, then,
is if it's that easy and straightforward
to build and play around with something
to see if it's possible,
then what can change down the road?
The closest they come with this so far has been PartyRock, which is too good for this
world. And I'm still surprised it came out of AWS because of how straightforward and friendly it is.
So I think we are talking about two different use cases, right? I'm talking about the enterprise
or even startup, the business application of generative AI, in which case, bedrock is
absolutely the way that I would go right now. And you're talking about the individual in consumer
usage of generative AI, which I agree. True. None of the stuff I've done yet has been with an eye
towards scaling. You're right. This is I'm not building a business around any of this. It is
in service of existing businesses. Listen, AWS builds backends really well. About interfaces and frontends,
I mean, there's a lot to be done.
I've actually been pretty pleased with some
of the changes that have happened in the console.
I know people don't like it when the console changes,
but there used to be these little bugs
like not having endings on the
table borders in the DynamoDB console.
That infuriated me.
I don't know why.
It was such a simple thing to fix.
And I worked at AWS at the time.
And it took me two years to get a commit in
to fix that console.
That was the entire reason you took the job.
Once it was done, it was time for you to leave
because you'd done what you set out to do.
This is actually a fun piece of history.
You know, the AWS console started out
in a Google Web Toolkit.
So it was GWT.
Does anyone remember that?
I don't think.
You wrote your entire front end in Java and it would be translated into like AJAX Google Web Toolkit. So it's GWT. Does anyone remember that? I don't think. Like this,
you wrote your entire front end in Java
and it would be translated
into like AJAX
and HTML
on the back end.
That's how
all of the original consoles
were written.
2009, 2010
was my first exposure
to AWS as a whole.
Was that replaced by then?
No.
I think many of the back ends
were still,
or sorry,
many of the consoles were still GWT back then.
Come to find out, surprise, surprise,
a few still are today.
Kidding, I hope, I hope that's a joke.
I don't think any are.
I mean, I don't use SimpleDB, so it could still be,
but I think almost all the consoles moved to Angular
after that because there was a pretty easy upgrade path
between GWT and Angular.
And then a lot of people started experimenting
with React. And then there was this kind of really polished internal UI toolkit that let you
basically pick the right toolkit for your service, the right front end framework for your service.
And I think they've continued to iterate on that. And I do think that's the right approach. I wish
there was a little more consistency in the consoles approach. I wish there was a little more consistency in the consoles,
and I wish there was a little bit more of an eye
towards power user experience.
So a lot of times console teams that are new,
like new services that launch,
they don't think about what it means.
Oh, I have 2,000 SAML users here,
and that's not going to auto-populate
when I do a console-side search of users.
It needs to be a backend search. All these little things. But I think that's because
AWS works in this semi-siloed fashion where the service team does a lot of the work on their own
console. The only truly centralized team that I'm aware of at AWS is their security team.
And then everything else is sort of, okay,
I've got to get one front-end developer, I've got to get one
product manager, I need four back-end developers, and I'm
going to need one streaming person.
So I think that's just an artifact of how they work.
Yeah, that is their superpower
and also the thing that they struggle with the most.
Because individual teams going in different directions
will get you to solve an awful lot
of problems, but also means that there are certain
entire classes of problems that you won't that there are certain entire classes of problem
that you won't be able to address in a meaningful way.
User experience is very often one of those.
Billing, I would argue, might be another,
but that's a personal pet peeve.
On the topic of billing, you've also been,
the polite way is talking,
the other, the impolite way is banging on about,
a lot about unit economics when it comes to generative AI.
As you might imagine,
this is of intense interest for me.
Yes.
What's your take?
So everyone wants the highest quality tokens
as quickly as possible,
as cheaply as possible.
Like if you are a enterprise user
or a large scale user of generative AI,
the unit economics of this go beyond tokens.
And I think if people just keep designing
to lower the per token cost,
there are models, there are architectures
that may not require tokenization
that we might want to use one day.
And this hyper focus on per token cost
is kind of like missing the forest for the trees
because that is only one part of the scale
and the cost that you have to deal with.
You have to think about embedding models.
So that's actually one place
where I've been pleasantly surprised and impressed
is AWS released the Titan V2 embeddings,
which support normalization.
And they're fairly new,
so we don't have hard, hard numbers on these yet. But we've had really good initial experiments. And I have all
the data on it if you want. A dramatic reading of an Excel spreadsheet. Those are the riveting
podcast episodes. But if you want me to send you a graph afterwards, I can show you where we saw
the good capabilities. And I can show you the trade-off between the 256 vector size, which
brings us back to the unit economics. right? Like the original Titan embeddings,
I think they had like a 4k vector output. Now, if you put that into PG vector, which is the
vector extension within Postgres, and you try to query it, well, guess what? You just blew up your
RAM. Like keeping that whole index in memory is very expensive. And these cosine similarity
searches are very expensive. Now, back then, PG Vector only supported what was called IVV Flat, which was just inverted index.
Next, what they did is they, Supabase, AWS, and this one individual open source contributor,
all worked together to get what's called HNSW, or Highly Navigable Small World Indexes, into Postgres.
And all of a sudden,
Postgres is beating Pinecone and everyone else on price performance and vectors.
Now, the downside is that Postgres doesn't scale to like 100 million vectors. Because as soon as
you get into sharding and things like vectors don't shard well, you have to pick a different
shard key, all this other good stuff. That is a whole other side of the unit economics. It's like,
what is your vector storage medium or your document storage medium? And what is your cost of retrieval? And then what
is your cost of context? Because the Cloud3 models, for example, they have 200k of context
and they have darn good recall within that entire context. But that's a lot of tokens that you have
to spend in order to put all that context in. So part of the unit economics of this are,
hey, how good is my retrieval at giving me the thing that I'm looking for so that I can enrich
the context of the inference that I'm trying to make? And measuring that is three levels of
abstraction away from tokens. You have to have a human in the loop say, this is what we thought
was a quality answer. And the context was quality too.
And it was able to correctly infer what I needed it to infer.
I think people have lost sight of just how horrifyingly expensive it is to get these
models up and running.
There was a James Hamilton talk at the start of the year at CIDR where he mentioned that
an internal Amazon LLM training run had recently cost $65 million in real cost.
And that, like, honestly,
the biggest surprise was that Amazon spent anything like that on anything without having
a massive fight over frugality. So that just shows how hustling they are around these things.
But it's, I think, why we're seeing every company, even when the product isn't fully baked yet,
they're rushing to monetize up front, which I appreciate. I don't like everything subsidized by VCs
until suddenly one day there's a horrifying discovery.
I love GitHub Copilot.
It normally would cost 20 bucks a month
and I'd pay it in a heartbeat,
except for the fact as an open source maintainer,
I get it for free.
It's worth every penny I don't pay
just because of how effective it is
at weird random languages with which I'm not familiar
or things like it.
That is just a, it is a game changer for me in a number of different ways. It's great stuff. And
I like the fact that we're monetizing, but they have to because of how expensive this stuff is.
The other thing to think about there is there are power users when you price something,
right? At like $20 per user per month or $19 per user per month. There are power users who are definitely going to go above what that costs.
So that's, I think, part of the Economic Balancing Act there is how do I structure
this offering in my product, whether it's a SaaS product, whether it's a B2B product,
or even a consumer-facing product, such that I am going to provide more value and impact
than it will cost me to deliver
and I will make this margin.
And those are the most interesting conversations
that I get to have with customers is moving...
First, I love the implementation.
I love getting hands-on keyboard
and building cool things for people.
But then we move one level up from that
and we're like,
hey, this is a technical deliverable.
But did it solve our stated business goal?
Did we actually accomplish the thing that we set out to do?
And building in the mechanisms for that and making sure we can measure the margin and
know that it's genuinely impacting things and moving the needle, that takes time.
That's more than a quarter over quarter view because it takes time for people to learn
about the product and to adapt it.
And people have to be willing to make some bets in this space. And that's scary for some enterprises that
are not used to making bets. But there was one other thing that I wanted to mention there about
the cost of training, which is the transformer architecture, the generative pre-trained
transformer architecture has quadratic, essentially, or exponential even, training costs. So as you
grow the size of the transformer network, as you increase the number of parameters,
as you change the depth of these encoders and decoders, you are increasing the cost to train.
Then you have the reward modeling, which is the human in the loop part. You have all of this other
stuff that you have to do, which again, increases the cost of training. There are alternative architectures out there. And I think the future
is not necessarily purely transformer based. I think the future is going to be some combination
of like state space machines and transformers. And, you know, we're going to go back to the RNNs
that we used to use. And I, you know, what kind of ticks me off is,
I don't know if you remember back in 2017,
Sunil and I did this video
walking through the transformer architecture on SageMaker.
And even we didn't get it back then
that it was going to be this, you know,
massive thing unlocking emergent behavior.
And I think it was only people like Ilya and Andrej Kaparthi who realized
actually, if we just keep
making this thing bigger, we get emergent behavior.
And it is suddenly not just a stochastic
parrot, it is actually able to
the act of predicting the next
token has suddenly given
us this emergent behavior and this
access to this massive latent space of knowledge
that has been encoded in the model
and can, in real real time in exchange for compute
be turned into a valuable output.
Very far from AGI still in my opinion,
but I think you could brute force it
if you use the transformer architecture
and you just threw trillions of parameters at it
and trillions and trillions of tokens,
you could probably brute force AGI.
I think it is much more likely we will have an architectural shift away from transformers
or transformers will become one part and we will use something akin to SSMs or another
architecture alongside that. And you've already seen promising results there from different models.
These are early days. And I also suspect on some
level, the existing pattern has started to hit a point of diminishing returns. When you look at the
cost to train these models from generation to generation, at some point, it's like, okay,
when Sam Altman was traveling around trying to raise $7 trillion, it's okay. I appreciate that
that is the logical next step on this. I am predicting
some challenges with raising that kind of scratch. And that is, so there have to be a different,
different approaches to it. I think inference has got to be both on device and less expensive
for sustainability reasons, for economic reasons. And for example, something I think would be a
terrific evolution of this would be a personalized assistant that sits there and watches what someone
does throughout the day, like the conversations they have with their families, the things they
look for on the internet as they do their banking, as they do their job, as they have briefings or
client meetings, and so on and so forth. And there's zero chance in the universe I'm going
to trust that level of always-on recording data to anything that is not on-device, under my control, that I can hit with a hammer if I have to.
So to do that, you need an awful lot of advancements. That feels far future, but
future's never here until suddenly it is. I don't think it's that far away. We've already
gotten Lama 3 running on an iPhone 15, the 8 billion parameter model. And I think we got 24 tokens per second or something.
I mean, and admittedly,
that was a quantized version of the model,
but I mean, that's what everyone does
for this sort of hardware.
I think we're not as far from that future
as you might think.
And I love the idea of agents
and that's another good feature of Bedrock.
And I know we've gotten far
from the topic of Bedrock at this point,
but I know we're coming to an end here and I'd really just want to my goal with this core is to
convince you to give bedrock a shot and like go in try it again explore it and like update your
opinion because i i agree with you when it first was announced there was a lot of hullabaloo about
something that wasn't really there already yet. But we have these real things
that we are building on it now.
And it is really exciting.
Like, I just love seeing these products come to life.
There's one customer we have, Brainbox.
They built this HVAC management system
that is powered by generative AI.
And it can do a lot of the data science stuff that I was
talking about before where it's like, oh, resume a Jupyter notebook in a Fargate container and
show me this plot. Also, look up the manual for this HVAC thing and tell me what parts I'm going
to need before I drive out there. And it's all a natural language interface. And it's really helping
the built environment be better at decarbonization.
And those are the sorts of impacts that I'm excited about making. And I think building that on top of OpenAI, it would have been possible. We could have done it on top of OpenAI,
but getting it to integrate with Fargate and all of these other services would have been
more challenging. It would have introduced more vendors. It would have introduced more vendors.
It would have been this overall
very weird kind of complex architecture
where we're balancing different models
against each other.
With Bedrock, it's one API call.
We're still able to use multiple models,
but it's all within this Boto3 ecosystem
or this TypeScript ecosystem.
And we're able to kind immediately, when a new model is
added to Bedrock, we started out in Cloud 2.1, or maybe it was Cloud 2, I don't remember.
Immediately, we were able to switch to Cloud 3 Sonnet when it came out and get better results.
So that's the other advantage of Bedrock is because this stuff moves so quickly,
I can go as soon as it's available in Bedrock without having to change or introduce
new SDKs or anything. Start using that model. I got way off whatever point I was originally
trying to make. I got excited about it. The point you started off with is that you were
urging me to give Bedrock another try. The entire AWS apparatus, sales and marketing,
has not convinced me to do that. But I strongly suspect you just may have done that. For someone
who's
not a salesperson, you are disturbingly effective at selling ideas. Listen, there are other SDKs out
there. There are other offerings out there and many of them are good, but Bedrock is one that
I'm bullish on. I think if they continue to move at this pace and they continue to improve,
it's going to be a very powerful force to be reckoned with.
There's a lot more they need to do though.
I email the product manager all the time.
And I'm very sorry.
Sorry if you're listening to this.
I'm sorry for blowing up your inbox.
There are all these little things that I want fixed in it.
But the fact is they fix them and they fix them within days.
So getting that sort of responsiveness and excitement from a team is just really powerful.
And you don't always get that with AWS products.
Sometimes teams are disengaged, unfortunately.
Sometimes teams are surprised to discover that's one of their products, but that's a
separate problem.
Okay, fair enough.
I will give it a try.
And then we will talk again about this on this show about what I have learned and how
it has come across. I have no
doubt that you're that you're right. I'll be surprised. You've you are many things, but not
a liar. What do you think about CDK these days? I haven't done a lot with it lately just because I
have I been honestly going down a terraform well. Honestly, the big reason behind that,
honestly, sometimes I want to build infrastructure where I'm not the sole person on the planet
capable of understanding how the hell it works. And my code is not exactly indexed for reliability and readability.
So there's a question around finding people who are conversant with a thing and Terraform's
everywhere. I'm a little worried about the IBM acquisition, to be honest. I don't know how all
of that is going to play out. Suddenly someone who's not a direct HashiCorp competitor is going
to care about OpenTofu. So that has the potential to be interesting.
But I don't know if you remember, you used to not be the biggest fan of CDK.
And then you and I had a Twitter DM conversation.
And then I think you started liking it.
Oh, then I became a cultist and gave a talk about it at an event dressed in a cultist robe.
Yes.
Costuming is critically important.
I'm hoping I can convince you on the bedrock side too.
I don't think it's cult worthy yet,
but it could get there.
We'll find out.
Thank you so much once again for your time.
I appreciate your willingness to explain complex concepts to me using simple words.
Something I should probably ask,
if people want to learn more,
where's the best place for them to find you?
Oh, you should go to kalent.com.
We post a bunch of blog posts.
We've got all kinds of stuff
about LLM ops and performance. And we post all our results. Back when Bedrock went GA, I wrote
a whole post on everything you need to know about Bedrock. Some of that stuff's out of date now,
but we keep a lot of things up to date too. And if you need any help, if you find all of this
daunting, all of this knowledge, all of this kind of content around generative AI
really difficult to stay apace of, feel free to just set up a meeting with me through Twitter
or with Kalen. And, you know, we, we do this every day. Like this is this is our jam. We want to build
more cool stuff. And we will, of course, put a link to that in the show notes for you. Thanks
again for taking the time. I appreciate it. Always good to chat with you, buddy. Randall Hunt, VP of Technology
and accidental bedrock
convincing go-to-market marketer.
I'm cloud economist Corey Quinn,
and this is Screaming in the Cloud.
If you enjoyed this podcast,
please leave a five-star review
on your podcast platform of choice.
Whereas if you hated this podcast,
please leave a five-star review
on your podcast platform of choice,
along with an insulting comment, which is going to be hard because it's in the AWS console and it won't work on a phone.