Software Huddle - Lessons from Building AI Agents with Rafal Wilinski
Episode Date: August 12, 2025Today we're talking with one of our favorite engineers, Rafal Wilinski. Rafal has been on the cutting edge of AI development in the last few years as he has led AI teams at Zapier and Vendr. Rafal wa...lks us through the hard-won lessons about actually integrating AI tools into the applications you're building. One of the hardest things in integrating these AI tools is how to ensure you're getting better and not regressing as you improve your prompts and upgrade your models. He shows how using evals is one part of the story along with deeply investigating customer signals to see how they are or aren't succeeding with AI. Along the way, we also talk about RAG, his favorite models, his AI development toolset, and why Poland has been killing it lately. Check it out and be sure to follow Rafal if you want to learn more on building with AI.
Transcript
Discussion (0)
So Zapier agents is, it's like a platform for non-technical people to build their own agents.
Do you see stuff like that where people are just using it as like a generalized chat GBT, basically?
Yeah, yeah.
The problem we are having right now is that people are using Zapier agents as a free chat GPT alternative.
But some people just like paste code and errors and they debug their code using Zapier agents.
or write articles or proofread them or edit stuff.
On evals, are you using brain trust?
Is that right?
That's correct, yeah.
Okay.
Is that like the clear in a way winner out there?
That's like the main one I see out there.
Are there other options or is that what most people are using?
Are you all type strip or are you Python?
Like, what's the language do you run stuff in?
Fortunately, we are all typescript, both on the back end and the front end and the whole, you know,
know, magical AI loop. So, yeah, I'm really happy about that. You don't have to mess with that
Python ecosystem. Yeah. Oh, my God. Don't get me started. What's up everybody? This is Alex.
Super excited to have Rofal Volinsky on the show. Rofal, someone I've known for a long time,
a really good engineer, really good product sense and technical sense. And he's been doing a lot of
AI tech lead type work, both at Zapier and at vendor before that. So I just wanted to get in touch
with him about just like, what are the best practices around building with AI?
I've actually incorporated the AI into your project, not just using cursor and things
like that, but actually like, you know, working with these APIs yourself.
So he walks me through like tools, best practices, all sorts of questions that I had there,
which was really fun.
I really good to talk to him about.
As always, if you have any questions, comments, guests for the show, feel free to reach out.
And with that, let's get to the show.
Rofal, welcome to the show.
Thanks for having me.
Yeah, absolutely. Well, you know, you and I have been like sort of in overlapping communities for, I don't know, like a decade now and always like respected you. I think you're a really great engineer. And I'm super excited to have you on because now you're doing like some really cool AI stuff, especially like the last two jobs, the last three or four years. You've been like legit like building with AI, not just using it to code like a lot of people are like actually building it into products and integrating and doing some deep stuff there. So I want to talk about all that stuff today.
but I guess maybe for people that don't know you,
can you just introduce yourself,
give a little background on yourself?
Yeah, sure.
I'm Raphael.
I used to work a lot on AWS infrastructures.
I've been just like Alex.
I was super obsessed about DynamoDB for a period of time.
I created DynoBase, which is UI for DynamoDB,
because for some time in the past,
we had this great technology,
but there is no great bridge to use that technology.
so I decided to create a UI.
And yeah, that opened me a lot of doors.
We worked briefly together at Steady.
Then I continue working on other startup using AWS.
And then the chat GPT happened.
And I was like, okay, this is a really good moment to pivot to something else.
So I decided to be obsessed about something else.
And it's generating text.
Yeah, so for the past two roles,
I was leading AI initiatives at vendor.
Now I'm leading AI agents at Zapier.
And yeah, this is a really rapidly evolving field,
and there is a ton of exciting stuff happening.
Yeah, that's funny how you're like, it's generating text,
which makes it sound like so, I don't know, it's kind of boring,
but it's amazing, like, how powerful it is and just like all the amazing stuff.
Like, did you know, like, how quickly did you know, like,
oh, I need to get into this new field after ChatJPG came out.
Yeah, yeah.
So I feel like the people to the moment for me was when I stumbled upon this library
called Langchain, and it had like less than 1,000 stars back in that moment.
I mean, I issued a bunch of API calls, and I asked a bunch of silly questions just like everyone does.
And, okay, it generated some text.
It told the story.
and maybe hallucinated a few answers.
But when I realized that you can actually make it generate the JSON,
and the JSON can be an argument to a function that you can run,
so you can plug it to an environment,
make it do a bunch of API calls, get some data back,
and maybe synthesize a report,
that was like a mind-blowing thing to me.
I remember that I hooked up a bunch of blockchain primitives.
asked the question, like, hey, which of my S-3 buckets are insecure?
And it did like this, it essentially did a bunch of AWS S-free API calls, and it figured,
okay, this bucket is public, this public, bucket is public?
It, like, put that data all back into the context and it synthesized the report for me.
It wasn't, like, ideal, but it was, it was mind-blowing, like, oh, my God, this
could be so powerful.
So, yeah, I think the pivotal moment for me was tool calling,
and it was back in the day when tool calling wasn't possible.
You were just, like, forcing the NLM to, hey, reply with Jason.
I'm really asking you, reply with Jason, multiple exclamation marks.
And, yeah, that was really sick.
Yep, yep.
I remember that, man, what's his name, Riley, something that, like,
he has a good Twitter and all that stuff,
where he's showing all these different tricks
to reply with JSON.
And finally, like, the only way he gets it to happen
is, like, when he tells Gemini,
like, someone's going to die
if you reply with anything other than JSON.
And they finally do.
So it's nice to have, you know, like structured outputs
and, you know, the niceties we have now for sure.
Man, I forgot that you had that, like,
was it chat with Cloud?
Or what was that, like, initial AWS thing that you built?
Yeah.
Yeah, yeah, yeah.
It was chat with cloud.
I feel like it was the code name or chat with AWS was a code name.
I don't remember.
But essentially, yeah.
My initial, my usual path is like, okay, I figure out there's something interesting.
Let's make it a product because that's the ultimate test, ultimate validation of your idea.
Can I sell it?
Because the value prop was clearly there, right?
I can ask multiple questions about my AWS environment and IWS,
is notorious of being super complicated.
The dashboard is super complicated.
In fact, I've made so much money on making that simpler, right?
So I can do it again, probably.
But it turns out, so I tried building that.
And it looks like it was super hard to convince any other people
who are really security obsessed about giving the access to the cloud
to the LM because it can go crazy,
it can delete your resources,
and of course you can have IAM policies and whatnot,
but there's always some kind of element of risk
or maybe training models on the data
that you're going to be fetched,
that is going to be fetched from your cloud environment.
So, yeah, effectively, I realize that it's too hard
to sell it to people that are so,
that care about security.
so much. And the value proposition is like the mode was so thin because everyone else could replicate
it. So yeah. Yeah. Yeah. AWS should have got it and use it to like bootstrap queue or something
something like that or whatever they had in the console. But anyway, okay, let's talk about
what you're up to these days because I want to talk about just like, you know, all the stuff
you've learned about building with AI over the last couple of years. So maybe just give us a background.
You sit your ad Zapier now doing like the AI tech lead, working on agents.
What does that mean specifically?
What's the product you all are offering?
Okay.
So Zapier agents is, let's take a platform for non-technical people to build their own agents.
And essentially what we are trying to achieve is that you're going to tell us what would
you like to achieve in like very non-technical verbiage, like you completely don't know how
you're going to build it and like what kind of tools you need.
you're just like, hey, I want to have an email assistant
that marks my messages as important
and maybe create some kind of summary every 9 a.m.
or something like that.
And we take the prompt and we are thinking,
okay, this user wants to probably do this.
In order to achieve that, we need to pull this API
and pull this trigger and we need this tool
and we are going to refine the prompt
and we are going to run it on a schedule.
So yeah, we are taking really, really vague prompts
from users. We are asking follow-up questions, and we are trying to figure out the best
workflow or automation or agent for you that can simplify the boring parts of your job.
So you can have more time for the creative things. Yeah, I think it's such a cliche.
Yep. Yep, for sure. Yep. And so we've heard a lot about, you know,
2025 being the year of agents, and you all are sort of like building that out. I guess, like,
How do you distinguish this agentic workflow from like a more discreet, like, targeted workflow?
The stuff that's happened, you know, in the last year or two compared to what's happening now
with like this agentic stuff.
Yeah.
I feel like what differs us from the majority of the other agent builders is that if you go to
NA10 or Diffy or any other tool for creating agents, it's mostly or even the original Zapier.
It's oriented on a canvas.
You have nodes and arrows, and you can link them.
You can probably inject an ALM node or agent node
and connect it with some memory tools.
Our approach is totally different.
We are approaching it from the purely text-based point of view
because we believe that it's just so much easier.
We believe that people that are non-technical,
just you know that the text medium is probably easier for them to reason about it
and no matter how messy they are going to express their thoughts we are going to
distill that into something that we can work with going back to the canvas I feel
like for the past even back to the prior to the L and times we've been focused on
creating workflows that were predeterministic or mostly predeterministic
You had a set of paths that you could follow.
There were like some if statements, so you could like go this path or that path.
And right now we are entering this phase of total agency or randomness of chaos.
Because you cannot probably map all the possible paths inside an agent.
It's agenting.
By definition, it's going to receive some payload and it's going to decide to act.
but there is no like just two paths to choose from.
It can decide to do multiple, multiple things.
It has its prowls because it can come up with a really novel ways to solve a problem
or when it sees a completely new payload that's not something that we've seen before in a pipeline.
It can probably adapt to it.
But it also has a risk because the same property, the agency property, means that when you invoke an agent 100 times, it will probably, with the same payload, it will yield the same result, but for the 100 first time, it's going to decide something radically different because it's a probabilistic machine, it's a probabilistic programming, probabilistic runtime.
So instead of having edge cases, you have like this whole long tail on unexpected behaviors.
And we are trying to balance or figure out how we can make the agent predictable for the majority of cases where you want it.
And agentic enough to be smart, but not to go crazy and to do stupid stuff.
Yeah.
Yep. Okay. So, like, in a demo I watched of Zapier agents, like someone says, hey, you know, check my calendar, go reach out or research all the people on it and send me like a summary about all the people, stuff like that. So when that sort of gets defined, like, I'm in the definition stage. It gets defined. Once it actually starts running, is there still like a lot of agency that can happen there? Or is it like more, is it more fitting within like this defined flow of like, hey, check calendar, Google people, write, you.
Or could it like, if something weird happens, like, you know, go off on the long tail,
Agis it's actually executing or is it more defined once it gets into like more execution mode?
Does that make sense?
Yeah.
I think what you're referring to is like, are we turning the text into some kind of discrete workflow?
Are we baking it into a workflow, right?
We are not doing that.
We are not doing that.
We are researching that because we believe for many, many cases, that's actually what users want.
They won't rock-solid predeterministic workflow.
But we decided to go with the totally agentic path.
What you express as a text goes into the prompt without the transformations.
And each time you're going to run it, actually with each step of an agent, we are going to evaluate that.
Yeah.
Okay, cool.
So, you know, you're saying like Zapier's got this enormous, like, library of actions and triggers and Zaps, all that stuff.
Is that like a cheat code in terms of like, hey, you had a bazillion tools that are basically built out and defined free?
Are you using like those sorts of things or did you build your own tools to for this sort of agentic stuff?
No, you're totally using that.
I think it's our big leverage that we have access to over 8,000 apps that you can just plug and play both on the trigger side and on the action side.
You can invoke other Zabs.
you can connect to other primitives we have on Zapier platform.
I believe that many people associate Zapier just with ZAPs,
but we have so much more.
We have tables.
We have interfaces.
We have chatbots.
We have agents.
Some of the parts of the platform are overlapping.
For instance, I mentioned chatbots.
I mentioned agents and Zabs.
And many of those things can probably overlap and do the same thing.
But it's the same like with every other platform.
If you take AWS, you can have Lambda, you can have beanstalk, you can have EC2.
They can all probably do the same, but it's a kind of different way.
Yeah.
Yep.
Okay, so you say, like, you know, Zapier has like 8,000 different apps.
I guess, like, is it hard for your eight?
Like, I know what people have talked about with agents.
Like, if you have an MCP server that has too many sort of actions, they sort of get confused.
Or, like, how do you solve that problem of?
letting it know about this whole world of apps that you have integrated on Zapier,
but also not getting it confused or overloading the context or something like that.
Yeah, that part is happening.
Actually, that's the same question I had when I was onboarded to the team.
I was like, hey, I actually getting the access to all of these tools at the runtime.
And the answer is no.
When you're defining an agent, either you or we,
for you pick the correct actions that probably an agent should be equipped during the runtime.
So instead of like giving it the wildcard access to everything, it's like, okay, here is a prompt,
here is a list of, we can call it equipped tools. And for each of the actions that you're
going to equip the agent with, you can, for each of the parameters in action, you can set a fixed
value or you can let it guess from the list of fields,
or you can just let it guess something totally random or different,
or let it use other actions to figure out what's the correct value.
For instance, when you'd like to send a Slack message,
you probably need an ID of a user, but no one knows what's the idea of the user.
You're probably only know the nickname or the display name.
So you also figure out that.
We have a whole pipeline dedicated to guessing the parameters for,
let's call it, higher order actions.
Yep. Yep.
OK.
I know like a bunch of, like we're seeing agents, you know, like Open AI has their new agent
and different agents out there.
I know a bunch of them are trying to solve like the browser problem,
you know, like just being able to like navigate a browser.
Do you do much browser stuff?
Or are you just like, hey, we have so many different,
again, APIs and actions and stuff.
We can use these and get 95% of what we want.
We don't have to mess with all the difficulties of a browser.
Yeah, great question.
We've been working on that.
I think we started working on that in March.
It was our Hackaton project, one of our engineers,
essentially implemented the same thing
that you can see in chat GPT agent.
hey, I've plugged in a browser,
and you can see it clicking around and doing all that stuff.
And it was great in a demo, just like it's great in other demos,
like, hey, oh, I can book this Airbnb,
or I can go to DoorDash and order a pizza.
But the reality is no one wants that.
Or when you actually want to do something,
you want to, let's say, buy an item from an Amazon,
you run into CAPTCHA, or you need to log in,
or there's some kind of other prevention blocking you from doing this thing,
or there's so much data that you have to deal with that the agent gets confused.
Essentially, the success radio were so low that we decided to not make it live
because there are other people doing that.
We have a ton of integrations,
and probably the gain from,
you know, adding the ability to use a live browser wouldn't be that great.
And most users will probably just get frustrated that, hey, you know, it's clicking around,
it's doing some stuff, but it's a lot of smoke and not so much fire.
Yeah.
Okay.
But I still believe it can have great utility.
For instance, if you'd like to integrate with some legacy systems.
Think about all those on-premise systems with weird accounting software running on Windows XP or whatever.
There's probably a whole world of software that we are not aware of,
but we are not thinking consciously about, that you could probably automate, like, I don't know,
adding invoices to the systems, like creating some kind of, sending some kind of messages.
I bet there is whole world doing that.
In fact, we've been working on steady that's, you know, dealing with EDI prior to steady.
I never heard about EDI.
And it's also like a really ancient protocol for, you know, exchanging messages in weird software.
So probably there is something there which could be automated with all lamps too.
Yep.
Yeah.
I know.
It's amazing how much just like amazingly old software there is out there.
I remember like back when I was a lawyer, we had a client who sent me a screenshot of
they're like accounting software that like you need a screenshot and it was it was literally in the
terminal it was like ms dos he was like browsing his stuff and i'm like come on how is this still
a thing you know can you get like quickbooks or something but um yeah it's it's wild what's out
there um okay this is this is good so um maybe just take me like through what your your stack is like
as you're actually building this and i would say let's let's start with models first
like are you changing models frequently or is that like pretty rare to like how do you sort of go about choosing a model are there different models for different tasks I guess a lot of different questions to think about there but how do you think about models for for this stuff yeah in a full end to end agentic pipeline there is a lot of things that you need to figure out there is a lot of guesses that you need to make
Some of the things that you have to guess are really hard and require reasoning, and some of them are easy.
For instance, let's say parameter guessing.
Let's say you want to pick a correct Slack ID from the list of 10 IDs that should be associated to your user.
You can use an LM that's extremely cheap, extremely fast, but doesn't have to be smart.
Like Gemini is great for that, and we are leveraging Google Gemini a lot to do this kind of things.
For instance, we've been also experimenting with Gemini for guessing actions.
So when you give us a prompt, for instance, every day at 9 a.m., send me a Slack message with a joke, we take that prompt and we try to extract the bits from that prompt that could be mapped to an action.
So there's like every day on 9 a.m.
That's a trigger.
Schedule.
Yep.
Okay.
Send a Slack message.
That's also like some kind of action.
So yeah, we use those extremely cheap, extremely fast models to do this easy things.
When it comes to the most important part, we call it a core loop.
We use reasoning models for that or we recently started using reasoning models for that.
But when I joined, we've been using Sonnet 3.5.
And we used that model because it was the best at agentic capabilities.
And it was like a very strict executioner.
It was like just not asking any questions, always following the protocol,
always trying to just persevere till it's done.
Then the other model started to emerge that,
almost really good capabilities. For instance, we've been super excited about Gemini 2.5 Pro.
But we started testing that model using our EVVALs. And I'm going to speak about
evils in a moment. But so, yeah, we run it through our testing software. And to our surprise,
despite the Gemini being, you know, so much good on all the public benchmarks, on our benchmarks, it
was just like terrible.
It was like 10% worse.
And we started trying to figure out, hey, like what's happening?
It should be better in every metric.
We started investigating the answers one by one.
And we realized that it's not that, how do you call it?
It was not that it was just asking a ton of follow-up questions.
It was asking like user permission to use tools.
It was yapping a lot because it was generating a ton of tokens.
It sometimes failed even to generate the correct JSON.
So that gave us this realization that all those models have this really distinct characteristics
that it's really hard to convey with just metrics and benchmarks.
They all have their own personalities.
And it's putting the prompt totally aside because we've been experimenting with not a lot.
So, yeah, we started with something 3.5.
And sorry, just to expand on that.
So does that mean you need to even, like, cater your prompt a little bit to the different models?
Oh, yeah.
Like, okay, so you can't even just run the same prompt.
Like, you can't just run e-vals to three different models with the same exact prompt.
Like, the prompt itself might need to be more tailored to get best results for a model.
Yeah, you totally can, but there are distinct best practices and things that you should do.
with each model.
For instance, with GPD 4.1,
I think there is a very specific mention in the OpenAI guide
to follow the protocol or something like that.
I don't even go exactly,
but there is a very specific prompting guide for GPD 4.1.
And there's different prompting guide for Sonnet or for Gemini.
So, yeah, when you're changing the model,
you should always try to change the prompt
because those things are totally different for each model.
And how do you get a sense for that?
Is that mostly just like experimentation and more like qualitative type stuff to be like,
okay, I got to play with this model a little bit to see what's better.
And then once I've done that, maybe I'll move into something more quantitative with evals
or something like that.
Or is it like looking at the prompting guide very closely?
Like how do you figure out how to get the best out of a single model?
So we had our set of evals, and we mentioned that,
but I feel like not everyone on your podcast might know what is that.
So in traditional software, we have tests.
Like you have a piece of functionality that you'd like to run in a sandbox environment
and assert that the output is something that you expect.
In AI engineering, we call it evals.
And it's subtly different because you run it, first of all,
you run it multiple times, for instance, 10 times on the same output,
10 times in the same input, because each time the output might be a little bit different.
Then because of the output is different, we use different style of assertions.
We use those scorers.
And the score is, score can be any type of, it can be like a very dumb assertion.
So you can check whether the text is containing a substring,
or you can do a semantic similarity between the output and the value that you're expecting.
Or you can use LLM as a judge to use another LLM to, hey, look at this input, look at this output.
Does it make sense?
Does it meet the criteria or does it meet the rubric that you are defining?
So, yeah, I feel like Eval are very similar to tests, but there are more scientifically oriented.
And in tests, it's like always, okay, it's either passing or not, and we are always aiming to be that always green.
With Evals, I feel like the most healthy, with Evals, we are trying to not necessarily maximize the metric.
Because if you are going to the very high scores, like 90% plus, it probably means at least an agentic workflows that your data set is not interesting enough.
If you're scoring really high scores,
it means like, oh, we almost achieved an AGI, right?
In reality, it seems like your test cases are just not interesting.
Just too easy or something.
Yeah.
So you're trying to float around like 60% or 70%
to make sure that, hey, okay, we are doing quite well
on the really big base of the most basic
and like the most common use cases.
but there is still a set of really hard, really complicated problems
that we can still improve our software
and you are still waiting for our models to get better
and we are still testing with our harness.
So, yeah, so there is still like a room for improvement.
Yeah.
And when you say, when you say, hey, roughly aiming for 60%,
is it the case that, hey, maybe you have 100 tests
and they're all pass-fail-ish so that and you want 60% of that?
Or do you have 100 tests and the, you know, your LLM as judge can sort of give a score that's between 0 and 100% and you want that to average out to 100% or is it, is it some of both?
Does that make sense?
There are multiple ways to approach that.
Yeah, we use the scale from from 0 to 1 with 0.2 increment, 0405, 8, and 1.
there are multiple ways to approach it
and there are other people
claim that you should always go with binary evaluations
because there are so much easier to reason about
and many AI teams have this problem
where they define so many metrics
and so many North Stars
and the scoring algorithms
are some complicated that you lose track
of what you're optimizing for
and when you're optimizing for five metrics
you change the model or a prompt and you get an improvement on three of them, but decrease on two of them, there is not a clear signal there, whether that's an improvement overall or not.
So in general, you should make your evils as dumb as possible, but I think it's really tempting to add all those different metrics like faithfulness or factfulness or, you know, formatting and tone.
And they're all their metrics, like, we call precision.
So, yeah, you have to be really smart about choosing them.
Yeah, yeah.
Is it hard to know, like, when you are releasing changes to a prompt
or changing to a model or just changing an approach in some case,
and it runs these evals, is it hard to know, oh, did that really improve?
Did it not improve?
Like, is it, is there some art there in not just science?
Or, like, is it pretty clear, oh, that actually did improve it?
We can release that.
oh, it didn't improve it, we should make some more changes.
Yeah, so for a majority of cases, unfortunately, it's not that simple because you can get an
improvement of, let's say, half a percent compared to the previous run, but then you rerun it
again and you realize, all the improvement is gone.
Yeah, yeah.
Because you run an Eval on a different day of the week, or you run an Eval in August, which
is a month of summer when Europeans take a vacation. So when LLM is lazier, there are hypotheses
like that. So yeah, it's really hard to say with a really great confidence that, okay,
this configuration with this model, with this prompt, with this temperature is better than the
previous one. What we are doing instead is we, sometimes we focus on just one particular
failing use case. So for instance, can our model nail, let's say, a hundred tool calls in a sequence?
Let's say I have a spreadsheet with 100 emails and I want to send some kind of cold outbound email to
each one of them. And my rubric is, okay, an email was sent to each one of them. And that's
something that we've been focusing on some time ago. And then we are just,
like iterating on that one specific entry inside an eval data set and when we solve that
and we don't see a decrease on the rest of the cases then we call it success even if the metric
didn't jump noticeably on a larger scale yeah yeah and then do you have something i mean i think of evals
is almost like hey i am making this change does it sort of have improvements in our test world that we're
thinking of. You think it does. You release it. Do you have some sort of like broader now like
usage based user metrics that you're like are looking at over time to say like, oh, is this
actually getting better? Are you able to get any sense like that? Or how do you sort of feed that
back to, you know, like, hey, these evals say they're getting better? Is it getting better in the product
or just different things like that? Yeah. We have a whole so-called data flywood set up to make
sure that we are constantly improving.
So we have our evolves in place.
We have the data set in place.
And whenever we see when a user marks a message with a thumbs down, the user is not
satisfied, we are looking to almost every negative feedback.
And when the case is particularly interesting, and this is, when it's something that we
think that we should do well on, that negative feedback, that trace that is associated
with that negative feedback is pushed to the data set,
and we changed the expectation, expected value to something that, you know,
the model the agent should do, and then we are iterating towards that.
But that's only when we have a very good case and the feedback was very explicit,
when user presses, you know, thumbs down.
But that's not happening very often.
People are, so first of all, users are not super eager to press thumbs up because when they're happy, they're just using the product.
When they are not happy, they are not always eager to press thumbs down and give you a very elaborate explanation of what they wanted, what's the acceptance criteria and all that.
They are usually swearing.
So there is a lot of implicit signals about whether your product is good or not.
A positive implicit feedback is when, well, user is logging to the platform every day or when the user has a multiple workflows that are just running in a background and the user is not churning, so they are probably happy.
Detecting negative feedback is something kind of tricky that we also do.
We are trying to infer the emotions from the chat.
So when the user is swearing, repeating words,
using a lot of exclamation marks, yada, yada, yada,
we try to say, okay, that was definitely a negative interaction.
And then we are trying, then we are using a reasoning model
to just look at the trace, the whole conversation,
and try to assess, hey, what actually went wrong?
In each of those agentic workflows,
there's like sometimes even up to 20, 50, 40 components happening
before the final message is synthesized.
So you just give it that old data to O3 or O3Prow,
and we ask it, hey, what was the problem here?
Was it a problem with like resolving parameters
or was the prompt not clear
or maybe some kind of API call failed
or maybe bad SQL?
There's like the whole universe of problems
that could have happened.
And yeah, we tried to use that.
But it's also really hard to tune
the reasoning model with the prompt to be good at doing the investigation.
So tuning your LM as a judge is even the whole different category.
It's like chicken and egg problem because you want to use an LLM to do investigation,
but you first need to eval your LM as a judge,
which is also like supposed to do evils.
Yeah.
Yeah.
So we are, yeah.
Yeah. How much time do you spend doing like truly like qualitative examination, like, you know, stuff that doesn't scale, but like you're looking at individual examples, whether it's like a chat log of someone cursing or just anything like that, where you're just like what went wrong here? Is that like a big, a decent part of your work week? Are you spending, you know, a couple hours of that? Yeah. Yeah. Okay. Yeah. It really depends. Sometimes we get those like high priority pings because there is some like, for instance,
internal use case. So we have the access to the end customer and we can like pair with them,
talk to them like, hey, what was the expected thing? How would you, you know, we can just talk to
the user. And we have that when you have the access to the user, it's just so much easier to do
the investigation versus just like, okay, I'm going to like stare at this trace for 15 minutes
and try to figure out what we're wrong. So yeah, it varies. Recently, we are one more,
focused on feature development because there is a ZapConnect coming our annual
conference so there's a lot of things going on and there's a lot of things that we are
going to release but yeah we always try to look at the negative feedback and always
have someone on a I would say soft on call or like yeah best effort on call for
that just reviewing that stuff yeah yep yeah okay okay so let's get practical on e-vows
Are you using brain trust?
Is that right?
That's correct, yeah.
Okay.
Is that like the clear-in-away winner out there?
That's like the main one I see out there.
Are there other options or is that what most people are using?
I don't know about what are most people using.
In my bubble, everyone's using brain trust.
I've heard about Lansmith or Lange Trace.
I don't remember.
There are so many Lange-related products.
Link, is there link fuse too?
Is that one?
Oh, yeah, Langfuse.
Yeah, I think there's also a rise.
I also, I'm getting a lot of positive messages from my friend.
Oh, he's using observability from Datadog.
And he has integrated like every interface metrics and traces and also LN traces.
So even Datadog is in the game apparently.
Yeah.
Yeah. Okay. And so are you using brain trust both for e-vals and then also for like tracing production calls?
Correct. You can go investigate that stuff. Okay. Correct. Because it, yeah, it allows us to close the data loop, which means that whenever we see an interesting trace that's annotated as a bad feedback, we can add it to the data set, which is also living in the brain trust. So when we run the e-vals, we see whether we are still failing on that or maybe we solve that case.
So in brain trust, we keep the data set, we run the evals, we see all the traces.
So that's the place where all the evils and observability takes place.
Yeah, okay.
Okay.
Okay, so practical stuff, brain trust is what you're working at.
Okay, going back to models, do you have like, out of the big model producer, you know,
OpenAI, Anthropic, Google, X-I, do you have like, hey, these are like the clear one or two we always use?
Is that changing a lot or where are you at with those?
Yeah, so as I mentioned before, I was super excited to use Gemini Pro 2.5,
but that didn't work well for us.
For our use cases, we see that sonnet is always a clear winner.
Whether it was 3.5 or 3.7 or 4, that's just the best model that we are observing.
because, yeah, as I mentioned, it's just so consistent
and it's not giving up.
It's not forgetting about its initial purpose.
It's just doing stuff until it finishes.
And premature determinations were the big problem for us
when we are using different models.
And the sonnet is the only one that's really not giving up.
Okay, okay.
What about just other tools?
Like, are you, are you all typescript?
Are you Python?
Like, what's the language do you run so soon?
Fortunately, we are all typescript, both on the back end and the front end and the whole, you know, magical AI loop.
So, yeah, I'm really happy about that.
You don't have to mess with that Python ecosystem.
Yeah.
Oh, my God.
Don't get me started.
It's, like, truly the worst.
Oh, man, someone tried to get me to do that a few months ago.
And I was like, man, I remember why I left this behind.
Like, Python is how I learned to code, and, like, I had Pipballs set up and all that stuff.
But then if you're not doing it all the time, that ecosystem is just a disaster, if you like.
Yeah, because you need to, oh, you need a virtual end with, oh, no, with Poetry, or with Pip free.
No, Pip, not this Python.
No, you should actually use UV.
And it's like, oh, my God, it's terrible.
It's really a disaster.
Like, you know, the JavaScript ecosystem has its challenges, but at least it's, like, easy to just, like, get a project going generally.
Okay, so within that world, are you using like AISDK or something like, what other tooling are you using in the TypeScript world?
Yeah, I feel like just like many other people, I was super excited about line chain and it allowed me to start this journey.
And I feel like it also allowed to start so many other people this journey.
But you can just, after using it for some time, you can feel that there are so many ups and.
and like templates and classes and like,
hey, I just want to send a string.
Let me see the prompt.
Let me just send the messages and that.
That's really it.
So for that reason, we are using AISDK exclusively
because it's so good.
You can switch providers dynamically.
It's working really well.
It abstracts away all the nasty pieces
giving you type safe, elegant interface.
Streaming us.
their object generation is there, changing providers.
We are also using light L-L-L-M to do some kind of fallbacks,
retries.
What's it called light L-L-L-L-M?
L-I-T-E-L-L-M.
OK, I love this one.
Yeah, it's like a gateway or router infrastructure piece,
let's call it.
So we are having the same model served by multiple providers.
And when one is down, you're writing to a different one.
So, yeah, kind of like a little sponsor or something.
Yeah.
Okay, cool.
I just spoke with Arvid Call and he's got a kind of a unique use case where he's like transcribing all the podcast across the world.
And because of that like cost was a big factor for him, so he's like running his own models, is cost a factor for you all?
Or is it like, you know, he's working with like, you know, hour long transcripts.
He's got a lot of output, which costs more than input.
Like, do you ever think about running your own models?
Are you like, hey, that's not a, that's not something we even ever want to mess with?
we're just going to use the APIs and go from there.
We are starting to look into area of optimizing costs
because our usage is growing exponentially.
And, yeah, the bill is getting really, really high.
So previously, we are like, I feel like just like every other startup
because we, at agents, believe that we are like a startup
but embedded in a bigger company.
We've been looking desperately for the product market fit, and once we have it, now we can think about optimizing costs and stuff like that.
But we didn't really bother for a long time, making sure that we delivered the best possible experience was the only priority we had.
Yeah, yeah.
One last thing I was thinking about is, you know, again, we got to know each other through like serverless and dynamo and all that stuff.
And I just see like so much less talk about databases or infrastructure or infrastructure as code lately.
Is that just because there's like so much potential around AI that's just like, why would you even talk about the other stuff?
Like what happened to those bites that were around like five years ago, you know?
Yeah.
I feel like I just, I'm not longer interested about IAM policies and the latency of their application.
of my DynamoDB between regions.
I just want to persist the data, run some code on the server,
and whether it's Versailles, AWS, Azure,
it's not a huge concern to me.
On the social media side, you can see that it's just so much more sexy
to talk about AI, and that's where also the money from VCs is.
And even if you compare the spend,
how much you spend on LLM and how much you spend on the server,
if you spend like $200 or $1,000 on tokens,
you don't really care whether your compute is optimized.
You can just run something that's not super optimized.
You can run some serverless functions that,
okay, maybe they will be cheaper on a Hedzner, but I don't care.
I'm mostly focused on looking for my product market fit.
I'm mostly trying to maximize my revenue by making the product as good as possible,
not by optimizing the margins because there is a finite possibilities to optimize the margins
and probably an infinite amount of ways I can make the product better, right?
Yeah, yeah, for sure.
Okay, switching gears a little bit.
I know you talk about stuff.
You had a great talk at AI engineer.
You've written about some stuff and you do some AI consulting.
I guess, like, where are you sort of seeing other people struggle outside of Zapier and Vendor where you work?
Like, I guess, are regular devs picking up the AI, like building with AI pretty well?
Are you seeing some struggles?
Like, what's that look like?
I feel like the biggest struggle is that whenever I ask people, okay, so how do you measure it?
Or how do you know if it's working?
They are like, well, I wrote this code.
I deployed it, I sent a request or I just tested the product and it works.
So I feel like many people just lack, and I see this all the time, they lack this mindset or
methodology or framework to test this thing, whether that's working.
Because, you know, they have no method of verification what happens after they change the
prompt or change the model or change
the temperature. They are just
like running blind and I feel
like there is still
not much knowledge about
this topic. There are great people like
Hamlet Hussein or Jason Liu
doing really, really
great job about EVVs
and data flywheels. But
that's just not really
sexy. People just
want to deploy this and it's
so much, it's so easy to deploy the
initial version, but
Just as I said, doing our talk, we believe that shipping the initial POC is just the beginning of a journey.
After you deploy this, you realize that users have totally different ideas on how to use your product than you imagined.
You probably, for instance, you came up with this insurance claims chatbot, and they are going to ask it for a recipe to how to cook a salmon or something like that.
That's happening all the time, and it's not going to stop.
happening because the spectrum of inputs you can set sense to the system is infinite there is no
like okay there's button to continue or go back there's like a text field so yeah infinite things
can happen yeah interesting and so is that just like engineers that need you need engineers
with like a little bit of product sense even more than before because it's not just like hey did
it pass like the deterministic test but like it's like a broader sense of just like
hey are you solving the user's goals you know which is like like is that sort of what
what people need or like what do you yeah I guess what makes people succeed
there's definitely a component of the of the product oriented mindset as you said
which is I feel like for the past decade engineers could feel comfy just staying in
their repo and in their IDE, maybe visiting GitHub from time to time to merge a pool request.
And that was the whole territory that they were operating in.
I got a ticket, I'm going to ship it, I'm going to merge it onto the next one.
And that was pretty much it.
Maybe look at the metrics of AWS from time to time, maybe write an RFC on comment in
RFC.
Now it's more like, okay, if I want to own a feature from start to finish, I need to discover
what's the use case for the product or what's the use case for this feature?
I probably need to talk to users. I need to code it. I need to deploy it. After I
deploy it, I need to watch the metrics. I need to collect a data set of failing cases.
I need to then start improving the prompts or the feature or change the model to make sure that it's actually working.
So yeah, I feel like it's growing just from
being code oriented to being product oriented,
but also being metrics oriented,
or have some kind of small,
I don't know how to say, like scientific component,
because there's like this whole metrics thing
and maybe calculating recall or precision
or whatever you are going to choose,
it's needed for the thing that you're working on.
But yeah, I feel like that's a discourse
that is really popular recently.
which is we see product engineers
or full-stack engineers being more than just code.
It's from the idea through the code,
through the product, doing everything.
One man, Army.
Yeah, yeah, exactly.
You mentioned, like, a user might, you know,
it's a chat bot for something else
and then they ask for, like, a salmon recipe or whatever.
Do you see that Zapier in your chatbots?
And, like, do you protect against that?
Is it worth protecting?
against that? Is it like a small enough thing that's just like whatever?
I guess like, do you see stuff like that where people are just using it as like a generalized
chat GPT basically? Yeah, yeah. The problem we are having right now is that people are using
Zapier agents as a free chat GPT alternative because we are charging based on the amount of
actions that you're going to use. But some people just like paste code and errors and they debug
their code using Zapier agents or write articles or proofread them or edit stuff.
This is like we haven't like we didn't imagine Zapier agents doing that.
But as I said, users are really not predict unpredictable and they have sometimes even
more agency, non-agents.
Yeah.
Yep.
Is that easy to protect against or is that like just like a lossy thing which is like, hey,
Some of these are, you're just like always playing whack-a-mole with that.
Yeah, I think it's always a whack-a-mole.
There is a long tail of the behaviors.
You cannot, just like with security, you can try to secure yourself for the 100%,
but there's always going to be a new prompt injection technique or new way of tricking us.
But I don't feel, if you have sensible rate limits in place,
then you'll be probably good.
Things that you can also use is, for instance,
OpenAI has a free moderation API.
So if you don't want to see potentially harmful content
on your platform, you can use that and flag those messages.
But yeah.
Yeah.
OK.
I know you've written a little bit about RAG and stuff too.
And I know for a while that was all the rage.
Are you still seeing like a lot of
RAG out there, I guess maybe you have less of it at this particular one.
I guess, like, what's the future of RAG?
Is it still a big deal?
Oh, yeah, definitely.
I mean, like, some people say that RAG is dead.
I don't think that's true.
Maybe the vector databases are not as relevant as they were
because we see the agentic retrieval.
For instance, when it comes to code, you can,
Instead of using traditional semantic similarity-based retrieval,
which is using vectors and mathematical operations,
you can maybe traverse the codes just like user does
by using imports and requires and by looking up into the classes and whatnot.
I think the retrieval in itself is never going away.
After all, that's the way how you supply the relevant context to the model.
So it's always going to be there.
in many other forms.
And I don't believe that we are going to see
in near-tamed future
models that are going to have
really huge context windows.
We see
that even
the GPT 4.1, I think
it has 1 million context window.
But after you pass
some threshold, I feel like
50K tokens or even less,
the quality of the model
decreases dramatically.
So it's
still, so the context engineering, I feel like this term is really hot recently, is really
important to make sure that the thing that you're keeping in the context of the model
to make that context really succinct and to only contain the necessary pieces is really,
really important. So rag is still important and it always will be. Yep. And then what about just
like general AI progress over the last, let's say, six months. Do you think it's like still
progressing as quickly as you expected? Do you think it's slowing down? Like, where do you think,
how do you think that's doing? Come on. Like, we had so many interesting moments. Like, we had
deep seek moment. We had trillions of value being vanished from the stock market. And then we had
this, like, constant race of, oh, GPT4.1, and then 03 and Sonnet 4 and Opus 4. And then some
Kimi, K2, open source models are progressing super, super quickly.
I remember at the very beginning, my line of thinking, like two years ago
was something like, why those people that even really bother with those open source models?
Open AI is probably the best and no one's going to win with them.
And I was so foolish because you can see right now the open source models can be really, really amazing.
They're getting really good.
Yeah, it is wild.
Yeah.
And yeah, it's going to continue to happen.
Yeah.
What's the biggest thing sort of holding back progress for you at Zapier?
Is it like, hey, we need better models or it's just going to take some time or there's some
tooling or just like best practices or things like that?
Like, what do you think you need to make this even better?
That's a really good question.
I feel like what's problematic for us is that we are.
general use tool, or we are trying to cover as many use cases as possible.
If you want to use up your agents as a content creation machine, yeah, you should be able to do it.
If you want to use it as a legal advisor, you should be able to do it.
If you want to use it as a pull request laborer, yeah, we also allow you doing that.
So by making the tool as broad as possible,
it's probably impossible to optimize it for a specific thing.
Yeah.
And so there is a really wide spectrum of use cases.
There is this 8,000 apps on top.
From time to time, something breaks.
There's a really wide spectrum of things.
that can break.
So, yeah, yeah.
Maybe, maybe, maybe it's the, it's the maintenance.
Or sometimes I joke that our users slow us down.
Going back to the topic of are we switching models that often.
The point is we already have a ship with like thousands of users.
And we cannot maneuver with that ship as fast as we could because we have thousands of
workflows like running and people run.
relying on them. So we cannot just like turn everything upside down, go from one model to another
because it's going to break so many assumptions. So in that sense, being blessed with product
market fit, being blessed with users makes you slower and not able to maneuver and pivot as fast
as you'd like to. Yeah. Interesting. Do you think you'd ever like take on the complexity of almost
like models per workflow where like if someone has an existing workflow set up and when that
came out, you know, it was using this model and this prompt and all that stuff. And now you
want to iterate, but you don't want to break that. Would you like almost like version those and
say, hey, that person can still use that? It won't, it won't change for them. It won't break anything,
but we're going to keep iterating on new stuff. Yeah, yeah, yeah. We are getting there,
definitely. I think the ideal dream state would be to use reinforcement learning on
per agent basis.
So imagine you can dry run your agents,
let's say 10 times.
You label those runs as thumbs up, thumbs down.
And then we create for you a personalized,
let's say, fine-tuned model that's
specifically run for you, hosted for you,
and it's just optimized for your data.
Yeah, that would be amazing.
But I feel like we are not there yet.
Always away.
Yeah, interesting.
Okay, what about just you personally developing with AI?
What's your tool set workflow, like cursor or WinSurf or what are you using?
I using cursor for those surgical edits.
I have this intuition when CloudCod is going to do a good job and when it's not going to do a good job.
When I know there's a bit of code that I have a really precise change in my mind.
I just use Cursor, I highlight this code, I chat with it.
And yeah, I use Cursor a lot.
Recently, I started shifting to ClotCode a lot.
It's really, really good.
And the workflow that I'm really enjoying is ClotCode installed on GitHub ActionCi.
So you can, on your phone, like, whenever I'm on a walk,
I'm just opening GitHub Mobile on my phone.
I create an issue, and then I add in comment,
out Cloud, please fix, or something.
And then it's starting Cloud code inside GitHub Action CI.
It's running the process there.
It's looking at my comment, and it's doing the fix there.
And then it's creating a poor request after it has finished.
So it gave me a great joy because I don't have to sit here
between the chair and the display.
I can just go on a walk.
And whenever I have an interesting thought, I'm like, OK,
Let's do some coding.
I'm just going to express my thought vaguely or even use dictation to just, you know, say something to my phone.
And when I'm back home, the change is probably here already for me to review.
That's something that I really enjoy, especially during summer.
Yep, yep, exactly.
Do you feel like you, like, how quickly are you, or how closely are you revealing that and related, I guess, do you feel like you're losing,
understanding of your repo.
Like, I feel like there are repos I know really well.
Totally.
Totally.
Yeah.
It's like, there are repos I know really well.
And like, cloud code, I'm just like, oh, change this, change this, change this.
It'll be like super quick.
And then, like, newer repos that have been like mostly cloud coded, I'm just like, oh, my gosh.
Like, I don't know my way around this nearly as well.
Yeah, yeah, yeah.
Totally, totally.
I remember the pre-AI age when I really knew the structure and the functions and maybe like even the line number of something because that was that was really peculiar thing.
But right now, I'm like sometimes looking at because clot code is not even you showing you the code, right?
You're just typing your prompts.
It's showing you dips.
But sometimes you're just like, okay, dangerously apply everything, whatever.
Yeah, I'm sometimes, I recently had a situation, I feel like, a month ago when I opened a file and I started reading that, reading the code.
And I was like, why this thing is defined two times?
And I realized that the streaming piece, the most important piece I had in one of my site projects, was like defined twice.
It was so weird.
So, yeah, I feel like the problem with.
vibe coding is real.
You can be caught off guard pretty easily.
Oftentimes, just adding just one last message to, like, review the code thoroughly,
make sure that everything is dry or whatever solves the problem.
It's not always solving the problem, but yeah.
Yeah, yeah.
What about like how much better do you think
this will get, right?
Because like you see people right now trying to do stuff on lovable and bull and stuff,
and they can like, especially not software devs and they can get like decently far,
but then it sort of gets into spaghetti.
And it definitely helps to have a software background.
I guess like, how much better do you think that'll be in a couple of years?
Do you think that software background is still going to be helpful or will it get good
enough to where it's like, ah, you need less of that?
Hmm, good question.
I feel like it's still going to be useful.
because sometimes you just have a ballpark idea of the load that you're going to have,
or maybe that async operation should be in some kind of processor that's behind a queue or something
that maybe this photo rescaling after the upload should not be blocking the user, stuff like that.
Maybe it's just, maybe it's a set of skills that you acquire that is tangents to being a software engineer,
just being a product person.
I think we all learned that just through the work
and it's going to be really, really beneficial.
I recently read a really nice sentence
which said that 90% of my skills are almost obsolete,
but the 10% of skills that I have
are now like 100% or 100x more valuable.
Something like that.
I don't remember exactly.
Yeah, I think I saw that.
Was I Kent Beck, I think, said that one.
Yeah, that was a good one.
Yeah, it's interesting.
Yeah, cool.
Okay, two last, like almost random questions.
Number one, you talked about GitHub actions.
I've had two people on that are basically providing compute for GitHub actions.
Are you using just like the standard GitHub actions compute?
Are you using a different provider?
Like, how big of a problem is that for, do you know what I'm saying there?
Yeah, I didn't see it as a problem.
I'm like, all that happens is that you mentioned Claude.
There's a GitHub hook or something that's just starting a CI job.
It's starting a cloud process that pulls all the comments and converts that into a prompt.
The cloud does its job.
It's usually less than 10 minutes.
So, like, it's perfectly reasonable for a CI limits.
And then it's doing a bunch of GitHub API operations or,
which is Git operations like commit, push, and whatever.
So compute is not a concern for me, to be honest.
I really want to give a try to open code from Dax and Adam.
Oh, yeah, I've used open.
I've actually haven't even used cloud code.
I've just used open code.
And yeah, I like it pretty well.
I've been surprised at how much I like the COI agent workflow as compared to
cursor.
I love cursor.
I love cursor, but now I really like open code too.
yeah
cool all right
last other question
just like why is
Poland killing it
like you are you're from Poland
every like I know like three or four
other Polish engineers I've worked with
and they're all awesome
obviously like their economy is killed
they're like richer than England here pretty soon
like what's going on in Poland
I don't know
I don't know honestly
I feel like
we had a period of like really bad time
after the second world and after that really hard time i feel like really wanted to
you know just work hard and it's paying off i think that's pretty much it yeah yeah but's awesome
yeah i love all the polish people i know and my wife and i we actually went there last year and like
it was awesome great food great town it was like yeah it was a really fun time so um but yeah for
thanks for coming on. It's been great. I've learned a ton. If people want to find out more about you
or hit you up for consulting or stuff like that, where should they go? I think Twitter or X.com
is the best place slash Rafael Willinsky. Yeah. Yep. All right. And definitely check out
Rofal's talk on Eval's at AI Engineer. That was a great talk. So yeah, Rofal, thanks for coming on.
Yeah. Thank you. Thanks for having me. Cool.