PurePerformance - Serverless Observability needs a paradigm shift with Toli Apostolidis
Episode Date: August 28, 2023Only a few can claim they have successfully created a Pure-Serverless architecture and only those really understand the challenges of observing real event driven architectures. Apostolis Apostolidis (...also known as Toli) is one of those people and its why we invited him back to discuss all the lessons learned from his time as Head of Engineering Practices at cinch. Tune in and learn about the evoluation of Serverless observability and the challenges when observing API Gateways, Queues and Step Functions. Listen to Toli's advice on picking one observability vendor, doing your own custom instrumentation and making yourself familiar with the observability data from your managed service provider.Also go back to our previous episode to hear more from his Engineering Practices for Success and remember that the time to ask about coldstarts is over 🙂 Additional links we discussed today:Previous Podcast with Toli: https://www.spreaker.com/user/pureperformance/unlocking-the-power-of-observability-engOpenTelemetry: https://opentelemetry.io/AWS Step Functions: https://aws.amazon.com/step-functions/Dynatrace Business Flow: https://www.youtube.com/watch?v=W0bSzvQrUzA
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have the mimicker-in-chief Andy Grabner making fun of me during the opening, which someday we'll record visually
to let everybody see because then people can sleep at night.
Speaking of sleeping at night, I don't have any dreams again, Andy.
But I did just wake up recently so if i sound you know bed weary that's because i am hi andy how are you doing
i'm very good but i am in a different setting today but uh what did you say-C in control. What did you call me earlier?
Oh,
my,
something in chief.
Oh,
Mimicr in chief.
Mimicr in chief,
M-I-C.
Oh, and I forgot to call you,
I'm supposed to call you Salsa Boy.
I forgot about that.
That's right.
Yeah,
yeah,
yeah.
See,
it's morning.
So Andy got a nickname,
everybody.
Well,
we did the recording yesterday,
but the previous episode,
Salsa Boy,
because we had a guest, Sala Boy
and Andy mentioned Salsa Boy
so we now have a new name for Andy
which I'm going to try to promote all throughout the company
I won't do that to you Andy
but anybody listening I'm sure
you deserve more respect than you get
from the things I hear people talking about you Andy
I think you deserve a lot more respect
I'm actually also good with salsa, boy, because that's
my passion, right? Dancing salsa.
But I also have one quick story to tell, and this is why
I'm in an unusual setting today
because I made the stupid mistake
of trusting the Windows
update message. It says there's an update
pending. It only takes six minutes.
45 minutes later,
it was still going on, and so I
switched to a different room here.
It finally finished, but I'm in a different setting.
In case I sound different today, then it's because I'm in a different setting.
But anyway, let's not talk about Windows updates.
Let's not talk about Salsa or something like that.
We actually have a guest back who we just recently had. Actually, two days ago, as of the recording today, we just aired the episode, which was called Unlocking
the Power of Observability, Engineering Practices for Success with Toli. Toli, welcome back
to the show. And the reason why we have you back on the show last time when we talked,
you said there's a whole lot of set of topics around serverless.
And you wanted to talk about serverless observability.
Serverless is a really big topic for our communities out there, from a
performance engineering perspective,
from a site reliability engineering perspective, from a
platform engineering perspective.
And that's why we're really happy to have you back
because it's a topic that was on
your mind and you also said you want to have
a little conversation here
so we can kind of bounce back ideas on how to, you know,
bring serverless observability to, well, discuss the challenges with it
and come up with some conclusions.
But, Toli, welcome back. How are you?
I'm great. Thanks for having me back.
Thanks for publishing the podcast.
I had really good fun talking about observability last time.
I think we danced around the topic of serverless
in the last topic, in the last podcast.
I think the history is that my experience was
I joined Common Gold Cinch four years ago
and the company was entirely serverless.
So everything we were talking about, or most of the stuff we were talking about last time,
was relating to serverless observability, but we kind of masked it.
So I'm really looking forward to hear what you found in the field
about how people are using observability, how they're enabling observable systems.
Last time we talked a bit more about practices, I guess,
but this time
it'd be interesting to see the
technical difficulties
as well.
I mean, from my perspective, and that's actually
let's start right away. The technical
challenge that I see, and Brian and I, we've
both been in the field of observability
for a long, long time where we built
agents that you installed
on the server, on a VM, right? And as we said, even though there are servers in serverless,
these servers are out of reach for the classical agent. And so the question is, what are good
approaches to actually bring observability into your serverless functions, right? Because you
don't control the runtime underneath.
Now, I know a lot of things have changed over the years
because, you know, the way, like when you take,
for instance, AWS or Microsoft or Google, right?
I think they also learned that they need to bring
a little bit more flexibility
into allowing observability frameworks or vendors
into their runtimes.
Obviously not too deep, but just getting it in.
I also think that open telemetry is playing obviously a big role in serverless observability.
But yeah, I think there's definitely a difference because we can no longer just install an agent
on a machine that automatically detects all the processes, all the containers that run
on an instrument. We need to kind of instrument from within the serverless function or whatever we are
allowed to do.
So I think that's the big difference in approach.
Now, Tole, I know with your experience at your previous organization, how did you approach
it?
How did you instrument your serverless functions?
What type of mechanisms did you use? Did you instrument with open telemetry?
Did you use what was provided by the vendor?
Any other things that we may not be talking about that you found out?
That was a great summary, I think, of the estate, really.
It's very interesting to see how the vendors have evolved over time. Our experience was we did start with an agent hosted in an app service, I think.
And that was when we moved to serverless.
It sounded a bit over the top to have a server to host the agent,
but then everything else is serverless.
At the time, I think this is around 2020, a lot of the vendors were learning how to
grow and how to migrate to this new paradigm.
Serverless, having worked for a company that was entirely serverless, didn't own a server, didn't own a VM at all,
it is really apparent how much of a paradigm shift serverless is.
And this comes with instrumentation as well.
So we quickly figured out that the best option is to instrument
using the vendor SDK,
the vendor tracer.
Over time, the vendor learned and got better.
And as you said, the cloud providers were actually allowing more support.
So typically, just for the listeners, what happens,
and I guess this is maybe AWS specific,
so I'd be interested to know what the other two big vendors do.
They
have a native
tracing client, for example,
and they have a native
product that you can use.
But
for a long time, that made it hard for
other vendors to instrument
as well as that, because
it doesn't have that native integration with everything.
But AWS in particular opened up using this form of extensions on Lambdas
to actively work with these vendors and allow them to build extensions
on, as you said, the Lambda runtime.
And that really, we could definitely experience it.
Things got a lot better, a lot quicker.
So typically in that context,
you'd use an extension as your agent,
really like a mini agent,
and your Lambda runtime would push a telemetry async,
I believe, after the end of the Lambda invocation.
And that seems like a nice, clean
architecture. And that works nicely with instrumentation. That's for traces, but you could probably say the same for other things.
What do you think were the main drivers why somebody like a vendor,
whichever one of the big ones you name,
what do you think were the real big drivers
for them and the motivation to open up?
Why would they, if I have a solution
that already, right?
I mean, I'm providing serverless
and I also want to provide the whole package, right?
Obviously all these vendors, all these cloud vendors they try to to to i think at least
right keep you in the environment right because then everything is feels like from one vendor and
everything is is they like um what do you think pressured them to actually open up
that's that's a great question I don't have any inside information,
but I believe that the reason is they recognize that the observability vendors
do observability, although you can't say that,
do observability a lot better than they would do.
That's not the differentiator.
I think it's really smart that they've done that. They offload observability capabilities to other
companies and in a way
they make their users' life easier as well. I tend to, whenever I meet
someone at AWS or I meet someone at Datadog, whenever
there's a bit of a, or Dynatrace, whenever there's a
difficult problem, I just say, can you just speak to each other?
Just speak to each other, make it work.
So the signals I've got is I've not got any friction from either side.
They seem to collaborate quite nicely.
I think it's very critical to have serverless experts
in the observability vendors as employees.
So it's only when we had a lot of struggle
actually getting our observability vendor
to really understand our problems
because they didn't really understand serverless.
So that helped a lot.
But yeah, I guess Tildear,
I think it's to their benefit
to offload this non-differentiator.
They can definitely offer a basic functionality,
observability functionality with, in AWS's case,
with things like CloudWatch and X-Ray and their metrics.
And that works well for a while, but it quickly breaks
if you want to do more advanced and complex metrics. And that works well for a while, but it quickly breaks if you want to do more advanced
and complex things.
Yeah, I mean, from my perspective, what I see,
and Brian, I'm not sure how you see this as well,
when working with our customers
and there are typically large enterprises,
you are typically not just using one kind of stack, right?
That means if you're using AWS Lambda,
then chances are that those apps are calling
into other services that may run on another cloud,
that may run on your on-premise,
but you still want to get the end-to-end observability.
And this is why you don't want to end up
with observability silos where all the Lambda data
is in AWS, and then you have all of your other data
in another observability platform.
Really in the end,
you want to get the end-to-end view.
And I think also that was one of the reasons
why they opened up.
I think it was the pressure from their customers saying,
hey, we need this type of visibility, right?
Because otherwise,
it's going to be harder for me to troubleshoot.
Also, if I'm your customer
and I don't know if it's my mistake,
it might be yours, then I need to open up a support ticket with you.
And then your guys need to look into this. So why not just give me the data
that we need? And then I think it was really good that OpenTelemetry came
along so that we had an open standard that everybody could agree on, that we don't
have anything proprietary, but we really built something that everybody could
easily consume
i think that was a it was a really important piece in the last couple of years with open telemetry
yeah and i think what you're speaking to there is like the market pressure right there uh you know
if they're not doing that the customers are going to be unhappy and then they if like if amazon had
it closed and then azure came in with their serverless functions and said, hey, it's open. They might see an exodus.
But I also think part of it could be the general IT community.
What we see over and over and over is there's one cloud vendor that's not one of the big three that I could imagine would lock it down.
I'm not going to say their name, but I imagine they'd be like, no, you're going to do it all through us, right?
I think there are so many people at these, you know, the big three and so many other places where their nature is just to be like, oh, yeah, we should do that.
People want it and it makes sense.
There's that spirit within our community that really lends to that.
So I'd hope to think that that's part of it, right?
Yes, market pressure is going to be a big part of it,
but I think it made it easy
just because of the way we all operate as an IT community.
Yeah, that's an excellent point.
I really like that.
I hope that's a pressure that we apply as a community
because in a way we want to collaborate
and we want things to not be locked down overall but I also like the market pressure thing you can imagine
a worldly map where the user is anchored as a developer or a company and they want to monitor
their systems that's their need and then what do you need to achieve that and you can imagine the
native client custom built thing for the cloud provider to be something
that they really don't want to it'll be in custom built so they want to move that into product and
their approach the inertia to build into a product is that that's not the differentiator maybe so
why not farm it out to other companies hey totally coming back to your experience with building a system that was purely serverless-based,
one of the challenges that I hear from our users or questions that come up is like, hey,
serverless monitoring, yes, we can monitor the individual serverless functions, right? We can
instrument them with open telemetry. We can use the extensions and AWS Lambdas and so on and so
forth. But how can we get visibility into our end-to-end system?
Because there's many services in between, like the end user and the serverless function
and all the other things that a serverless app is consuming.
What's your experience on this?
First of all, was it critical for you to get end-to-end
visibility also through these connected so through these services whether it's a gateway api gateways
event buses um you know down to the database was this a important and b how did you uh how did you
solve it if it was important for you um that that's a very interesting question. I guess the answer is maybe, or yes and no. It is important, but it isn't.
I think, first of all, it's interesting to say that I was part of a company that had maybe 10, 15, 20 teams at any point in time building serverless systems.
So they were building the systems.
They had to understand the visibility of their systems.
So they had to understand how observability works in serverless.
So the first thing they, if you take it almost like a timeline,
the first thing they did was, okay, I can emit telemetry data
from the thing that I can control,
which is your Lambda and AWS, in this case,
or your serverless function.
But then I think what you're saying is,
the question is, if you want to use tracing,
how do you trace across managed services
like API Gateway and AWS or EventBridge
and things like that?
And the answer is that we tried.
Initially, we used a functionality of X-Ray.
So X-Ray traces a lot of these things natively
and merged those traces with the observability vendors SDK,
their version of Tracer.
And that sort of works nicely, but has a lot of limitations when it comes to filtering and that they seem to lose x ray seems to lose traces. And so we couldn't really use span data as you know, using a monitor because we couldn't rely that it's going to be there.
So that was one of the strongest limitations for that approach.
Then as the years went past, interestingly, as we waited for things to evolve, and that's often something that happens when you adopt a new paradigm that's new to everyone is that you have to wait for things to evolve and improve. So what happened was the tracers, the
observability vendor tracer got better, that improved over time. So it natively traced
things like API gateway quite easily. And the API Gateway, so if we classify things,
I don't know how you think of them in your head,
but there's three things that I'm thinking in terms of serverless.
And if I define, can I take a little small parenthesis
to define serverless?
Sure, yeah.
So in my mind, serverless is elastic compute
and elastic billing.
Your usage of whether it's a managed service, whether it's
a computer you control, to an extent, that should go to zero if you've got no traffic.
And it should scale automatically and you shouldn't need to put any rules in.
And when we talked about server earlier, the server should not be visible to you.
So when we talk about Lambda runtime,
that runtime is not a VM or a server.
It's just the runtime of the Lambda.
It's a bit of a higher level abstraction.
So given that that's what serverless is,
you might then have to integrate your bits of code,
which is your serverless functions,
with other managed services.
And in this case, you have three classes of managed services.
I'd say things like API Gateway, which is fairly straightforward,
and it's synchronous.
And then you've got more asynchronous things,
which is your queuing systems, your SQS, or your event bus,
which is your event bridge, or your notification system,
SNS and AWS.
And then the third class is step functions.
And that's a big topic in itself.
But I'd say the API gateway is really useful because you can,
in the last episode, we talked about having real user monitoring.
So then you can get a trace that has your real user monitoring top span.
Then you have an API gateway and then you have your Lambda function.
And that's useful because you can trace things across.
You can see how long the API gateway takes
when you're looking at support cases
and just in general what overhead it adds
and what hacks you can do to improve that.
But I'd say that it's not the biggest value.
We talked about it last time.
The flame graph is not the biggest value.
So it's useful, but
not amazingly useful.
Then the second thing is the
asynchronous stuff. And so the things
like SQSS and SNM Enbridge, those
are quite hard to trace.
And I think the biggest...
I'm curious to see what you're seeing in your
focus group.
I don't
know what the best visualization
is for loads of things going
into SQS and loads of things going out
of SQS and all this kind of men-to-men
relationship. We tried it
a few times to see what's the best approach,
but it's not easy to
visualize. It's not
easy to understand.
So if the experts in the company
can't really figure out what's the best way and the tool is not helping
you, then I don't find any use into giving
this to all the teams to use.
And a similar story is with step functions. So step functions
are a very different paradigm as well. So for those of you
I think it's Logic Apps in Azure,
but for those of you who don't know what step functions are,
it's basically a state machine.
So you have a well-defined set of steps,
and once it starts, then it goes through these steps,
and then there's an end game.
And it's a bit different to using.
You can use a lambda within
the step factor you can kind of compose a bit of a um uh it's it's an orchestration versus
choreography they call the event driven stuff so that is a very difficult to and to use effectively
because it may take two weeks so you do want a flame graph with a massive span. So the visualizations in the asynchronous managed services and the step functions are
not really compatible, I think.
But have you seen anything different?
Yeah, I think we what we are seeing these days, I think I took some notes and I'll try to
at least give you my thoughts on
this um first of all when you are sending a message to a queue right i think one of the
things that you want to measure and monitor and also then highlight is how many subscribers do
you have on the other side and i think there's a there's patterns that you can figure out and say
hi why do we have you know 50 subscribers to this particular message?
And it grew from 50 to 60 to 100.
I think that's a good indication.
This is something you can actually answer
when you actually do distributed tracing.
So, you know, one message comes in and then it spans out into 50, 60.
And you want to have an eye on this number
because one of the things that I've been talking about almost since the beginning,
since I started working in observability, is what I call the architectural validation
and architectural regressions.
So with an architectural regression, in the early days, Brian,
we talked about the M plus one query problem.
And coming back to this, because in the end, it's the same thing.
In the classical...
So is that.
In the classical Java apps that we saw in the very beginning of our work in Dynatrace,
we had one call, somebody makes a request, and then 100 database calls go off.
And then tomorrow, instead of 100 database calls, 200 database calls off for the same transaction now why is this because somebody was
iterating through the result list of 100 200 and then making an additional database call as the
classical m plus one query problem and I think we can and we we answered these so we highlighted
these architectural patterns through distributed traces because we said this request on that URL
with these parameters was producing five
database calls yesterday and today the same one is 50 and tomorrow it's 100 so you clearly have
a data-driven problem here and i think that's one of the things that i think can also be applied
uh when you're sending a message and then you want to see how this message fans out so just
keeping track of of how many inverses out so that's one thing the other thing uh
looking at the um at the other example there was step functions i think with step functions you're
talking a lot about kind of like almost like business process like you'll be basically
modeling a process and what we do here now is um i mean in in our terminology uh we talk about uh
business events and but it might not be the perfect term.
But what we basically try to do, we try to model an end-to-end business process that can take a minute, it can take an hour, it can take a day, it can take a week.
And we're basically trying to fetch the phase where this process is in as an event.
So it's the same thing right i mean you have an event
you know order comes in uh order gets sent to the i don't know checking the payment when the
payment is completed then it goes into order fulfillment it gets shipped and then at the end
the customer says i received this package and so what we are basically looking into how many instances of these processes are executed,
how long does it take to go from step to step?
Does something change over time, right?
Do certain times between steps all of a sudden become longer?
Does this correlate with the workload or with the with the volume that comes
in does it change maybe with some attributes on these transactions so maybe an order because we
recently had a podcast with um mark forester from michelin butlers they are a restaurant chain um
and in in in the uk and they basically monitor their whole food delivery end to end with this right and it's
their systems but it's also the ubereats of the world that are doing the delivery and then also
like including the payment and the delivery and they basically uh you know monitor every step
along the way and then they figure out where things may go wrong and Maybe Uber Eats had a bad day today and therefore all of a sudden
whenever Uber Eats was used to deliver food, there was a bad experience and people were
returned not paying for it in the end because they gave a bad rating and things like that.
So to sum up, I think when we talk about step functions and you have longer processes that can take much longer than just a second, like a traditional transaction, you want to think
about what are the individual phases of your business process and how can you monitor every
phase, how many transactions actually go through, and then also detect critical parameters that
tell you, well, in this case, this transaction is just buying a burger.
In this case, this transaction is buying a much more complicated item that may also take longer to produce and so i think this is where
you also then need to have some domain knowledge to then figure out how do you then detect certain
patterns and not just put every transaction into the same bucket that's kind of like what i'm
saying what i try to say that that sounds really exciting to me, to be honest.
I wanted to interrupt you a few times there,
but I didn't.
I think it was really good.
By the way, business events is an ideal name,
I think, in my mind.
I think it's a really, really good term.
I have loads of questions for what you said,
but if I understood correctly what you're saying is that you pay attention.
At the center of what you do is the business events, and you visualize and monitor those,
and you get insights for things like time between things, events that have failed.
And I think absolutely this is what is missing
from most observability vendors, to be honest.
And I think that, especially with serverless,
you have the opportunity to stop worrying about servers and VMs
and you start worrying about business transactions.
And what's interesting with business transactions
is in a distributed system that we typically build nowadays,
like a transactional distributed system, I guess,
you don't care only about the server request response.
Pat, you want to know about multiple hops.
And I think what you were saying there about the particular example
is that there's a lifetime of an order.
And it's interesting
because I've just moved
into the restaurant tech industry.
So I'm very interested in that example.
But there's a lifetime of an order
and you want to have visibility of that.
That was the number one thing
back when I was working at Cinch
that executives and directors cared about
and the teams cared about ultimately.
How do I visualize?
Okay, I've got all these Lambda functions.
I've got all this, what they call choreography,
that things with choreography,
they mean that because you're using a vendor in architecture
and systems are integrating between each other with events
rather than synchronous API calls,
you're hoping that the whole process falls together beautifully.
But you have no way of,
you have no way, well, you have no easy way of verifying as you were saying with the architectural validation maybe that you
have no way of verifying that that process has worked so so is your answer to i'm guessing i
guess i've got two questions so is your answer to that that you have this concept of business events and you build your
practices and your instrumentation and your dashboards and everything from that perspective?
And if yes, what are the telemetry data types that you use?
So the way this works is, right, you want to, first of all, kind of almost sketch out your business process end to end.
Like, where does it start?
I guess typically with an end user that is trying to do something.
And then how does this business process evolve until that process is completely fulfilled?
And coming back to that example with Mark Forrester, they have only a small piece of the whole
food delivery is stuff that they actually built.
Most of it is third party components.
And so they were basically sketching out the whole end to end business process.
And then we were trying to figure out how can we get the individual faces?
How can we monitor this?
So some of this can be done by looking at traces
at logs people can also like especially sas vendors they can also push data to our observability
platform so for instance this was one of the things that i liked so much about that conversation
because i asked him how did you convince that the sas vendors that you're working with right most
of the services that they consume are other APIs.
And then I was asking him,
how did you convince vendor A, B, and C
that they send you their logs and traces?
And he said it was very easy
because when we have a problem
and we don't have the insights,
then we open up a support ticket with them,
which means we are binding people on their end
and on our end to try
to figure out whose fault is it, why the food was not delivered.
Was it our mistake because we couldn't make the API call or it didn't reach their end?
Or it was a problem on their end because they couldn't call back?
And so he said, once we established that it's a benefit for both sides to get insights,
it was very easy, right? And in the end, they now have all the data
and can solve technical problems on their end
because now they know that there are problems.
And if the problem is on the other side,
they have the proof and can say,
you know, we have done the root cause analysis.
We know that when you are making a call back to us
to figure out how do we still have
burgers on stock? Do we still have this and this food item? And sometimes our API fails. This is
one of the scenarios that he talked about. Sometimes these food platforms, they were
showing an outdated stock supply, right? This showed this restaurant still has, you can still
order this even though the restaurant didn't have it anymore.
So when the customer then went through the process, in the end, at the end of the order process,
the system returned an error because when they made the final call, they said that we're out of
stock. And so this is an interesting piece, right? And so how do we, to answer the question,
how do we get this data? Right now, the way way we do it and i'm sure other maybe observability vendors are doing this as well but we can ingest
it a through our agent b through logs we can also ingest this from the front end so we also have an
agent or a javascript library on the front end if you're building web applications so again we can
also capture it there.
Brian, did I forget anything?
You can just send events to API.
Yeah, we have an API where you can just send events.
So this is where SaaS vendors can just send it.
Now, the critical piece is you need to have some unique identification that kind of tells us that all of these events somehow belong together, right?
And that's where typically you have something like an order ID or something.
Customer ID, order ID.
Yeah, some ID. But we can take all of these individual events and we categorize them phase
one, phase two, phase three, phase four, and then we can stitch them together if you want
to know the whole process for one order. But then we can run analysis and say, hey, that many orders came in,
that many made it successfully
to the end, and then we can fan
it out and say, hey, at the food delivery,
we have five food delivery people
or companies, and we have a higher
error rate when we use delivery
one versus two versus three. And so these
are the things we can do.
And you know, it doesn't get more serverless than
tracking shipping.
Yeah.
So I'm very
interested in this
approach because I'm happy you said order ID
because that's the one thing that
I was going to say that you need
something to stitch it together. So it's the
difference between a trace
and tracing a request, tracing a request,
which is a request from the kind of the boundary of an API
or even the front end
if you link it with the reels you're monitoring
and the difference of tracing a particular order.
And the trouble you have as a customer
with these types of things is that
it's good practice to have some top-level tags,
if you like, call them tags for now,
that are used across all the teams.
And in this case, as you say, across companies.
So you have to have a naming convention
and you have to stick to that naming convention.
And if you don't, then all of that kind of doesn't work that well.
But ultimately, as a user perspective,
what I'd really, really like is if the vendors,
the observability vendors, helped with that process
of basically having that convention but adhering to it,
elevating the top-level metrics, top-level tags,
making them a really, really important aspect
of the observability telemetry data.
I think that would be really useful.
And also the events themselves
and that whole choreography,
if that could be visualized
in something different than flame graphs
or a list of spans,
is also really useful.
And just the last thing, very, very business-related.
What's interesting with what you said
about the two companies realizing
that it's more efficient and effective to collaborate
is if you look at it from the other perspective,
the restaurants or the hungry person ordering food
doesn't care whose fault it is.
No, of course not, yeah.
They say, I don't care whose fault it is, just figure it out.
Just fix it. I want my order.
Or you've lost my order.
Where is it? Things like that.
So it's super interesting.
It's 2 a.m. Where are my wings?
Exactly.
So the other, you know,
and totally you brought up another interesting point
that I didn't think,
I think I read it into your comment,
but like with the idea of the order ID again, right?
If we go back to the very beginning of our conversation where we had this idea of the cloud vendors opening up the API because people needed this information, right?
So it was like a market demand, if you will, and also cooperation. Not only is it cool that these companies are
working together to do this, but when you take something like an order ID, we might find
ourselves in situations where either the originating vendor, the money generating vendor,
in this case, it would be the restaurants generating money for the Uber Eats and the
stock suppliers and all that kind of stuff. Or it could even be something like the observability vendors who are getting them to change the process.
Because I can imagine, and again, I'm just using Uber Eats as an example, so I'm not picking on them.
Uber Eats, someone goes in and puts an order in through Uber Eats.
Uber Eats has their own order ID.
Maybe that gets sent to the restaurant.
Maybe in the restaurant, they're tracking it with
their own order ID that they put back to that. Now that's going to be very inefficient and hard
to go through. So if the restaurants then make a modification so that they can observe end to end
and have all this information by saying, we're going to take whatever order ID our delivery
service is sending us and use that in our system and unify that piece of data throughout this entire process.
Now you open it up so that all these parties can work together and then leverage the observability
platforms or any other thing that they want to do to track this.
It's interesting that this might drive inter-company cooperation for the purposes of improving
their services and making more money, which I just find fascinating. And yeah, it's crazy.
It's crazy to me.
Yeah. I just started to that, by the way, you can use Flipdish if you want,
which is a company I've started. You can pick, you can pick,
you can pick on them.
So I find it very interesting because one of the things that we do at Flipdish
is we, we surface the events as they happen to the customer in a UI.
And that's something that's custom implemented separately as part of the software.
I've always thought that there's a very close connection between what we do with observability data, which is almost like it's metadata.
It's data about the software
system.
But we put so much work in it and we put so much methodical and systematic work into instrumenting
our code.
That's not very far from then surfacing these type of things.
If we had that list of events in your observability vendor, then it shouldn't be very hard to then
display that in a different context for a more operational aspect. I know that's mixing a bit
responsibilities, but you're more likely to be more accurate, maybe, or you're looking at the
same data. So yeah, it's just an idea that i thought i i always i always feel that really quite
difficult that we have to ins we have to build our own thing but also we have things in our
observance vendor to understand our system it feels it feels like an overlap yeah one quick
thing brian on your idea of kind of like what because what you were saying is you should kind of
uh encourage different organizations that work in a similar field to come up with data standards your idea of kind of like what because what you were saying is you should kind of encourage
different organizations that work in a similar field to come up with data standards right that's
in the end of the data standard but data standards have have existed for many years right i mean
there's many i remember when i was in high school and it was in the 90s um we had our our main
teacher in class he came with an industrial background, steel industry, and he told us about systems that were exchanging messages on a data format that has been standardized for I don't know how many years even before that.
So I guess now it's more and more non-traditional.
Industries go into IT and need to communicate with each other, right?
Maybe this is another kind of point in time
where we see an explosion of new data standards,
but business data standards.
How can we track an order?
How do we track delivery?
So that everybody that participates in the end-to-end workflow can easily participate.
And that also then allows you to easily switch to one provider versus the other
because you know that the API is still the same.
In the end, it's about APIs and data standards.
I don't know what it is, but there is something for food menus, for example, for restaurants.
I found out, basically.
Yeah.
I'm sure people are, the businesses that are going to do best have already started tackling this.
I mean, that's the example Andy was doing.
The fact that they were able to use that, that order ID means that a lot of them are tracking that the proper way. So it doesn't, and at that point it's nice too, because it doesn't even become a stretch
to do it. It's not like, oh, now we have to go through and say, it's like, if there is that
standard established, it's, you know, I don't want to say simple, but it's a heck of a lot simpler
to be like, yeah, we can expose that as opposed to, you know, standard queue monitoring is how many are going in and how many are coming out.
And there's no tie.
It's just looking at the volumes in and out.
Whereas if you have that ID going through,
you're now seeing when specific ones are going in and out,
which is the big jump, right?
That's what we need.
Hey, quickly back to some of the
observability
things that
I would like to
ask you.
One of the
questions that
came up in our
community,
because I do run
a working group
within our
customer base,
and questions
that came up,
how do you
correctly do,
is there any
standards on
when you're rolling out a new version of a function
and you're rolling out a new version
that is providing some additional capability?
Is there any standards already
on kind of progressive delivery?
Is there something on, you know, releasing it,
obviously maybe deploying the serverless function,
but not yet deploying it,
but not yet releasing it, maybe through can serverless function, but not yet, deploying it, but not yet releasing it,
maybe through canary deployments, feature flagging.
Is there any, are there any best practices out there
on how you deal with when you switch over
from one version to another, and also how do you retire it?
Because you want to eventually also get rid of functions
that maybe no longer need it.
Have you experienced with this?
Unfortunately, I've got a bit of a non-answer to this.
We tried a couple of times, at least that I know of,
to implement a progressive delivery with Lambda functions.
I don't know if there's a way to do it.
I don't know that if we actually found a way to do it,
but from recollection, we mostly didn't find a good way of doing it.
And I think one of the problems,
I can't remember the specifics of the problem,
but it was something to do with versioning.
But what we did do was we would deploy Lambda functions with feature flags that we...
Implementation of feature flags that we created
with things like Parameter Store and stuff like that.
So you would read from Parameter Store
and decide whether you open up the functionality or not.
That's, I think...
And we considered more managed feature flagging services as well, but
we never actually implemented them. So we spent at least three or four years evolving a serverless
architecture without any of these progressive delivery of functions. I guess my comment on
that is probably that that's a signal in itself. We never really had a problem.
And when you do have a problem, it's easy to roll back.
Again, it's a paradigm shift.
It's different.
You sort of want a progressive delivery if you have a big thing that you're deploying
that takes a long time to deploy and a long time to roll back.
Or it's hard to figure out what's wrong.
Obviously, you don't want to break things in the first place.
So, yeah, so in a way, we don't have a way
to progressively deploy serverless function.
But then you have, I guess, the resilience of,
in our case, we have the resilience of queues
and event-driven architecture.
So, again, it's a paradigm shift in a way.
So in practice, progressive delivery
became more of a nice to have. Yeah. Yeah. I mean, I've been talking recently a lot about
feature flagging for multiple reasons. One, we've launched open feature as a standard last year,
as a CNCF open source project. And therefore, I'm just a little bit more familiar now with all the use cases.
And for me, feature flagging was, for me, most of these use cases that I had in mind
was like maybe some A-B testing.
You turn on a feature for an end user, and then you figure out if it works, and then
you give it to somebody else.
But what I also recently learned is, and this was in a conversation that I had with a customer,
and he said, for us, the biggest adoption for feature flagging,
this could also work well in the serverless world, is we're building new functionality
that assumes a certain backend system to be available or to also have upgraded
as a backend system that maybe we are not controlling a third
party vendor right let's say a third party vendor promises a new api so we are implementing new
features based on that new api but we don't know is it really going to be ready next week
next month or next year but we don't want to just keep the code lingering around and never being
deployed never deploying it because there's a lot of cognitive load on our developers because
they're always fearing hey in a month from now will the old code that I wrote a month ago finally be released
or not? And I still have to remember it. So what they do is they're using feature flags to really
be able to deploy new code, but not activating it until that point when the backend system is ready.
And then they can flip the switch and let's say the backend system all of a sudden breaks
and fails and doesn't do what it's supposed to, turn off the feature flag, go to the old
system, go to the old API.
Another example that I also learned was there's also if you have particular times when you
are allowed to do certain things as an organization, I don't know, maybe you have a marketing event and you're selling things from one to two for a special discount. You can also
turn use feature flags to maybe then show a special banner, but you don't have to deploy at
one and then roll back an hour later. And I think these are interesting use cases for feature flags
as well. And the first one that I mentioned could be
very interesting also for serverless functions. So you're deploying a new version that has new code,
but you're still kind of holding it locked behind the feature flag, and then you just turn it on
when the time is ready. Yeah, I think the use case that I was very interested in, if you use a managed service that evolved these things
and made them work really well,
then you get this idea of releasing things to a group of people,
maybe in your company, and that's in production,
and that's exercising production code.
So releasing things to segments of people is really, really powerful.
But unfortunately, I don't have experience with that,
so I can't really comment.
No worries.
Another question that I have for you
is the topic of cold starts.
I'm not sure if that's still as big of a problem
as it used to be in the early days of serverless.
So I'm just throwing it out to you.
Cold starts, was this ever a challenge?
Any considerations on it? Any best practices?
I get this question a lot from people that haven't got serverless experience.
Bam, Andy, you just got smacked.
I'm kidding. I'm kidding too. I'm just kidding. Yeah.
So don't ask that question again.
No, I mean, so I can see the problem because my background is at.NET.
So as a.NET developer,
you don't want to put the.NET runtime
on a serverless function, or you do,
but then it's typically more for more async jobs,
more back-office jobs, integrations, for example, that are
not that time sensitive.
But then people think, well, if you're running a website with a backend being functions,
then you will basically hit call start and you'll have a very slow system. Our experience was that our system got faster
when we got a lot more traffic,
which is what you expect from cold starts,
and it got more efficient.
But our base response time was never a problem.
We never really had any situations
where we had problems with cold starts.
I think, obviously, our configuration was more Node.js.
And I think with Node.js, you don't have that many problems.
Interestingly, I'm just going to reference someone who I collaborated with at Datalog, AJ.
He's a serverless product manager.
He wrote a great blog post about SDKs,
like what SDK you use, what AWS SDK you use,
and how you use it might affect the duration of your Lambda function,
for example, things like that.
But that's less about cold start.
So you start thinking about these type of things.
So maybe the telemetry libraries actually, interestingly,
might be a bit more of a problem than the runtimes, has been my experience.
But overall, I'd say that we didn't spend three, three and a half years talking about cold starts.
So it was a bit of a non-problem for us. Okay, cool. And then, because last time we talked about kind of, you know,
engineering best practices, how do we become an observability-driven organization?
Specifically on serverless, if people are listening in and they also think,
you know, serverless is the right choice for us, what do you need?
Any other things besides obviously having them to listen to the previous podcast,
but any additional things that you want to tell people that really makes them successful
with serverless, especially around observability?
What do you need to do in order to make sure that developers put in the right level of
observability from the start?
What do you need to provide as an organization to make this easier for them?
I've just had a talk at KCD Munich,
Kubernetes Community Days.
I talked about platform engineering.
What can platform engineers do
to provide guidance templates
to just make the developer's life easier
and more efficient?
That's a really good question.
I think it all depends on your context, obviously,
but overall, you need the tooling, right? And I think you mentioned it earlier that you want
to strive to have one single pane. So in this case, you want to have one observability vendor
rather than have multiple ones and have fracture planes across your understanding of your systems,
across your organization. So the first thing to think about is what observability vendor you'll go for.
And for serverless, there are options.
There's loads and loads of options that are coming up,
and they're a bit more traditional ones.
Vendors that come to mind are Lumigo, Epsilon, Thundra, I think, as well.
More newer ones, Baseline is one that I've noticed that's kind of come up a lot.
So you can choose one of those,
which is very, very serverless.
They're all serverless-centric,
and they'll give you a very good view of... All the UI will be focused on serverless concepts
and serverless paradigm.
However, it depends on what you have,
because if you have other things that you also support that's not entirely serverless concepts and serverless paradigm. However, it depends on what you have, because if you have other things that you also support
that's not entirely serverless,
then you might want to look at the more traditional vendors
who also do serverless.
So that's the number one I'd say,
but you have to choose,
you have to think about it a bit carefully.
The other thing is for observability,
specifically for serverless,
how you can help engineers
to understand.
I think you need to do two things.
You need to do
custom instrumentation,
which is whether you're going
to choose your tracing,
the same thing as the non-serverless.
You can use tracing
or you're going to use logs
or you're going to use metrics.
It doesn't... Well, I prefer traces,
but ultimately, as long as you are consistent
across the teams or with what you do within your team,
then that doesn't really matter.
If metrics work for you, that's fine.
The thing that's a bit different with serverless
is that the metrics from the managed services
make sure you familiarize yourself with the metrics from the managed services,
make sure you familiarize yourself with the metrics that they give you and the logs that they give you potentially,
because they're really, really important.
You rely on them as much as you rely on your code.
So although they mostly work, you want to have visibility of those.
And I don't know, I think it's a bit funny to say,
but it's not observability related,
but consider your architecture a bit.
Are you building a serverless system
that's mimicking your server full system
or are you shifting your mindset
to a more event-driven approach to a an approach where
you are building small building blocks and you can you can observe those but ultimately um that
you're part of a bigger choreography or a part of a bigger orchestration and then and then you start
having questions like where do you do orchestration?
Where do you do choreography?
I think those questions are really important
before you think about observability.
And obviously when you get to that,
the biggest thing is, I think,
going back to what you said,
try and find a way to make,
use whatever telemetry data you're using,
but try and visualize the steps
between order created,
for example, to order fulfilled and visualize it in a dashboard and start understanding how
that operates. Find the gaps, find the problems, go as quickly as you can to the health of your
business transactions because serverless enables you to not care about you okay you care about memory but you
don't care about CPU you don't care about any traditional things that we um traditionally look
at but you start caring about your business and and that's the most powerful thing
so serverless allows you to think about your business transactions in a non-traditional way
because you're not bound and constrained by memory and CPU because it's a nice way.
So we think differently about.
Thank you so much.
No, so that's why I'm very happy that sometimes I get comments like you shouldn't ask this
question because this tells me that you are not an expert.
Like I should not talk about cold starts anymore,
but I'm actually happy that I do because that's why I learned so much.
I know it's amazing how time flies.
Do we have any other things that we need to discuss?
Anything we missed that is important?
Anything that made you successful in your previous job
and will make you successful now with Flipdish?
I think the most important thing,
if people have not listened to the previous episode,
is that observability needs to become a core practice
within your teams, within your software engineers.
It's not enough to just buy a tooling.
It's not enough to also instrument and create monitors
and automate all that.
You need to be understanding what
you want to instrument and what you want to observe on the other side.
With serverless, I've got a bit of a plead to all the observability vendors. Think about
the serverless paradigm and how it's different. Suddenly you care about implications rather than uptime.
You care about events.
If your serverless has naturally driven you down the road
of event-driven architectures, you care about events.
Those are your protagonists, and you mentioned them,
the business events.
Make those a protagonist in your UI,
in your UX of using observability vendors.
That will really help the serverless architectures and look at the language
and the concepts and the notation that's used in serverless and
adopt some of it so that
reserved, I can't remember what it's called now, reserved concurrency or things
like that are more important than scaling out
servers and scaling up servers
and things like that in the serverless world.
So those are the two things.
Care about observability if you're a software engineer
and if you're an observability vendor,
try and make these
concepts a protagonist in your UI.
I will bring this to our engineering
team.
This feedback is really good.
Amazing.
No, of course.
We are here to educate
our global community
that may or may not use our products,
but I think it's great to learn from
folks like you that have
worked in this field
and know what
challenges are out there and how tool vendors can do a better job in you that have worked with in this field and then know what challenges
are out there and how tool vendors can
do a better job in helping
people like you, people, engineers, architects.
And so I'm definitely taking
this as great feedback back
to the engineering team.
Perfect.
The question now is, right,
what's the next topic we're going
to discuss on the next episode? And Tully's question has already been on twice's the next topic we're going to discuss on the
next episode?
And Tully's
question has already
been on twice.
The next topic is
when you start
paying me to be a
guest.
We'll look at our
budget, which is
zero, and we'll
give you 50% of
our operating
budget.
Yeah.
Amazing.
And I'll just
start shooting down all your
questions then.
No, it's been great
fun.
I've learned so much.
It's really, really good
to connect with other
observability nerds, I
guess.
Sorry, I didn't mean to
call you nerds.
No, no, no.
That's what you are.
And yeah, I really, really hope
that the serverless observability
just becomes better over time
because it is really,
especially the step functions bit,
it's just mind-blowing really.
So yeah.
Yeah.
Awesome.
Yes, thank you so much.
I've got nothing to add
this was just such a
jam-packed
episode that
yeah
the only thing I can add
is thank you so much
for being on again
it's been a
great pleasure to have you back
and we hope our listeners
feel the same
and
we really
you know
I do have one thing
I take away
is again
that community
that we have here
in the IT world, the modern
IT world, obviously
it built from the past. I mean, we can see
all the evidence of where this came from.
And as long as we keep driving
towards sharing information and working with
each other, I think this will continue
to be a great community to work in.
So thanks for everyone for maintaining that.
And
yeah, that's it. Thanks everyone for listening.
Until next time.
Thank you.
Bye-bye.
Thank you.
Bye-bye.