In The Arena by TechArena - Solving the Internet’s Single Point of Failure: CDN Resilience
Episode Date: April 9, 2026Cloud expert Venkata Gopi Kolla joins Allyson Klein to discuss the CDN "single point of failure" and a new IETF protocol for sub-second edge recovery and AI correctness. A must-listen for infrastructu...re leads.
Transcript
Discussion (0)
Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Allison Klein.
Now, let's step into the arena.
Welcome in the arena. My name is Allison Klein, and I am really excited for today's episode.
We've got Benkata gopicoa with us. He is the lead software engineer at Salesforce.
And first time on the program. Welcome, Benkata. How's it going?
Going good. Thanks for having me.
So, Venkata, why don't we just start?
Salesforce is a large company, and there are a lot of roles there.
So can you briefly describe your role at the company and what you're working on today?
So, yeah, I work in a team where we onboard Salesforce, several domains customer bring into Salesforce on top of the CDN systems.
So we work with several CDN vendors.
And especially I work on large-scale SaaS platforms like that heavily on CDN and its infrastructure.
And more recently, AI-driven workflows layered on top of that.
And over time, I started working on traditional distributed systems platforms where interact with
database and asynchronous processing message cues and then slowly transition into edge systems.
Now that edge systems are getting more like AI flavor, like lately that thinking has expanded
how AI workloads change the stakes around availability, correctness, and how the inference happens.
That's awesome.
You know, I was so interested in this episode because inference at the edge,
is such an exciting topic for me
and focusing on solving across CEMs
and edge infrastructure selection
for particular area is really interesting.
One question to maybe just get started
is what are some of the challenges
you focused on solving across CDNs
and edge infrastructure in your work?
So there are several challenges
teams face with CDNs and edge infrastructure today.
So things like cost optimization is one,
which we are dealing with
and also configuration complexity, especially if teens use SSL certificates and cash rules
and all the things, and also performance variation.
And observability is also another key concern.
But among all the challenges, the one I'm focused in solving personally is, and the one
I believe poses the greatest risk to the internet infrastructure.
So it is the CDN single point of failure problem.
So what makes this big deal is the scale of the problem.
Like today, approximately 90% of the internet traffic flows to.
through just four to five major CDN providers.
So it is unbelievable, but it is a fact.
For example, like CloudBairs taking the bigger piece about 22, 25%,
and then it is followed by Amazon CloudFrant, Akamai, and Fastly and Google.
And these are all the CDN vendors that eats all that CDN traffic.
It's like 90% of the internet traffic goes through these CDN systems.
So this extreme concentration creates a massive single point of failure that affects billions of users.
And imagine like 90% of the global shipping went through just five ports.
And when one port closes, thousands of ships have nowhere to go.
Even though their cargo is fine, their warehouses are operational, but still there is nowhere to go.
So that is the same problem we are dealing with today in the internet world.
We saw this play out dramatically in the couple of months back in November.
When Cloudflare experienced an outage for almost a couple of hours,
thousands of websites and services became completely unreachable.
For example, like chat GPD vendor.
and Discord went offline and Shopify stores were not reachable.
And the critical part here is none of the company's servers are actually failed.
Their infrastructure was running perfectly fine,
but because the CDN layer that is sitting in between the users and those servers
stopped working, the services have become totally unable.
And the estimated economic impact was in hundreds of millions globally.
You will simply ask why companies can't just avoid CDNs.
First of all, performance, because as you all know, CDN, they're sitting close their computation to users and that radiuses latency.
And second is cost.
You may not believe, interestingly, it costs less to serve reach byte when using CDNs compared to the origin servers.
And the third aspect is more a non-avidable thing, that is security.
Like DDoS attack mitigations, bought profit, overwhelming origin servers and VARP protection, rate limiting, SSL termination.
So CDNs provide these things as out-of-box solutions, and building these things at origin
server would cost millions.
So companies must use CDNs.
There is no practical alternative.
So the one I am developing going back to your question is CDN Resilience Protocol.
So it is a fundamentally different approach that pushes failure detection and recovery to the client level.
Yes, it is at the client sitting at the client level.
So here is how it works.
So instead of relying on centralized monitoring and DNS updates, we,
embed intelligence directly into the client like browsers, mobile apps, API clients. So these
clients are configured with both CDN endpoint and an origin fallback. Like when a request to the
CDN fails, the client immediately retries against the origin server. It is the secondary one
VQ. Like no DNS propagation needed, no even intervention. Like recovery happens in milliseconds.
So the protocol includes like fast failure detection at client level, instant failure that
bypasses DNS entirely and automatic recovery when CDN comes back online and zero infra
change is required.
And it's universally compatible across any CDN provider.
So I'm working on building an open source implementation with a JavaScript client library
and in the process of formalizing this as an RFC for IETF standardization.
So why this matters now?
The goal of this work is to make CDN resilience practical and affordable for companies of
all sizes.
because a couple of weeks when the cloud failure out is happened,
I'm pretty sure most of the companies, 90% of the companies,
had this question popped up,
hey, CDN, what are we doing?
What is our plan B when CDN goes down?
And this is the solution which I'm proposing as a standard.
It is not just for enterprises.
This is to establish an open standard that the entire industry can adopt
because this is not just a technical problem.
It is a critical infra-adulterability that affects everyone in the internet.
That's amazing.
And I want to unpack that with you.
I want to back up a little bit.
In perspective, if you look at CDMs and you've been working in the space for a long time,
and what you just said is phenomenal in terms of solving that single point of resilience.
Yeah.
How do you see CDMs evolving to address the AI era?
And what do you see changing?
I mean, I think everybody lived through that cloud flare outage.
When you were talking about it, I was remembering what was happening to me.
It's and it was frustrating.
But you can imagine for mission-critical applications becoming, you know, really threatening to some people if they go down.
Obviously, we can live without chat to PTT for a few hours, but it's an annoyance versus something that's risky.
Can you talk a little bit about how the CDM community has evolved CDNs to address this moment in time?
So I'm glad you asked.
CDS have actually a very interesting evolution trajectory.
So right when they started CDNs, they just started as a pure.
performance optimizers like websites in California, I mean it's mostly for caching, right?
They started as, hey, we are here for caching your resources. For example, website in California
and users might be in Tokyo, they experience high latency because the request has to travel
across the globe. The solution is caching static content like images, CSS, JavaScript,
on the servers geographically closer to the users. And at this stage, CDNs were purely optional.
If a CDN fails, your site would just be slower.
Users would still reach to urgent servers with a degraded experience.
But here is the next phase comes, where security has evolved, like CDLs evolve to handle more dynamic content later.
This happens in 2000s, like TCP optimization, compression, smart routing for a request that could not be cast.
And more importantly, this is when CDNs became essential for security.
For example, like DDoS attacks got larger and more sophisticated,
and most companies could not for the infrastructure to absorb asset attacks.
And C-Dense became a security perimeter for RIDOS protection, WAP, board detection, embargo blocking, and rate limiting.
And this is a point where they pivoted from optional to mandatory.
And the next phase is even more interesting.
And this is my favorite phase.
This is like where Edge has an architectural driver.
So think of like Google Maps, right?
So before it existed, people planned their entire route before leaving.
like they look at the map and they write down the instructions and remember the landmarks.
But after the Google Maps, they not only made the directions simple,
they don't worry about all the things they used to worry,
but there are people who also started virtually visiting places.
Like people who don't have time, they use Google Maps to visit a place, how it looks.
And also before going physically, they just check out the restaurant like exterior
or how the place where if they have a parking or all these kind of things.
Also, they see how crowded beaches looks like.
So the tool Google Maps did not just make navigation easier here.
It also created an entirely new perception for a new behavioral pattern for people.
So that's what CDN did for our internet traffic and app server building also.
Like the CDN caching advantage if you take, this advantage drove an industry-wide architectural shift.
Like companies using client-side rendering, loading an empty HTML shell, and rendering everything in the browser, switch to SSR.
and SSD specifically to take advantage of the edge caching.
The same pattern paid out with image optimization,
board detection, captures.
So these services moved to CDians,
not because origin servers could not do them,
but because doing them at the edge was 10 to 20 times more efficient.
It is less expensive and also it is faster.
What's critical here is,
CDN capabilities started driving architectural decisions,
like teams were not asking what is the best architects
for your app. They are asking what architecture takes maximum advantage of seed and cache.
So the infra layer was now shaping the application layer. And another phase, the current phase,
this is fundamentally different from everything before. There are two major shifts in this,
like the first ship. AI inference workloads are moving to the edge and companies want to run
models close to the users for lower latency and better privacy and also reduced bandwidth costs,
like spam detection, content motivation,
image classification, and the personalization engines,
these are all some of the examples where pretty soon I predict they move to the edge
infra.
And the second shift is this is more profound.
So AI is changing how applications depend on the infrastructure.
Traditional applications, if you take, they make random requests like user clicks and a
button and the request would go to server.
You get a response.
If infra fails, the request would fail and user sees the error.
The blast radius is limited to the specific interaction.
But AI systems don't work that way.
A single AI-driven interaction triggers multiple backend operations.
It's like talking to multiple agents or talking to different MCPs.
It's like in parallel or sequence, they all happen.
Like data retrievable from multiple sources and inference, tool calls, external services, validation, synthesis.
So these operations dependents on each other.
So here is what really changes with AI, right?
failures now affect correctness, not just availability, because so far we were only thinking,
hey, CDN is not there, availability is not there. At least things are consistent when they
were available. But now it's a matter of correctness also. Consistency also comes into picture.
So with AI systems making autonomous decision, partial failures create a more dangerous problem.
An AI agent pulls data from five sources, enhance inferences, takes action. If those five data
sources are unreachable because edge infra failed, what happens? Do you what I bought and proceed
with incomplete data and hallucinate or retry aggressively and amplify the infra problem? So there is no
good answer. And each choice has consequences. So overall, the stakes have changed. See
and outage used to be mean slow websites, but slowly it transitioned into, it's like a broken
automation in current decisions, exposed security. So the fundamental shift here has brought to
how we think H-infra is like H-infrailious.
That is what most important thing now to avoid all these things.
Now, I think you brought up a couple of key points about some of the assumptions
teams are making about CD and deployment.
Can you talk a little bit about your view on the most dangerous assumptions team can make
about reliability and availability?
Sure.
So actually, I start with my own things, what assumptions I had when I started talking
early on the CDS, right?
So the first assumption was CDN won't fail because if you look at the SLA CDNs will give, they give 99% uptime, which is four nines, which means, is it that really good, four nights?
So yeah, on the paper, 99.99 is really good. But let's do the math, right? So 0.01 downtime means 52 minutes of downtime per year. And here is a critical part. Those 52 minutes does not.
arrive evenly distributed. It's not like four minutes per month. So you get zero town time for 11
month and on one busy day where it is a mission critical day for you, you get all the 52 minute
outage during the busiest shopping day. So think of it like earthquake insurance in California,
right, which we all deal with. So 99% of chance, no major effect this year sounds safe. But
ethquies, they don't arrive even. You get nothing for 30 years and then a big one. So CDN outage
work the same way.
So that's why the first assumption is bad to assume how CDNs have 99.99% availability.
Don't assume they never fail.
And the second assumption.
So second assumption is a little advanced.
The people who have moderate idea on CDNs would do.
Origins can handle full traffic.
That is what everyone would think.
When CDN goes down, technically you can route all traffic to origin via DNS failure and everything will be fine.
But there is a catch.
origins are sizes to handle only 5 to 10% of the traffic.
So this is the fun fact because when you are in your application server in your web traffic,
the portion that can be cached, only that portion today you are having your origin to handle them,
which means 80% of the traffic is being handled by the Cienes.
So origin servers are not dimensioned to handle.
They are not evolved to handle 100%.
Yes, before Cedians were in the picture.
you design your origin to handle 100% of the load.
But that time, imagine the internet traffic.
The internet traffic has 10 times or 100 fold increased since then.
So nowadays, if you are designing an application,
it's very likely you keep CDN in mind.
And 80% of the traffic would be handled by CTN.
Only 20% of the traffic will be handled by your origin.
So now, with the traditional DNS based failover,
if you make a binary decision saying, hey, my CDN is down,
now route everything to my origin directly for the next.
45 or 60 minutes, whatever is the time the Cidon is down for.
During this time, your origin, you are setting to handle almost 10 to 20 times more than its
designed capacity.
And it is also being exposed to the attacks the Cidian was blocking.
So these are always Cidian outage cascades into an origin outage.
Now you are actually bringing the problem of Cidon into your own infra.
Because at least if the problem lies in CDN, you know the company will resolve it quickly.
Otherwise, everybody will be complained.
But by letting all this traffic come to your own origin, you are breaking your origin where you are supposed to fix everything.
So that is the secondary assumption is very critical not to have that assumption.
And the other one is, they say, hey, we have monitoring.
But where does your monitoring run?
So that is the first question I would ask.
Because many teams monitoring infrastructure itself depend on the CDN.
Because your dashboards might be behind the CDN.
And you're alerting the web hooks might be routing through it.
So I have seen some operation teams, some companies, they do not wait.
realize the CDN was down for 10 to 15 minutes because their monitoring services was also
impacted because who will tell you the person who is supposed to tell you was also down.
So that is why they only find out when the customer started calling.
And maybe you can call, hey, I'll have an independent monitoring system.
It gives me a centralized view.
Like the CDN is down, meaning like I'll put my monitoring service somewhere outside which
does not allow CDNs.
But there is a catch.
This is the interesting part with CDN.
Because seed and failures are often region.
You're monitoring in Virginia shows everything is fine while the users in Tokyo cannot reach your service.
So that doesn't mean you can put the monitoring service everywhere.
So another interesting assumption is seed and failure is just a downtime.
Everyone thinks, hey, our company can effort that one hour downtime.
There are two mission critical aspects to it.
One is customer trust.
So users don't distinguish between our CDN failed and the site is down.
The trust error is to stimulate.
They just see, hey, your site is down.
and multiple outages will lead to customer reputation impact.
And security exposure.
So during failure, when you're scrambling to fail over or waiting for recovery, your system is vulnerable.
If you have routed your traffic to origin, that means you are exposing the attack surface.
Your origin is now the attacking surface.
If AI-revelled security decisions depending on its infrastructure, those decisions are not happening.
So attackers know seed and outages create windows of opportunity.
and they keep looking for such kind of windows.
So we have seen some coordinated attacks targeting services during CDN failures,
knowing defenses are weaken.
Because attackers know for a fact nowadays everyone know,
all the companies are moving their security services to Syrians.
CDN down means they are weak in terms of security, right?
So now let me tell you why these assumptions are persisting.
Because CDN outages are very infrequent enough that teams can.
go yes without experiencing one. But a sudden dependency deepens and outage becomes more frequent.
The question is not if, but when.
No, you've made me stressed with all those stories.
The next question is going to make it worse. When you make it all of that,
you talked about imprints moving closer to users at the edge. How does that change the impact
of failure compared to centralized AI systems?
Yeah. So this is the critical distinction.
That's not very well understood yet.
So I'm glad you brought up that.
So we have centralized and edge AI failures.
They look similar on the surfaces.
So without edge, what we have traditionally is the centralized AI.
We are a big LLM running and then it will be answering your questions.
So now with edge AI failures, they look similar on the surface,
but the impact, the blast radius and the recovery dynamics are fundamentally different.
For example, let's talk about how things look in centralized AI.
So failures are very well defined boundaries in that.
case, meaning your chat board data center goes down and the chat board fails. But you can
your checkout or inventory or email all keeps working. So each service has independent infrastructure.
So you also control the entire stack. So you have visibility. You can walk to backup data
centers and you can scale. So recovery is pretty straightforward. You fix the issue,
the traffic resumes, the business is back online. But with HIAI, it is compounding and it is
geographic and it is unpredictable. So it's influence changes everything through three mechanisms.
One is compounding failures. So when edge infrastructure fails, everything running on those edge
nodes fades simultaneously, like spam detection, content moderation, personalization, fraud
detection or whatever the things you keep at the edge, they all collapse together. Because
this year's same infrastructure and a customer support workflow using AI for classification or
sentiment analysis and routing, as the entire pipeline is stopped and not just even one piece.
Think of it like this. So centralized failures are like a restaurant running out of one ingredient.
So if that ingredient is out, then you'll just say, hey, we are out of this dish.
But edge failures are like losing power to the restaurant. So nothing works because everything is
shared, the electrical panel is shared. And the second is geographic asymmetry. So this is what
makes edge more interesting and also complicated.
So edge failures are rarely global.
So if you have seen the history of CDN outages,
you never see the entire CDN going down.
You always see even the last November 1.
It impacted only 60 to 70% of the CDN services software there,
40% of remaining intact.
So they're not global, they're regional.
So for example, your fraud detection system might be down in Tokyo,
but working in U.S.
With centralized AI, if fraud detection is down,
it is down for everybody.
You make one decision, block all transactions, or accept the risk.
At least things are in control.
So whatever you are allowing it.
But with TTII, you are in a partial failure.
And in software industry, partial failure is the one you hate most.
I would rather prefer full failure.
I'm knowing what happening rather than a partial failure because it is impossible to detect.
Like some users are protected, but others are not.
Centralized systems are often don't know.
So you need regional deficient making.
that most systems are not built for.
And the third one is cascading application.
So when an edge node fails, we try more triggers and it retries.
Like 10 AI services, all retry agnest, already stressed backup infrastructure.
So you have just multiplied the load 10 times.
So AI agents monitoring data don't gracefully degrade.
They aggressively retry and they overwhelm whatever is left.
But with centralized AI, you can implement centralize.
circuit breakers, the traditional way of how we deal fits.
And with HGI agents are distributed across thousands of locations, making independent
retry decisions.
It is very complicated to control them in these kind of scenarios and that failure amplifies
geographically.
And beyond these mechanisms, you get entirely new problems as well.
For example, like security dilemma, right?
So with central layer AI, when it fails, you fail closed, like block everything and
recover safe.
But with HGI, you can't fail closed because edge infra provides basic connectivity.
You fail open, meaning like you allow actions without validation, like creating security
exposure.
And board detection fails.
So you block all traffic or you allow all the traffic.
And the latency trial.
So you move inferences to edge for 20 millisecond latency.
And applications are designed around that.
When H fails and you fall back to the central inference,
suddenly that 20 milliseconds is jumping up to 300 milliseconds.
And some applications functionally break as well because we have a timeout.
And we know how much the NLMs take to infer an answer.
And in our traditional software request, it does not wait for more than certain milliseconds.
And also the content moderation also slowly fails to block harmful content before it is visible.
So it's like having a car with failing brakes.
The car seems to be failed until you really need to stop.
So HGI silent degradation looks like it is working, but safety is compromised in many invisible ways.
Now, we know that a lot of teams look at multi-CDN or DNS-based failover to manage risk.
In your mind, why did those approaches so often while short and practice?
Yeah, so those were the first things actually that came to mind when I was thinking about the CDN resiliency protocol.
Because, hey, isn't it very intuitive to think about if cloud,
fair is problematic, then I'll onboard to CDNs. Cloudflare and Acoma as a backup.
Or a DNS-based failover, meaning the DNS layer, if this appears as a soft cloud fair is not
reachable, then I will map it to my origin server. So those are very intuitive things. But when we
think through it, this is what actually makes they are not a reliable option. On theory,
yes, they look good, but let me pick multi-cedon first. So multi-seedian seems a very logical thing.
I totally agree. But when you use multiple providers, if you want to
fails traffic routes to another. Fine. You've eliminated the single point of failure. Correct.
But what about the cost? So you are paying almost two to three times for student services.
But honestly, if multi-seedans actually worked reliably, companies would pay the premium. But the real
killer is operational complexity. Meaning like configurational drift. This is the first problem.
Like every CDN has a different syntax, different caching rules, different perks by carisms.
You start with good intentions to keep.
configurations synchronized because you have to keep things consistent by if the request is going to
different CDN, it is exposing different behavior. You need to keep things consistent. So let's say
then someone makes an emergency cash rule change on a CDNA to fix a production issue at 2am,
but they forgot to replicate it to CDNB. Now your CDNs are behaving differently and user seating
CDN is A, C correct behavior and user seating CDNB are still content, right? And that means you have
created inconsistency in the name of residence.
And the other thing is, not every CDN supports same operations.
So this is practically what I have seen.
So when we are onboarding things between different CDLs.
So one CDN says, hey, I only allow file upload size to 100mb.
But another CDN says, I allow file up to 500 mb.
If you have promised your customer functionality of uploading 500 mb,
but then if you want to use as a plan B CDN that does not allow 500MB,
then you are stuck.
That means you are creating a behaviorally different approach.
the customer would not know, hey, you are on a different senior.
The customer would question, hey, I was able to upload 500MB file yesterday, but today I'm not able.
So what's changed?
So that is like a issue, right?
You always need to be consistent in supporting that functionalities.
And cash coherency.
This is even worse.
So when you deploy new JavaScript file and publish an article and update the pricing, you need to push the cache on all CDNs simultaneously.
So as we know, like cash invalidation is one of the classical software engineering problem.
right? So if CDNA app adjust but CDNB does not punch, it fails or delays. Users see different
site versions depending on the CD and they hit. So this problem will hit very bad for e-commerce
sites because it can mess up the prices. And seeing a wrong price, like you start with a one price,
but by the time you go to check out, you see a different price. Or you open a page from your
machine and your friend opens up from a different machine and you both see different price. That is not
what you wanted to show.
So, yeah, these kind of problems, that's so multicidine is not practical.
It is theoretical, but not practical.
Now let's get back to DNS failure.
This is another obvious thing, right?
It has even more fundamental problems, to be honest.
So think of DNS propagation is like you're trying to recall a rumor.
So you tell the first person, hey, ignore what I said.
But some people heard it five minutes ago and some 20 minutes ago and some haven't even
heard it done.
So you have no control.
over how fast correction spreads.
So what I'm trying to correlate here is how the DNS propagation happens.
So in my metaphor for the rumor, that's how the DNS caching would work.
It happens at multiple layers beyond your control, like resolvers, operating systems,
browsers, mobile carriers, even with 60 second TPL, real world propagation takes 50 to 60 minutes.
So you have no idea at how many places the cache for DNS happens, right?
from your laptop, in your operating system, in your browser, to your ISP, there are so many layers
where BNSC things can be cast.
So that means propagating is all these caches getting invalidated to find out what is the new
origin IP for your domain.
So that takes almost like 15 to 60 minutes, which is losing the entire purpose.
And the second problem is the partial visibility problem.
So here is what this makes operationally terrible.
So you have no idea what's happening.
at the client level. You update a DNS. Your monitoring says, hey, change complete, traffic
routing to a new destination that is directly to the origin. Everything looks good. But actual users
some are still hitting the fair IDN because their DNS is still cast. Someone hit the new
destination and some bounce between both as the cash expires. So you have zero visibility into
what percentage of users are experiencing failure versus recovery. And your monitoring says problem
solved. But support tickets keep coming in for another 30 to 55 minutes. So is there a secondary
issue? Is the problem worse than you think? Or you just see DNS proposition tail, right? So you don't
know. So yeah, that's why these two techniques, DNS failover and multi-seidian is not
correct, although they sound theoretically correct. So you concluded that pushing failure detection and
failure to be closer the client was the right decision. Why does that matter so much for fast,
reliable recovery. Yeah. So because client side failures detection and failure is the only
architecture that matches how CDN failures actually occurs. So let me explain why this matters and how
the CDN resilience protocol which I'm proposing would work. Right. So here is a fundamental
insight. A CDN outage is not a infra event affecting everyone equally. It's millions of individual
failures, failure events happening at a client endpoints around the world. For example, a user sitting in
Singapore makes a request, times out. But a user in London makes the same request. It succeeds.
And no failure, right? And a user in Tokyo and they may again get a failure. So that means the
failure is experienced at a client level in a specific location and also at a specific moment.
No centralizer system can capture this reality because by the time you aggregate the health
checks and makes a global decision whether my server is up or not, the situation has already
changed. It's like having a fire alarm in every room versus one central alarm. So the central alarm
only knows there is a smoke somewhere in the building after average sensor readings and deciding
if it is real. But by then, people in the affected rooms have been searching for minutes.
But individual alarms in each room detect and alert instantly when there is a problem actually
in that room. So it's about when failure detection happens, the same analogy occurs, right?
When the failure detection happens at the client, it's happening at each independent room area.
So recovery happens at network speed or at a human speed.
For example, in a traditional approach timeline.
So at the 0th minute, let's say the CDN fails in Asia.
And at the second minute, centralized monitoring detects.
And at the probably 8th or 10th minute, engineer confirms, oh, it's real.
And probably at 10 or 15 minutes, because you need to run through some approvals at then, you update the DNS.
And then, because the DNA system takes anywhere,
between 5 to 25 minutes to 24 hours, which means probably on an average, it takes at least
1 to 2 hours for DNS to propagate globally.
That means we are looking at 2 hours of outage.
But look at the client's side failover timeline, right?
At 0 at the minute, CDN fails for user in Tokyo.
And at half a second, the request will time out when they try to open it.
And then immediately the client retrys again from the origin.
And almost within a second, user receives the response.
It's pretty much like within one to two seconds of disruption for the customers.
And here is work is most critical.
Different users fail over based on when they experience the failure.
Tokyo user fails at first second.
London user might not fail at all because CDN is never down for them.
They still continue to use over CDN.
And Singapore user might experience the failure at the third minute.
It's geographically accurate, automatic and zero human intervention in this case.
and we're able to cater all this traffic with the citizens raising SIP protopar pushing it to the client.
So this is what I am proposing to the IETF for standardization, right?
The protocol uses simple configuration, which every client can understand.
So we define a primary which goes through CDN, subprimary URL,
and we define a fallback that was direct to origin.
So client libraries like JavaScript or mobile SDK or server side HTTP clients, they implement this logic.
They attempt primary with an aggressive timeout.
If we time out, they detect failure and they immediately fail out to the origin, the secondary one.
It's like no delay, no DNS look up needed.
And the local state management can mark, hey, the primary is down.
Let's not bother the primary for maybe next 10 requests.
And then after 10 requests, let's see the CDN-based primary one is back on.
If it is back-hand, we'll use it at the client's side.
If not, then we'll continue to using the secondary one.
So here, we are totally bypassing that DNS in the innovation, because DNS propagation,
was one of the biggest problem, as I mentioned in the other question.
So the protocol works because it completely bypasses DNS-based routing.
So both endpoints are configured in the client code.
When failure happens, you are not updating DNS and waiting for propagation.
You are just changing which URL the client requests.
And the DNS was never designed for fast failure.
It was designed for low distribution and caching.
By moving the failure decision from DNS to the Estutimic client, we have eliminated
propagation delay entirely.
And another thing is regional decision making without coordination, right?
So traditional failure requires coordination, like a centralized monitoring system that is detecting failure, makes a decision and propulates via DNS.
And how do you get all the clients to agree under current state?
With the client said failure, what we are proposing, there is no synchronization needed.
Each client observes its own reality and it acts accordingly.
And CDN down in Tokyo, but up in London, sure, Tokyo,
fail over and London trains don't. That's it. Automatic, no global coordination. So this is
more like a decentralized fashion, I would say. And handling origin capacity concern would be this.
So you might ask, because you initially said in the DNS failure, when you opposed the idea
of sending the traffic to the origin, but now you're proposing, you are suggesting to use origin
as the direct URL in the secondary origin. Doesn't they send traffic direct to origin,
which can't handle full load? So you may ask that question. So the key difference,
difference here is the traffic pattern. So when DNS failure happens, you are sending 100%
of your users, even though some of them are like up and running still, you send everybody
to your origin for almost entire time when the CDN is down. And that means your origin is slammed
all at once and stays completely overrunned. But with the client said failover, it is gradual
and temporary, meaning only users experiencing CDN failure will fail over individually, which means
the traffic automatically shifts bad when CDN recovers.
And also not everybody slamming your origin once
because maybe not everybody using your server
whenever they use, then they come to your origin server.
And that too, you don't send the traffic
where the CDN is actually up at running.
So you only send the impacted traffic
and also you go back very soon when the CDN recovers
without any DNS changes or anything like that.
That sounds like.
Yeah, it's like a such pricing concept.
The system can handle time-time demand for two hours during a concert,
even though it can't sustain for 12 by 7.
Your origin can handle CDN level traffic for three-hour outages twice a year,
especially with the cloud auto-scaling.
Now on AWS, we have the elastic options, right?
So yeah, for machine-critical systems,
the fallback can be secondary CDN instead of the origin.
So the protocol doesn't mandate origin.
It mandates fast, automatic failure to whatever the endpoint you configure.
That's awesome.
I want to drill back down into something that you said earlier.
When you were talking about AI systems handling partial failure poorly,
can you describe what new failure modes emerge,
what inference runs at the edge or the CDN layer?
Oh, yeah.
So when AI inference runs at the edge,
you get entirely new classes of failures that don't exist in centralized systems.
So what makes them dangerous is they are mostly silent.
Like the systems appear to work,
but correctness is compromised in invisible ways.
and that is the thing that actually developers hit.
So for example, like in centralized AI, deploying a new model is atomic.
Everyone gets the same version instantly.
At the age, you are pushing model weights to thousands of nodes globally.
Some updates quickly, some slowly, some fail entirely.
That means, in traditional centralized AI, everybody, user in Tokyo and user in London,
everybody gets the same version.
Because if you fail, you fail for everybody.
But in the age model, user in Tokyo might treat a node that is running on model like Gershen, that is 2.1.
But user in London, they might treat Washington 2.0.
Like same question, you'll send different answers.
Not from AI is non-determinism, because I know we know AI is known for non-determinism.
But here you are adding more to the non-determinism, but by literally running different models in different notes.
And the other thing is context loss, right?
So AI inference often needs data from multiple sources, like customer history or support tickets or account status or documentation.
So in centralized systems, when retrieval fails, the entire inference fails and you get an error.
But at the edge, retrieval can be partial and silent.
Like the edge load reaches the customer database, but not support tickets.
And the AI generated response based on three out of five data sources.
So the response looks reasonable, but it is wrong.
because it is missing a mission critical contact.
Because nowadays we all know how much context is important.
Otherwise, we will be letting AI to hallucinate and running wide.
And the third thing is like the tool calls.
Now we are having so many things like MCP or agents,
all these things in the modern AI, right?
And modern AI system, they use tools like EPA calls,
MCP calls, database queries, external actions.
And when inference runs centrally, tool failures are very much physical.
Like the model knows, it failed and it can handle.
it. But at the edge, tool calls, they go through unrelevable infrastructure, like three bad outcomes,
the model hallucinates and it gives a result that is a hallucinated result instead of admitting the
failure. And the model skips validation and proceeds with unverified information sometimes.
And the tool partially succeeds with truncated data and the model thinks it has complete information.
So that's what stays people. And the other thing is geographic non-differences, right?
So the same prompt to the same model version produces wildly different results based on which edge node handles it.
Not from A randomness, as I said, but because different nodes have different cache contexts and different available tools and different data freshness.
So I'm also working on in the AI for the last several months like 12 to an year.
So something I hate doing is AI models are already non-determistic.
I don't want to add more fuel to that non-deterministic.
I want to keep them grounding as much as possible.
And by doing this edge failures, they actually add more and more fuel to the non-determinism aspect of the LLMs.
And then also security aspect, right?
So attackers can exploit edging fra unreliably as an attack vector.
Like an attacker sends a problem with all back instructions saying,
hey, if you can't access safety guidelines, just answer directly.
On a healthy edge note, the AI fetches guidelines and applies them.
On a degraded note, where retrieval fails, the prompt injection activates.
The AI follows the attacker's instructions instead of refusing the unsafe request.
They're not exploiting the model, but they're exploiting the infrastructure that is not reliable.
And data residency violations, right?
So, yeah, companies run its inference per data residency, like European data that we have GDPR, right?
When edge fails, you fail over centralized origin.
You might violate those requirements.
Eum users data is now processed in US data center, which is a government violation, right?
So you are forced to choose failover and violate complaints or don't fail over and violate
SLAs.
So the failover mode only emerges during the outage when you are least prepared to handle it.
And the cash invalidation dilemma like edge system cache inferences results for speed.
But when data source fail, when do you invalidate the cash results?
You have inference results from five minutes ago.
when all the data was healthy.
But now, one source has failed, a user sends the same prompt,
return the cash result based on the new now unavailable data.
And recompute with incomplete data, that gives a wrong answer to the user.
And the CDR Resilance Protocol addresses this by making failover fast enough, like one, two seconds,
that these partial value state rarely occur.
Like instead of degrading for 50 to 60 minutes while DNS propagates,
you fail over before two calls them out,
and before the context becomes stale.
So yeah, for Eidde,
resilience is not optional.
It is the correctness requirement.
Well, Akata, I feel like you've given us a masterclass on CD at an edge today.
And it's such an important topic because edge is evolving so quickly.
If you were looking at infrastructure leaders designing systems today,
what you guide them on in terms of mindset shift required to treat resilience as a
foundational requirement of their designs?
Yes, until now it was like resilience is something we add later.
That's the kind of mindset we had, but now it has to switch.
Resilience is a day zero architectural requirement.
So let me make it concrete with three specific sheets.
So it has to change it from if it happens to when it happens.
The old mindset is like, hey, our CDN has 99.99% up time.
We probably never see an outage.
If we do, we will deal with it then.
That was the old mindset.
But the new mindset, CDN failure is inevitable.
The action is not if, but when, like, how are we ready whenever it happens?
This changes how we make decisions.
You don't ask, should we invest in resilience?
You ask how fast we can recover when the failure happens.
Like practically, this means testing failure scenarios in production like environments.
If you have never simulated a CV and outage, I'm pretty sure most of the companies may not help.
You have no idea if your failure mechanisms actually work.
Like most companies discover their runbooks, they don't work during the actual outage when it is too late.
And infrastructure leaders need to treat seed in resilience the same way they treat database replication or load balancer redundancy.
So in this matter, actually, I appreciate Netflix.
So Netflix actually, they do some interesting.
So they intentionally break their infrastructure just to see how they're responding in the case of any emergency or in the case of any outages.
So I really like that idea.
And they not only break like a regional level, they break at global level as well.
They want to see how much they are prepared.
So yeah, that kind of preparedness we need to have.
And the shift two is from infra only to the client aware.
So it's like the old mindset is resilience is an infra problem.
We will solve it with infra tools like multi-seed and DNS failure or health checks.
But the new mindset is resilience require intelligence at every layer, especially at the client.
Yeah, because this is uncomfortable.
for many infra teams because it means coordinating with application teams.
So in every company you have that one big infra team and also you have one big product
application team who is developing things, right?
So you can't just solve this in your own terraform configs.
You need client libraries updated, mobile apps updated, EPA clients updated.
But here is why the shift matters.
Infra only solution have 15 to 60 minute recovery time because they are fighting against
how DNS and the internet fundamentally work.
But the client-aware solutions have one to two-second return times because they are working with the grain of an architecture.
The question for leaders is, are you willing to accept 45 minutes out days to avoid coordinating with application teams?
When you frame it that way, the answer becomes obvious.
And there is another shift from enterprise only to universal best practice.
Like the old mindset.
CDN resilience is only for companies with massive budget like Netflix, Amazon, meta.
Small companies just have to accept the risk.
But the new mindset is every company deserves affordable practical resilience
and open standards make that possible.
That is why we have RFC in open standards to help all these companies,
have access to all these things.
When you think like that, suddenly resilience is not something only enterprises can afford.
It is something every startup gets for free by using standard tools.
For example, for intra-leaders, this means two things.
First, if you are defining a new system today,
use the tools and frameworks that support open-source resilience,
standards. Don't build an appropriate
resolutions that locked you in. Or
the second, like contribute to the standards,
the more companies participate in defining
how senior resilience should work,
the better the ecosystem becomes for
everyone. And finally, there is
calculation, right? Leaders need
to reframe how they calculate the risk.
The question is not what about
the resilience cost. It's about
what does the 45 minutes of
downtime cost. For e-commerce,
that's hundreds of thousands in last revenue.
For SaaS models, that's
blown SLA and customer chan. And for critical infrastructure, that's a reputation damage. And you have
no idea for some companies, it also matters people's lives. For example, if your car broke down
somewhere and you need to call for a request for service and if that website is down, then obviously
you don't know what kind of conditions they are in, right. So you need to help them the sooner
the better. So that means when you compare the cost of implementing clients and failover,
for most of the companies in a few days of engineering work using open source tools, it's hard
big amount of money, right? So it's like a single major outage. The RO is very obviously
it will be recovered. The mind shift is like treating resiliency not as an insurance
acknowledged, unlikely events, but as a protection against inevitable events. Like CDN outages
will happen. The question is whether you are ready or not.
I love it. I'm going to give you a bonus question. We just finished our predictions series
on Tech Arena for 2026. And I bet your own predictions question. What do you think changes first,
how CDMs are built, how inference is deployed at the edge, or how the industry standardizes resilience.
I go with, I think standardization comes first, and it's already happening.
Like the CDN resiliency protocol, RFC, which I'm submitting to a UTIA, is part of it.
And I'm pretty sure there are many people like me also working on it.
So why standardization needs, right?
So here is why.
Because it becomes more architectural changes.
Like they create pressure on both CDN provider and a platform.
and application developers simultaneously
without requiring either site to move fast.
Think about how HTTPS became universal.
It was not because every website decided independently to add SSL.
It was because browsers started making HTTP sites as not secure,
which created some pressure from the users.
The standard came first and then browsers implemented it.
Then websites had to follow.
So that's why I feel CDR resilience will follow a similar pattern
like the adoption sequence.
So the sequence is like standardize the protocol through a UTF.
Framework and library authors will implement it and developers will adopt the news nameway.
And CDN providers will adopt their infra to support the standard.
And CDN architecture changes comes later, I believe, as a secondary thing.
Like CDN architectural changes comes later because they require the most investment and also coordination.
And CDN providers won't restructure their infra until there is a clear market demand.
And that demand only materializes after it.
client-safe resilience becomes standard practice. But I don't think we'll see evolution,
like Cedans might federate for resilience, like CEDAns, like Cloudflare and Fastly,
they're establishing a fallback agreements where if one fails, the other provides a backup
capacity, like how you have, like you can make emergency call when your UT&T coverage is not there.
You can still make a call through mobile or wherever. So yeah, we will see better health
check endpoints that you find great regional status. But you'll see graceful degradation
where CDNs can signal, hey, I'm degraded, but not fully done.
So clients can make informed decisions.
And these changes take two to three years minimum because they involve core info modifications,
legal agreements between providers and business model evaluations.
And age inference deployment adopts fastest.
So age inference deployments will adopt the fastest once client resilience is standardized.
Because this is like a greenfield territory.
Like companies are just starting to deploy A and age.
So they can design for resilience.
resilience from day one.
So what this means for industry is, the broader impact is that resiliency becomes invisible
infrastructure.
Just like you don't think about the TCP transmission today, it just works.
And you won't think about the CDN failures tomorrow because we already standardized,
it will work.
And it will be built into every HCT library, every framework, and every platform.
That's the power of standardization, right?
Once it's a protocol that everyone implements it, it becomes ambient.
So to directly answer the question, standardization changes first for my understanding and then client framework follow in the next one to one and a year.
And then age inference adopts within the next couple of years.
Cidine art picture evolves within next three to four years.
But it all starts with the standard first.
That makes a lot of sense.
And we're going to hold it to it.
I'd love to have you back on the show at some point to talk about the standards progress and see how the world of CDMs are evolving.
I'm sure that folks who are listening online, watch.
to learn more from you and engage on the standard.
Where can they connect with you?
Oh, yeah.
So the CDN protocol resilience,
the regional protocol is going to launch it
as an open source very soon.
The GitHub repository will be included
the complete clients library implementation,
technical specifications, working examples,
and also the documentation.
Like everything, teams need to start implementing
client side failover.
And I'm also going to publish a comprehensive technical
article and medium that goes deep into both
problem and solution, which will go
I'm hoping I can put it live in next couple of weeks.
And for the standardization path, I'm preparing an RFC submission to YETF
and planning to have speaking engagements with the local tech chapters.
I work with local SFBAA ACM chapter and also on part of ICCI, Silicon Valley chapter.
And I'm going to present this idea to seek a key feedback from the tech people to see
if there is anything we need to make modifications and how tech community is going to take it.
Yeah.
And the best way to follow the launch is all of my updates on my.
my LinkedIn profile, so where I'll be sharing the GitHub repository goes like what the
medium article publishes are. If I have a plan, it will be to talk in one of the local chapters.
Well, Lankata, thank you so much for the time today. It was awesome talking to you. I learned a ton,
and I'm sure our audience did too. Thanks so much. Thank you so much, Alison. Thanks for having me.
Thanks for joining Tech Arena. Subscribe and engage at our website,
techorina.a. All content is copyright by Techarena.
