In The Arena by TechArena - Solving the Internet’s Single Point of Failure: CDN Resilience

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Allison Klein. Now, let's step into the arena. Welcome in the arena. My name is Allison Klein, and I am really excited for today's episode. We've got Benkata gopicoa with us. He is the lead software engineer at Salesforce. And first time on the program. Welcome, Benkata. How's it going? Going good. Thanks for having me. So, Venkata, why don't we just start? Salesforce is a large company, and there are a lot of roles there.

Starting point is 00:00:37 So can you briefly describe your role at the company and what you're working on today? So, yeah, I work in a team where we onboard Salesforce, several domains customer bring into Salesforce on top of the CDN systems. So we work with several CDN vendors. And especially I work on large-scale SaaS platforms like that heavily on CDN and its infrastructure. And more recently, AI-driven workflows layered on top of that. And over time, I started working on traditional distributed systems platforms where interact with database and asynchronous processing message cues and then slowly transition into edge systems. Now that edge systems are getting more like AI flavor, like lately that thinking has expanded

Starting point is 00:01:19 how AI workloads change the stakes around availability, correctness, and how the inference happens. That's awesome. You know, I was so interested in this episode because inference at the edge, is such an exciting topic for me and focusing on solving across CEMs and edge infrastructure selection for particular area is really interesting. One question to maybe just get started

Starting point is 00:01:44 is what are some of the challenges you focused on solving across CDNs and edge infrastructure in your work? So there are several challenges teams face with CDNs and edge infrastructure today. So things like cost optimization is one, which we are dealing with and also configuration complexity, especially if teens use SSL certificates and cash rules

Starting point is 00:02:06 and all the things, and also performance variation. And observability is also another key concern. But among all the challenges, the one I'm focused in solving personally is, and the one I believe poses the greatest risk to the internet infrastructure. So it is the CDN single point of failure problem. So what makes this big deal is the scale of the problem. Like today, approximately 90% of the internet traffic flows to. through just four to five major CDN providers.

Starting point is 00:02:33 So it is unbelievable, but it is a fact. For example, like CloudBairs taking the bigger piece about 22, 25%, and then it is followed by Amazon CloudFrant, Akamai, and Fastly and Google. And these are all the CDN vendors that eats all that CDN traffic. It's like 90% of the internet traffic goes through these CDN systems. So this extreme concentration creates a massive single point of failure that affects billions of users. And imagine like 90% of the global shipping went through just five ports. And when one port closes, thousands of ships have nowhere to go.

Starting point is 00:03:09 Even though their cargo is fine, their warehouses are operational, but still there is nowhere to go. So that is the same problem we are dealing with today in the internet world. We saw this play out dramatically in the couple of months back in November. When Cloudflare experienced an outage for almost a couple of hours, thousands of websites and services became completely unreachable. For example, like chat GPD vendor. and Discord went offline and Shopify stores were not reachable. And the critical part here is none of the company's servers are actually failed.

Starting point is 00:03:40 Their infrastructure was running perfectly fine, but because the CDN layer that is sitting in between the users and those servers stopped working, the services have become totally unable. And the estimated economic impact was in hundreds of millions globally. You will simply ask why companies can't just avoid CDNs. First of all, performance, because as you all know, CDN, they're sitting close their computation to users and that radiuses latency. And second is cost. You may not believe, interestingly, it costs less to serve reach byte when using CDNs compared to the origin servers.

Starting point is 00:04:16 And the third aspect is more a non-avidable thing, that is security. Like DDoS attack mitigations, bought profit, overwhelming origin servers and VARP protection, rate limiting, SSL termination. So CDNs provide these things as out-of-box solutions, and building these things at origin server would cost millions. So companies must use CDNs. There is no practical alternative. So the one I am developing going back to your question is CDN Resilience Protocol. So it is a fundamentally different approach that pushes failure detection and recovery to the client level.

Starting point is 00:04:50 Yes, it is at the client sitting at the client level. So here is how it works. So instead of relying on centralized monitoring and DNS updates, we, embed intelligence directly into the client like browsers, mobile apps, API clients. So these clients are configured with both CDN endpoint and an origin fallback. Like when a request to the CDN fails, the client immediately retries against the origin server. It is the secondary one VQ. Like no DNS propagation needed, no even intervention. Like recovery happens in milliseconds. So the protocol includes like fast failure detection at client level, instant failure that

Starting point is 00:05:28 bypasses DNS entirely and automatic recovery when CDN comes back online and zero infra change is required. And it's universally compatible across any CDN provider. So I'm working on building an open source implementation with a JavaScript client library and in the process of formalizing this as an RFC for IETF standardization. So why this matters now? The goal of this work is to make CDN resilience practical and affordable for companies of all sizes.

Starting point is 00:05:56 because a couple of weeks when the cloud failure out is happened, I'm pretty sure most of the companies, 90% of the companies, had this question popped up, hey, CDN, what are we doing? What is our plan B when CDN goes down? And this is the solution which I'm proposing as a standard. It is not just for enterprises. This is to establish an open standard that the entire industry can adopt

Starting point is 00:06:17 because this is not just a technical problem. It is a critical infra-adulterability that affects everyone in the internet. That's amazing. And I want to unpack that with you. I want to back up a little bit. In perspective, if you look at CDMs and you've been working in the space for a long time, and what you just said is phenomenal in terms of solving that single point of resilience. Yeah.

Starting point is 00:06:40 How do you see CDMs evolving to address the AI era? And what do you see changing? I mean, I think everybody lived through that cloud flare outage. When you were talking about it, I was remembering what was happening to me. It's and it was frustrating. But you can imagine for mission-critical applications becoming, you know, really threatening to some people if they go down. Obviously, we can live without chat to PTT for a few hours, but it's an annoyance versus something that's risky. Can you talk a little bit about how the CDM community has evolved CDNs to address this moment in time?

Starting point is 00:07:16 So I'm glad you asked. CDS have actually a very interesting evolution trajectory. So right when they started CDNs, they just started as a pure. performance optimizers like websites in California, I mean it's mostly for caching, right? They started as, hey, we are here for caching your resources. For example, website in California and users might be in Tokyo, they experience high latency because the request has to travel across the globe. The solution is caching static content like images, CSS, JavaScript, on the servers geographically closer to the users. And at this stage, CDNs were purely optional.

Starting point is 00:07:51 If a CDN fails, your site would just be slower. Users would still reach to urgent servers with a degraded experience. But here is the next phase comes, where security has evolved, like CDLs evolve to handle more dynamic content later. This happens in 2000s, like TCP optimization, compression, smart routing for a request that could not be cast. And more importantly, this is when CDNs became essential for security. For example, like DDoS attacks got larger and more sophisticated, and most companies could not for the infrastructure to absorb asset attacks. And C-Dense became a security perimeter for RIDOS protection, WAP, board detection, embargo blocking, and rate limiting.

Starting point is 00:08:33 And this is a point where they pivoted from optional to mandatory. And the next phase is even more interesting. And this is my favorite phase. This is like where Edge has an architectural driver. So think of like Google Maps, right? So before it existed, people planned their entire route before leaving. like they look at the map and they write down the instructions and remember the landmarks. But after the Google Maps, they not only made the directions simple,

Starting point is 00:09:00 they don't worry about all the things they used to worry, but there are people who also started virtually visiting places. Like people who don't have time, they use Google Maps to visit a place, how it looks. And also before going physically, they just check out the restaurant like exterior or how the place where if they have a parking or all these kind of things. Also, they see how crowded beaches looks like. So the tool Google Maps did not just make navigation easier here. It also created an entirely new perception for a new behavioral pattern for people.

Starting point is 00:09:30 So that's what CDN did for our internet traffic and app server building also. Like the CDN caching advantage if you take, this advantage drove an industry-wide architectural shift. Like companies using client-side rendering, loading an empty HTML shell, and rendering everything in the browser, switch to SSR. and SSD specifically to take advantage of the edge caching. The same pattern paid out with image optimization, board detection, captures. So these services moved to CDians, not because origin servers could not do them,

Starting point is 00:10:02 but because doing them at the edge was 10 to 20 times more efficient. It is less expensive and also it is faster. What's critical here is, CDN capabilities started driving architectural decisions, like teams were not asking what is the best architects for your app. They are asking what architecture takes maximum advantage of seed and cache. So the infra layer was now shaping the application layer. And another phase, the current phase, this is fundamentally different from everything before. There are two major shifts in this,

Starting point is 00:10:34 like the first ship. AI inference workloads are moving to the edge and companies want to run models close to the users for lower latency and better privacy and also reduced bandwidth costs, like spam detection, content motivation, image classification, and the personalization engines, these are all some of the examples where pretty soon I predict they move to the edge infra. And the second shift is this is more profound. So AI is changing how applications depend on the infrastructure.

Starting point is 00:11:03 Traditional applications, if you take, they make random requests like user clicks and a button and the request would go to server. You get a response. If infra fails, the request would fail and user sees the error. The blast radius is limited to the specific interaction. But AI systems don't work that way. A single AI-driven interaction triggers multiple backend operations. It's like talking to multiple agents or talking to different MCPs.

Starting point is 00:11:30 It's like in parallel or sequence, they all happen. Like data retrievable from multiple sources and inference, tool calls, external services, validation, synthesis. So these operations dependents on each other. So here is what really changes with AI, right? failures now affect correctness, not just availability, because so far we were only thinking, hey, CDN is not there, availability is not there. At least things are consistent when they were available. But now it's a matter of correctness also. Consistency also comes into picture. So with AI systems making autonomous decision, partial failures create a more dangerous problem.

Starting point is 00:12:04 An AI agent pulls data from five sources, enhance inferences, takes action. If those five data sources are unreachable because edge infra failed, what happens? Do you what I bought and proceed with incomplete data and hallucinate or retry aggressively and amplify the infra problem? So there is no good answer. And each choice has consequences. So overall, the stakes have changed. See and outage used to be mean slow websites, but slowly it transitioned into, it's like a broken automation in current decisions, exposed security. So the fundamental shift here has brought to how we think H-infra is like H-infrailious. That is what most important thing now to avoid all these things.

Starting point is 00:12:47 Now, I think you brought up a couple of key points about some of the assumptions teams are making about CD and deployment. Can you talk a little bit about your view on the most dangerous assumptions team can make about reliability and availability? Sure. So actually, I start with my own things, what assumptions I had when I started talking early on the CDS, right? So the first assumption was CDN won't fail because if you look at the SLA CDNs will give, they give 99% uptime, which is four nines, which means, is it that really good, four nights?

Starting point is 00:13:24 So yeah, on the paper, 99.99 is really good. But let's do the math, right? So 0.01 downtime means 52 minutes of downtime per year. And here is a critical part. Those 52 minutes does not. arrive evenly distributed. It's not like four minutes per month. So you get zero town time for 11 month and on one busy day where it is a mission critical day for you, you get all the 52 minute outage during the busiest shopping day. So think of it like earthquake insurance in California, right, which we all deal with. So 99% of chance, no major effect this year sounds safe. But ethquies, they don't arrive even. You get nothing for 30 years and then a big one. So CDN outage work the same way. So that's why the first assumption is bad to assume how CDNs have 99.99% availability.

Starting point is 00:14:17 Don't assume they never fail. And the second assumption. So second assumption is a little advanced. The people who have moderate idea on CDNs would do. Origins can handle full traffic. That is what everyone would think. When CDN goes down, technically you can route all traffic to origin via DNS failure and everything will be fine. But there is a catch.

Starting point is 00:14:38 origins are sizes to handle only 5 to 10% of the traffic. So this is the fun fact because when you are in your application server in your web traffic, the portion that can be cached, only that portion today you are having your origin to handle them, which means 80% of the traffic is being handled by the Cienes. So origin servers are not dimensioned to handle. They are not evolved to handle 100%. Yes, before Cedians were in the picture. you design your origin to handle 100% of the load.

Starting point is 00:15:11 But that time, imagine the internet traffic. The internet traffic has 10 times or 100 fold increased since then. So nowadays, if you are designing an application, it's very likely you keep CDN in mind. And 80% of the traffic would be handled by CTN. Only 20% of the traffic will be handled by your origin. So now, with the traditional DNS based failover, if you make a binary decision saying, hey, my CDN is down,

Starting point is 00:15:35 now route everything to my origin directly for the next. 45 or 60 minutes, whatever is the time the Cidon is down for. During this time, your origin, you are setting to handle almost 10 to 20 times more than its designed capacity. And it is also being exposed to the attacks the Cidian was blocking. So these are always Cidian outage cascades into an origin outage. Now you are actually bringing the problem of Cidon into your own infra. Because at least if the problem lies in CDN, you know the company will resolve it quickly.

Starting point is 00:16:05 Otherwise, everybody will be complained. But by letting all this traffic come to your own origin, you are breaking your origin where you are supposed to fix everything. So that is the secondary assumption is very critical not to have that assumption. And the other one is, they say, hey, we have monitoring. But where does your monitoring run? So that is the first question I would ask. Because many teams monitoring infrastructure itself depend on the CDN. Because your dashboards might be behind the CDN.

Starting point is 00:16:30 And you're alerting the web hooks might be routing through it. So I have seen some operation teams, some companies, they do not wait. realize the CDN was down for 10 to 15 minutes because their monitoring services was also impacted because who will tell you the person who is supposed to tell you was also down. So that is why they only find out when the customer started calling. And maybe you can call, hey, I'll have an independent monitoring system. It gives me a centralized view. Like the CDN is down, meaning like I'll put my monitoring service somewhere outside which

Starting point is 00:16:59 does not allow CDNs. But there is a catch. This is the interesting part with CDN. Because seed and failures are often region. You're monitoring in Virginia shows everything is fine while the users in Tokyo cannot reach your service. So that doesn't mean you can put the monitoring service everywhere. So another interesting assumption is seed and failure is just a downtime. Everyone thinks, hey, our company can effort that one hour downtime.

Starting point is 00:17:24 There are two mission critical aspects to it. One is customer trust. So users don't distinguish between our CDN failed and the site is down. The trust error is to stimulate. They just see, hey, your site is down. and multiple outages will lead to customer reputation impact. And security exposure. So during failure, when you're scrambling to fail over or waiting for recovery, your system is vulnerable.

Starting point is 00:17:47 If you have routed your traffic to origin, that means you are exposing the attack surface. Your origin is now the attacking surface. If AI-revelled security decisions depending on its infrastructure, those decisions are not happening. So attackers know seed and outages create windows of opportunity. and they keep looking for such kind of windows. So we have seen some coordinated attacks targeting services during CDN failures, knowing defenses are weaken. Because attackers know for a fact nowadays everyone know,

Starting point is 00:18:17 all the companies are moving their security services to Syrians. CDN down means they are weak in terms of security, right? So now let me tell you why these assumptions are persisting. Because CDN outages are very infrequent enough that teams can. go yes without experiencing one. But a sudden dependency deepens and outage becomes more frequent. The question is not if, but when. No, you've made me stressed with all those stories. The next question is going to make it worse. When you make it all of that,

Starting point is 00:18:51 you talked about imprints moving closer to users at the edge. How does that change the impact of failure compared to centralized AI systems? Yeah. So this is the critical distinction. That's not very well understood yet. So I'm glad you brought up that. So we have centralized and edge AI failures. They look similar on the surfaces. So without edge, what we have traditionally is the centralized AI.

Starting point is 00:19:14 We are a big LLM running and then it will be answering your questions. So now with edge AI failures, they look similar on the surface, but the impact, the blast radius and the recovery dynamics are fundamentally different. For example, let's talk about how things look in centralized AI. So failures are very well defined boundaries in that. case, meaning your chat board data center goes down and the chat board fails. But you can your checkout or inventory or email all keeps working. So each service has independent infrastructure. So you also control the entire stack. So you have visibility. You can walk to backup data

Starting point is 00:19:49 centers and you can scale. So recovery is pretty straightforward. You fix the issue, the traffic resumes, the business is back online. But with HIAI, it is compounding and it is geographic and it is unpredictable. So it's influence changes everything through three mechanisms. One is compounding failures. So when edge infrastructure fails, everything running on those edge nodes fades simultaneously, like spam detection, content moderation, personalization, fraud detection or whatever the things you keep at the edge, they all collapse together. Because this year's same infrastructure and a customer support workflow using AI for classification or sentiment analysis and routing, as the entire pipeline is stopped and not just even one piece.

Starting point is 00:20:34 Think of it like this. So centralized failures are like a restaurant running out of one ingredient. So if that ingredient is out, then you'll just say, hey, we are out of this dish. But edge failures are like losing power to the restaurant. So nothing works because everything is shared, the electrical panel is shared. And the second is geographic asymmetry. So this is what makes edge more interesting and also complicated. So edge failures are rarely global. So if you have seen the history of CDN outages, you never see the entire CDN going down.

Starting point is 00:21:05 You always see even the last November 1. It impacted only 60 to 70% of the CDN services software there, 40% of remaining intact. So they're not global, they're regional. So for example, your fraud detection system might be down in Tokyo, but working in U.S. With centralized AI, if fraud detection is down, it is down for everybody.

Starting point is 00:21:26 You make one decision, block all transactions, or accept the risk. At least things are in control. So whatever you are allowing it. But with TTII, you are in a partial failure. And in software industry, partial failure is the one you hate most. I would rather prefer full failure. I'm knowing what happening rather than a partial failure because it is impossible to detect. Like some users are protected, but others are not.

Starting point is 00:21:50 Centralized systems are often don't know. So you need regional deficient making. that most systems are not built for. And the third one is cascading application. So when an edge node fails, we try more triggers and it retries. Like 10 AI services, all retry agnest, already stressed backup infrastructure. So you have just multiplied the load 10 times. So AI agents monitoring data don't gracefully degrade.

Starting point is 00:22:18 They aggressively retry and they overwhelm whatever is left. But with centralized AI, you can implement centralize. circuit breakers, the traditional way of how we deal fits. And with HGI agents are distributed across thousands of locations, making independent retry decisions. It is very complicated to control them in these kind of scenarios and that failure amplifies geographically. And beyond these mechanisms, you get entirely new problems as well.

Starting point is 00:22:45 For example, like security dilemma, right? So with central layer AI, when it fails, you fail closed, like block everything and recover safe. But with HGI, you can't fail closed because edge infra provides basic connectivity. You fail open, meaning like you allow actions without validation, like creating security exposure. And board detection fails. So you block all traffic or you allow all the traffic.

Starting point is 00:23:11 And the latency trial. So you move inferences to edge for 20 millisecond latency. And applications are designed around that. When H fails and you fall back to the central inference, suddenly that 20 milliseconds is jumping up to 300 milliseconds. And some applications functionally break as well because we have a timeout. And we know how much the NLMs take to infer an answer. And in our traditional software request, it does not wait for more than certain milliseconds.

Starting point is 00:23:41 And also the content moderation also slowly fails to block harmful content before it is visible. So it's like having a car with failing brakes. The car seems to be failed until you really need to stop. So HGI silent degradation looks like it is working, but safety is compromised in many invisible ways. Now, we know that a lot of teams look at multi-CDN or DNS-based failover to manage risk. In your mind, why did those approaches so often while short and practice? Yeah, so those were the first things actually that came to mind when I was thinking about the CDN resiliency protocol. Because, hey, isn't it very intuitive to think about if cloud,

Starting point is 00:24:21 fair is problematic, then I'll onboard to CDNs. Cloudflare and Acoma as a backup. Or a DNS-based failover, meaning the DNS layer, if this appears as a soft cloud fair is not reachable, then I will map it to my origin server. So those are very intuitive things. But when we think through it, this is what actually makes they are not a reliable option. On theory, yes, they look good, but let me pick multi-cedon first. So multi-seedian seems a very logical thing. I totally agree. But when you use multiple providers, if you want to fails traffic routes to another. Fine. You've eliminated the single point of failure. Correct. But what about the cost? So you are paying almost two to three times for student services.

Starting point is 00:25:02 But honestly, if multi-seedans actually worked reliably, companies would pay the premium. But the real killer is operational complexity. Meaning like configurational drift. This is the first problem. Like every CDN has a different syntax, different caching rules, different perks by carisms. You start with good intentions to keep. configurations synchronized because you have to keep things consistent by if the request is going to different CDN, it is exposing different behavior. You need to keep things consistent. So let's say then someone makes an emergency cash rule change on a CDNA to fix a production issue at 2am, but they forgot to replicate it to CDNB. Now your CDNs are behaving differently and user seating

Starting point is 00:25:43 CDN is A, C correct behavior and user seating CDNB are still content, right? And that means you have created inconsistency in the name of residence. And the other thing is, not every CDN supports same operations. So this is practically what I have seen. So when we are onboarding things between different CDLs. So one CDN says, hey, I only allow file upload size to 100mb. But another CDN says, I allow file up to 500 mb. If you have promised your customer functionality of uploading 500 mb,

Starting point is 00:26:12 but then if you want to use as a plan B CDN that does not allow 500MB, then you are stuck. That means you are creating a behaviorally different approach. the customer would not know, hey, you are on a different senior. The customer would question, hey, I was able to upload 500MB file yesterday, but today I'm not able. So what's changed? So that is like a issue, right? You always need to be consistent in supporting that functionalities.

Starting point is 00:26:34 And cash coherency. This is even worse. So when you deploy new JavaScript file and publish an article and update the pricing, you need to push the cache on all CDNs simultaneously. So as we know, like cash invalidation is one of the classical software engineering problem. right? So if CDNA app adjust but CDNB does not punch, it fails or delays. Users see different site versions depending on the CD and they hit. So this problem will hit very bad for e-commerce sites because it can mess up the prices. And seeing a wrong price, like you start with a one price, but by the time you go to check out, you see a different price. Or you open a page from your

Starting point is 00:27:13 machine and your friend opens up from a different machine and you both see different price. That is not what you wanted to show. So, yeah, these kind of problems, that's so multicidine is not practical. It is theoretical, but not practical. Now let's get back to DNS failure. This is another obvious thing, right? It has even more fundamental problems, to be honest. So think of DNS propagation is like you're trying to recall a rumor.

Starting point is 00:27:36 So you tell the first person, hey, ignore what I said. But some people heard it five minutes ago and some 20 minutes ago and some haven't even heard it done. So you have no control. over how fast correction spreads. So what I'm trying to correlate here is how the DNS propagation happens. So in my metaphor for the rumor, that's how the DNS caching would work. It happens at multiple layers beyond your control, like resolvers, operating systems,

Starting point is 00:28:04 browsers, mobile carriers, even with 60 second TPL, real world propagation takes 50 to 60 minutes. So you have no idea at how many places the cache for DNS happens, right? from your laptop, in your operating system, in your browser, to your ISP, there are so many layers where BNSC things can be cast. So that means propagating is all these caches getting invalidated to find out what is the new origin IP for your domain. So that takes almost like 15 to 60 minutes, which is losing the entire purpose. And the second problem is the partial visibility problem.

Starting point is 00:28:40 So here is what this makes operationally terrible. So you have no idea what's happening. at the client level. You update a DNS. Your monitoring says, hey, change complete, traffic routing to a new destination that is directly to the origin. Everything looks good. But actual users some are still hitting the fair IDN because their DNS is still cast. Someone hit the new destination and some bounce between both as the cash expires. So you have zero visibility into what percentage of users are experiencing failure versus recovery. And your monitoring says problem solved. But support tickets keep coming in for another 30 to 55 minutes. So is there a secondary

Starting point is 00:29:19 issue? Is the problem worse than you think? Or you just see DNS proposition tail, right? So you don't know. So yeah, that's why these two techniques, DNS failover and multi-seidian is not correct, although they sound theoretically correct. So you concluded that pushing failure detection and failure to be closer the client was the right decision. Why does that matter so much for fast, reliable recovery. Yeah. So because client side failures detection and failure is the only architecture that matches how CDN failures actually occurs. So let me explain why this matters and how the CDN resilience protocol which I'm proposing would work. Right. So here is a fundamental insight. A CDN outage is not a infra event affecting everyone equally. It's millions of individual

Starting point is 00:30:05 failures, failure events happening at a client endpoints around the world. For example, a user sitting in Singapore makes a request, times out. But a user in London makes the same request. It succeeds. And no failure, right? And a user in Tokyo and they may again get a failure. So that means the failure is experienced at a client level in a specific location and also at a specific moment. No centralizer system can capture this reality because by the time you aggregate the health checks and makes a global decision whether my server is up or not, the situation has already changed. It's like having a fire alarm in every room versus one central alarm. So the central alarm only knows there is a smoke somewhere in the building after average sensor readings and deciding

Starting point is 00:30:50 if it is real. But by then, people in the affected rooms have been searching for minutes. But individual alarms in each room detect and alert instantly when there is a problem actually in that room. So it's about when failure detection happens, the same analogy occurs, right? When the failure detection happens at the client, it's happening at each independent room area. So recovery happens at network speed or at a human speed. For example, in a traditional approach timeline. So at the 0th minute, let's say the CDN fails in Asia. And at the second minute, centralized monitoring detects.

Starting point is 00:31:22 And at the probably 8th or 10th minute, engineer confirms, oh, it's real. And probably at 10 or 15 minutes, because you need to run through some approvals at then, you update the DNS. And then, because the DNA system takes anywhere, between 5 to 25 minutes to 24 hours, which means probably on an average, it takes at least 1 to 2 hours for DNS to propagate globally. That means we are looking at 2 hours of outage. But look at the client's side failover timeline, right? At 0 at the minute, CDN fails for user in Tokyo.

Starting point is 00:31:52 And at half a second, the request will time out when they try to open it. And then immediately the client retrys again from the origin. And almost within a second, user receives the response. It's pretty much like within one to two seconds of disruption for the customers. And here is work is most critical. Different users fail over based on when they experience the failure. Tokyo user fails at first second. London user might not fail at all because CDN is never down for them.

Starting point is 00:32:20 They still continue to use over CDN. And Singapore user might experience the failure at the third minute. It's geographically accurate, automatic and zero human intervention in this case. and we're able to cater all this traffic with the citizens raising SIP protopar pushing it to the client. So this is what I am proposing to the IETF for standardization, right? The protocol uses simple configuration, which every client can understand. So we define a primary which goes through CDN, subprimary URL, and we define a fallback that was direct to origin.

Starting point is 00:32:50 So client libraries like JavaScript or mobile SDK or server side HTTP clients, they implement this logic. They attempt primary with an aggressive timeout. If we time out, they detect failure and they immediately fail out to the origin, the secondary one. It's like no delay, no DNS look up needed. And the local state management can mark, hey, the primary is down. Let's not bother the primary for maybe next 10 requests. And then after 10 requests, let's see the CDN-based primary one is back on. If it is back-hand, we'll use it at the client's side.

Starting point is 00:33:21 If not, then we'll continue to using the secondary one. So here, we are totally bypassing that DNS in the innovation, because DNS propagation, was one of the biggest problem, as I mentioned in the other question. So the protocol works because it completely bypasses DNS-based routing. So both endpoints are configured in the client code. When failure happens, you are not updating DNS and waiting for propagation. You are just changing which URL the client requests. And the DNS was never designed for fast failure.

Starting point is 00:33:49 It was designed for low distribution and caching. By moving the failure decision from DNS to the Estutimic client, we have eliminated propagation delay entirely. And another thing is regional decision making without coordination, right? So traditional failure requires coordination, like a centralized monitoring system that is detecting failure, makes a decision and propulates via DNS. And how do you get all the clients to agree under current state? With the client said failure, what we are proposing, there is no synchronization needed. Each client observes its own reality and it acts accordingly.

Starting point is 00:34:24 And CDN down in Tokyo, but up in London, sure, Tokyo, fail over and London trains don't. That's it. Automatic, no global coordination. So this is more like a decentralized fashion, I would say. And handling origin capacity concern would be this. So you might ask, because you initially said in the DNS failure, when you opposed the idea of sending the traffic to the origin, but now you're proposing, you are suggesting to use origin as the direct URL in the secondary origin. Doesn't they send traffic direct to origin, which can't handle full load? So you may ask that question. So the key difference, difference here is the traffic pattern. So when DNS failure happens, you are sending 100%

Starting point is 00:35:04 of your users, even though some of them are like up and running still, you send everybody to your origin for almost entire time when the CDN is down. And that means your origin is slammed all at once and stays completely overrunned. But with the client said failover, it is gradual and temporary, meaning only users experiencing CDN failure will fail over individually, which means the traffic automatically shifts bad when CDN recovers. And also not everybody slamming your origin once because maybe not everybody using your server whenever they use, then they come to your origin server.

Starting point is 00:35:40 And that too, you don't send the traffic where the CDN is actually up at running. So you only send the impacted traffic and also you go back very soon when the CDN recovers without any DNS changes or anything like that. That sounds like. Yeah, it's like a such pricing concept. The system can handle time-time demand for two hours during a concert,

Starting point is 00:36:01 even though it can't sustain for 12 by 7. Your origin can handle CDN level traffic for three-hour outages twice a year, especially with the cloud auto-scaling. Now on AWS, we have the elastic options, right? So yeah, for machine-critical systems, the fallback can be secondary CDN instead of the origin. So the protocol doesn't mandate origin. It mandates fast, automatic failure to whatever the endpoint you configure.

Starting point is 00:36:23 That's awesome. I want to drill back down into something that you said earlier. When you were talking about AI systems handling partial failure poorly, can you describe what new failure modes emerge, what inference runs at the edge or the CDN layer? Oh, yeah. So when AI inference runs at the edge, you get entirely new classes of failures that don't exist in centralized systems.

Starting point is 00:36:47 So what makes them dangerous is they are mostly silent. Like the systems appear to work, but correctness is compromised in invisible ways. and that is the thing that actually developers hit. So for example, like in centralized AI, deploying a new model is atomic. Everyone gets the same version instantly. At the age, you are pushing model weights to thousands of nodes globally. Some updates quickly, some slowly, some fail entirely.

Starting point is 00:37:13 That means, in traditional centralized AI, everybody, user in Tokyo and user in London, everybody gets the same version. Because if you fail, you fail for everybody. But in the age model, user in Tokyo might treat a node that is running on model like Gershen, that is 2.1. But user in London, they might treat Washington 2.0. Like same question, you'll send different answers. Not from AI is non-determinism, because I know we know AI is known for non-determinism. But here you are adding more to the non-determinism, but by literally running different models in different notes.

Starting point is 00:37:48 And the other thing is context loss, right? So AI inference often needs data from multiple sources, like customer history or support tickets or account status or documentation. So in centralized systems, when retrieval fails, the entire inference fails and you get an error. But at the edge, retrieval can be partial and silent. Like the edge load reaches the customer database, but not support tickets. And the AI generated response based on three out of five data sources. So the response looks reasonable, but it is wrong. because it is missing a mission critical contact.

Starting point is 00:38:23 Because nowadays we all know how much context is important. Otherwise, we will be letting AI to hallucinate and running wide. And the third thing is like the tool calls. Now we are having so many things like MCP or agents, all these things in the modern AI, right? And modern AI system, they use tools like EPA calls, MCP calls, database queries, external actions. And when inference runs centrally, tool failures are very much physical.

Starting point is 00:38:48 Like the model knows, it failed and it can handle. it. But at the edge, tool calls, they go through unrelevable infrastructure, like three bad outcomes, the model hallucinates and it gives a result that is a hallucinated result instead of admitting the failure. And the model skips validation and proceeds with unverified information sometimes. And the tool partially succeeds with truncated data and the model thinks it has complete information. So that's what stays people. And the other thing is geographic non-differences, right? So the same prompt to the same model version produces wildly different results based on which edge node handles it. Not from A randomness, as I said, but because different nodes have different cache contexts and different available tools and different data freshness.

Starting point is 00:39:35 So I'm also working on in the AI for the last several months like 12 to an year. So something I hate doing is AI models are already non-determistic. I don't want to add more fuel to that non-deterministic. I want to keep them grounding as much as possible. And by doing this edge failures, they actually add more and more fuel to the non-determinism aspect of the LLMs. And then also security aspect, right? So attackers can exploit edging fra unreliably as an attack vector. Like an attacker sends a problem with all back instructions saying,

Starting point is 00:40:09 hey, if you can't access safety guidelines, just answer directly. On a healthy edge note, the AI fetches guidelines and applies them. On a degraded note, where retrieval fails, the prompt injection activates. The AI follows the attacker's instructions instead of refusing the unsafe request. They're not exploiting the model, but they're exploiting the infrastructure that is not reliable. And data residency violations, right? So, yeah, companies run its inference per data residency, like European data that we have GDPR, right? When edge fails, you fail over centralized origin.

Starting point is 00:40:41 You might violate those requirements. Eum users data is now processed in US data center, which is a government violation, right? So you are forced to choose failover and violate complaints or don't fail over and violate SLAs. So the failover mode only emerges during the outage when you are least prepared to handle it. And the cash invalidation dilemma like edge system cache inferences results for speed. But when data source fail, when do you invalidate the cash results? You have inference results from five minutes ago.

Starting point is 00:41:11 when all the data was healthy. But now, one source has failed, a user sends the same prompt, return the cash result based on the new now unavailable data. And recompute with incomplete data, that gives a wrong answer to the user. And the CDR Resilance Protocol addresses this by making failover fast enough, like one, two seconds, that these partial value state rarely occur. Like instead of degrading for 50 to 60 minutes while DNS propagates, you fail over before two calls them out,

Starting point is 00:41:40 and before the context becomes stale. So yeah, for Eidde, resilience is not optional. It is the correctness requirement. Well, Akata, I feel like you've given us a masterclass on CD at an edge today. And it's such an important topic because edge is evolving so quickly. If you were looking at infrastructure leaders designing systems today, what you guide them on in terms of mindset shift required to treat resilience as a

Starting point is 00:42:08 foundational requirement of their designs? Yes, until now it was like resilience is something we add later. That's the kind of mindset we had, but now it has to switch. Resilience is a day zero architectural requirement. So let me make it concrete with three specific sheets. So it has to change it from if it happens to when it happens. The old mindset is like, hey, our CDN has 99.99% up time. We probably never see an outage.

Starting point is 00:42:34 If we do, we will deal with it then. That was the old mindset. But the new mindset, CDN failure is inevitable. The action is not if, but when, like, how are we ready whenever it happens? This changes how we make decisions. You don't ask, should we invest in resilience? You ask how fast we can recover when the failure happens. Like practically, this means testing failure scenarios in production like environments.

Starting point is 00:42:58 If you have never simulated a CV and outage, I'm pretty sure most of the companies may not help. You have no idea if your failure mechanisms actually work. Like most companies discover their runbooks, they don't work during the actual outage when it is too late. And infrastructure leaders need to treat seed in resilience the same way they treat database replication or load balancer redundancy. So in this matter, actually, I appreciate Netflix. So Netflix actually, they do some interesting. So they intentionally break their infrastructure just to see how they're responding in the case of any emergency or in the case of any outages. So I really like that idea.

Starting point is 00:43:35 And they not only break like a regional level, they break at global level as well. They want to see how much they are prepared. So yeah, that kind of preparedness we need to have. And the shift two is from infra only to the client aware. So it's like the old mindset is resilience is an infra problem. We will solve it with infra tools like multi-seed and DNS failure or health checks. But the new mindset is resilience require intelligence at every layer, especially at the client. Yeah, because this is uncomfortable.

Starting point is 00:44:05 for many infra teams because it means coordinating with application teams. So in every company you have that one big infra team and also you have one big product application team who is developing things, right? So you can't just solve this in your own terraform configs. You need client libraries updated, mobile apps updated, EPA clients updated. But here is why the shift matters. Infra only solution have 15 to 60 minute recovery time because they are fighting against how DNS and the internet fundamentally work.

Starting point is 00:44:32 But the client-aware solutions have one to two-second return times because they are working with the grain of an architecture. The question for leaders is, are you willing to accept 45 minutes out days to avoid coordinating with application teams? When you frame it that way, the answer becomes obvious. And there is another shift from enterprise only to universal best practice. Like the old mindset. CDN resilience is only for companies with massive budget like Netflix, Amazon, meta. Small companies just have to accept the risk. But the new mindset is every company deserves affordable practical resilience

Starting point is 00:45:06 and open standards make that possible. That is why we have RFC in open standards to help all these companies, have access to all these things. When you think like that, suddenly resilience is not something only enterprises can afford. It is something every startup gets for free by using standard tools. For example, for intra-leaders, this means two things. First, if you are defining a new system today, use the tools and frameworks that support open-source resilience,

Starting point is 00:45:32 standards. Don't build an appropriate resolutions that locked you in. Or the second, like contribute to the standards, the more companies participate in defining how senior resilience should work, the better the ecosystem becomes for everyone. And finally, there is calculation, right? Leaders need

Starting point is 00:45:49 to reframe how they calculate the risk. The question is not what about the resilience cost. It's about what does the 45 minutes of downtime cost. For e-commerce, that's hundreds of thousands in last revenue. For SaaS models, that's blown SLA and customer chan. And for critical infrastructure, that's a reputation damage. And you have

Starting point is 00:46:08 no idea for some companies, it also matters people's lives. For example, if your car broke down somewhere and you need to call for a request for service and if that website is down, then obviously you don't know what kind of conditions they are in, right. So you need to help them the sooner the better. So that means when you compare the cost of implementing clients and failover, for most of the companies in a few days of engineering work using open source tools, it's hard big amount of money, right? So it's like a single major outage. The RO is very obviously it will be recovered. The mind shift is like treating resiliency not as an insurance acknowledged, unlikely events, but as a protection against inevitable events. Like CDN outages

Starting point is 00:46:46 will happen. The question is whether you are ready or not. I love it. I'm going to give you a bonus question. We just finished our predictions series on Tech Arena for 2026. And I bet your own predictions question. What do you think changes first, how CDMs are built, how inference is deployed at the edge, or how the industry standardizes resilience. I go with, I think standardization comes first, and it's already happening. Like the CDN resiliency protocol, RFC, which I'm submitting to a UTIA, is part of it. And I'm pretty sure there are many people like me also working on it. So why standardization needs, right?

Starting point is 00:47:25 So here is why. Because it becomes more architectural changes. Like they create pressure on both CDN provider and a platform. and application developers simultaneously without requiring either site to move fast. Think about how HTTPS became universal. It was not because every website decided independently to add SSL. It was because browsers started making HTTP sites as not secure,

Starting point is 00:47:47 which created some pressure from the users. The standard came first and then browsers implemented it. Then websites had to follow. So that's why I feel CDR resilience will follow a similar pattern like the adoption sequence. So the sequence is like standardize the protocol through a UTF. Framework and library authors will implement it and developers will adopt the news nameway. And CDN providers will adopt their infra to support the standard.

Starting point is 00:48:12 And CDN architecture changes comes later, I believe, as a secondary thing. Like CDN architectural changes comes later because they require the most investment and also coordination. And CDN providers won't restructure their infra until there is a clear market demand. And that demand only materializes after it. client-safe resilience becomes standard practice. But I don't think we'll see evolution, like Cedans might federate for resilience, like CEDAns, like Cloudflare and Fastly, they're establishing a fallback agreements where if one fails, the other provides a backup capacity, like how you have, like you can make emergency call when your UT&T coverage is not there.

Starting point is 00:48:49 You can still make a call through mobile or wherever. So yeah, we will see better health check endpoints that you find great regional status. But you'll see graceful degradation where CDNs can signal, hey, I'm degraded, but not fully done. So clients can make informed decisions. And these changes take two to three years minimum because they involve core info modifications, legal agreements between providers and business model evaluations. And age inference deployment adopts fastest. So age inference deployments will adopt the fastest once client resilience is standardized.

Starting point is 00:49:22 Because this is like a greenfield territory. Like companies are just starting to deploy A and age. So they can design for resilience. resilience from day one. So what this means for industry is, the broader impact is that resiliency becomes invisible infrastructure. Just like you don't think about the TCP transmission today, it just works. And you won't think about the CDN failures tomorrow because we already standardized,

Starting point is 00:49:45 it will work. And it will be built into every HCT library, every framework, and every platform. That's the power of standardization, right? Once it's a protocol that everyone implements it, it becomes ambient. So to directly answer the question, standardization changes first for my understanding and then client framework follow in the next one to one and a year. And then age inference adopts within the next couple of years. Cidine art picture evolves within next three to four years. But it all starts with the standard first.

Starting point is 00:50:14 That makes a lot of sense. And we're going to hold it to it. I'd love to have you back on the show at some point to talk about the standards progress and see how the world of CDMs are evolving. I'm sure that folks who are listening online, watch. to learn more from you and engage on the standard. Where can they connect with you? Oh, yeah. So the CDN protocol resilience,

Starting point is 00:50:34 the regional protocol is going to launch it as an open source very soon. The GitHub repository will be included the complete clients library implementation, technical specifications, working examples, and also the documentation. Like everything, teams need to start implementing client side failover.

Starting point is 00:50:50 And I'm also going to publish a comprehensive technical article and medium that goes deep into both problem and solution, which will go I'm hoping I can put it live in next couple of weeks. And for the standardization path, I'm preparing an RFC submission to YETF and planning to have speaking engagements with the local tech chapters. I work with local SFBAA ACM chapter and also on part of ICCI, Silicon Valley chapter. And I'm going to present this idea to seek a key feedback from the tech people to see

Starting point is 00:51:18 if there is anything we need to make modifications and how tech community is going to take it. Yeah. And the best way to follow the launch is all of my updates on my. my LinkedIn profile, so where I'll be sharing the GitHub repository goes like what the medium article publishes are. If I have a plan, it will be to talk in one of the local chapters. Well, Lankata, thank you so much for the time today. It was awesome talking to you. I learned a ton, and I'm sure our audience did too. Thanks so much. Thank you so much, Alison. Thanks for having me. Thanks for joining Tech Arena. Subscribe and engage at our website,

Starting point is 00:51:50 techorina.a. All content is copyright by Techarena.

In The Arena by TechArena - Solving the Internet’s Single Point of Failure: CDN Resilience

Cloud expert Venkata Gopi Kolla joins Allyson Klein to discuss the CDN "single point of failure" and a new IETF protocol for sub-second edge recovery and AI correctness. A must-listen for infrastructu...re leads.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.