In The Arena by TechArena - The Rising Demands on Cloud Infrastructure with Rebecca Weekly

Episode Date: December 13, 2022

TechArena host chats with Cloudflare infrastructure VP and Open Compute Board Chair Rebecca Weekly about the rising demands on cloud infrastructure across performance, design modularity, and sustainab...ility.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Allison Klein. Now, let's step into the arena. Welcome to the tech arena. Today we are continuing on our journey of cloud computing entering 2023. And my guest today is somebody that needs no introduction in the cloud space. She was just named one of Business Insider's Cloudverse 100 as a builder of the next generation of the internet. She's also the VP of Hardware Systems Engineering at CloudFlare. Welcome to the program, Rebecca Weakley. Thank you, Allison.
Starting point is 00:00:57 It's so good to see you. So Rebecca, you recently made a big change in your career, moving from the land of infrastructure into managing oversight of infrastructure at one of the largest cloud service providers. Tell me about that transition and what it's like to be so close to the customer at CloudFit. Well, I think honestly, that was the learning part of the journey that motivated me. You know, I've spent my whole career in semiconductors, in EDA tooling for semiconductor development, you know, very much in the nitty gritty and, you know, had the wonderful opportunity within Intel to work on the systems that we build for cloud service providers holistically, kind of across compute and storage and networking solutions.
Starting point is 00:01:46 And as much as I felt I could learn and grow in that domain, I felt like I couldn't go. There's just certain things that you will learn as you operate something that helps you understand what needs to happen in the products underneath what you're doing versus being underneath and trying to sort of orient up to, yes, they would want to use it in this way. Some people can do that. And maybe, you know, if I, if I had my crystal ball a little more polished, I would have been that much better at it coming from the direction of bottoms up silicon two
Starting point is 00:02:26 systems. But when you are able to work and call up, you know, the director of the radar team or, you know, one of the other wonderful products and services that we have and say, okay, what experience problems are you having? And their words are going to be something like, well, it seems really slow, but slow to a software person could be a million different things to a hardware person, many of which aren't actually even hardware, right? But what is hardware could be the network design, you know, the actual bandwidth of the network, redundancy factors, factors relating obviously to the performance of the computing element and the bandwidth in too. So there's like 50,000 things that slow could mean. from one small piece of integrated compute device, whatever it might be that I see that somebody's working on. It's just, it's an impossibly difficult problem
Starting point is 00:03:30 to model effectively when you're actually working with customers who have millions of CPUs operating at scale. You know, in our case, over 275 different cities that we're operating in. That global distributed network has all sorts of effects that aren't going to show up in a node level test on a, you know, spec and date. Apples and oranges.
Starting point is 00:03:58 So in this, you know, you've described, you know, what is slow and trying to root cause that. Are there a few things as you've made this transition that were like the big aha moments for you of like, oh my gosh, I never even thought about that. Or, oh my gosh, this is something that I need to feed back to the industry that we need to work on together. I probably have those every week, at least in the first six months, I'm pretty sure I have them every week and I will do a poor job summarizing all of them, but many. Aspects of supply are always underestimated in terms of the impact that it has on folks. There are many times where we've evaluated something. It looks really good on paper,
Starting point is 00:04:47 in lab, and when push comes to shove, we end up not actually deploying it because we can't really get it in the kind of scale we would need. Or the timeline to actually get that device through our supply chain is nine months, three quarters, based on lead times of various components, et cetera. So supply capability and not just the individual pieces of silicon, but the systems associated with them is such a problem. And lots of people, like vendors all across that we work with are like, but look, my benchmark looks so good. Don't you want it tomorrow? Why aren't you buying tomorrow? And it's like, let me talk you through how, what it takes for us to actually acquire your CPU or your XPU, which is even harder by the way than a cpu to actually get availability of and the list goes on i mean it may be qlc and the drive like lead times for such things ddr5 uh the lead times
Starting point is 00:05:54 this is not an industry where everyone's left edge of readiness aligns to any one vendor. And so there's a lot of moving parts in the world of hardware that really do mean, yeah, performance matters. It really matters. But you're going to end up making choices with the systems you've got if you're not working with partners who know how to deal with compliance issues, the ability to ship to different locations, all these other factors that I lump into supply that are really impactful operating a global service. So that's definitely one big bucket. Performance matters, but the customer matters most and you need to service the customer. And zero supply means you get a zero on performance. Sorry, I hate to say this to you, but you can have the best thing ever.
Starting point is 00:06:53 And if I can't buy it, it doesn't actually exist. So that certainly, I've had that conversation so much. And I think because there's so many new companies and newer players in the industry right now, which is so much. And I think because there's so many new companies and newer players in the industry right now, which is so great. I mean, it is a golden age of, you know, silicon innovation that is happening, but they may have amazing architects coming up with really amazing pieces of silicon and didn't invest in the humans that are required to manage their supply chain, to manage their assets, to forecast correctly, to handle logistics and shipping. These are real things that a mature company has had to think through that some of our newer players don't know yet. That's expected, but it's a big deal. The other thing that I also am seeing with a lot of our partners is software readiness in the ecosystem for similar reasons.
Starting point is 00:07:50 This shows up as kind of two different flavors just to roughly bucket them. One is in the domain of building their own because it's awesome and amazing and because it looks so good on this thing. And of course, you're going to want to use it. And that's insane because, you know, we're a company that's been around for 12 years. We have huge investments, many of them off of open source projects, but some of them off of our own internal development or have branched, you know, quite a long time ago.
Starting point is 00:08:22 And they're not actually fully aligned to what is currently in an open source project and we have security concerns we operate in terms of compliance there's FIPS certification on certain libraries anytime you start diverging from either the open source community or at least we have inspection capabilities for security reasons, into something that's all your own, like you're never going to sell to me. It's never going to happen. I mean, I shouldn't say never, never say never.
Starting point is 00:09:00 But Allison, you put a barrier to entry where you better be NVIDIA. You better be that much more performant that you're forcing people to use things like cuda and if there's any way in the world that i can use tensorflow or py torch or anything else with a community behind it i will because being so locked in to one solution is illogical, won't pass my security requirements, doesn't have the benefit of a community looking at it, inspecting it, banging on like, I get it. It's easier if you control it, but it really means I can't ever use you more or less. All right. So, so good, good challenges that you're setting this interview up with. You're also a member and deeply involved in OCP. And what is, you know, you talked about open source. What is the role of standard configurations in addressing the former challenge? And can standard configurations help with supply?
Starting point is 00:10:21 Absolutely. Absolutely. Something we spend a lot of time talking about in Open Compute Project is that open software is not the same as open hardware. And there's a reason for that, right? Open source software projects are on a GitHub repo. Everyone can access, everyone can contribute. And, you know, it's fully inspectable. You're able to kind of engage in that. And then people just say, hey, I certified that version of the Linux kernel versus this one. And, you know, it works in that fashion. With hardware, you own an asset at the end of this process, and somebody has to produce that asset for you, whether it's a server, whether it's, you know, a NIC card, whether it's, there's a combination
Starting point is 00:11:05 usually of several integrated chips and then several different components coming together. And whether you're using an ODM or an OEM, somebody has to develop firmware, bio solutions, et cetera, on top of that box to give you a finished functional part. And that's true, whether it's, you know's a server or whether it's a white box. So what we do with an open compute is drive standards to help ensure that as many components as possible can be interoperable across different vendors. It doesn't mean that the net end server is an open source thing that anybody could build and anything could happen, you know, in the same way that an open source project would be. People's IP for their integrated
Starting point is 00:11:53 chips, for their server design is still their own IP and they have that right. But we create open standards around the interconnection points so that we can ensure that if you buy a DCSEM or an OCP 3.0 NIC card from any number of vendors, they're going to plug into the PCIe slot the exact same way that anybody else's OCP 3.0 card plugs in and you can get different form factors of a compliant card to that specific design so that you can accommodate, you know, a half width board, one U, two U, whatever, whatever's right for your specific environment. So it is a little different. We do that through sort of process of contributions and defining the subset that will make interoperability possible, but still enabling people to own their own IP. And honestly, I believe that is important
Starting point is 00:12:53 for three reasons. You know, one, as a consumer, I need to know, to your supply comment and question earlier, I need to know if I choose to validate a NIC, I can get another supplier of that NIC if my current one isn't available. So and ideally that it's going to be, I mean, I'm still going to test it before I put a new vendor's version of it out there. But in general, it's going to be a very short turnaround time to validate a different version of an OCP 3.0 NIC because they're all compliant to a standard. So that helps with sourcing capabilities as a user. It helps with supply assurance and it helps with validation timelines because everything is faster when I know it's going to plug into my box. Now, even from an ecosystem perspective, I would argue this is better for IT ecosystem providers because they're not making big investments just to move a NIC card around, right? And every person's environment often is different. management network, you know, directly on their NIC card with a different one gig network interface versus their standard interface for their overall, you know, consumer network. Some people like to
Starting point is 00:14:12 have a separate switch itself, you know, at the Tor level. Everyone's different in how they want to do their systems design for their reliability concerns and challenges. If you can standardize subcomponents so things are more aligned, then we get a much easier process as developers and manufacturers, ODMs and OEMs, in producing widgets that people can adopt quickly and not spending a lot of time in redesign. And then the last community that we serve are obviously silicon providers as well. And those folks, knowing that new players can come into the market, use a form factor that's been standardized and be adopted and have a supply chain ready for them faster is a real advantage as well.
Starting point is 00:15:05 So ideally, open compute helps in all of those ways when it's a healthy, vibrant community. It doesn't always work. I believe OCP is an amazing organization for this. And the last thing I would add about that is it's not just about supply assurance. It's also about sustainability because there's a lot of, especially as we're seeing more bifurcation of options in the market, if I have to do a new full motherboard design with every new component locked in for every CPU vendor, every DPU, IPU, XPU, it's prohibitively expensive and it's horrible for the environment so if i can take
Starting point is 00:15:47 standard building blocks try things in my lab swap them out be able to do the right thing validate if this actually makes sense for us before going big and doing this big server design that makes it faster to adopt new technology more interesting for new companies to get into, and a lot less e-waste in the world. I mean, about 70% of a server doesn't need to be redesigned gen on gen to get performance efficiencies. And that 70% e-waste that can, through modularity kinds of initiatives, not happen. I mean, we're going to have to build new servers for capacity needs, but there's no reason why we can't get to a mindset where we swap the CPU and the memory
Starting point is 00:16:34 and everything else keeps going for 10 years, 15 years, instead of regenerating those fans and the CPLD and the BMC, every single generation. Why? What's really changed in those devices? That's a great point. Yeah, that's a great point. And so poignant. And it kind of gets me into my next question. Sustainability has become a bigger issue. And not that it wasn't always an issue, but rising energy prices have put it, you know, at a 10x of focus. What have you seen in terms of the shift in your attention to performance efficiency? Has there been a shift or were you already there?
Starting point is 00:17:22 And are customers asking about performance efficiency as well? So I would argue that given that we've run a global network where we've been subject to the fun challenges in terms of pricing in all sorts of different global markets, we've always been very sensitive to performance efficiency. We were experimenting and testing the Amberwing solution, which was one of the very first arm-based solutions back in 2015, to see if there might be alternate options out there that could be more efficient. So Cloudflare has a strong history in trying to make sure we are being as efficient as possible in serving the internet. And that includes looking at accelerator solutions, looking at all sorts of options.
Starting point is 00:18:13 Just because we've always been exposed. You know, we care about the eyeball responsiveness. And so you're going to build in places that do not have a good PUE, you know, like high humidity, expensive, you know, expensive power sources all the time. Now everybody's on this bandwagon, which is actually in some ways very nice because it means there's so much more, you know, focus and interoperability and core capacity out there. I think what's really changed that I've seen has been actually in the software ecosystem, because the number one metric for
Starting point is 00:18:54 software developers, you know, for the vast majority of time I've been on this earth, has been efficiency of development time, the agility of the development agility of the team. And I'm starting to see in the software ecosystem, people trying to figure out, is this an efficient way of doing it? Are we being logical? And that's a huge shift. I mean, it's not fair. Everyone thought that way in the sixties, but it was because they had no actual like memory to use and they had to optimize all of their time. But in normal programming, it's been about developer agility. And now people are starting to really look at the GHG protocol, they'll tell you, you know, of the 60 tons of carbon embodied in a server, 90% of the carbon emissions associated with a server are the operation of it. So we can get as good as we want to in the supply chain, in the reuse, in the recycling practices, and it's still going to only move the needle on 10
Starting point is 00:20:07 the operations which again we can do some great things with different architectures if we have crappy code on there that is sitting and you know know, I don't know, interrupting the processor 24-7. Like, this is not going to be efficient for serving its overall objectives. And so I do believe this is a problem that as much as hardware is going to work on it, I am most excited to see the software teams getting excited about it and working on it because that's where we're really going to move the needle. I'm interested in this efficient code. And the last conversation I had on cloud was with Abby Kearns, former CTO of Puppet. And we were talking about the state of app stack automation and the complexity that we've created with cloud with the number of workloads and the complexity of the stack, do you see us making progress in the efficiency just from a standpoint of the cloud stack and actual allocation of workloads?
Starting point is 00:21:16 So the biggest factor I've seen in increasing the efficiency of a server is containerization, right? Virtualization, containerization, just upping the number of users on a system, given that multi-core architectures exist. I don't think we've hit some tipping point on complexity with respect to containerization or virtualization. Organizations who provide package services in this domain or open source projects in this domain are incredibly successful. And I tend to think about problems in the Pareto rule of 80-20. So if I can get 80% more usage, it's not 80% more, but the average single tenant server is usually about 10% occupied. And one that is supporting containerization or even virtualization has the opportunity to be closer to 45 to 65% load. So that's a huge increase in improvement in reducing that 90% number is just adopt containerization. It is complex. And
Starting point is 00:22:24 container management solutions, whether anybody who's operated a Kubernetes cluster, which I don't do personally, but I know the gentleman who helps run the team. And it's a lot of work. It's a lot of work to do distributed systems at scale and make sure that they stay up and are consistent and all of those factors. So it is a hard problem, but I don't think the easier, lower hanging fruit parts of it are unsolved. I think there's really good technology happening in that domain. The specifics of writing more efficient code, I think this is one of those areas where we're going to have to start with a mindset shift for developers. And there's this great book called Nudge, but it really talks about how showing people data, they start to make changes that enforcing choices doesn't work very well. And so I think one of
Starting point is 00:23:30 the most important things we can do as engineering management, as, you know, leaders in the industry is to show people data about the consumption of their processes, to show people data about the carbon footprint of the choices they're making, both consumers, by the way, as well as developers. Like if I am using a service and it told me, oh my gosh, you're doing this in high def and you are taking 10 times the computing power and therefore 10 times the emissions footprint as if you were watching this in standard def, I might choose to go to standard. It might work better anyway, since I'm probably on my cell phone, you know, on a treadmill.
Starting point is 00:24:12 So it's actually not a bad thing to give consumers those choices. And similarly, I think if we give developers better tools, you know, there's some great tools out there. I think Ari and Vannevan wrote like PowerTop and has contributed that into the open source ecosystem. There's a bunch of really fantastic tools that the community is starting to put out there. And if we build those into our development pipelines and help ensure that our developers can see that, people's own desire to do better for the world will help. I mean, you know, that is a positive Pollyanna, totally Rebecca statement, but it is so true in my world view. Like everyone is
Starting point is 00:24:54 good. Most people are good. And most of us want to make the right choices for each other for, you know, so I think there's a nudge that's worth making towards helping make sure we're exposing. And I think this is a challenge because a lot of cloud providers want to have a magical experience and don't want you to have to think at all about the hardware, us included. Trust me, if a customer has to talk to Rebecca, something went really wrong, really wrong. But there is options to expose the impact. And I think you're starting to see that of choices. And maybe you'd be willing to spend a little bit more on that cloud instance to know that the power source behind it is 100% green and that it's using a processor that has, you know, 100% water recycling or
Starting point is 00:25:49 like, I think maybe people can be a little bit better if we start to show them instead of just taking all those decisions away. I'm going to go to that small part of the population that is not good for a second. Cloudflare published a paper this week on DDoS attacks, and I was reading it, and it brought to the attention that this is an area which is evolving by the minute and a race between the industry to protect data and protect customers and bad actors and looking to exploit situations. Where do you think we're at? And what role does infrastructure have to play in a root of trust, have to play in terms of security? Oh, there are so many answers. Okay. Where do I think we are in the world? You're absolutely right. You know, we published some great work during the beginning of the conflict, you know, between Russia and Ukraine, about some of the ways in which the infrastructure itself can show you changes in not just DDoS situations, right, but even just changes in upstreaming and data content and data access and potentially what that means. The state of the internet is that it is a globally distributed system and physical
Starting point is 00:27:14 access, therefore, of where you send your data goes through lots of places you may not feel comfortable with. I think governments are taking actions to try and create regulatory environments that keep user data more protected. I think obviously companies have a responsibility to take actions either through leveraging services that control and support a secure access edge or through companies like us who work to have fully encrypted servers to make sure that we have disaggregated root of trust, that we are signing all of our certificates, that everything is, you know, there's a lot of best practices in security and nothing is 100% secure. And I think the goal is to layer so many parts of that cake that you are not the easy
Starting point is 00:28:09 pickings out there in the world. And to recognize that, you know, this is a very complicated problem because of the nature of the internet. And we should use our brains and question things and be smart consumers and think through, have we really, you know, created a situation here that's logical? Maybe that's a different podcast and a different conversation, Allison. But in general, you know, the way I absolutely hardware has a huge burden in this. There's a lot of solutions coming out to make that faster. But even if you don't have it in the hardware, there are options and have been options in software for a long time in terms of, you know, network security, you know, whether it's VPN or some form of SASE intercept, whether it's hardened by hardware or not, even just a software set of good practices is a hundred percent better. My favorite case of this, you know, in the news in the last, whatever, six months is, you know, two-factor authentication through a true FIDO security key.
Starting point is 00:29:33 Like how many companies got exploited because they thought two-factor auth was good enough and no, really, it is a lot harder to spoof a hardened security key. And I don't care how many times you have something text you a different code. Yes, that's better defense in depth, but it is not as good as having a hardened, in that case, security key. And we continue to see this. It's like good, better, best. We start with software. We start with encryption. We start with a layered model. We start with a lack of trust model where people have to ensure that they are compliant to the user they are and the behavior patterns of users that are like them and have
Starting point is 00:30:12 those access patterns. And we spend a lot of time on that. I mean, I in no way want to downplay the importance of the incredible services built 100% in software that help ensure people are actually having best security practices. But as we layer in hardened security behind that, that is harder to spoof. I won't say impossible to spoof. Everyone's read CVEs out there knows that nothing's impossible, but it just makes it that much harder to ensure or to break the encryption, to break the security model, to break the key schema. And I think that's what we're all trying to do. Disaggregated root of trust is not because people haven't had some sort of a root of trust concept, but if that has been commingled with your BMC, you're in a situation where if the BMC has violated, which go read the CVED database, like, unfortunately, this is not an uncommon
Starting point is 00:31:14 situation. You've been trusting an entity that is not trustable, not trustworthy. So that's really where a disaggregated route of trust in your attestation chain as you are trying to make sure your keys are, you know, accurate makes sense. But there's no one panacea. I just, you know, seeing it every day, going through and trying to reduce the issues every single day. This is an area where we will constantly need to be innovating. We will constantly need to be working as a community to create better solutions. And I think it is an incredible time because all the biggest players in the industry are working on this. And most of them are actually driving standards into open systems, open compute project. So, you know, major projects were announced at our last global summit, Hydra being, you
Starting point is 00:32:18 know, one implementation. I mean, obviously everything that has happened with OpenTitan, but all of the work that has happened in that domain for servers specifically are really exciting for the industry. And I think it really is showing like security is a differentiator, but system level security, you're only as good as your weakest link. And so if we don't bring up the whole industry, we're in trouble. And I think that's a huge amount of leadership from Google, from Microsoft, stepping up to say, hey, we're only as strong as our weakest link.
Starting point is 00:32:56 Let's make sure the industry is better. That's fantastic. One final question for you. We're heading into 2023. What are the exciting things that we can expect from the Cloudflare team? And what are you most excited about to see from the industry next year? world, you know, there's so much data and insights running a global network, specifically targeting, like reducing DDoS attacks, making sure that the internet is more secure. I look forward to seeing our teams take the mic and talk a little bit more about threat intelligence and all the different ways in which, you know, we can help consumers be smarter about that. Won't be my team at all,
Starting point is 00:33:43 but I just, for the sake of the world, so that people can understand more, I thought the papers and work that we did around Ukraine were incredibly powerful. And I really look forward to seeing the team expand that work, because it's some of the stuff that inspires me most every day on building a better internet. For my team, we are building our next generation of modular server, which is super exciting. Again, for all the e-waste conversation we had earlier. So that's been a lot.
Starting point is 00:34:13 We are working actively on white box switches and solutions to both have more inspection and capability through using best-in-class server design techniques in the networking domain, as well as having the network sort of enable us to build what we want to do. So I'm really excited about that. I mean, I'm totally geeking out about hardware stuff. Awesome.
Starting point is 00:34:35 I don't know if that's all good. And the accelerator ecosystem continues to evolve, continues to be interesting. So I have at least three different ASICs, varieties of ASICs, actually, I have at least two vendors for most of them in lab right now that we're starting to experiment on to improve, you know, the accuracy of time pools for, you know, running a global network to increase our efficacy in serving machine learning and analytics. So, so many different domains. And obviously, you know, one of our newest services that launched this year was R2. And as R2 continues to scale, you know, going from being a global distributed network to being a computational network to actually deliver object storage on top
Starting point is 00:35:28 of that there's so much transformation that is happening in our footprint in our builds in you know durability and latency requirements in end user services uh so it's been an incredible learning journey with that team to date and i just you know, am ecstatic to continue to build that to, you know, make it better, stronger, faster for our users. That's fantastic. Thank you so much for being on the program today, Rebecca. It's always a great chat. Thank you for having me. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net.
Starting point is 00:36:08 All content is copyright by the Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.