In The Arena by TechArena - The Rising Demands on Cloud Infrastructure with Rebecca Weekly
Episode Date: December 13, 2022TechArena host chats with Cloudflare infrastructure VP and Open Compute Board Chair Rebecca Weekly about the rising demands on cloud infrastructure across performance, design modularity, and sustainab...ility.
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Allison Klein.
Now, let's step into the arena.
Welcome to the tech arena. Today we are continuing on our journey of cloud computing entering 2023. And my guest today is somebody that needs no introduction in the cloud space.
She was just named one of Business Insider's Cloudverse 100 as a builder of the next generation
of the internet.
She's also the VP of Hardware Systems Engineering at CloudFlare.
Welcome to the program, Rebecca Weakley.
Thank you, Allison.
It's so good to see you.
So Rebecca, you recently made a big change in your career, moving from the land of infrastructure
into managing oversight of infrastructure at one of the largest cloud service providers.
Tell me about that transition and what it's like to be so close to the customer at CloudFit.
Well, I think honestly, that was the learning part of the journey that motivated me. You know, I've spent my whole career
in semiconductors, in EDA tooling for semiconductor development, you know, very much in the nitty
gritty and, you know, had the wonderful opportunity within Intel to work on the systems that we build
for cloud service providers holistically, kind of across compute and storage and networking solutions.
And as much as I felt I could learn and grow in that domain,
I felt like I couldn't go.
There's just certain things that you will learn as you operate something
that helps you understand what needs to happen in the products
underneath what you're doing
versus being underneath and trying to sort of orient up to, yes, they would want to use it
in this way. Some people can do that. And maybe, you know, if I, if I had my crystal ball a little
more polished, I would have been that much better at it coming from the direction of bottoms up silicon two
systems. But when you are able to work and call up, you know, the director of the radar team or,
you know, one of the other wonderful products and services that we have and say, okay, what
experience problems are you having? And their words are going to be something like,
well, it seems really slow, but slow to a software person could be a million different
things to a hardware person, many of which aren't actually even hardware, right? But what is
hardware could be the network design, you know, the actual bandwidth of the network, redundancy factors, factors relating obviously to the performance of the computing element and the bandwidth in too.
So there's like 50,000 things that slow could mean. from one small piece of integrated compute device, whatever it might be that I see that somebody's working on.
It's just, it's an impossibly difficult problem
to model effectively
when you're actually working with customers
who have millions of CPUs operating at scale.
You know, in our case,
over 275 different cities that we're operating in.
That global distributed network has all sorts of effects
that aren't going to show up in a node level test on a, you know, spec and date.
Apples and oranges.
So in this, you know, you've described, you know, what is slow and trying to root cause that.
Are there a few things as you've made this transition that were like the big aha moments
for you of like, oh my gosh, I never even thought about that. Or, oh my gosh, this is something that
I need to feed back to the industry that we need to work on together. I probably have those every
week, at least in the first six months, I'm pretty sure I have them every week and I will do a poor job summarizing all of them, but many.
Aspects of supply are always underestimated in terms of the impact that it has on folks.
There are many times where we've evaluated something.
It looks really good on paper,
in lab, and when push comes to shove, we end up not actually deploying it because we can't really
get it in the kind of scale we would need. Or the timeline to actually get that device through our supply chain is nine months, three quarters, based on
lead times of various components, et cetera. So supply capability and not just the individual
pieces of silicon, but the systems associated with them is such a problem. And lots of people, like vendors all across that we work with are like,
but look, my benchmark looks so good. Don't you want it tomorrow? Why aren't you buying tomorrow?
And it's like, let me talk you through how, what it takes for us to actually acquire your CPU
or your XPU, which is even harder by the way than a cpu to actually get availability of and the list goes on
i mean it may be qlc and the drive like lead times for such things ddr5 uh the lead times
this is not an industry where everyone's left edge of readiness aligns to any one vendor. And so there's a lot of moving parts in the world of hardware
that really do mean, yeah, performance matters. It really matters. But you're going to end up
making choices with the systems you've got if you're not working with partners who know how
to deal with compliance issues, the ability to ship to different locations,
all these other factors that I lump into supply that are really impactful operating a global
service. So that's definitely one big bucket. Performance matters, but the customer matters
most and you need to service the customer. And zero supply means you get a zero on performance.
Sorry, I hate to say this to you, but you can have the best thing ever.
And if I can't buy it, it doesn't actually exist.
So that certainly, I've had that conversation so much.
And I think because there's so many new companies and newer players in the industry right now, which is so much. And I think because there's so many new companies and newer players in the industry right
now, which is so great. I mean, it is a golden age of, you know, silicon innovation that is
happening, but they may have amazing architects coming up with really amazing pieces of silicon
and didn't invest in the humans that are required to manage their supply chain, to manage their assets, to forecast correctly, to handle logistics
and shipping. These are real things that a mature company has had to think through that some of our
newer players don't know yet. That's expected, but it's a big deal. The other thing that I also am seeing with a lot of our partners is software readiness in the ecosystem for similar reasons.
This shows up as kind of two different flavors just to roughly bucket them.
One is in the domain of building their own because it's awesome and amazing and because it looks so good on this thing.
And of course, you're going to want to use it.
And that's insane because, you know,
we're a company that's been around for 12 years.
We have huge investments, many of them off of open source projects,
but some of them off of our own internal development or have branched,
you know, quite a long time ago.
And they're not actually fully aligned to what is currently
in an open source project and we have security concerns we operate in terms of compliance
there's FIPS certification on certain libraries anytime you start diverging from either the open
source community or at least we have inspection capabilities for security reasons,
into something that's all your own,
like you're never going to sell to me.
It's never going to happen.
I mean, I shouldn't say never, never say never.
But Allison, you put a barrier to entry where you better be NVIDIA.
You better be that much more performant that you're forcing
people to use things like cuda and if there's any way in the world that i can use tensorflow
or py torch or anything else with a community behind it i will because being so locked in to one solution is illogical, won't pass my security requirements, doesn't have the benefit of a community looking at it, inspecting it, banging on like, I get it. It's easier if you control it, but it really
means I can't ever use you more or less. All right. So, so good, good challenges that you're
setting this interview up with. You're also a member and deeply involved in OCP. And what is, you know,
you talked about open source. What is the role of standard configurations in addressing
the former challenge? And can standard configurations help with supply?
Absolutely. Absolutely. Something we spend a lot of time talking about
in Open Compute Project is that open software is not the same as open hardware. And there's a
reason for that, right? Open source software projects are on a GitHub repo. Everyone can
access, everyone can contribute. And, you know, it's fully inspectable. You're able to kind of
engage in that. And then
people just say, hey, I certified that version of the Linux kernel versus this one. And, you know,
it works in that fashion. With hardware, you own an asset at the end of this process, and somebody
has to produce that asset for you, whether it's a server, whether it's, you know, a NIC card, whether it's, there's a combination
usually of several integrated chips and then several different components coming together.
And whether you're using an ODM or an OEM, somebody has to develop firmware, bio solutions,
et cetera, on top of that box to give you a finished functional part. And that's true,
whether it's, you know's a server or whether it's
a white box. So what we do with an open compute is drive standards to help ensure that as many
components as possible can be interoperable across different vendors. It doesn't mean that the net
end server is an open source thing that anybody could build and anything could happen,
you know, in the same way that an open source project would be. People's IP for their integrated
chips, for their server design is still their own IP and they have that right. But we create
open standards around the interconnection points so that we can ensure
that if you buy a DCSEM or an OCP 3.0 NIC card from any number of vendors, they're going to
plug into the PCIe slot the exact same way that anybody else's OCP 3.0 card plugs in and you can get different form factors of a compliant card to that specific
design so that you can accommodate, you know, a half width board, one U, two U, whatever,
whatever's right for your specific environment. So it is a little different. We do that through
sort of process of contributions and defining the subset that will make interoperability
possible, but still enabling people to own their own IP. And honestly, I believe that is important
for three reasons. You know, one, as a consumer, I need to know, to your supply comment and question
earlier, I need to know if I choose to validate a NIC, I can get another supplier of that NIC if my current one isn't available.
So and ideally that it's going to be, I mean, I'm still going to test it before I put a new vendor's version of it out there.
But in general, it's going to be a very short turnaround time to validate a different version of an OCP 3.0 NIC because they're all compliant to a standard. So that helps with sourcing capabilities as a user. It helps with supply assurance and it
helps with validation timelines because everything is faster when I know it's going to plug into my
box. Now, even from an ecosystem perspective, I would argue this is better for IT ecosystem providers because they're not making big investments just to move a NIC card around, right?
And every person's environment often is different. management network, you know, directly on their NIC card with a different one gig network interface
versus their standard interface for their overall, you know, consumer network. Some people like to
have a separate switch itself, you know, at the Tor level. Everyone's different in how they want
to do their systems design for their reliability concerns and challenges. If you can standardize subcomponents so things are
more aligned, then we get a much easier process as developers and manufacturers, ODMs and OEMs,
in producing widgets that people can adopt quickly and not spending a lot of time in redesign.
And then the last community that we
serve are obviously silicon providers as well. And those folks, knowing that new players can
come into the market, use a form factor that's been standardized and be adopted and have a
supply chain ready for them faster is a real advantage as well.
So ideally, open compute helps in all of those ways when it's a healthy, vibrant community.
It doesn't always work.
I believe OCP is an amazing organization for this.
And the last thing I would add about that is it's not just about supply assurance.
It's also about sustainability because there's a lot of,
especially as we're seeing more bifurcation of options in the market, if I have to do a new
full motherboard design with every new component locked in for every CPU vendor, every DPU,
IPU, XPU, it's prohibitively expensive and it's horrible for the environment so if i can take
standard building blocks try things in my lab swap them out be able to do the right thing validate if
this actually makes sense for us before going big and doing this big server design that makes it
faster to adopt new technology more interesting for new companies to get into, and a lot less e-waste in the world.
I mean, about 70% of a server doesn't need to be redesigned gen on gen to get performance efficiencies.
And that 70% e-waste that can, through modularity kinds of initiatives, not happen.
I mean, we're going to have to build new servers for capacity needs,
but there's no reason why we can't get to a mindset
where we swap the CPU and the memory
and everything else keeps going for 10 years, 15 years,
instead of regenerating those fans and the CPLD and the BMC,
every single generation.
Why? What's really changed in those devices? That's a great point.
Yeah, that's a great point. And so poignant. And it kind of gets me into my next question.
Sustainability has become a bigger issue. And not that it wasn't always an issue, but
rising energy prices have put it, you know, at a 10x of focus. What have you seen in terms of the
shift in your attention to performance efficiency? Has there been a shift or were you already there?
And are customers asking about
performance efficiency as well? So I would argue that given that we've run a global network
where we've been subject to the fun challenges in terms of pricing in all sorts of different
global markets, we've always been very sensitive to performance efficiency.
We were experimenting and testing the Amberwing solution, which was one of the very first
arm-based solutions back in 2015, to see if there might be alternate options out there
that could be more efficient. So Cloudflare has a strong history in trying to make sure we are being as efficient as possible in serving the internet.
And that includes looking at accelerator solutions, looking at all sorts of options.
Just because we've always been exposed.
You know, we care about the eyeball responsiveness.
And so you're going to build in places that do not have a good PUE, you know, like high
humidity, expensive, you know, expensive power sources all the time.
Now everybody's on this bandwagon, which is actually in some ways very nice because it
means there's so much more, you know, focus and interoperability and core capacity out
there. I think what's really changed
that I've seen has been actually in the software ecosystem, because the number one metric for
software developers, you know, for the vast majority of time I've been on this earth,
has been efficiency of development time, the agility of the development agility of the team. And I'm starting to see in
the software ecosystem, people trying to figure out, is this an efficient way of doing it?
Are we being logical? And that's a huge shift. I mean, it's not fair. Everyone thought that way
in the sixties, but it was because they had no actual like memory to use and they had to optimize all of their time.
But in normal programming, it's been about developer agility.
And now people are starting to really look at the GHG protocol, they'll tell you, you know, of the 60 tons of carbon embodied in a server, 90% of the carbon emissions associated with a server are the operation of it.
So we can get as good as we want to in the supply chain, in the reuse, in the recycling practices, and it's still going to only move the needle on 10
the operations which again we can do some great things with different architectures
if we have crappy code on there that is sitting and you know know, I don't know, interrupting the processor 24-7. Like,
this is not going to be efficient for serving its overall objectives. And so I do believe this is a
problem that as much as hardware is going to work on it, I am most excited to see the software teams
getting excited about it and working on it because that's where we're really going to move the needle.
I'm interested in this efficient code. And the last conversation I had on cloud was with Abby
Kearns, former CTO of Puppet. And we were talking about the state of app stack automation and the
complexity that we've created with cloud with the number of workloads and the complexity of the stack, do you see us making progress in the efficiency just from a standpoint of the cloud stack and actual allocation of workloads?
So the biggest factor I've seen in increasing the efficiency of a server is containerization, right? Virtualization, containerization,
just upping the number of users on a system, given that multi-core architectures exist.
I don't think we've hit some tipping point on complexity with respect to containerization
or virtualization. Organizations who provide package services in this domain or open source projects in this domain are incredibly successful.
And I tend to think about problems in the Pareto rule of 80-20.
So if I can get 80% more usage, it's not 80% more, but the average single tenant server is usually about 10% occupied. And one that is supporting containerization or even
virtualization has the opportunity to be closer to 45 to 65% load. So that's a huge increase in
improvement in reducing that 90% number is just adopt containerization. It is complex. And
container management solutions, whether anybody who's
operated a Kubernetes cluster, which I don't do personally, but I know the gentleman who helps
run the team. And it's a lot of work. It's a lot of work to do distributed systems at scale
and make sure that they stay up and are consistent and all of those factors. So it is a hard problem, but I don't think the easier,
lower hanging fruit parts of it are unsolved. I think there's really good technology happening
in that domain. The specifics of writing more efficient code, I think this is one of those areas where we're going to have to start
with a mindset shift for developers. And there's this great book called Nudge,
but it really talks about how showing people data, they start to make changes that enforcing choices doesn't work very well. And so I think one of
the most important things we can do as engineering management, as, you know, leaders in the industry
is to show people data about the consumption of their processes, to show people data about the
carbon footprint of the choices they're making,
both consumers, by the way, as well as developers. Like if I am using a service and it told me,
oh my gosh, you're doing this in high def and you are taking 10 times the computing power and
therefore 10 times the emissions footprint as if you were watching this in standard def,
I might choose to go to standard.
It might work better anyway, since I'm probably on my cell phone, you know, on a treadmill.
So it's actually not a bad thing to give consumers those choices.
And similarly, I think if we give developers better tools, you know, there's some great
tools out there.
I think Ari and Vannevan wrote like PowerTop and has contributed that into the open source ecosystem. There's a bunch
of really fantastic tools that the community is starting to put out there. And if we build those
into our development pipelines and help ensure that our developers can see that,
people's own desire to do better for the world will help. I mean, you know, that is a
positive Pollyanna, totally Rebecca statement, but it is so true in my world view. Like everyone is
good. Most people are good. And most of us want to make the right choices for each other for,
you know, so I think there's a nudge that's worth making towards helping make sure we're exposing.
And I think this is a challenge because a lot of cloud providers want to have a magical experience and don't want you to have to think at all about the hardware, us included.
Trust me, if a customer has to talk to Rebecca, something went really wrong, really wrong. But there is options to expose the impact.
And I think you're starting to see that of choices.
And maybe you'd be willing to spend a little bit more on that cloud instance
to know that the power source behind it is 100% green
and that it's using a processor that has, you know, 100% water recycling or
like, I think maybe people can be a little bit better if we start to show them instead of just
taking all those decisions away. I'm going to go to that small part of the population that is not good for a second. Cloudflare published a paper this week on DDoS attacks, and I was reading it, and it
brought to the attention that this is an area which is evolving by the minute and a race
between the industry to protect data and protect customers and bad actors and looking to exploit
situations. Where do you think we're at? And what role does infrastructure have to play in a
root of trust, have to play in terms of security? Oh, there are so many answers. Okay. Where do I
think we are in the world? You're absolutely right. You know, we published some great work during the beginning of the conflict, you know, between Russia and Ukraine, about some of the ways in which the infrastructure itself can show you changes in not just DDoS situations, right, but even just changes in upstreaming and data content and data access and potentially what
that means. The state of the internet is that it is a globally distributed system and physical
access, therefore, of where you send your data goes through lots of places you may not feel
comfortable with. I think governments are taking actions to try and
create regulatory environments that keep user data more protected. I think obviously companies
have a responsibility to take actions either through leveraging services that control and
support a secure access edge or through companies like us who work to have fully encrypted servers
to make sure that we have disaggregated root of trust, that we are signing all of our certificates,
that everything is, you know, there's a lot of best practices in security and nothing is 100%
secure. And I think the goal is to layer so many parts of that cake that you are not the easy
pickings out there in the world. And to recognize that, you know, this is a very complicated problem
because of the nature of the internet. And we should use our brains and question things and be smart consumers and think through,
have we really, you know, created a situation here that's logical? Maybe that's a different
podcast and a different conversation, Allison. But in general, you know, the way I absolutely
hardware has a huge burden in this. There's a lot of solutions coming out to make that faster.
But even if you don't have it in the hardware, there are options and have been options in software for a long time in terms of, you know, network security, you know, whether it's VPN or some form of SASE intercept, whether it's hardened by hardware or not,
even just a software set of good practices is a hundred percent better. My favorite case of this,
you know, in the news in the last, whatever, six months is, you know, two-factor authentication through a true FIDO security key.
Like how many companies got exploited because they thought two-factor auth was good enough and no, really, it is a lot harder to spoof a hardened security key.
And I don't care how many times you have something text you a different code. Yes, that's better defense in depth, but it is not as good as having a hardened, in that case, security key.
And we continue to see this.
It's like good, better, best.
We start with software.
We start with encryption.
We start with a layered model. We start with a lack of trust model where people have to ensure that they are
compliant to the user they are and the behavior patterns of users that are like them and have
those access patterns. And we spend a lot of time on that. I mean, I in no way want to downplay the
importance of the incredible services built 100% in software that help ensure people are actually having best security practices.
But as we layer in hardened security behind that, that is harder to spoof. I won't say
impossible to spoof. Everyone's read CVEs out there knows that nothing's impossible, but it just makes it that much harder to ensure or to
break the encryption, to break the security model, to break the key schema. And I think that's what
we're all trying to do. Disaggregated root of trust is not because people haven't had some
sort of a root of trust concept, but if that has been commingled with your BMC, you're in a situation where if the BMC
has violated, which go read the CVED database, like, unfortunately, this is not an uncommon
situation. You've been trusting an entity that is not trustable, not trustworthy. So that's really
where a disaggregated route of trust in your attestation
chain as you are trying to make sure your keys are, you know, accurate makes sense. But there's
no one panacea. I just, you know, seeing it every day, going through and trying to reduce the issues every single day. This is an area where we will
constantly need to be innovating. We will constantly need to be working as a community
to create better solutions. And I think it is an incredible time because all the biggest players
in the industry are working on this. And most of them are actually driving standards into open systems, open compute project.
So, you know, major projects were announced at our last global summit, Hydra being, you
know, one implementation.
I mean, obviously everything that has happened with OpenTitan, but all of the work that has happened in that domain for servers specifically are really exciting for the industry.
And I think it really is showing like security is a differentiator, but system level security, you're only as good as your weakest link.
And so if we don't bring up the whole industry,
we're in trouble.
And I think that's a huge amount of leadership from Google, from Microsoft,
stepping up to say,
hey, we're only as strong as our weakest link.
Let's make sure the industry is better.
That's fantastic.
One final question for you.
We're heading into 2023.
What are the exciting things that we can expect from the Cloudflare team? And what are you most excited about to see from the industry next year? world, you know, there's so much data and insights running a global network, specifically targeting,
like reducing DDoS attacks, making sure that the internet is more secure. I look forward to seeing
our teams take the mic and talk a little bit more about threat intelligence and all the different
ways in which, you know, we can help consumers be smarter about that. Won't be my team at all,
but I just, for the sake of the world,
so that people can understand more, I thought the papers and work that we did around Ukraine
were incredibly powerful. And I really look forward to seeing the team expand that work,
because it's some of the stuff that inspires me most every day on building a better internet.
For my team, we are building our next generation of modular server,
which is super exciting.
Again, for all the e-waste conversation we had earlier.
So that's been a lot.
We are working actively on white box switches
and solutions to both have more inspection
and capability through using best-in-class
server design techniques in the networking domain, as well
as having the network sort of enable us to build what we want to do.
So I'm really excited about that.
I mean, I'm totally geeking out about hardware stuff.
Awesome.
I don't know if that's all good.
And the accelerator ecosystem continues to evolve, continues to be interesting. So I have at least
three different ASICs, varieties of ASICs, actually, I have at least two vendors for most of them
in lab right now that we're starting to experiment on to improve, you know, the accuracy of time
pools for, you know, running a global network to increase our efficacy in serving
machine learning and analytics. So, so many different domains. And obviously, you know,
one of our newest services that launched this year was R2. And as R2 continues to scale,
you know, going from being a global distributed network to being a computational network to actually deliver object storage on top
of that there's so much transformation that is happening in our footprint in our builds in you
know durability and latency requirements in end user services uh so it's been an incredible
learning journey with that team to date and i just you know, am ecstatic to continue to build that to, you know, make it better, stronger, faster for our users.
That's fantastic. Thank you so much for being on the program today, Rebecca.
It's always a great chat.
Thank you for having me.
Thanks for joining the Tech Arena.
Subscribe and engage at our website, thetecharena.net.
All content is copyright by the Tech Arena.