In The Arena by TechArena - The Future of Open Solutions with OCP Chair Rebecca Weekly
Episode Date: May 11, 2023TechArena host Allyson Klein chats with OCP Chair and CloudFlare VP Rebecca Weekly about the future of open computing solutions, how regional demands drive the OCP mission, and the importance of susta...inability.
Transcript
Discussion (0)
Welcome to the Tech Arena. My name is Allison Klein, and I am so delighted to invite back Rebecca Weakley. Welcome back to the program, Rebecca.
Thank you for having me, Allison.
Rebecca, why don't you just remind our audience what you do for your day job and what you do for the topic that we're talking about today?
So my day job, I am the Vice president of hardware systems engineering at Cloudflare.
That is what I do most of the time.
But my part-time job is also to be the chairperson of the Open Compute Project Foundation.
And that is an organization that exists to make sure that we have open, collaborative standards within the ecosystem so that we can produce sustainable, secure systems that work around the world.
So they're very related.
You know, I need one to be able to do my other job conceptually, but I don't necessarily
have to have a leadership role in it.
So Open Compute had an amazing regional summit last week in Prague.
I was there podcasting and it was such a breath of fresh air to see Open Compute back in Europe.
How long has it been since Open Compute was in Europe?
Four years.
Our last regional summit in Europe was in 2019.
And then because of COVID, we didn't do 2020.
Obviously, it wasn't even an option.
And then we kind of didn't think people were quite
ready yet for 2021 or 2022 and this year we even were on you know a little bit on the fence but
I'm so glad we pushed forward with the organization because to your point the European community has
different needs and a different sense of what's important. And it's so important for those
community members to get together, have the debates and the discussions to be able to connect
on what's changing in the ecosystem. It's so interesting because Open Compute is such a vast
organization. It's evolved from being all about cloud service providers to also including Edge,
also including Telco. Tell me why you think the regional engagement
is so important. And you said it's different. How is it different as you move from geography
to geography? Well, I think, so we'll start with Europe. In general, the tenants of the
organization are the same, right? Sustainability, reliability, security, these things persist.
But who cares about what?
And the specific environment for compliance, regulation, and the key players who will help
in creating those standards change by region, right?
In Asia, there's Scorpio.
In Europe, we have all sorts of different open
source organizations, many of whom were present, that are partners that, you know, we'll see Linux
Foundation as an important software partner, no matter where we are in the world. But we might
see a very specific European organization who is incredibly important for chiplet standards for fabrication facilities
and manufacturing facilities in Europe that is very different than what we would see,
you know, in America or in Asia. So I think those three markets specifically have a lot of different
regulatory constraints and are incredibly important for us to have the right
local partnerships to make sure that we can be successful. Because ultimately, Open Compute
Project is about scaling innovation to the entire industry. And so if we don't have the partnerships
in the different regions, then we can't actually create accessible solutions for the ecosystem.
I think that the event was fantastic and it didn't disappoint in terms of the announcements. I think that the doors almost flew off the building when Amazon's first contribution to OCP was announced.
We've been waiting for a long time as an industry to see them engage. Tell me about what that was about and why it's a big deal to get Amazon as a seat at the table for Open Compute.
You know, we have always had a history of helping to lead the IT ecosystem providers towards good opportunities for them.
You know, it's not open source, it's open hardware.
And there's a very big difference between the two.
And it's a feature, not a bug.
Open hardware, as a consortium, we've come together to establish standards of how we work. And usually the way that is done within
OCP is to create a contribution license agreement between a small set of parties. And that set of
parties is usually the minimum number of people it requires to create a server specification,
a profile for Redfish, whatever the outcome is that we're seeking.
And so that's usually an IT ecosystem provider or two, a silicon set of providers, and somebody
who brings market share and a brand and an ecosystem with them so that those different
providers know the work that they're doing.
It's going to be monetized at some point.
Exactly. Unlike software where it's, you know, just one person's time or a subset of maintainer's
time to put software into the ecosystem and let it go. When it comes to hardware, you have to have
a product at the end that somebody warranties, that somebody stands behind. And if nobody knows that there's going to be a dollar value to that, we won't see an output
that is actually able to be purchased. So what happened 11 and a half years ago when OCP was
started, was that a subset of folks, Intel, Meta, at the time, Facebook, Rackspace, Goldman Sachs,
they all sat down and said, hey, we represent, you know, some dollars, we represent some silicon,
we represent systems providers, we can work together to create standards that allow us to
buy hardware, better, faster, stronger, that will end up being
more reliable. Because a lot of the things that we deal with in this ecosystem are
unique by accident. Things won't work out because the firmware version or the BMC version
is different just because you didn't know the methodology of getting things standardized so that we can be effective.
So there's places where everybody in the ecosystem wants to see differentiation in their silicon providers and their IT ecosystem providers.
But there's places where by differentiating there, you're creating more pain for your end users.
Because now we have to create all sorts of wrappers for this version of the server with this part looks like this.
This version of the server with this part looks like this.
And we already have so much complexity, at least in my world, my real day job, managing 297 regions with different rack power height, you know, thermal design, fiber connectivity,
to be able to take that and then have to go to your matrix of garbledygook of things that are
different just because nobody said, hey, here's basically what we would like to see. Like, just
give us this and then play over here. So why does Amazon matter?
Why do we need to go and see and like to see the largest market makers in the ecosystem
working with us in open compute? It's exactly that. If Amazon is bringing their business
and creating standards in OCP, it's not like we make money off of Amazon. That's not how OCP works,
but it allows the IT ecosystem who is in OCP to make systems that will work for Amazon and all
the people who will follow suit. So it generally helps us make sure that those who contribute the most engineering effort to systems that are OCP ready, OCP compliant, are getting the best return on their investments. is the importance of sustainability. It was so just poignant how much sustainability was coming through in every presentation,
in every vendor discussion, in every customer discussion.
I felt like there had been a sea change.
And I understand that we went through a weird time of not gathering for a number of years.
But there had been a sea change in attitudes about the importance,
the relative importance of sustainability. And it just smacked me in the face at the summit in Prague.
Tell me about what the discussions have been inside of OCP on that topic,
and why you think that's the driving force.
Yeah. Well, last year is when we added sustainability as one of our central tenets. So it has been a trajectory that has been sort of gaining momentum, I'd say over the crises of 2020, 21, you know, we started
to realize, wait a second, we're starting to really reset the big engine here. Are we doing
it with intelligence? Can we take this opportunity to think through how to be as efficient as possible and, you know, build together. So I think it's been a
central part of our, again, our market makers, right? The hyperscalers have been very loud and
proud about how much they're trying to do to increase sustainability, to increase, you know,
green energy sources, better recycling of their water, better recycling, et cetera,
et cetera. But they are one pillar. The whole ecosystem has to come together. It's how we
build buildings. You know, you can have green concrete or not. I mean, it comes down to every
aspect of material science, every aspect of the methodologies we use. And there's so many people
involved in the train that is building a data center, in the train that is building a server.
And there's not enough standards. There's not enough reporting that is happening. A lot of
times I feel like what I read, I have to, you know, I read it the way I read academic papers,
like what are all the holes here and what's still worth learning. It's the same with a lot of these
sustainability reports because we haven't had great standards that have been set in terms of
how we report scope one, scope two, scope three. You know, there's a lot of work that these
companies are bringing together to standardize how we report and then what our methodologies will be to be able to have a circular economy, reduce, you know, reuse, recycle the whole cycle.
So I think there's a lot that's been there.
We've had a circular economy initiative, I want to say, since 2012, which was a great partnership with, you know, the team at IT Renew, now Iron Mountain within OCP.
But I see the not just e-waste reduction and making sure that we're doing the right things
on the backend with servers. Now it's the, how do we design them better? Active power analysis
versus peak power analysis. Most servers have always been designed for peak
power. Now you're starting to see a lot more leeway for active power. It saves components.
It saves, it, you know, make sure we're getting the most out of every watt invested. And it turns out we can burst on our power profiles for a shorter period
of time and ultimately be in a better situation for sustainability. So there's all these different
initiatives that are happening throughout the ecosystem for modularity so that we can reuse
larger portions of servers and not throw away as much for, you know, how we're operating them, which
I'm talking about server stuff, but there's some great partnerships that are happening in the
software side because 90% of the problem is what code we run. So we got to do that. And then
obviously on the backend from an e-use, e-waste reduction and reuse capability. So I think,
you know, the whole industry has really been thinking
about this. Probably it did start with the hyperscalers. I think it also starts with this
generation, right? I think these consumers care as much about what they're purchasing
and what it does for the world, maybe more than we did just, you know, getting the fastest response time on something.
When I was thinking about that, it really made me think about we're moving sustainability at an equal level as performance, right?
We've had rigor and structure in how we measure workload performance for many, many years.
The entire industry understands both standpoint of benchmarking as well as what you would do to
actually test a workload. Seems like we need the same type of rigor and structure and industry
norms in sustainability so that we're speaking the same language, right?
Yeah, no, I, it's a great point. And, um, I still remember, gosh, what event was it? Uh, don't quote me. I'll, I'll get you the data for the show notes, but, um, there was a presentation
that was done, uh, by John Miranda who was showing, you know showing the amount of energy in watching a stream in high def versus
watching a stream in standard def. And that there are actually tools now where we're starting to see
these services show you options of, do you want to use this much energy or this much energy?
And not just at the infrastructure level, when you're renting an instance from Google and they show you like,
hey, this one's served out of this region,
which is 92% green power, whatever.
But literally an active web portal,
when you're purchasing something, you can see now,
like this was done using this much water,
or this was done using this much water, or this was done using this much
water, or I can watch this stream and I can watch it in standard deaf, which I'm looking at my little
phone, right? Who cares if it's high def, I'm at the gym. I just want something to entertain me
while I'm running on my treadmill versus, you know, I'm sitting down in my home theater and I want perfect. Exactly. And I think by default, we
assume as providers, we've got to give the highest quality experience. Quality, to your point,
performance has always been top bin, highest performance. It's not necessarily been
performance per watt or performance at an ideal quality to allow the user to optimize for their current connectivity, the tolerance they have for creating waste in the ecosystem.
And if we give people more data, it's amazing what they're willing to do to determine that, you know, yes, I want this,
but I don't want this at the expense of the amount of emissions.
Water. Earth. Exactly. Yeah, exactly. I have one final question for you. One of the things that
was really notable is the amount of breadth of contributions that are flowing through OCP and how that breadth
represents different segments of the market. OCP is a very fundamentally different organization
than it was five years ago. How do you see that shaping and why is this OCP model
becoming more and more relevant to different segments of the market. And how do you think
that's going to shape what we experience in the fall at the OCP summit in the US?
Okay. Well, I'll start with what is going to change in the next, you know, we've changed a
lot in the last five years. Do I think we're going to change a lot in the next five years?
Probably no.
I think we went from being a nascent organization that was sort of focused on just server design and efficacy.
And now we really are a more mature organization. We have representation from nearly all of see that hyperscale innovation driving different
standards and market segments continue. And I think that is what we've done really well as an
organization. And again, I think it comes down to that we do open hardware versus doing open source,
true open source for reasons of IP and protection of our members. For the what is going to continue to change and
how we're going to continue to scale, we don't add tenants lightly, but sustainability is still
in its nascent days. And there's a lot more that needs to come throughout each and every scope.
And a lot of, I think, key partnerships, again, on that software side, that will be really important
for us to keep going. Because again, it's great if we reduce on the front end, that will be really important for us to keep going.
Because again, it's great if we reduce on the front end, but if we don't reduce the usage over the lifespan and cycle, then that will be a problem. And I think the other key domain that
really is going to change is AI. We've just barely scratched the surface of how to do that.
There's great standards around OAM, OAI that are really around, you know,
how do we build servers for big accelerator systems? Great. But backend fabrics, debug,
and then aspects of human, you know, review, I'll say friction for the right case. You know,
you want most things to scale without a lot of
friction. But in the case of AI, when something, especially these LLMs can give you definitive data
and speak it as if it is the truth, and it's actually just a reinforcement of bias,
that creates a situation where we have some concerns, right? So where do we apply human friction logically
and how to create systems
where the telemetry and observability
can allow you to roll back
what was done to get to that conclusion.
As these things are in super custom frameworks, fabrics,
no telemetry, no visibility
across different ecosystems or vendors,
everyone's solving those problems themselves. But this is a worldwide problem.
When and where to be able to determine why the LLM said this and the truth is this
and turning that into an error report where a human can understand and then rationalize,
businesses are going to have to care about that because they will lose the trust of their
consumers if they can't do that correctly.
And if we don't build systems, specifically backend fabrics, that's where I worry because
very few of these systems will singularly be able to solve a problem, right?
I mean, there's so many parameters in these models.
We need scale. We're going to be at 4,096 nodes. Exactly. So what we're going to deliver
has to be stitched together. And that's going to take somebody building the fabric,
someone building the server, someone building the silicon, various forms of silicon. So this is where
being able to work together. Again, we've just scratched the surface, but we got to go.
And so what does that mean for the fall for OCP Summit to try and take it all the way around
Global Summit? I hope you'll see a lot of interesting partnerships. Certainly we're
seeing a lot of interesting projects, and I hope we'll see interesting partnerships,
not just in AI from a fabrics point of view, but also telemetry, observability, and security. This is where,
you know, who's responsible for what? How do we have that? And in fact, you know, I think you saw
a little bit in the European summit, the chiplet standards and the link layer work. That's that
start of, okay, who's producing the chip or the chiplets? Now, how does it all come together?
Who owns security, reliability, and telemetry? So that no matter who in the ecosystem is producing
the widget, the user has visibility and manageability. And constancy across platforms,
right? Amazing. Amazing. I can't wait. I'm really excited for the fall.
I'll see you there.
In the meantime, if folks want to catch up with you and connect about what you're doing at Cloudflare, what you're doing at OCP, where should they reach out to you, Rebecca?
Absolutely. rweekly at cloudflare.com. It'll always weekly or Rebecca Lipon, my maiden name at Twitter
or on LinkedIn, you know, Rebecca Lipon. Fantastic. Thank you so much for the time
today, Rebecca. It's always a pleasure. Always good to see you.