In The Arena by TechArena - GEICO’s Rebecca Weekly on IT Transformation and OCP Innovation
Episode Date: October 16, 2024In this episode, Rebecca Weekly shares how GEICO is rethinking cloud strategy and embracing OCP for improved efficiency, security, and cost savings in its infrastructure journey....
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators
and our host, Alison Klein.
Now, let's step into the arena.
Welcome in the arena.
My name is Alison Klein, and we're coming to you from OCP
Summit in San Jose, California this week. And this was my, I hate to say this about the other
interviews, but this was my favorite interview coming up this week. And it's Rebecca Weekly,
now of Geico. Welcome to the program, Rebecca. Thank you so much for having me out, Finn.
So Rebecca, you've been on the show before many times.
We've talked about OCP a lot,
but this is the first time that you've been on the show when you've been at Geico.
So why don't you tell everyone about your new role
and a little bit about your background?
Sure.
So earlier this year, I joined Geico,
to lead infrastructure engineering,
and that is the platform down from cluster management down
within our public and private cloud footprint.
So Geico was a fully on-prem infrastructure build in 2013, started doing what a lot of
different companies did in the enterprise space, moved to the public cloud.
It's going to help you with agility.
It's going to help you with developer capabilities.
You're going to have all these great tools.
And fast forward 10 years into that journey, they had only gotten 80%
of their workloads out of the on-prem footprint. The last 20% could not lose. And of those 80%,
it had driven up the cost over 3X. Wow. And so as responsible fiduciaries of our company,
we had to really look at what we were trying to accomplish to achieve the objectives,
and if not, reevaluate how we do this better, faster, stronger.
And so I actually, funny enough, met with the Geico team when I was in OCP capacity,
helping support their concepts, their early investigations into this role.
In some ways, how I ended up in this role, because I could not have been more excited
about the infrastructure that I was building at CloudCare.
But being able to help a company that is serving everyone, keeping them in their cars, keeping them in their homes,
in driving tangible decisions around their infrastructure to better serve the business is just a fascinating opportunity. It's not a skill problem. It's a legacy and a migration
and making sure you're creating the right separation of duties between players
so that application teams can innovate
and infrastructure can continue to optimize the cost footprint to Cirque.
This is a 4% margin business.
We cannot afford to be wasting money.
Now, you know, I think the first thing that anybody would say to you is OCP is about hyperscale,
OCP is about the public cloud. And here you are at Geico, you guys gave a keynote presentation
today about the fact that not so fast, my friends, we can actually use OCP as well. Tell me about why I first chose to use OCP configurations and how that journey has gone.
So I think whether we had used ODMs or OEMs was actually a 10%, 30% kind of cost structure.
So it's not necessarily true that one has to go towards OCP and OpenSuite for cost.
The reason why we chose to go towards OpenSuite was to control our own destiny. We are heavily
regulated. We need to make sure we have a path to attest to our firmware, a path to ensure that we
are doing secure boot, a path to ensure at any time we can do signed firmware images. When you work in an OEM, the vendors of your silicon or the vendors of your component are
going to drop a firmware update. And then your OEM has to integrate that into their closed source
environment for you. And then at some point you get that new package, you validate that new package
and you roll it out across your fleet. And this is how you end up with people who have patched
their fleet and have rolled out firmware updates in eight years because they don't have any real understanding
and they aren't really close to the problem. They're relying on somebody else to do it. And
if it causes any business outage or continuity challenge, they don't prioritize it. And that's
a problem. When you come into an ODNP ecosystem, they're only going to sell you what you need. They work for you.
And so what OCP gives you is a common language to work with ODNs on building what you need.
Now, there's challenges to that and there's opportunities.
So when you say something like DCMHS or DCSCM, there's a 1.0 spec, a 2.0 spec, a 2.1 spec,
and lots of different opportunity for people to have
created deltas that will make it hard to build your firmware to validate that process across
and make sure your BIOS is actually working across the different systems.
So buyer beware.
There's a good reason to use OEM systems if you don't want to invest and understand how
you're going to keep your fleet secure and manageable. But if you need to for your business,
if that is a challenge that you found in the public crowd,
a challenge you found on-prem,
a challenge you've seen in different domains,
you're going to think about what OCP can bring you
from a security perspective
to be able to have a secure supply chain,
root of trust methodology
for signing for where it bios or being able to have a secure supply chain, root of trust methodology for signing for
Wared Bios or being able to report.
I actually think the value for us was sure savings, but truly a path to a much
more compliant, a much more manageable fleet from a control your own destiny.
You know, you have a CBE, you're going to address that.
You don't have to wait 18 months for
some vendor to give you a blob that is not inspectable, that is not understood, that is
just you're at their mercy. And if you don't roll it out and something goes wrong in the deployments,
now you're 18 months from another one. Now, obviously, with your history, you're extremely
deep in OCP. So that guidance for this journey, I'm sure has been advantageous for guidance.
And while I'm sure that you have an amazing IT staff, you don't have the endless engineers that
some of the hyperscalers have to run their data. How do you manage and mitigate when you have
reasonable size resourcing pools, reasonable size budgets, and you're looking at implementing
technology that was designed potentially at first
for IT departments that are much larger? I think it's the unexpected things that catch you up,
right? I'm deep in OCP and in what the specs mean and that they are more descriptive than
prescriptive. So I know what to ask for. But one of the big ones that caught us was top rack switches. So we went with ORV3 racks
that really future-proofs your data center to go up massively in power delivery. Everything's DC.
Guess what? You're not going to find a top rack switch from a standard vendor that is ORV3
compliant. You're not going to find one that's DC compatible. You're going to have to, so you can change inverters on there and you can do all sorts
of things.
And we came up with it, I would argue, very clever way of solving that problem to be able
to work in our environment.
But it was one of those, oh, this isn't a hyperscale problem because they're building
their own.
And so they can do whatever they want.
But for the broad ecosystem that's selling into 90-inch cabs that are in
brownfield installations across all the data centers everywhere in the world, they haven't
seen the market opportunity yet to solve this problem. It's not the rack itself, but it was the
power that had to be converted correctly to be able to use off-the-shelf networking gear versus
what we see in the server side that's ready to go. And that's this interesting
moment of you don't know what you don't know until you get there. That makes a lot of sense. Now, if
you were talking directly to someone in your equivalent position at another enterprise,
considering starting this journey. Yeah. What are your top suggestions about how to get going
and how to engage this wonderful, vibrant
community?
I think the number one thing is to know what you're looking for.
Understand your current footprint from a compute perspective.
And fuel.
Understand your storage footprint.
Are you predominantly legacy running ISVs and therefore you're going to need certified
systems doing X, Y, or Z?
That's the case.
You're probably not in the right spot to come to open hardware.
Open hardware loves open stuff.
Open source is the name of the game.
If you are looking at open stack, if you're looking at Kubernetes, if you're running KVM,
if it's Qtvert, if you're in this ecosystem where people are running containerized or
even virtualized, but they're doing it in a modern or nice stack, you're in this ecosystem where people are running containerized or even virtualized,
but they're doing it in a modern, modernized stack, you're going to have a lot more options
from a supplier perspective than if you're locked into these legacy ISVs. So that's kind of the
first, as someone's looking at their IT spend and looking at their systems design. The next is to
look at your mix of storage and computing. So storage in and of itself can be incredibly expensive.
These are much more expensive servers than your compute servers.
Well, you're buying more compute servers, probably depends on your mix and your work.
But you're going to have a lot more costs caught up in all the drives, everything you're trying to accomplish.
So understanding, do you have data lakes?
Do you have data lakes? Do you
have data warehouses? Are you in this analytics domain? Are you actually understanding where the
data is, what the data processing mechanisms are going to be, what those SLOs are for your
application team? Also is going to be that anchor spend that you need to think through
in your IT strategy. And then I think the last part is where do people want to go?
Where do they want to spend money?
When we're looking in an infrastructure perspective,
most of us don't want to spend money
in our management stacks, in our database management
when there's so many good open source solutions.
So you want to spend where it's a freeing value
back to the company.
You also want to build where it's IP that's potentially differentiating.
And so you're trying to find the sweet spot with Open
where it's not necessarily that you want to build it yourself.
I don't need a custom server.
I want to use as vanilla off-the-shelf servers as possible,
but I want to get them from somebody who's not going to put 15 different layers of warranty maintenance, manageability software overhead that I don't want to use because it's going to make you less secure and less responsive to buying.
So when you actually look at OCP for that, but because the ways the hyperscalers move the market means they
can get good supplies and quantities and customize them for what they need for their niche, what
they need for their needs.
So if you can draft that investment for the core motherboard for everything that you're
going to do for PCPA, that's going
to help you then amortize your spend. That's where I would really help people understand,
is this where you want to spend all your money? Would you like some french fries with that
hamburger? And then let's go focus in getting complicated and fancy where we have to because
it's accruing balance. That makes a lot of sense. Now, your conversation talked about a year's journey.
What have you accomplished in that year?
Yeah, so lots of things.
Obviously, we designed and selected our new server, which is compute, storage, light and heavy, if you want to call it that.
And then GPU servers as well to be able to run our on-prem footprint of what is core to our business.
We're not trying to take everything out of the public cloud.
We're trying to rationalize our use of the public cloud where it's best for our business.
So experimentation, ephemeral compute, things where cloud is great, things that are super
low latency to end users, areas where you want to be protected against DDoS.
These are domains cloud is great for.
When we're talking about our unprofitable grid,
we want our billing at our backend,
our payments and things that are core to our business,
so latency sensitive.
Right.
And are really critical for compliance, audit and security.
But that's really been our focus.
So we designed our servers.
We purchased our new data center spaces.
We are in the process of shutting down legacy data
centers, building up some of our legacy data centers so that they are modern and capable.
Also, we built out a hybrid cloud stack that is all open source space to be able to run across
our public and private cloud footprint with a consistent micro system from a placement engine
perspective for vended VMs and containers and clusters.
And that has been the goal of the one year. There's a lot more to do to really make sure
we shut down all the legacy data centers, stop if they were horrible, you know, pretty inconsistent.
So there wasn't a singular sort of power footprint design pieces in every site. So lots of opportunity to improve that so
that you have a true active passive configuration and, you know, the ability to deliver high
reliability experience to your end users. Now you started this process with 20% of your workloads
on from. Yep. Where are you going to end? That is a great question.
One of my favorite quotations is,
every model is informative.
It's definitely not accurate.
So we've modeled lots of scenarios
that if work were to stay consistent,
we would be illogical to run less than 80% on-prem.
But nothing is going to stay consistent
because while I'm working on this part
of the infrastructure rebuild,
all sorts of teams within Daikuro are also changing how we do data, how we process and interact from a digital perspective, how we run our call center. So all of that modernization effort is going to change the workload and what is needed.
I will always bet on PrEP being cheaper than the cloud if it's predictable.
But if things are not predictable, the cloud has great elasticity to compute and type,
right?
You have so many options.
So where we end up, will it be 80-20?
Will it be 60-40?
Will it be, I don't know.
But looking at what we're currently serving, it would be much more logical to serve all of that on-prem.
And then it frees up CapEx dollars to spend on interesting, new, innovative projects that may very well change that program and change that configuration.
And you do all this and you still run and you still sing.
And you did all this in a year.
That's incredible.
I wasn't here the full year. And massive props to my boss, my previous boss, Harry Govind, who started this journey and brought me here to run.
He had a lot of passion coming from meta, having gone through a lot of this at Target.
And I think he knew he was a true believer in the value, stripping out non-value-add features, focusing on the core business, focusing on open source,
focusing on open hardware, controlling our own destiny and building what we need to do.
That's fantastic. Where do you think OCP is going? I mean, we're sitting in the conference
that has 7,000 people. 7,000 plus. I remember when OCP was drawing like 150 because I'm that old.
Where do you think OCP is going in terms of broad market? And do you
think this is just the beginning of enterprises taking advantage of this? And we're going to see
a lot more of that? Certainly, I want to say yes. I think OCP has a lot of work to do to help
enterprises be more seen. And it starts by listening, right? It starts by engaging enterprise communities
and helping ensure that they are actually getting a voice at the table. OCP's perception
has changed when it first started. Everyone thought it was all about Facebook. You know,
Goldman Sachs was one of the founding members and Rackspace was one of the founding members
who was very much serving enterprises. It started about scaling innovation through an ecosystem,
and it should stay true to those roots. But to do that in these domains, you can't be 100%
talking about immersive cooling as much as I love immersive cooling. You can't be 100%
talking about chiplets. You can't be 100% talking about ZXL. stranded memory capacity is a real challenge to room, but it's not going to be solved
by disaggregating all memory. So we have smaller problems to solve and they're real problems and
their problems are interoperability, their problems of usability, their problems of manageability.
They're the core of what the previous 10 plus years of OCP has been. We don't need to stop
integrating. We don't need to stop innovating.
We don't need to stop facilitating those discussions.
But we also need to recognize that generative AI is not necessarily the core business of
the vast majority of enterprises.
They still need traditional compute, traditional storage, making really good, reliable solutions.
And if we just layer additional costs in every space, because
AI is the buzzword du jour, that's where we're going to be at with it. I've been fascinated
walking through the expo at all the ways in which people are building LLMs into management software
and DSIM solutions. And we're probably not where people are going to spend extra dollars of their
infrastructure set.
Not because fleet management does that.
Not because we don't care about assets and inventory being easy to facilitate.
But having a full augmented reality overlay for your data center is something that you would invest in if you're running 300 physical sites.
Probably not true if you're running three.
So you're not going to be buying a nuclear power plant anytime soon is what you're saying?
I think it's a great clean power source for us to be investing in from a distribution mechanism.
That is something that I'm going to purchase via a co-location facility for my data centers.
And so I want to make sure that they have good battery backup systems and building monitoring
solutions so that I can do remote management and directly brushing.
And each and every one is going to have some different sensor array.
So whatever solution you try to sell me and whatever model you think we're going to run
is probably not going to fit all the different sets of constraints at the same time.
So let's solve the skill problems
and let's solve the small problems with an eye towards total cost of ownership, with an eye
towards helping businesses run effectively. Then OCP is really scaling innovation in terms of the
open hardware and open software, open source design patterns that can solve real problems. Now you're at OCP this week.
I know that it's been a flurry of a day for you. I'm going to ask you one question about beyond
Geico, which is, I know that you've talked to a lot of different companies here. You've walked
the show floor. What was the most exciting announcement that doesn't have to do with Geico?
We've heard about it at the show.
There's a lot.
I'll just specifically speaking as Rebecca, I'm nobody else.
I really enjoyed seeing the announcement of x86 as an ecosystem.
Actually investing in sustainability and infrastructure.
Right.
And those two go together.
So there's a lot of x86 software that's written in the world. I want to see a continuous investment in the software and the hardware to ensure that continues to evolve.
I'm very sure five years ago, even three years ago, you wouldn't have seen those two on stage together.
And that was a pretty impressive ecosystem and break that I thought was fantastic, interesting.
All the words that can come to mind that I just,
I'm glad I lived to see. I thought there were some interesting conversations around scaled fabrics,
how we're seeing universal ethernet come together and where we're starting to see that make
progress. I am a huge fan of it. And whether it's InfiniFan, whether it's EnvyLink, whether we're
talking about post-proprietary solutions to the interconnect space, I feel that stops us from solving problems collectively.
And I think there's just so much worth engaging the ecosystem upon and that we truly are smarter together.
I get excited when I see the open ecosystem approaches towards connectivity, towards scaling, actually
growing legs.
Nice.
Running well together.
So those are probably two that jumped out at me of, yes, we're really starting to see
that even in the ASPs, which will be what OpenML has offered in terms of open models
and the way in which you're seeing different accelerators come and join and show that those
models can run fast within their domain-specific acceleration spaces, all the funnys that are
out there.
I think the journey is not solved.
Problems are not solved in this domain.
Models are changing so fast.
The ecosystem is changing so fast.
And so starting to bring collaboration together with open minds is a place where we're
going to see a lot of integration. And it's all of them. Everybody has an AI chip, an AI accelerator,
a fabric that is working on a backend compliant with X, Y, or Z to help people move forward.
That's awesome. One final question for you. Where can folks engage with you and continue the conversation? I am sure they're going to want to.
I'm on LinkedIn. That's probably the best place to reach out. And I'm happy to chat especially about the enterprise journey and what it takes to be effective.
Always fun to have you on the show, Rebecca. Thank you so much.
Thanks for joining the Tech Arena. Subscribe and engage at our website,
thetecharena.net. All content is copyright by the Tech Arena.