In The Arena by TechArena - GEICO’s Rebecca Weekly on IT Transformation and OCP Innovation

Episode Date: October 16, 2024

In this episode, Rebecca Weekly shares how GEICO is rethinking cloud strategy and embracing OCP for improved efficiency, security, and cost savings in its infrastructure journey....

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name is Alison Klein, and we're coming to you from OCP Summit in San Jose, California this week. And this was my, I hate to say this about the other interviews, but this was my favorite interview coming up this week. And it's Rebecca Weekly, now of Geico. Welcome to the program, Rebecca. Thank you so much for having me out, Finn.
Starting point is 00:00:42 So Rebecca, you've been on the show before many times. We've talked about OCP a lot, but this is the first time that you've been on the show when you've been at Geico. So why don't you tell everyone about your new role and a little bit about your background? Sure. So earlier this year, I joined Geico, to lead infrastructure engineering,
Starting point is 00:00:59 and that is the platform down from cluster management down within our public and private cloud footprint. So Geico was a fully on-prem infrastructure build in 2013, started doing what a lot of different companies did in the enterprise space, moved to the public cloud. It's going to help you with agility. It's going to help you with developer capabilities. You're going to have all these great tools. And fast forward 10 years into that journey, they had only gotten 80%
Starting point is 00:01:25 of their workloads out of the on-prem footprint. The last 20% could not lose. And of those 80%, it had driven up the cost over 3X. Wow. And so as responsible fiduciaries of our company, we had to really look at what we were trying to accomplish to achieve the objectives, and if not, reevaluate how we do this better, faster, stronger. And so I actually, funny enough, met with the Geico team when I was in OCP capacity, helping support their concepts, their early investigations into this role. In some ways, how I ended up in this role, because I could not have been more excited about the infrastructure that I was building at CloudCare.
Starting point is 00:02:05 But being able to help a company that is serving everyone, keeping them in their cars, keeping them in their homes, in driving tangible decisions around their infrastructure to better serve the business is just a fascinating opportunity. It's not a skill problem. It's a legacy and a migration and making sure you're creating the right separation of duties between players so that application teams can innovate and infrastructure can continue to optimize the cost footprint to Cirque. This is a 4% margin business. We cannot afford to be wasting money. Now, you know, I think the first thing that anybody would say to you is OCP is about hyperscale,
Starting point is 00:02:52 OCP is about the public cloud. And here you are at Geico, you guys gave a keynote presentation today about the fact that not so fast, my friends, we can actually use OCP as well. Tell me about why I first chose to use OCP configurations and how that journey has gone. So I think whether we had used ODMs or OEMs was actually a 10%, 30% kind of cost structure. So it's not necessarily true that one has to go towards OCP and OpenSuite for cost. The reason why we chose to go towards OpenSuite was to control our own destiny. We are heavily regulated. We need to make sure we have a path to attest to our firmware, a path to ensure that we are doing secure boot, a path to ensure at any time we can do signed firmware images. When you work in an OEM, the vendors of your silicon or the vendors of your component are going to drop a firmware update. And then your OEM has to integrate that into their closed source
Starting point is 00:03:54 environment for you. And then at some point you get that new package, you validate that new package and you roll it out across your fleet. And this is how you end up with people who have patched their fleet and have rolled out firmware updates in eight years because they don't have any real understanding and they aren't really close to the problem. They're relying on somebody else to do it. And if it causes any business outage or continuity challenge, they don't prioritize it. And that's a problem. When you come into an ODNP ecosystem, they're only going to sell you what you need. They work for you. And so what OCP gives you is a common language to work with ODNs on building what you need. Now, there's challenges to that and there's opportunities.
Starting point is 00:04:34 So when you say something like DCMHS or DCSCM, there's a 1.0 spec, a 2.0 spec, a 2.1 spec, and lots of different opportunity for people to have created deltas that will make it hard to build your firmware to validate that process across and make sure your BIOS is actually working across the different systems. So buyer beware. There's a good reason to use OEM systems if you don't want to invest and understand how you're going to keep your fleet secure and manageable. But if you need to for your business, if that is a challenge that you found in the public crowd,
Starting point is 00:05:11 a challenge you found on-prem, a challenge you've seen in different domains, you're going to think about what OCP can bring you from a security perspective to be able to have a secure supply chain, root of trust methodology for signing for where it bios or being able to have a secure supply chain, root of trust methodology for signing for Wared Bios or being able to report.
Starting point is 00:05:27 I actually think the value for us was sure savings, but truly a path to a much more compliant, a much more manageable fleet from a control your own destiny. You know, you have a CBE, you're going to address that. You don't have to wait 18 months for some vendor to give you a blob that is not inspectable, that is not understood, that is just you're at their mercy. And if you don't roll it out and something goes wrong in the deployments, now you're 18 months from another one. Now, obviously, with your history, you're extremely deep in OCP. So that guidance for this journey, I'm sure has been advantageous for guidance.
Starting point is 00:06:06 And while I'm sure that you have an amazing IT staff, you don't have the endless engineers that some of the hyperscalers have to run their data. How do you manage and mitigate when you have reasonable size resourcing pools, reasonable size budgets, and you're looking at implementing technology that was designed potentially at first for IT departments that are much larger? I think it's the unexpected things that catch you up, right? I'm deep in OCP and in what the specs mean and that they are more descriptive than prescriptive. So I know what to ask for. But one of the big ones that caught us was top rack switches. So we went with ORV3 racks that really future-proofs your data center to go up massively in power delivery. Everything's DC.
Starting point is 00:06:53 Guess what? You're not going to find a top rack switch from a standard vendor that is ORV3 compliant. You're not going to find one that's DC compatible. You're going to have to, so you can change inverters on there and you can do all sorts of things. And we came up with it, I would argue, very clever way of solving that problem to be able to work in our environment. But it was one of those, oh, this isn't a hyperscale problem because they're building their own. And so they can do whatever they want.
Starting point is 00:07:20 But for the broad ecosystem that's selling into 90-inch cabs that are in brownfield installations across all the data centers everywhere in the world, they haven't seen the market opportunity yet to solve this problem. It's not the rack itself, but it was the power that had to be converted correctly to be able to use off-the-shelf networking gear versus what we see in the server side that's ready to go. And that's this interesting moment of you don't know what you don't know until you get there. That makes a lot of sense. Now, if you were talking directly to someone in your equivalent position at another enterprise, considering starting this journey. Yeah. What are your top suggestions about how to get going
Starting point is 00:08:02 and how to engage this wonderful, vibrant community? I think the number one thing is to know what you're looking for. Understand your current footprint from a compute perspective. And fuel. Understand your storage footprint. Are you predominantly legacy running ISVs and therefore you're going to need certified systems doing X, Y, or Z?
Starting point is 00:08:24 That's the case. You're probably not in the right spot to come to open hardware. Open hardware loves open stuff. Open source is the name of the game. If you are looking at open stack, if you're looking at Kubernetes, if you're running KVM, if it's Qtvert, if you're in this ecosystem where people are running containerized or even virtualized, but they're doing it in a modern or nice stack, you're in this ecosystem where people are running containerized or even virtualized, but they're doing it in a modern, modernized stack, you're going to have a lot more options
Starting point is 00:08:49 from a supplier perspective than if you're locked into these legacy ISVs. So that's kind of the first, as someone's looking at their IT spend and looking at their systems design. The next is to look at your mix of storage and computing. So storage in and of itself can be incredibly expensive. These are much more expensive servers than your compute servers. Well, you're buying more compute servers, probably depends on your mix and your work. But you're going to have a lot more costs caught up in all the drives, everything you're trying to accomplish. So understanding, do you have data lakes? Do you have data lakes? Do you
Starting point is 00:09:25 have data warehouses? Are you in this analytics domain? Are you actually understanding where the data is, what the data processing mechanisms are going to be, what those SLOs are for your application team? Also is going to be that anchor spend that you need to think through in your IT strategy. And then I think the last part is where do people want to go? Where do they want to spend money? When we're looking in an infrastructure perspective, most of us don't want to spend money in our management stacks, in our database management
Starting point is 00:09:56 when there's so many good open source solutions. So you want to spend where it's a freeing value back to the company. You also want to build where it's IP that's potentially differentiating. And so you're trying to find the sweet spot with Open where it's not necessarily that you want to build it yourself. I don't need a custom server. I want to use as vanilla off-the-shelf servers as possible,
Starting point is 00:10:20 but I want to get them from somebody who's not going to put 15 different layers of warranty maintenance, manageability software overhead that I don't want to use because it's going to make you less secure and less responsive to buying. So when you actually look at OCP for that, but because the ways the hyperscalers move the market means they can get good supplies and quantities and customize them for what they need for their niche, what they need for their needs. So if you can draft that investment for the core motherboard for everything that you're going to do for PCPA, that's going to help you then amortize your spend. That's where I would really help people understand, is this where you want to spend all your money? Would you like some french fries with that
Starting point is 00:11:15 hamburger? And then let's go focus in getting complicated and fancy where we have to because it's accruing balance. That makes a lot of sense. Now, your conversation talked about a year's journey. What have you accomplished in that year? Yeah, so lots of things. Obviously, we designed and selected our new server, which is compute, storage, light and heavy, if you want to call it that. And then GPU servers as well to be able to run our on-prem footprint of what is core to our business. We're not trying to take everything out of the public cloud. We're trying to rationalize our use of the public cloud where it's best for our business.
Starting point is 00:11:52 So experimentation, ephemeral compute, things where cloud is great, things that are super low latency to end users, areas where you want to be protected against DDoS. These are domains cloud is great for. When we're talking about our unprofitable grid, we want our billing at our backend, our payments and things that are core to our business, so latency sensitive. Right.
Starting point is 00:12:13 And are really critical for compliance, audit and security. But that's really been our focus. So we designed our servers. We purchased our new data center spaces. We are in the process of shutting down legacy data centers, building up some of our legacy data centers so that they are modern and capable. Also, we built out a hybrid cloud stack that is all open source space to be able to run across our public and private cloud footprint with a consistent micro system from a placement engine
Starting point is 00:12:42 perspective for vended VMs and containers and clusters. And that has been the goal of the one year. There's a lot more to do to really make sure we shut down all the legacy data centers, stop if they were horrible, you know, pretty inconsistent. So there wasn't a singular sort of power footprint design pieces in every site. So lots of opportunity to improve that so that you have a true active passive configuration and, you know, the ability to deliver high reliability experience to your end users. Now you started this process with 20% of your workloads on from. Yep. Where are you going to end? That is a great question. One of my favorite quotations is,
Starting point is 00:13:28 every model is informative. It's definitely not accurate. So we've modeled lots of scenarios that if work were to stay consistent, we would be illogical to run less than 80% on-prem. But nothing is going to stay consistent because while I'm working on this part of the infrastructure rebuild,
Starting point is 00:13:56 all sorts of teams within Daikuro are also changing how we do data, how we process and interact from a digital perspective, how we run our call center. So all of that modernization effort is going to change the workload and what is needed. I will always bet on PrEP being cheaper than the cloud if it's predictable. But if things are not predictable, the cloud has great elasticity to compute and type, right? You have so many options. So where we end up, will it be 80-20? Will it be 60-40? Will it be, I don't know.
Starting point is 00:14:20 But looking at what we're currently serving, it would be much more logical to serve all of that on-prem. And then it frees up CapEx dollars to spend on interesting, new, innovative projects that may very well change that program and change that configuration. And you do all this and you still run and you still sing. And you did all this in a year. That's incredible. I wasn't here the full year. And massive props to my boss, my previous boss, Harry Govind, who started this journey and brought me here to run. He had a lot of passion coming from meta, having gone through a lot of this at Target. And I think he knew he was a true believer in the value, stripping out non-value-add features, focusing on the core business, focusing on open source,
Starting point is 00:15:05 focusing on open hardware, controlling our own destiny and building what we need to do. That's fantastic. Where do you think OCP is going? I mean, we're sitting in the conference that has 7,000 people. 7,000 plus. I remember when OCP was drawing like 150 because I'm that old. Where do you think OCP is going in terms of broad market? And do you think this is just the beginning of enterprises taking advantage of this? And we're going to see a lot more of that? Certainly, I want to say yes. I think OCP has a lot of work to do to help enterprises be more seen. And it starts by listening, right? It starts by engaging enterprise communities and helping ensure that they are actually getting a voice at the table. OCP's perception
Starting point is 00:15:53 has changed when it first started. Everyone thought it was all about Facebook. You know, Goldman Sachs was one of the founding members and Rackspace was one of the founding members who was very much serving enterprises. It started about scaling innovation through an ecosystem, and it should stay true to those roots. But to do that in these domains, you can't be 100% talking about immersive cooling as much as I love immersive cooling. You can't be 100% talking about chiplets. You can't be 100% talking about ZXL. stranded memory capacity is a real challenge to room, but it's not going to be solved by disaggregating all memory. So we have smaller problems to solve and they're real problems and their problems are interoperability, their problems of usability, their problems of manageability.
Starting point is 00:16:37 They're the core of what the previous 10 plus years of OCP has been. We don't need to stop integrating. We don't need to stop innovating. We don't need to stop facilitating those discussions. But we also need to recognize that generative AI is not necessarily the core business of the vast majority of enterprises. They still need traditional compute, traditional storage, making really good, reliable solutions. And if we just layer additional costs in every space, because AI is the buzzword du jour, that's where we're going to be at with it. I've been fascinated
Starting point is 00:17:10 walking through the expo at all the ways in which people are building LLMs into management software and DSIM solutions. And we're probably not where people are going to spend extra dollars of their infrastructure set. Not because fleet management does that. Not because we don't care about assets and inventory being easy to facilitate. But having a full augmented reality overlay for your data center is something that you would invest in if you're running 300 physical sites. Probably not true if you're running three. So you're not going to be buying a nuclear power plant anytime soon is what you're saying?
Starting point is 00:17:49 I think it's a great clean power source for us to be investing in from a distribution mechanism. That is something that I'm going to purchase via a co-location facility for my data centers. And so I want to make sure that they have good battery backup systems and building monitoring solutions so that I can do remote management and directly brushing. And each and every one is going to have some different sensor array. So whatever solution you try to sell me and whatever model you think we're going to run is probably not going to fit all the different sets of constraints at the same time. So let's solve the skill problems
Starting point is 00:18:25 and let's solve the small problems with an eye towards total cost of ownership, with an eye towards helping businesses run effectively. Then OCP is really scaling innovation in terms of the open hardware and open software, open source design patterns that can solve real problems. Now you're at OCP this week. I know that it's been a flurry of a day for you. I'm going to ask you one question about beyond Geico, which is, I know that you've talked to a lot of different companies here. You've walked the show floor. What was the most exciting announcement that doesn't have to do with Geico? We've heard about it at the show. There's a lot.
Starting point is 00:19:07 I'll just specifically speaking as Rebecca, I'm nobody else. I really enjoyed seeing the announcement of x86 as an ecosystem. Actually investing in sustainability and infrastructure. Right. And those two go together. So there's a lot of x86 software that's written in the world. I want to see a continuous investment in the software and the hardware to ensure that continues to evolve. I'm very sure five years ago, even three years ago, you wouldn't have seen those two on stage together. And that was a pretty impressive ecosystem and break that I thought was fantastic, interesting.
Starting point is 00:19:42 All the words that can come to mind that I just, I'm glad I lived to see. I thought there were some interesting conversations around scaled fabrics, how we're seeing universal ethernet come together and where we're starting to see that make progress. I am a huge fan of it. And whether it's InfiniFan, whether it's EnvyLink, whether we're talking about post-proprietary solutions to the interconnect space, I feel that stops us from solving problems collectively. And I think there's just so much worth engaging the ecosystem upon and that we truly are smarter together. I get excited when I see the open ecosystem approaches towards connectivity, towards scaling, actually growing legs.
Starting point is 00:20:28 Nice. Running well together. So those are probably two that jumped out at me of, yes, we're really starting to see that even in the ASPs, which will be what OpenML has offered in terms of open models and the way in which you're seeing different accelerators come and join and show that those models can run fast within their domain-specific acceleration spaces, all the funnys that are out there. I think the journey is not solved.
Starting point is 00:20:54 Problems are not solved in this domain. Models are changing so fast. The ecosystem is changing so fast. And so starting to bring collaboration together with open minds is a place where we're going to see a lot of integration. And it's all of them. Everybody has an AI chip, an AI accelerator, a fabric that is working on a backend compliant with X, Y, or Z to help people move forward. That's awesome. One final question for you. Where can folks engage with you and continue the conversation? I am sure they're going to want to. I'm on LinkedIn. That's probably the best place to reach out. And I'm happy to chat especially about the enterprise journey and what it takes to be effective.
Starting point is 00:21:37 Always fun to have you on the show, Rebecca. Thank you so much. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.