Software Huddle - Valkey After the Fork: A Conversation with Madelyn Olson

Starting point is 00:00:00 March 20th, 2004, Redis publishes this blog post announcing the shift from BSD to SSPL. Did you know about this before I did or was the blog post the first thing you had heard about it or how did that come down for you? They had told us about 24 hours in advance, just as a heads up, by the way. But no, we were basically learning about learning about the same time that everyone else learned about it. Okay, so then how quickly did the fork come together and decided, hey, we're going to do we're going to do some sort of fork. Yeah, what was that timeline like? So I, I think I have these dates, right. So on, they happened on March 20th. We had basically decided we were going to fork on March 22nd.

Starting point is 00:00:48 When was the first Valqii release? So April 16th was when we launched the fork. And now, of course, we have some new developments just last week, now, exactly a week ago. Redis announces they're sort of back to an open source license, AGPL. I guess, like, does this change anything for you and Valkey or how does that, like how do you think about that? What's up everybody, this is Alex and I'm super excited to have Madeline Olson

Starting point is 00:01:14 on the show today. She is a principal engineer at AWS and she's like one of the big people in the caching community. She was core maintainer for Redis for a long time. I think the person a lot of people thought of for that. She also was like the, one of the co-founders of Valkey after the Redis license change.

Starting point is 00:01:29 So we talked about her background. We talked about the Redis license change and sort of how she found out what happened, how to get to Valkey. And then there's a lot of time on just like performance improvement stuff in Valkey and what that's like working with that. And it's just really interesting, I think,

Starting point is 00:01:42 to go deep with someone like that technical able to explain like these low level things that I have working with that. And it's just really interesting, I think, to go deep with someone like that technical able to explain like these low level things that I have no experience with. So it was really cool there. She also got a really good story about using AI at the end that was interesting in Yake and I hadn't heard that one before. So check that out.

Starting point is 00:01:58 If you have any questions, if you have any people you want on the show or comments, anything like that, feel free to reach out to me or Sean. And with that, let's get to the show. Madeline, welcome to the show. Thanks for having me. Yeah, I am very excited to have you on.

Starting point is 00:02:11 You are a principal engineer at AWS. And I think more famously, you are like a former core maintainer of Redis. You're one of the creators, founders, I don't know how we say that, of Valkey. And just like, as I've gotten more into the caching world the last couple of years, your name is like the one that always comes up about people. I think you've been mentioned on this show by Roman from Dragonfly, from Quaja at Memento.

Starting point is 00:02:36 And just like, yeah, a lot of people speak really highly of you. So I'm excited to have you on. I guess for people that don't know as much about you, can you give us a little bit of your background? Yeah, so yeah, my name is Madeline. I am a principal engineer at Amazon within AWS. I work primarily on the Amazon Elastic Ash and Amazon MemoryTP teams.

Starting point is 00:02:56 I've actually only worked there. I'm one of the few Amazonians who I joined a team and have worked there for over a decade now. I can actually say that. I've worked on the same team for 10 years. So early on, I was always a big open source proponent. So within a couple of years, I started contributing a lot to Redis open source. So we were the managed in-memory database team.

Starting point is 00:03:16 So we managed Memcache and Redis for a long time. So I advocated a lot to, you know, take all those things that we were sort of putting into our internal forks of these projects and pushing them upstream. The thing I was most well known for was pushing a lot for TLS within Redis. Redis was very resistant to accepting it for a long time, but after a bunch of iterations, we figured I had to make that happen. And so for that work, the other work, when Redis moved to an open governance model in 2020, I got asked to be on the, become one of the maintainers for the project.

Starting point is 00:03:49 So that was sort of my big involvement. So I was an open source Redis maintainer for four years. Redis very notably moved away from an open source license in 2024. And so that led me and some of my other contributors from the open source Redis project to make Valky. And that's what I've been working on for the last year or so. So yes, co-creator, co-founder, co-founder sounds a little pretentious.

Starting point is 00:04:13 Co-creator sounds better. Yeah. How do you talk about something that's like an open source project that's within a foundation? I don't know how to talk about that, but whatever. You are one of the impetus, impetai behind the project. Yeah, I'm very friendly. So I tend to be the face of a lot of things, but like, I always feel bad because I don't write the most code. Like there's other really smart people in the project and I want to highlight them more, but they're engineers. But also like that, that, you know, like explanation piece and just like sharing and marketing type pieces, super like,

Starting point is 00:04:44 not marketing, but just making other people aware of it. I remember watching your re-invent talk, I think it was 2020, is that when you did the Alaskan Cash maybe performance? Yeah, that was the first time I heard it. I was like, wow, this is a great talk. And then I just see your name popping up

Starting point is 00:04:58 all over the place since then. I think that's just a really important role to make people aware of all these things. So, okay, you mentioned you've been at Alaska Cash for 10 years. Was that like your first job out of like, did you have other jobs too? Or did you like, was that your first one? No, that was my first job out of college. It was, I was actually an electrical engineer. I had started getting involved in databases. There's some like research I was doing in undergrad. And I got very kind of involved in

Starting point is 00:05:26 hacking with Postgres because it wasn't meeting my analytic needs at the time. We're trying to do analysis of like, we're trying to build like passive radar systems. And we have this huge amount of data. We're trying to hack Postgres and things like hacking and transformers and stuff into Postgres. And then basically, I got an offer from Amazon to work at DynamoDB at the time. There you go. And then I joined and it's like, oh no, actually you're working on this small other NoSQL service called ElastiCache. I was like, oh, okay.

Starting point is 00:05:57 And so yeah, I've just been doing that for 10 years. It was kind of not what I was expecting. Yeah. Yeah. And like, were you like a low level performance expert? Like, were you like interested in that sort of stuff? I mean, like, I guess how did that come about? Sort of? Yeah, I, my actual pathway. So again, I was an electrical engineer. So all my coding came from those like programming competitions, like the GPC one, the Google one. And then like, there's this thing called Project Euler, which is like more of like a math heavy programming competition thing. And so that's, that was my entrance.

Starting point is 00:06:32 So like I was actually very good at low level optimizations. Like I had written assembly code quite often to optimize stuff. I actually was really good at x86, like actually knowing the instruction set when I joined Amazon. And I was basically asked to write a bunch of Java code at the time, which was fine. It was okay. But no, yeah, it's kind of like one of those things

Starting point is 00:06:54 where I think most the traditional CS Athway doesn't do a lot of that very low level software engineering stuff. And they're also taking like computer architecture classes, which is also not like most CS take like introduction to computer architecture, but I had taken like high level, I take like a five level, like a master's level computer engineering class, but time I joined Amazon. So that also helped a lot with this, like this type of programming specifically, because like if you're running mostly in JavaScript, you really

Starting point is 00:07:20 just care about algorithms, you really don't care about like, you know, the memory layout of the data formats. That's, that totally true. Like I think this caching stuff is so interesting. I try to read like the Valky release notes or like some of the papers and I it's like it's like way more it's way harder for me even than like database type stuff because it's like getting directly with the memory and like so it's so close to the metal where I'm like, man, I just have no context for understanding some of this stuff, which makes it super interesting, but also hard to understand all that stuff, how it works. Okay.

Starting point is 00:07:53 So you mentioned one of your early things was convincing them to add TLS to Redis Core. I guess why did they resist it for a while? What's the story there? Yeah. So Amazon Elastcash wanted to build TLS because a lot of our customers wanted it for compliance reasons. It's big for FedRAMP and HIPAA compliance. But if you think about what REST is, it's a hash map attached to a TCP server. So adding TLS is actually extremely heavy overhead. server. So adding TLS is actually extremely heavy overhead, right? In the early days, before like the Graviton instances have kind of done a good job of optimizing that a bit and like modern Intel hardware is pretty good optimizing the crypto involved, it was somewhere around like 30 to 40% CPU hit. So if you were bottlenecked on basically like the requests per second that

Starting point is 00:08:42 the cache could do, you would be paying twice as much to enable TLS. So it's a very big impactful problem. And basically, you know, the goal at the time of res, let's keep things simple. So changing the paradigm, so that's, you know, we had like, it was actually very intrusive to like the connect, like the networking layer. So what everyone was using at the

Starting point is 00:09:07 time was basically a TLS proxy in front of Redis. So they terminated the TLS of the proxy layer and then send TCP to Redis. But as like how we just talked about all Redis is is like a TCP server plus a hash map. And so you're basically adding another TCP server. And so we were like, you know, this was like the dog method time. It's like use TLS proxies. And it's just like, it was really expensive. So like, what, this needs to be in this engine. Like it needs to be, you know, natively there. You don't want to pay this double cost. So we had built this inside ElastiCache and it was sort of like the first big stage inside our internal fork within AWS. And we were finally starting to pay the cost.

Starting point is 00:09:46 We wanted to stay compatible with Open Source Redis, but we also wanted to have this code. And so it was very difficult. So we were very motivated to get this code out of our internal fork into open source. And so we were prototyping a lot. So basically there was three main stages of us pushing this code to open source. So the first was, hey, let's just dump what we have. And we got sort of the pushback that it was kind of too complicated. So we tried to refactor it a bit into like sort of like a pseudo proxy where we're able to like reduce some of the overhead and like kind of build the proxy into one

Starting point is 00:10:18 process. But then we saw like, you know, it was using 160% more CPU than it was if we just did it normally. And so we finally were able to take some of the learnings of that prototype and build a new implementation that actually is what got finalized in the subtraction layer. So we have this connection abstraction, which kind of abstracts away the TLS part. So it still kind of looks like TCP, but can also support TLS. So, you know, it's that sort of like iterative approach that like finally got us to get it accepted in open source.

Starting point is 00:10:53 And I really liked the start. Like I can go into a little more detail, but it also, it wasn't just AWS, it was AWS and Red sort of working together to sort of figure out like, what's the right thing for the project. And that's always what I think is like the most important thing kind of for to build like the right thing for the project. And that's always what I think is the most important thing, kind of for to build the right thing for open source projects. You see a lot single vendor projects, they just like, they're like, we're doing it the way we want to do it.

Starting point is 00:11:14 And they don't really take into account what the community really wants. Yeah, interesting. And so Timeline, was that, was it like 2018, 2019? Is that when? That was 2018 through 2020 it took two years Two years to get that okay wow It was a little hard because like when I started contributing it was really just Salvatore like he was the main contributor to the project There were some other folks as well but there was no like like I

Starting point is 00:11:46 well. But there was no like, like I, one of the big contributions I made to the project was I created a Slack channel for us to all to talk to more online, we had like online sinks. They weren't videos like, you know, it was like every like once a month on Wednesday, we all got on the Slack together and asked ourselves our questions. And so we talked a lot about CLS and like, so it was that replacing like email lists or like GitHub issues or like how, what was it, what was going on before then? I think I was mostly replacing. Yeah, I was trying to get like, you know, cause GitHub issues are very low density.

Starting point is 00:12:13 You have to type a lot. And so I was trying to replace that with a more high density communication. People didn't want to use video at the time, but actually maybe it may have, Slack might not have supported at the time. I don't remember. This is like pre-Zoom.

Starting point is 00:12:27 This is like pre-pandemic. So I was thinking it was a little weird. Yeah, for sure. I remember Slack bought that one company and integrated their video for a while. And then I think, I mean, it's definitely not their current, their video implementation, but for a while they had some other VIA video provider in there.

Starting point is 00:12:40 So maybe it was that. Yeah. So. Yep. Okay. And then so at some point you became a core maintainer. When did when did that happen? So I was in 2020. So Salvatore stepped down in July of 2020. Right? Because it's like, this is three months after the

Starting point is 00:12:57 pandemic started. So it was like kind of weird time wise. It's like sort of surreal. So he stepped down. So Redis went from a BDFL, a single benevolent dictator for life to a core team. So the core team was still facilitated by the company Redis. And so they had three folks from Redis, the company, then me and an engineer from Alibaba named Xiao. So we were like the core team.

Starting point is 00:13:20 So we made decisions on what we called major decisions for like any API changes. Like that went through this little core team. So we made decisions on what we called major decisions for any API changes. That went through this little core team. So we met up and we decided what was the direction of the project. Yeah, gotcha. So is your work primarily like Redis, now Valky, core data plane type stuff? Did you work on any control plane type stuff for Elastic Cache or any of that sort of, you work on all that too? Well, so not anymore. So now I just work on open source Valq.

Starting point is 00:13:52 So I've kind of ebbed and flowed about what I'm working on. Like at the end of the day, Amazon wants me to produce value for the company. So like, what's the best? Well, and that's in today. I should say it wants me to be working what's best for our customers, right? That's. So most of the time, the last

Starting point is 00:14:11 couple years, the best use of my time for our customers is to deliver software open source. But early on, like we didn't really do much in open source. One of the first very big project I did, I was part of the Rolling Stone, the migration from off of Oracle. So there's a big initiative within Amazon to move from Oracle to a lot of it to Dynamo, some of it to Redshift, some of it to other databases. So I was the one that migrated our Oracle database for our service to Postgres, which

Starting point is 00:14:41 was a fun thing to do. Yeah. Yep. Yeah. I've heard a lot of Rolling Stone over the years. So that's fun. So I've done that stuff. I also built some other features as well.

Starting point is 00:14:53 But the big thing I have done was I own version currency. So I was the one responsible for making sure our internal version was compatible with open source. How hard is that maintaining Like maintaining this fork? Yeah, I guess like, is that like a total headache or is that like pretty manageable to do something like that? I mean, that just seems like a, yeah, big job. It is a big job.

Starting point is 00:15:16 It takes us a good bit of time. You can kind of see in here how much time it takes for Amazon ElastiCast to launch a version compared to when it was launched in open source. You can kind of see that little leg there. And it's something that I would like to see us do better at over time. Because a lot of it's because we've built stuff that really should just be open source. There's some very big things we've built. Like in the managed service, we have a feature called data tearing.

Starting point is 00:15:41 So we can spill data from in-memory onto SSDs and pull it back in all transparently to the end user. So that's like one feature we have that, maybe it should, maybe we'd like to open source it, but how are we gonna open source it? How does it work? It's very finely configured for the last cache service. But there's other stuff we kind of talked about that was like, yeah, we should just open source it, right?

Starting point is 00:16:04 Like the performance stuff that we pushed into Velki 8 was all stuff we kind of talked about that's like, yeah, we should just open source it right like the performance stuff that we pushed into Valky 8 was all stuff we had built internally and we're like, we don't need this to just be internal. This should just be an open source and we don't want to deal with having to deal with merge conflicts. So we just chopped it all into open source. Yep, gotcha. Gotcha. So has that gotten easier with like, I guess y'all still offer Redis somewhat.

Starting point is 00:16:26 And then there's, I guess like, has that gotten easier with Valky of just like, hey, there's a smaller divergent between like the fork and the open source one. I would really like to say yes, but at the same time, you know, we're delivering features for our customers at a high rate. And so sometimes there's this big internal conversation we have, which is like, do we upstream something first or do we upstream it later? So do we build something primarily in the open source, then merge it back, or do we build it in our internal fork and then figure out ways to push it to upstream later?

Starting point is 00:17:02 And the main difference is just time to market, right? We can build something internally much faster and push it upstream, but that's not always. And sometimes that meets requirements and there's a lot of internal politics that just happens at all companies. And so we were constantly having this conversation. So we're, you know, and that was the status quo, right?

Starting point is 00:17:22 Status quo, 10 years ago, we always built something internal first, and then we upstreamed it. So it's sort of this like it's been like this 10 year kind of slowly shifting more towards upstream first. Yeah, that's nice. Yeah. So one day I think we'll get there. But yeah, cool, cool.

Starting point is 00:17:39 OK, I want to talk a little bit about the the license change just for some background before we get into the Valky Performance stuff. So I guess first of all, just walk me through this. March 20th, 2004, Redis publishes this blog post announcing the shift from BSD to SSPL. Did you know about this before I did or was the blog post the first thing you had heard like how did how did that come down for you? Yeah, so, so they had told us about 24 hours in advance just sort of as like a heads up by the way. But no, yeah, we were basically learning about the same time that everyone else learned about it. And so yeah, they moved from from BSD license to a dual license SSPL and RSL, which are two proprietary licenses. And at the same time, you know,

Starting point is 00:18:28 but like I think that's like the thing that most people notice, but at the same time, they also dissolve the open governance that we're talking about before, which you kind of have to do if you're just gonna unilaterally change the license. And then they also added a CLA at the same time. So that's when you contribute,

Starting point is 00:18:43 you also have to sign basically allowing them to take your contribution and relicense it. Gotcha. Gotcha. Okay. So when you say they removed the open governance, does that mean like you and Zao were no longer maintainers? Is that what that means?

Starting point is 00:18:58 Yes. Okay. Right. They moved from... You know, they're actually a little vague now. I'm not entirely clear what the system is, but they did dissolve the open governance and they said so in their FAQ that,

Starting point is 00:19:09 hey, this wasn't working, so we're gonna do something else. Yeah, gotcha, gotcha. Okay, so then how quickly did the fork come together and decided, hey, we're gonna do some sort of fork. Yeah, what was that timeline like? So I think I have these dates, right? So they happened on March 20. We had basically decided we were going to fork on March 22.

Starting point is 00:19:35 So then we had AWS, Google, Ericsson, I believe, Alibaba on board at that time, as well as some other companies. But those are the four that had like, who eventually would have a maintainer's spots. Like Oracle is also kind of interested, Snap was interested, Verizon was interested. A lot of other companies who weren't directly contributing were also sort of like, hey, we'd like to help with this, right? We all have a vested interest in keeping the Redis project open. So at that point, we had also reached out to the Linux Foundation. So we had reached out the,

Starting point is 00:20:09 I'm a very big believer in having a foundation own the software to sort of prevent it from ever getting re-licensed. There's some other strategies, like we could have used a more restrictive license or more a copyleft license like LGPL, which is another conversation that was going on at the time, to sort of make sure like we couldn't do this forking in the future

Starting point is 00:20:30 or this like rug pull. But I'm a little bit less convinced of that. So we had engaged the Linux foundation and they're like, yes, we will sort of, we're interested in that. So sort of, I think it was a week after that. So I think it was March 28th was the Thursday or Friday we announced it as being under the Linux donation. So there's about a little over a week. I believe it was eight days from the announcement to when the actual creation of Valky happened, which it's funny.

Starting point is 00:21:01 I think a lot of that time was actually coming up with the name. I think we had gone like all the stakeholders together in like, I think we had the stakeholders by like Monday all kind of like signed off. But we had this name called placeholder KV. And so we were still trying to. I wish that was more popular because I have like a bunch of stickers that say placeholder KV and nobody knows man, that would have been such a funny one.

Starting point is 00:21:25 Nobody knows what I'm talking about. Yeah, yeah. That would have been a great long-term name for it. Interesting. And then, so, and then when was, oh man, I should have written this down, but when was like the 7.2, when was the first Valky release?

Starting point is 00:21:40 When I think, I have a lot of dates in my head and I think they're all right. So April 16th was when we launched the fork. So that's when we were done with all the rebranding have the containers all set up. And you know, obviously, it's a fork, right? We didn't, we could have kept it a little bit leaner and fork, but we also wanted to set ourselves up well for the future. To make sure we were able to, like, there's some stuff like within Louis scripts is like redis.call, which is like built in, like we didn't want that to be kind of the de facto thing. So we add also server.call. And so you could use a more generic term as well. So there

Starting point is 00:22:14 was some stuff like that. And then just a lot of rebranding just to make sure everything said Valkey instead of Redis. Yep. Yep. And okay, so this is something that's happened a few times now. And once I think of like Mongo and like Elasticsearch being the one that seems like most directly on point where there's already hosted services from a lot of different cloud providers and I guess like, did it help to have that that sort of prior art? Did you consult with these other teams like OpenSearch around how to how to approach us for fork and deal with that?

Starting point is 00:22:46 What I think is interesting is actually, like we'd actually applied a lot of the prior arts already. Right. So a lot of what we learned from Elastic was that it's important to be involved in these communities. Right. Like I had been deeply involved with Redis for almost six years at this point. And like, that was one of the things where like, Hey, one of the common things, you know, people push it adbistolized, like, hey, you guys aren't contributing back. And so we're like, okay, we'll contribute back. We'll have people deeply invested in these projects. And so that when the actual relicensing happened, I was in the community, I knew all the key stakeholders, I had like spent a lot of time so that we were able to

Starting point is 00:23:22 get everyone together really quickly to sort to move fast in creating the fork. I think that was the big thing that we learned. I think that's the thing that Amazon is a big company, but it's one of the few things that I think Amazon's getting better at over time. We're trying to be more involved in open source. Yeah. Actually. I see that around.

Starting point is 00:23:42 Yeah, that's absolutely true. Yeah, good see that around. Yeah, that's that's absolutely true. So yeah, good stuff happening there. And now of course, we have some new developments just last week now, exactly a week ago, redis announces they're sort of back to an open source license, a GPL. I guess like, does this change anything for you and Valky, or how does that? Like, how do you think about that? Like, how do you think about that? Yeah, so it doesn't change too much of the direction of the project. So as I kind of mentioned earlier, there's like three things that Redis changed back in 2024. They dissolved the open governance. The open governance was not reinstated

Starting point is 00:24:14 sort of after this relicensing. They still have a CLA, so they're still able to, like any project with a CLA, even if it's an open source, they're still able to do a rug full and go close source again later. But yeah, but AGPL, it's a good license. It's an open source license. I'm happy that it's good for the open source community

Starting point is 00:24:35 to have a Red Stack under an open source license. So that's all great. And but from Valkyrie's perspective, it's already been a year, we're so well established, the code has diverged as far as we know quite a bit. So, you know, it's like trust has sort of been lost. So like, nobody in the community really wants to go back and like, re-emerge. Like, a lot of people are like, are you going to merge back with RISC? I'm like, no, like a lot's diverged since then. And then also, we have to have the conversation moving

Starting point is 00:25:04 from BSD to AGPL, which I don't think anybody really wants to do. Yep. Yep. Is that the only way it could happen if you re-emerge? I don't know enough about licenses on how that stuff works. Would it have to be that license? So there would be basically two things.

Starting point is 00:25:18 One is Redis could use its power in the fact that it has the CLA. It can re-license everything back to BSD, which I think they've said they're not going to do. Yeah. The other thing was, yes, Valke could. So Valke's code's all under BSD. So BSD can be relicensed to be AGPL.

Starting point is 00:25:38 So Redis can take any code that we've written, relicense it, and merge it. We can't do the inverse. We can't take AGPL, relicense it to BSD, and merge it. right? We can't do the inverse, right? We can't take HPL relicense it to be as the end merchant. So it's kind of like a one way, right? But Redis the company could do it. Yeah, yeah, gotcha. On the, you said that, so you said the code base

Starting point is 00:25:56 is starting to diverge and I want to get into improvements and stuff next, but like, as of right now, is it still like a hundred percent drop in from Redis to Valky? Where like, hey, some under the hood stuff has changed, but like the API is still pretty compatible. Or like, are there certain areas where it's like, hey, do you use this? Well, then you need to think about this change.

Starting point is 00:26:15 Or like, where's compatibility between the two right now? So the way Valky's been framing our compatibility story is if you're using the set of APIs from Redis, open source 702, right? That's where the fork point happened. We're still fully compatible at that point in time, right? Cause that's sort of what we think is like, you know, the real, you know, core set of APIs.

Starting point is 00:26:38 Redis has introduced some APIs since then. They've introduced a lot in 8.0, for example, which Valq does not have compatibility with. So there's sort of like a superset of APIs that Redis has. Valky also, it's like they're not overlapping supersets. Valky also has some APIs we've built and we have implemented some of Redis' functionality. So there is some amount of, you know, if you're using this functionality, it might work and behave differently in Valky. But if you're using these core APIs,

Starting point is 00:27:08 they should all behave basically the same way. Got you. Got you. So the new, whatever, vector sets I meant, yeah, not going to be in Valky. But yeah, a lot of core stuff will. Okay. Okay, cool. Okay. So let's talk about improvements about the coming out. You had a 8.0 release,

Starting point is 00:27:23 you recently had the 8.1 release. If I sort of look at it, like the big, the things that stick out to me, and you can correct me if I'm wrong here, would be like the IO threads improvement, the new hash table stuff, which was 8.1, some replication stuff, a lot of like core sort of like performance and quality of life type stuff around that. I guess like, let's start with the IO threads improvement, because, like, Redis is famously single threaded, chest and benefits, some trade offs. We've seen some, like, sort of Redis compatible caches

Starting point is 00:27:54 that are multi-threaded. Like, is this similar to that, like, where it's like, hey, now Redis is fully multi-threaded? Or, like, what are these IO thread improvements? Yeah, what do they add in BoutB? Yeah, so I wanted to take a step back and talk about why we care about multi-threaded performance generally, and then how we sort of try to take that and apply that to what we built via threads. So if we're on most of these kind of caching that run in cloud virtualized environments,

Starting point is 00:28:24 if you have an Excel, let's talk about AWS. So if you're on like an R7G.ATXL, you have, you know, about like 100 gigabytes of RAM and a lot of extra cores. So if you're running in a caching environment, you're generally not using a lot of cores. Caching is a very efficient CPU type of workload, but it tends to be very much bottlenecked on the amount of memory you need. So you need a lot of memory and not a lot of CPU cores. So in that world, it was fine that Redis was single-threaded because you would just have a lot of memory and it would work out fine. We've started to see somewhat of a change because we're seeing more CPU intensive operations start to take over workloads like vector similarity search, search are all much more CPU

Starting point is 00:29:09 intensive operations. And so you might have kind of the same amount of data, but now you have, you know, you're running on these boxes a lot of extra cores, so it's good to utilize them. So like that's sort of the mindset, like we still think command processing, like simple gets and sets are still really fast. And Valkey and Res both have horizontal scaling, so you can have multiple processes. So just doubling the number of cores and doubling the throughput, you can just do that by scaling out.

Starting point is 00:29:39 So we wanna make sure that we're benefiting these more like search intensive workloads. And the other benefit is if you have a single hotkey, having multiple threads be able to serve that one key reduces tail-end latency. Because if you have two, most latency comes from contention. Commands are getting queued behind each other. So if you're able to drain that queue more quickly,

Starting point is 00:30:02 you'll see lower tail latency. So, okay, so then the last point about that is, with Valq and REST specifically, they have a replication process that does like a full sync. So usually you're basically have idle cores that are sitting around most of the time, so that when you need to, you could do that fork operation. Right?

Starting point is 00:30:27 And so we have the, like, you often have cores kind of sitting around. So if you are a CPU bound workload, we want to be able to kind of burst into those cores. So to try to like piece all of that together, when you have systems like, like Garmin and Dragonfly are the two, and I can say cache maybe to some extent,

Starting point is 00:30:46 are these big other caching systems, which are very thread scalable. They're targeting more the workloads of, I just have one box, I just want to run one box, and I just want to keep increasing the number of threads like to not have to deal with any of the horizontal scaling. And so that's fine.

Starting point is 00:31:01 Like I think they're kind of trying to solve a different use case. The way we're trying to think about ISo-threading is really about optimizing the price performance or the cost optimized performance of a single look of a box. Right. So we want to best utilize the cores we have available. So we don't want to degrade single threaded performance because most caching workloads kind of still run with that single thread performance, but we want to be able to kind

Starting point is 00:31:24 of burst into these extra threads sort of as needed. And so I think it kind of helps to also add in why ElastiCache specifically built this IO threading work. So IO threads had been available for a while, uh, within open source. I think you mentioned it was in 20, no, I'm getting things messed up. Someone else told me this morning that it was released in 2020. Okay.

Starting point is 00:31:47 In that blog we both read. Within Redis. Yeah, within Redis. There's IO thread. Because I know ElastiCache has this history of what like Enhanced IO and 2019. Like a lot of stuff that ElastiCache has done. Yeah, go ahead. Yeah. So one of the things we launched in ElastiCache has done. Yeah, anyway, go ahead. Yeah, yeah. So one of the things we launched in ElastiCache recently was the serverless feature. And one of the things the serverless needs to do

Starting point is 00:32:11 is dynamically scale the amount of, so reports very quickly. So one thing that being able to quickly add cores does allows you to burst into spare capacity. So if you are over-provisioned, you're able to kind of dynamically add cores, increase throughput, while at the same time, horizontally scaling, adding more shards to a cluster. So we added that so we could get that dynamic range, not because we care about how big of a number

Starting point is 00:32:37 a single process can do, but so we can dynamically adapt to performance. So that's a lot of context, which is sort of to say that I don't think just having a single process do a lot of requests per second isn't intrinsically beneficial, right? It's really, you know, it needs to be benefiting the end users we have. So that sort of tries to add a lot of context

Starting point is 00:33:01 about why we specifically try to not make, like the main like alternative implantations. Why is Valky not truly multi-threading? Like why is everything not multi-threading? That's because that's a lot of complexity and it's not solving these core issues. Interesting. Okay. Interesting. And that's like, again, like sort of beyond my, my expertise on that. So tell me about, I was going to ask you about Dragonfly later. Let's talk about it now. So they, I guess there's like a, a certain set of workloads where that

Starting point is 00:33:32 happens to be better for like their approach is sort of better for, is that what I'm understanding? Yeah, like if you have a box with like 32 cores and like you've paid for this box, right? And you don't want to manage a cluster distribution. You could just start Dragonfly on it and it will pretty efficiently. It's a very thread scalable architecture. So when you add more threads, the throughput goes up. And I have lots of great documentation about why it works well. And they have a fully concurrent, I shouldn't say too authoritatively. I believe I have not actually looked at the code because it's BSL, but I believe and Roman can argue with me if I'm wrong. Sure. Yeah.

Starting point is 00:34:12 But it's like a fully concurrent HashMap, right? You can do concurrent operations. I believe they have a blog post. They are able to lock specific segments of the hash table. And so they're able to do parallel work on multiple different commands at the hash table. And so they're able to do parallel work on multiple different commands at the same time. And this also helps with latency as well.

Starting point is 00:34:30 It brings down tail latency because commands don't get blocked. So yeah, it works well and scales pretty well. OK, interesting. Especially with CPU-intensive operations, if they're doing multiple very expensive CPUs, if they're doing the command in Valq and REST called SCUnionStore,

Starting point is 00:34:48 which does merging of two sets, which is very CPU-intensive. The fact that they can block different parts of the hash map and then do that work together, is good to help increase throughput. Yeah. Gotcha. Then tell me about in 8.1, we have these hash table improvements,

Starting point is 00:35:08 and there's just really good blog posts on how that works. It talks about Swiss tables and also how you had to optimize that for Valkyrie's unique needs. I guess tell me about, first of all, before we even get into those improvements, how do these types of improvements even get on your radar? Do they start in industry first?

Starting point is 00:35:29 Do they start in academia? Is it just like somebody's like, hey, I think there's some better way we can do this sort of thing? What is the flow for these sorts of ideas? So for the Swiss table, the hash table improvement stuff, it definitely came from this like general knowledge that, so the current, the hash table that existed before, well, so you know, for everyone, so the way hash tables work is you have a key, you run a hash function on it, and that points to a

Starting point is 00:35:58 bucket. And so when you have a hash collision, so two different keys point to the same bucket, you need a way to like resolve which of those two you want to do. So in the original implementation, we used a linked list. So we had two. We would just have them point one after the other. And that uses a lot of extra memory, because a lot of overhead and having a dedicated allocation.

Starting point is 00:36:19 So the common wisdom in the 2010s was you should do what's called an open addressing type of hash table. So instead of having... Basically you basically put the item in the wrong bucket, but you know that if you checked an item in the bucket and it's not the item, you basically keep checking until you find an empty bucket. So these are open addressing type of tables.

Starting point is 00:36:46 And so it's well known that these are more performance and more memory efficient on modern hardware. On older hardware, it didn't matter as much. But modern hardware is very good at prefetching and doing a lot of instructions in parallel. So although you're doing more work and counterintuitive to stick an IAM in the wrong bucket,

Starting point is 00:37:04 modern hardware handles this really well. So there was a, like a paper and I guess like a video that was produced. So Google built this thing in 2018 where they talked about a lot called the Swiss table. And so Swiss table sort of like a very finely tuned version of this table. And so we had like known about it, we had seen it.

Starting point is 00:37:27 But there was there's some very specific requirements that Valkey needed and Redis needed, I guess this isn't Redis time frame, that made it kind of difficult to adapt this implementation to our implementation. Specifically, we have a way to have like a cursor and you can like scan through all the items. And we provide very specific guarantees on how we scan over all the items. And we weren't able to preserve those in the Swiss table. So like, you can go, there's an issue.

Starting point is 00:37:56 I don't remember what it was open. It was something like 2021 or something. And we just started like just noodling on these ideas. And I think it wasn't until a conversation I had with another maintainer of Alki named Victor, we were at an open source, an open source conference, and we'd kind of finally figured out kind of how to solve all these problems. And so we had like a good idea. And so, you know, it just took a lot of time to figure out how to adapt sort of what was from academia into industry.

Starting point is 00:38:25 I guess in this case it was more from industry to industry. But yeah, we were finally able to adapt it. And we were able to kind of, I was kind of surprised. Like we knew we were all in theory, we put it all together and it was, how we'll present faster, it was saved like 16 bytes per object

Starting point is 00:38:43 and all sort of worked well, which is nice. We launched in 8.1, which is about a month ago. And I was worried there would be weird crashes or something. But weird performance, more importantly. But it's worked pretty well. It's worked kind of as expected. Yep. How do you get confidence on such a big change like that,

Starting point is 00:39:03 that it's not going to have weird performance stuff like that? Yeah. Performance is hard. We were fairly sure about the functionality of it. We ran a lot of fuzzing. We have a lot of built-in integration testing. So we were pretty sure it wasn't going to crash. Performance is really hard to measure, especially because it's just very use case dependent. We have no good proto-typical examples of use cases. We have some people who have large objects inside Belki, like 16 kilobytes.

Starting point is 00:39:40 We have other people with lots of very small objects. And so we have a couple of workloads we ran and from what we could tell, the hash table is faster in all cases. So that's a good signal that everything seems faster. And the theory is it should always be faster, right? Cause we're doing fewer memory lookups. I guess I didn't talk about this

Starting point is 00:40:01 when I was talking about the IO threading work, but one of the things that sort of changed recently is if you actually do profiling on the IO threading that was in Redis before, is a lot of the bottleneck is actually on the CPU waiting for the memory access to DRAM, right? So when you go and like try to fetch that item, the CPU stalls. It's like, I need this from main memory. And it'll just sit there and wait for it to get sent over. So one of the things that we spent a lot of time doing is before actually executing the command, we do a bunch of stages to prefetch memory into the CPU caches so that when we actually execute the commands,

Starting point is 00:40:39 we're not stalling on all this stuff. So it increases like actual command execution by like two or three X, which only ends up being like a, you know, two X performance overall improvement. So that's, you know, it's all stuff. But one of the big things about this hash table is we remove one of the memory, random memory accesses, right, doing a common memory access, right?

Starting point is 00:41:03 If you have some piece of memory that you're hitting a lot, it'll stay in the CPU cache. If it's kind of random, then it won't be in the CPU cache. You have to do the full miss. So we moved from two full misses to one full miss, which is a really big improvement. Yep, yep, interesting. You mentioned you sort of tested it

Starting point is 00:41:21 on some different workloads. Do you have like, like, are those like internal AWS workloads that you know of that are like good test candidates or other people, you know, Alibaba and Google and different things like that, or are they certain customers that are a little more, I don't know, I don't wanna say like adventurous on some of this stuff or like how do you,

Starting point is 00:41:37 before you actually release that out, get some of those tests, like real world tests? Yeah, that's a good question. So AWS has some benchmarks that we use, and they're kind of based on data we've collected from the ElastiCache fleet. And so we've made those available to the community as testing grounds. We also have a release candidate process, and we had a couple of folks actually go and run their benchmarks against our system

Starting point is 00:42:06 sort of before we did the GA with the hopes that they would be able to tease out some of these performance issues if they were there. Though it's not perfect, but those are sort of the two strategies we have as of right now. Yeah, for sure. Do you have a bunch of like, like if you're making a change like this,

Starting point is 00:42:23 even theoretically sort of early on, do you have like a bunch of like micro if you're making a change like this, even theoretically sort of early on, do you have a bunch of micro benchmarks to figure out, hey, is this thing faster? Or is it more like theoretical where it's like, hey, we know we're going from two random misses to one random miss, and we know it's going to be faster, we're pretty sure. And so we can go pretty far with implementation

Starting point is 00:42:40 before we can test some of that stuff. Or how are you testing even early on these different performance ideas? Yeah. So ideally there's probably like a three-step process. The first is you should profile and actually see where the CPU is spending time. So stuff like perf can kind of tell you where in the code base you're spending time. Right? So perf periodically probes and unwinds the stack and like, this is where you are at this given point in time.

Starting point is 00:43:06 So if you do that thousands of times, you know roughly where the application is spending a lot of its time. We also use like, they're like, Intel has a performance counter that you can use. You can basically say, hey, every time you hit this point, you can run a counter. So you can sort of build up, you know, how much,

Starting point is 00:43:23 like actually like instruction level, question like question, you can answer questions on like instruction level. How much time is this taking? Like how many DRAM misses are seat like L1 cache misses are you seeing? So we try to build up this like intuition first, like, is this actually the bottleneck? Right? Because like, we can make something a hundred times faster, but if it's not the bottleneck, doesn't matter. So once we have that, then we'll sort of isolate that code,

Starting point is 00:43:48 and then we do micro benchmarks on that code, right? Say like, hey, how do we make this faster? But it's important to remember that just because it's faster in isolation doesn't mean it's faster in the whole system, right? So once we saw that intuition, we still need to do the final benchmark of everything stitched together and make sure it's actually faster. And then we kind of can cycle back to the first part and like re-profile it to make sure that, you know, part of the code base is consuming less time than it did before.

Starting point is 00:44:15 Cause like there's this thing that's like still sort of Xing me is like, if you actually go and check, so Valky has this clustered mode, right? And one of the things it needs to do in this cluster mode is make sure like, is the request sent to me, can I serve it? Right? Am I the right node to be able to handle this? And it's like pretty BP chunk of time. It's like 5% of the CPU of command processing. And we go and profile it and we saw this function.

Starting point is 00:44:39 It's like taking a hundred percent of the time. Like, perfect. Let's just fix this one function. We spent a bunch of time, we fixed it. And we benchmarked, like micro benchmarked, it was faster. But in the whole system, it was the same performance, didn't do anything. And just something else was taking up 5%. And we're like, what's going on? And it's because basically, this 5% was shielding the fact that this other function still was going to like, like this whole like, this one function was pulling a bunch of memory into main memory. And now

Starting point is 00:45:07 it wasn't doing that. But now this other part of the codebase still had to wait on main memory to pull this memory in. So it actually didn't make the system faster. And we're like, well, and this other part was much harder to solve. And so we're like, I guess we're not fixing this today. So we're at it. Yeah, yeah. Like, how long does that whole process take process take of like, hey, identifying this thing, micro-bench marketing, and then figuring out, oh, man, it didn't work. Is that like, oh, man, we just lost two months of work?

Starting point is 00:45:32 Or is that something you can rate through pretty quickly? It's one of those things where if I'm not being randomized by meetings all day, you can do a full loop in a day. You can have an idea, test it, loop it like in a day. But it's something that you need like a lot of focus on. So I feel like, you know, we definitely don't lose months of effort. But it's definitely one of those things that you can do in a day if you want. And it's kind of fun, especially if it's like very if you have like a good idea, a good theory. The worst thing is, though, is like,

Starting point is 00:46:04 you know, you do it and it doesn't, there's no performance delta. You're like dang it. Yep. Okay. So if I look at these like 8.0 and 8.1 releases, it seems like a bunch of again like good performance, reliability improvements in like a pretty quick amount of time, right? You've only been around for a year. Was there a lot of low-hanging fruit there? Was it an explicit decision to focus on, hey, let's really nail the performance stuff that can distinguish us in certain ways or yeah,

Starting point is 00:46:35 I guess what accounts for it? Or is this about the pace that's happened every year with Redis in the past and now about? I think we're able to innovate a little bit faster. I think a lot of these ideas existed. Like not a lot of what we built in Valkey was like super novel. It's like the thing is we're coming out of the blue and we're like, this is a brand new idea.

Starting point is 00:46:54 We talked about a lot of these ideas with Redis. I think the Valkey project as a whole is set up in a way that it can move a little bit faster. I think for better or worse, like, so when we were in Redis, we had this open source team, this like five member team. And like, there was a time where like, you know, we wouldn't, we had a weekly meeting, right? And we'd like help sort of like, for things that require like high bandwidth, right?

Starting point is 00:47:23 When you're talking about IO threading improvements, you know, it's good to just be in a meeting and talk it through. And so like there was times we would go like months without having a meeting just cause like, and like the issue wasn't making progress. And one of the big things I wanted to change when moving to Valky is like, I want to make sure there was no bottlenecks, right? Like we want to make sure everything was still making progress, even if like someone was like process bottlenecks, not like within the process bottlenecks around the project. Okay.

Starting point is 00:47:50 Yeah, process bottlenecks like we'd always just get hung up on stuff stalling because like we weren't talking about we and six different individuals. So stuff gets hung up a lot less, I think, just because the way the project sort of operates. So I think it's just that. I think it's just a lot more people. There's also a lot of excitement in a new project. A lot of people are excited to write code, make progress. So I think it's kind of those two things. Yeah.

Starting point is 00:48:20 Cool. I also saw the release of the Valky Bloom module, and there's this Valky Modules API. Are modules going to be a big focus going forward? How are you balancing modules versus, again, that core performance type stuff? Or again, is it multi-threaded enough? You have different people working on different things that you can make progress on both pretty well.

Starting point is 00:48:43 Yeah. This gets into this, what I just said about I about the process bottlenecks, is we really wanted to be able to say, I took a lot of inspiration from how Postgres operates. PG Vector, I think is a good example. PG Vector started off as a separate project, and I believe the plan is to slowly move it into the core of Postgres. I like that idea of having this very rich, extensive API so that someone

Starting point is 00:49:07 can go start a project with a new data type. They can build it out. They can write all the code. And then it can sort of like, you know, it becomes super well adopted. It can then move into the main project. So JSON was one of those things that was actually donated by AWS.

Starting point is 00:49:25 It was something we built a while ago, and it's nice that we built it within this module API framework so we could just take the code and didn't require a lot of modification. So there's three modules, say, which are JSON, Bloom, and VectorSimilaritySearch. So the goal is those are all available in the container. It's called Valkey Extension. it's kind of in preview. That's one of the things like of the launches got kind of shoved down the most, like people were interested in VSS,

Starting point is 00:49:55 but didn't really care about how to get VSS, I guess. Um, so like it's interesting that you've heard about bloom, but like we haven't seen a lot of people actually using the container for it. Like a lot of people are interested, but I guess there's not a lot of adoption yet. Yep, interesting. What about in terms of, so I know Redis has their own modules and those were sort of licensed separately and I think they had a Bloom filter one, or like a probabilistic one. I imagine there's only so many ways to write a Bloom filter.

Starting point is 00:50:20 Is that hard to, I guess, write a Bloom filter without encroaching on their license? I don't know how licensing works in that particular area. Well, first, yeah, yeah, like this is... You understand what the question I'm asking there? I mean, yeah, I'll first definitely say I'm not a lawyer. I don't. Yeah, for sure. But like, the way we think about it is that the API, we can kind of emulate as needed,

Starting point is 00:50:45 like decide what's useful, what's not. And then the underlying implementation, we just, you know, don't look at their code. We just build our own thing. In this case, I do know that theirs is written, actually, I don't know this. I think it's in C++. It might be in C, I don't know that specific,

Starting point is 00:51:02 but ours is written in Rust. So at least that level, we know the code looks different. But yeah, so just like from a project hygiene perspective, we don't look at their code. They can, we look at, we might look at their docs, but we won't look at their actual code that they've written. Yeah. Yeah. Interesting. Okay. You said that's written in Rust.

Starting point is 00:51:24 Okay. What, I guess what said that's written in Rust. I guess what is Valqy written in? C? Written in C. OK. But then modules can be written in Rust, or parts of Valqy can be written in Rust, or how does that interrupt work?

Starting point is 00:51:35 So the way we've structured it right now is that the core project, Valqy itself, will be in C. And I guess technically, we allow C++ for certain types of surrounding libraries, but the main project needs to be in C. So one of the benefits of using modules is these modules can be built with a different compiling. They could be in Rust, they could be in C++. Theoretically, they could be in Go. And that was, you know, if it makes more sense to write in Rust, we also have this like Rust

Starting point is 00:52:10 SDK to make it very easy to write safe Rust code. I don't know how familiar with Rust. I'm not. No, we just I just had another person not talking about Rust. And I was like, man, this is good stuff. But I can't ask a question here. So yeah, I you should you should learn Russ. I love us so much.

Starting point is 00:52:27 I should do. OK, that's what I hear about. So so many people. So, yeah, if you want it, well, so my slightly more nuanced take, even though this wasn't your question, system level programming. If you're working in a large team where there's varying levels of skill, Ross is very good at forcing everyone to write high quality code. Right? So I think a lot of organizations are like swooning over it because it's very good. Like you can take a college grad. It's very easy to learn Rust because there's so much documentation

Starting point is 00:52:56 about it. And you don't have to worry about them, like making mistakes about like memory safety or like leaking memory too much because it constrains big projects really well. So that's why I think a lot of people kind of like it. It does some syntax stuff I like, it does some stuff I don't like. And yeah, I think it's a good language. Gotcha. Gotcha. Do you think most modules will be written in Rust going forward or it hasn't been in the practice so far? So right now we have three modules, right? So we have JSON, which is in C++.

Starting point is 00:53:31 We have Bloom, which are in Rust. I don't think it matters too much which one of those, between those two, because to be blunt, those two implementations are basically, we took a third party implementation of the data type, and then we implemented the APIs on top of it. There's not a lot of logic there. The one that's more interesting is vector similarity search, which is like a searching engine. And a lot of that code is custom. And so that's in C++. I would have liked to see that in Rust for a couple of reasons, especially around

Starting point is 00:54:07 the multi-threading memory safety stuff. But Google built it, they were on C++, we're not going to say no, rewrite your code in Rust. And if you're very methodical, again, I'll go back to if you're very methodical in writing C++ code, I don't think there's a lot of value in Rust. But from an open source project perspective, I think it has a lot of value to write in Rust. Yep. Yep. How close are you to that, the vector search stuff? Like, have you stayed close to it? And I guess, like, how different are the requirements in, you know, something like Val-key, a memory-based system, as compared to PG Vector and Postgres or Mongo's implementation and things like that. How different are those?

Starting point is 00:54:51 I will talk a little bit out of my depth, but I have mostly ignored it until recently. My understanding is in-memory vector similarity search is good for very high recall. So when you're basically descending a lot of the tree to look for the best possible matches, because then you're doing a lot of random memory lookups. So you want to avoid as many disk-based lookups as possible.

Starting point is 00:55:23 And so that's where Valqy can be more, can do better throughput than something like open search or PG vector. It sort of remains to be seen where, which use cases are that's most useful. Historically Valqy is mostly used for very online applications, very real time applications, right? You're serving, caching traffic, you know, someone's like on a website, someone's like looking for real-time, you know, ticker values from the stock market. So when you're doing batch operations,

Starting point is 00:55:54 it's a little bit less important to have this real-time stuff. So I think there's an open question of like where this makes sense. We've also seen an increasing push away from, so recall's only relevant if you're trying to do approximate nearest neighbor. If you're just doing, you want the exact closest object,

Starting point is 00:56:13 you could just flat search everything in the index. And so we're seeing a lot of people just stick indexes and valkey as a key value string. They extract the string and then they do a flat search on it. as a key value string, they extract the string and then they do a flat search on it. So I think the space is rapidly evolving. And so it's very interesting to watch it evolve. And so I think what we're really just focusing on is providing good fundamentals, which is why, you know, performance, efficiency, reliability, all that good stuff. And then, you know, kind of, you know, this is where it's nice to also work.

Starting point is 00:56:46 Like I said, I work primarily on open source Valqi. It's really nice to also pay attention to what's going on in WS because I can at least be a little bit more tapped in to like see what people are talking about, what use cases they're missing, where do they need high performance stuff. And then we can try to use that to help inform where Valqi should be going. Yeah, yeah, Cool. Do you still like talk? Do you stay pretty

Starting point is 00:57:08 close to customers? I imagine just like talking to customers all the time, like sort of standard AWS? Yeah, literally right before this call, I was talking with the customer. I love talking with customers. It's my favorite thing. Yeah. Oh, that's cool. Seriously, the best AWS engineers I know are just like that. They're just like learning info machines, just want to suck it in all the time and figure out what people are having. For some of these performance improvements, are there trade-offs or does it ever happen

Starting point is 00:57:39 where it's like, hey, this is better for 90% of customers, but this workload is slower? Does that ever happen or is it mostly like, eh, no, this is better for 90% of customers, but this workload is slower. Does that ever happen? Or is it mostly like, and now this is usually pretty good across the board. It's easy to build the stuff that's good across the board for sure. It's, I think it's a really good question.

Starting point is 00:57:57 Cause you know, I have this conversation with our product managers where I'm there like, like someone was like, oh, we saw a performance degradation. And we're like, yeah, that's sort of expected. And I'm like, what do you mean? You told us it was faster. I'm like, well, it's complicated, right? Not for every single person. Yeah, not for every single person. Like the most common thing. So the way our IO threading works, if you are not throughput bound, you are not going to see better performance. And under certain circumstances, if you're doing these operations we talked about earlier, like the S-Union store, you can actually see a little bit of degradation depending on your

Starting point is 00:58:35 workload. So, those do exist. And we try hard to fix them and mitigate them. Our goal is to make, if we can make everyone better by 10%, that's better than making 1% of users better by 100% and everyone else worse by 1%. So there's a lot of trade-offs there. Yeah. Yeah, like we had a-

Starting point is 00:58:56 Would you say, oh, go ahead. Yeah, I was thinking, there's another example. There's this data structure in Valky called HyperLogVlog, which does approximate size of sets and Someone was like I want to make this change it will make You know, I think something like 14% faster, but it'll make a 2% more memory you know use to some more memory and we're like I I don't know how to quantify that distinction

Starting point is 00:59:23 and so we ended up not accepting it because we're like, I don't know if people which one like which, like, we don't know which of these trade offs people prefer. And it's better to sort of keep it the same than to try to change it. Yeah, yeah, very interesting. Would you say the Valqe project, I guess, like has a North Star, like performance is like sort of the North Star, or is that balanced across adding more features, adding ease of use or different things? It is more of a balance of a few different things, or is performance, hey, that's the thing we care about most?

Starting point is 00:59:56 So I don't think we have one thing we care about the most. There's been five pillars we've talked about, which is performance, memory efficiency, reliability, observability, and ease of use, basically. We want to build features that make it easy to build applications on top of. And that's sort of like what JSON's for, right? Those vector similarity search stuff. So I wouldn't say any of those is more important than the other ones. We've sort of tried to not regress on one to build a feature in another, unless it's like, observability is the big one. Observability is one of those things that like

Starting point is 01:00:33 almost by definition has to impact performance because you're doing, you're introspecting in the commands. So we've made all the observability stuff all opt-in, for the most part, because we don't want to regress on performance to give you observability, but people still also want observability. When your cache is having issues, you really want to know why it's having issues. Yep, yep, sure. Okay, I want to talk a little bit about these sort of like other caching options we talked about. We talked about drag and fly a little bit, multi-threading. I think

Starting point is 01:01:09 you mentioned another multi-threading one. Garnet. Garnet, okay. Is that like a pretty similar approach there around? No, Garnet's doing something completely different. I actually think Garnet's pretty cool. Garnet is a... I sometimes a little mean. I just say, you know, they think the problem with Redis is it wasn't written in C sharp. So it's a C sharp based implementation. No, it's actually much smarter than I'm being a little tongue in cheek. So it's a, it's like a full, some people also say this is reductive, but it's like it's a concurrent hash map.

Starting point is 01:01:46 So every operation is a lockless concurrent, right? So it's not taking, like so Dragonfly takes locks on like pieces of the data set. What Gartner does is every single, it's able to like update stuff atomically using lockless algorithms. It's built on some previous research done by Microsoft, and it talks over the REST protocol.

Starting point is 01:02:07 It's really pretty cool. They have some really impressive high performance numbers because they're not doing basically any locking, which can take a lot of time. If you have very well-formed or uniform access patterns with a lot of keys, you could do a lot of requests per second on the internet. Interesting. That one is not like Redis compatible, uniform access parents with a lot of keys, you could do a lot of requests per second on Tarnet.

Starting point is 01:02:25 Interesting, okay. And so that one is not like Redis compatible, but sort of serves similar workloads as Redis, or Valkyrie, is that what you're saying? No, it actually does the same rise protocol. It is, okay. It doesn't, it has some weird performance under edge cases, but if you're doing simple gets and sets, it's another,

Starting point is 01:02:42 it's also fully open source, right? So I guess it's permissively open source. I think it's MIT. Okay, okay. What about KeyDB, which you mentioned, you mentioned Snap earlier, Snap picked up KeyDB. What was their sort of big insight there?

Starting point is 01:02:58 Yeah, so KeyDB was originally created as a multi-threaded fork of Redis. So as I said, so, actually, I don't think I talked about our IO threading architecture. So our IO threading architecture is we have a single command execution thread that delegates works to other threads. So the main thread will say like, hey, I need to go read

Starting point is 01:03:17 from these 12 clients, you IO thread, go read from these 12 clients. So there's still a main thread doing almost all of the work and coordination. So KDB did something slightly different. They had like a single global level lock. So they had threads that were context switching, some of them were doing IO,

Starting point is 01:03:33 and then they'd wait for the global lock to do command processing, and then they'd go back to doing IO. And so they have a slightly different multi-threading architecture, and ours is a little bit better, a little higher performance. But again, KDB built this implementation a long time ago, like three or four years ago.

Starting point is 01:03:52 And so they iterated a bunch. So KDB was originally started as a separate project. It built a bunch of functionality, like active-active replication, spilling data to disk. It was then later acquired by Snap. And Snap was sort of looking to, my understanding of Snap didn't really wanna be in the maintaining and open source cash business.

Starting point is 01:04:14 And so they were like, how do we merge this functionality together? So we've been working close with KeyDB engineers to get the functionality they want merged into ValKey. Gotcha. And did you say Snap is like one of the sort of, this part of ValKey? Yeah.

Starting point is 01:04:30 Yeah, it was one of the original companies that sort of backed ValKey. OK. OK, nice. Tell me also about MemoryDB, which I think you said you worked on as well. Yeah. So MemoryDB is just the, so it's basically ElastiCache. So ElastiCache is the managed Valky service

Starting point is 01:04:48 and Rattus and Memcache, but memory DB has a slightly different replication system that makes all the data durable be logged. So instead, so ElastiCache just replicates to other replicas, keeps everything in memory. Memory DB first pushes the data to a durable log. So when a memory DB node fails, you don't lose data. It just comes back up, has all the data.

Starting point is 01:05:09 So it has nice stuff like Dean. It's consistent. You can really use it for primary database workloads. So that's sort of the big thing about memoryDB. And so that's like, hey, I love the Redis API and I'm willing to trade off a little bit of write latency, and now I can use it as my primary database, is the thought there.

Starting point is 01:05:27 Yeah. That's the pitch. Nice. OK. Yep, very cool. Was that a fun one to work on? I'll say yes. Yes.

Starting point is 01:05:40 Nice, nice. OK. One thing I meant to ask during the performance work section is Glide. So Glide is like this. Yeah. Okay. Is this Glide is within Valkyrie or is that an AWS project? I can't remember. So it was started as an AWS project. So the key insight for Glide was that within the REST ecosystem, a bunch of clients got built up incrementally over time. Right? So there was like the Ruby space, they built a Ruby client, Python space, built a client. And they all sort of had different standards, different ways of doing like cluster topology

Starting point is 01:06:16 discovery and, you know, TLS and all this stuff. And so we had to say, you know, we see a lot of issues at scale within Alaska. So we had customers who were like, hey, I have a 400-node cluster and every time it does failovers, we see like really long, it takes a really long time for our client to re-build the topology. And so, you know, we looked into all these clients and like, well, we could fix them all individually, but instead, why don't we try to go build a single client that sort of has a Rust core. So we have the core implementation of Rust and it's what talks to the servers.

Starting point is 01:06:52 And so we can solve this logic once here. So that's like, we call like the glide core. And then we build high level bindings that talk to this core. So instead of building all the complex logic in Python, we can build it just in Rust. And this is actually great for certain types of interpreted languages, which are pretty slow inherently. So all of the expensive, you know, I-O work is done in Rust, as opposed to in a higher level language. So we built that originally for Redis, just to sort of, you know, it was with Natives project. And then when the licensing changed, we were like,

Starting point is 01:07:27 oh, we can just donate this and make it a Valky project. So now it's officially a Valky like driver. It still works with Redis because it's, again, we're like compatible with Redis open source 702. Yep, nice. What, like how many language bindings do you have for the four glide? So we have four which are practically done.

Starting point is 01:07:51 I think those not quite GA, but it's close. So we have go, which is almost done. Java, which is done. Python, which is done. And no Java, Python. I don't know which one I missed. So those are done. We also have someone working on a C++ binding. We have someone working on a C sharp binding and we have someone working on a

Starting point is 01:08:10 Ruby binding. So a lot of those are kind of in progress. Yeah, gotcha. When you talk to customers and they're having performance issues, are there a lot of things they can do? Like, is it like, oh, man, your, your client is misbehaving in certain ways? Is it like, oh, you need to tune a few parameters on your server and that will work? Is it data modeling? Is it all of the above? Like how do you like, what do you see out there a lot? I mean, it's definitely all of the above. We've had issues in clients, especially. You know, there's a lot of very simple tuning you can do on a client to make it much more performant.

Starting point is 01:08:50 There's things like, you know, connection counts, pipelining, all that kind of stuff, which we sometimes see. We generally see that performance ball not to be on the server side though. And so sometimes it's there, you know, using like a bad cluster sizing or topology, like using if they could change scale in and scale

Starting point is 01:09:11 out, or sorry, scale down, like change the instance type and scale out, can solve their problems. Oftentimes, it's data modeling issues, especially when you're doing cross-key operations, you're using very inefficient data types. Like a lot of the operations in Val-Key are built in such a way that they should scale well, but there's some operations which are, which are like done at all, right? So we see a lot of people, I'm going to keep going back to this edge to union stuff,

Starting point is 01:09:39 because I talked about it, like we see there's a, there's a customer that was doing this S to union basically say, I guess they're doing S difference. They're basically saying, okay, this is the set of objects a user had at point A or point time t. And then this is the one they had in time t2. Like I want you to do a difference and tell me what the difference is. And that works unless the sets get very large because it actually scales log n of time it takes to do these differences. So that was one of the things that was like, well, you know. And what's very large in this case? Is that like a thousand? Is that like?

Starting point is 01:10:13 Usually on the order of more millions, right? Because all this is dying in memory. It's going to be fast. Yeah, true. And so we just sort of work with them to be like, you know, instead of having one have like 25 of these individual sets. And cause like, if you do a simple hashing, you can split the, the objects up, right? And also that also basically brings down the analog end part

Starting point is 01:10:36 as well. So they're actually just overall, like overall producing a lot of CPU usage. So yeah, sort of across the board. It's very really tuning parameters. There's not a lot of tuning parameters in Valqy today. There are some, like we have different representations of data types.

Starting point is 01:10:55 So the set I kind of have been talking a bunch about, you can either have just basically an array of objects. Because at some point, you feel like is an object in a set, it's actually faster to just have it all in a block of memory than it is to actually have efficient data structures. So how we, and it's way more memory efficient. So when we transition from these dense encodings to these sparse encodings, is how we typically call it,

Starting point is 01:11:20 we have hard-coded values. So sometimes people tune those, but generally the defaults work. It's kind of unique. It's not a big thing. Yeah. Yeah. Okay. Yep. Yep. Sort of going forward on Valkyrie, if I'm interested in Valkyrie and want to find out, hey, what's getting focused on in 8.2 and sort of those things, is that pretty easy for me to just see or is it like, hey, wait for the release candidate and you'll see what you get or like, what's that look like?

Starting point is 01:11:46 It's funny, because like a lot of people come from the world, like they want a roadmap. And so a lot of single vendor open source projects can have a roadmap because there's a product manager who's like, here's the roadmap. The problem with like vendor neutral open source projects is it's like, here's the things we'd like to build. Here are the things we think someone's going to build, but also someone could show up tomorrow and be like, I want to build this. And we're like, wow, that things we'd like to build. Here are the things we think someone's going to build,

Starting point is 01:12:05 but also someone could show up tomorrow and be like, I want to build this. And we're like, wow, that's so much better to build than what we had in our roadmap. Let's go work on that instead. So there's some stuff we know is going to come in Valq 9. There's stuff like hash field expiration, being able to set a TTL and like a specific field

Starting point is 01:12:21 within a hash. We're trying to fix, this might be too detailed, but like, so when we're doing like resharding of data, so in cluster, if you have multiple shards and you want to move data between them, right now it has to be driven entirely by a third party observer. Like someone has to move all the data themselves

Starting point is 01:12:38 and we're trying to make it built in. So it like moves it between nodes directly. So it's much faster. So that will probably be there. We call that atomic slot migration. Because when it's third active, third actor can die, and then the server is just like, what's going on? Whereas in this new world, if the node dies, it just reverts and doesn't leave it in a broken state. So there's stuff like that. Full text search is also something

Starting point is 01:13:00 we're building. Would that be a module or that be built in? That's built in a module, yeah. Okay. Our goal is like very experimental features, module first, and then if it becomes super standard, we'll pull them in core. That's the thinking. Yep, you mentioned Valky 9. When do you, I guess like when do you do

Starting point is 01:13:17 major bumps versus minor bumps? Is that like time-based or is that? Right now we've just been releasing every six months. The thinking is we'll do alternating major, minor, major, minor. We're in the awkward position. Like we don't really do breaking changes. So we don't really need to do major bumps. But one of the things we learned about eight dot one is people were way more excited about

Starting point is 01:13:40 eight because it was a major version. Yeah, it's true. No one's excited about like, yeah, seven dot 14 years and like that. Yeah. So yeah. So like, maybe we should just do major version. But I don't know. Yeah. Yeah. Interesting. But so I think having the I know it's hard. It's like, is it a marketing thing? Is it a, you know, encoding some information about breaking changes, like you're saying? Probably not in that. Yeah. You know, you keep saying marketing, but like the way I feel is like you want to have the most impact we want people

Starting point is 01:14:10 to be using it if that involves getting people doing marketing. I don't think I don't think market I don't mean it in like a derisive way. Right? Like I think like getting that I know I'm just saying it's like it's Yeah, getting that information out like the way that people consume it. Yeah. Yeah. Yeah for sure I think especially in the engineering world people are derisive about marketing, but that's true. Yeah You gotta realize it's like part of the part of the thing right about like, you know users of this have limited time and sort Of paying attention to all these different projects. So it's like how can you most efficiently?

Starting point is 01:14:42 Just communicate to them like what's's new, this is worth upgrading for and those sorts of things like that. Yeah. So is it hard to find good people to work on, Valky, like being so sort of low level and things like that? Or do you feel like, hey, we have like a really good team, we have enough people going, like, what's that look like? You know, we definitely have a lot of good people, but we definitely need more people.

Starting point is 01:15:08 There's a split between like, the big thing that we're missing is like, maintainers on Valqy. We need more people like helping review this deep code, which is hard. Like that, I think that's like, you know, a long-term problem with open source projects that are very, they require a lot of tech depth

Starting point is 01:15:24 to understand is like people come in, they're like, I have my one feature, I would like to build my one feature. And then they got their feature merged. And they're like, I'm good. And they leave. Which is fine. It's great. Like, that's part of the process. But getting people to stick around for the long term, especially with something that's so it's not super sexy. It's not, you know, like, it's so much easier to get maintainers on like, you know, projects that are rapidly evolving and rapidly changing. This is sort of like a lot's going on in the project, but it's all very incremental.

Starting point is 01:15:52 So it doesn't feel as grand. We're not tripling the performance. 10% a year, 5% there. Things like that, for sure. 10% a year. Well, that's funny. I was writing super exciting, it's like this big number, but like,

Starting point is 01:16:09 yeah, it's hard to have these big 3x improvements year over year. Yeah, yeah, for sure. Yep. Usually I ask people like how they're using AI in their day to day work. Like, are you able to use like any of the AI stuff for your day to day work? Or is it so low level and specialized that it's hard to pop in Tertiary and have it do stuff for you? I'll give a concrete example. Yesterday, I was debugging an issue in Valkyrie. There's a data type called geospatial indexes. We had a customer who basically reported the fact that if I have a point and I want to find all of the points that are within a zero meter radius of that point.

Starting point is 01:16:52 And they had a valid use case. I was like, this is weird, but whatever. They it's like it should return that point. What do they mean a zero meter? What do they mean a zero like less than one? But what? No, like literally zero, right? So technically, by the math, it's like the point, the point. Okay. Right. So there's, there's a reasoning they're doing this. And so it should return itself, right? Because it's technically within the bounding area. On Intel, it was returning that point on arm, it

Starting point is 01:17:23 was not returning that point. And ARM, it was not returning that point. And so I was just like, well, this makes no sense. And so I went to my normal debugging, like using GDB, stepping through this, like the C code, and ARM was just behaving weird. I was like, what's going on? And so I had to go and actually look at the assembly instructions. So as I said, I said at the beginning, I know X86, so it's very easy for me to read x86. But I'm like, oh, these instructions. But it's very nice, because I just would just give it to chat tpt, and chat tpt would just like translate everything for

Starting point is 01:17:52 me in real time to be like, hey, this is what this all these instructions do. And it made it very fast for me to read and figure out what's going on. And I mean, the reasoning is not that interesting. Like, there was a there was an instruction at ARM that just was having a rounding error, which it's allowed to do. It's not, it's not violating the spec. It's like Intel has this super fancy instruction that does like two multiplications and then like a subtract, like it takes two numbers, multiplies them by something and then subtracts them, which is exactly the operation that we were doing in the code, but ARM doesn't have that.

Starting point is 01:18:27 So it was doing it in two steps and that two step was introducing a little bit of rounding error. And because of that, it was putting it outside the zone and so it was not returning it. So. Oh my goodness. Interesting.

Starting point is 01:18:39 So when you change that, do you have to, like, how do you even fix something like that? Do you have it run differently on ARM versus Intel or do you just like abstract it a little higher level to where it works for both of them? It's not doing that. In this case, I was actually very worried with debugging this, that there would be like something

Starting point is 01:18:53 and there was nothing to do. But in the reality, someone would just like, you know, a best practice when dealing with floating points is you should never compare something to zero. You should compare it to some extremely small value and we weren't doing that. So in this case, we just changed it and to do some very extremely small value and that fixed it. Yep. Yep. Okay. When you were having chat GPT translate for you, did you have it translate

Starting point is 01:19:17 to English or to x86? To English in this case, I basically said, can you rewrite this to explain in English what each of these instructions are doing? Yeah, okay. That is definitely the most, the most fun use of, of AI that I've heard. So well done. I just yeah, I like it for that. I'm not a big fan.

Starting point is 01:19:38 I've been trying to get into like agentic stuff. Like I've been using the built in stuff. For some reason, I get free GitHub copilot. Apparently, because it's because I work on an open source project, but I'm the only Valq maintainer that gets it. So I don't know how I'm on this secret group. So I do try to use it. But the agent stuff, I tried to, I asked it to build reverse search into the Valky CLI.

Starting point is 01:20:05 So Valky CLI is all in C. It just, it kept writing the wrong thing. It kept just, it kept just assuming, it seems to work well very tactically. It's like, hey, I want this very tactical function to do actions. It's pretty good at that. But when you're like, hey, taking everything together,

Starting point is 01:20:22 build this, it struggles a lot. Yep. Yeah, for sure. Yep. That's interesting. You get there, I think. Yeah. Yeah, we'll see.

Starting point is 01:20:29 Well, Manon, thanks for coming on. This is super interesting. I learned a ton. And it was great to talk to you. Will you be at re.Invent this year? Hopefully not, but I probably will be. Hopefully not. Yeah, nice.

Starting point is 01:20:41 Yeah, I hope you're there. And yeah, it's been fun to see Valky's progress and all that stuff. So best of luck going forward. If people want to find out more about you, about Valky, where should we send them? Yeah, so you can go to Valky.io. It's the best place. We have lots of blog posts on there. You can also, I'm pretty active on LinkedIn and Blue Sky these days. So presumably that handle will be posted somewhere. Yep. Cool. We can do that. Awesome. Well, thank you for coming on, Madeline.

Starting point is 01:21:10 All right. Sounds good. Yeah. Thanks a lot for having me. It's great chat.

Software Huddle - Valkey After the Fork: A Conversation with Madelyn Olson

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.