Software Huddle - Valkey After the Fork: A Conversation with Madelyn Olson
Episode Date: July 16, 2025Today, we're talking Valkey, Redis, and all things caching. Our guest is Madelyn Olson, who is a principal engineer at AWS working on Elasticache and is one of the most well-known people in the cachin...g community. She was a core maintainer of Redis prior to the fork and was one of the creators of Valkey, an open-source fork of Redis. In this episode, we talk about Madelyn's road to becoming a Redis maintainer and how she found out about the March 2024 license change. Then, Madelyn shares the story of Valkey being created, philosophical differences between the projects, and her reaction to re-relicensing of Redis in May 2025. Next, we dive into the performance improvements of recent Valkey releases, including the I/O threads improvements and the new hash table layout. Along the way, Madelyn dispels the notion that the single-threaded nature of Redis / Valkey is that big of a hindrance for most workloads. Finally, she compares some of the Valkey improvements to some of the other recent cache competitors in the space.
Transcript
Discussion (0)
March 20th, 2004, Redis publishes this blog post announcing the shift from BSD to SSPL.
Did you know about this before I did or was the blog post the first thing you had heard
about it or how did that come down for you?
They had told us about 24 hours in advance, just as a heads up, by the way.
But no, we were basically learning about learning about the same time that everyone else
learned about it. Okay, so then how quickly did the fork come together and decided, hey, we're
going to do we're going to do some sort of fork. Yeah, what was that timeline like? So I, I think
I have these dates, right. So on, they happened on March 20th. We had basically decided we were going to fork on March 22nd.
When was the first Valqii release?
So April 16th was when we launched the fork.
And now, of course, we have some new developments just last week, now, exactly a week ago.
Redis announces they're sort of back to an open source license, AGPL.
I guess, like, does this change anything for you and Valkey
or how does that, like how do you think about that?
What's up everybody, this is Alex
and I'm super excited to have Madeline Olson
on the show today.
She is a principal engineer at AWS
and she's like one of the big people
in the caching community.
She was core maintainer for Redis for a long time.
I think the person a lot of people thought of for that.
She also was like the, one of the co-founders of Valkey
after the Redis license change.
So we talked about her background.
We talked about the Redis license change
and sort of how she found out what happened,
how to get to Valkey.
And then there's a lot of time on just like
performance improvement stuff in Valkey
and what that's like working with that.
And it's just really interesting, I think,
to go deep with someone like that technical able to explain like these low level things that I have working with that. And it's just really interesting, I think, to go deep with someone like that technical
able to explain like these low level things
that I have no experience with.
So it was really cool there.
She also got a really good story about using AI at the end
that was interesting in Yake
and I hadn't heard that one before.
So check that out.
If you have any questions,
if you have any people you want on the show
or comments, anything like that,
feel free to reach out to me or Sean.
And with that, let's get to the show.
Madeline, welcome to the show.
Thanks for having me.
Yeah, I am very excited to have you on.
You are a principal engineer at AWS.
And I think more famously,
you are like a former core maintainer of Redis.
You're one of the creators, founders,
I don't know how we say that, of Valkey.
And just like, as I've gotten more into the caching world the last couple of years, your
name is like the one that always comes up about people.
I think you've been mentioned on this show by Roman from Dragonfly, from Quaja at Memento.
And just like, yeah, a lot of people speak really highly of you.
So I'm excited to have you on.
I guess for people that don't know as much about you, can you give us a little bit of
your background?
Yeah, so yeah, my name is Madeline.
I am a principal engineer at Amazon within AWS.
I work primarily on the Amazon Elastic Ash
and Amazon MemoryTP teams.
I've actually only worked there.
I'm one of the few Amazonians who I joined a team
and have worked there for over a decade now.
I can actually say that.
I've worked on the same team for 10 years.
So early on, I was always a big open source proponent.
So within a couple of years, I started contributing a lot to Redis open source.
So we were the managed in-memory database team.
So we managed Memcache and Redis for a long time.
So I advocated a lot to, you know, take all those things
that we were sort of putting into our internal forks of these projects and pushing them upstream.
The thing I was most well known for was pushing a lot for TLS within Redis.
Redis was very resistant to accepting it for a long time, but after a bunch of iterations, we figured I had to make that happen.
And so for that work, the other work, when Redis moved to an open governance model in 2020,
I got asked to be on the,
become one of the maintainers for the project.
So that was sort of my big involvement.
So I was an open source Redis maintainer for four years.
Redis very notably moved away
from an open source license in 2024.
And so that led me and some of my other contributors
from the open source Redis project to make Valky.
And that's what I've been working on for the last year or so.
So yes, co-creator, co-founder, co-founder sounds a little pretentious.
Co-creator sounds better.
Yeah.
How do you talk about something that's like an open source project that's within a foundation?
I don't know how to talk about that, but whatever.
You are one of the impetus, impetai behind the project.
Yeah, I'm very friendly. So I tend to be the face of a lot of things, but like,
I always feel bad because I don't write the most code. Like there's other really smart people in the project and I want to highlight them more, but they're engineers. But also like that, that,
you know, like explanation piece and just like sharing and marketing type pieces, super like,
not marketing,
but just making other people aware of it.
I remember watching your re-invent talk,
I think it was 2020, is that when you did
the Alaskan Cash maybe performance?
Yeah, that was the first time I heard it.
I was like, wow, this is a great talk.
And then I just see your name popping up
all over the place since then.
I think that's just a really important role
to make people aware of all these things.
So, okay, you mentioned you've been at Alaska Cash for 10 years. Was that like
your first job out of like, did you have other jobs too? Or did you like, was that your first one?
No, that was my first job out of college. It was, I was actually an electrical engineer.
I had started getting involved in databases. There's some like research I was doing in undergrad.
And I got very kind of involved in
hacking with Postgres because it wasn't meeting my analytic needs at the time. We're trying to do
analysis of like, we're trying to build like passive radar systems. And we have this huge
amount of data. We're trying to hack Postgres and things like hacking and transformers and stuff
into Postgres. And then basically, I got an offer from Amazon to work at DynamoDB at the time.
There you go.
And then I joined and it's like, oh no,
actually you're working on this small other NoSQL service called ElastiCache.
I was like, oh, okay.
And so yeah, I've just been doing that for 10 years.
It was kind of not what I was expecting.
Yeah. Yeah. And like, were you like a low level performance expert? Like, were you like interested
in that sort of stuff? I mean, like, I guess how did that come about?
Sort of? Yeah, I, my actual pathway. So again, I was an electrical engineer. So all my coding
came from those like programming competitions, like the GPC one, the Google one. And then like,
there's this thing called Project Euler, which is like more of like a math heavy
programming competition thing. And so that's, that was my entrance.
So like I was actually very good at low level optimizations.
Like I had written assembly code quite often to optimize stuff.
I actually was really good at x86,
like actually knowing the instruction set when I joined Amazon.
And I was basically asked to write a bunch of Java code
at the time, which was fine.
It was okay.
But no, yeah, it's kind of like one of those things
where I think most the traditional CS
Athway doesn't do a lot of that very low level
software engineering stuff.
And they're also taking like computer architecture classes,
which is also not like most CS take like introduction to computer architecture, but I had taken
like high level, I take like a five level, like a master's level computer engineering
class, but time I joined Amazon. So that also helped a lot with this, like this type of
programming specifically, because like if you're running mostly in JavaScript, you really
just care about algorithms, you really don't care about like, you know, the memory layout
of the data formats. That's, that totally true. Like I think this caching
stuff is so interesting. I try to read like the Valky release notes or like some of the papers
and I it's like it's like way more it's way harder for me even than like database type stuff
because it's like getting directly with the memory and like so it's so close to the metal where I'm
like, man, I just have no context for understanding some of this stuff, which makes it super interesting, but also hard to understand all that stuff,
how it works.
Okay.
So you mentioned one of your early things was convincing them to add TLS to Redis Core.
I guess why did they resist it for a while?
What's the story there?
Yeah. So Amazon Elastcash wanted to build TLS because a lot of our customers wanted it for compliance reasons. It's big for FedRAMP and HIPAA compliance. But if you think about what
REST is, it's a hash map attached to a TCP server. So adding TLS is actually extremely heavy overhead.
server. So adding TLS is actually extremely heavy overhead, right? In the early days,
before like the Graviton instances have kind of done a good job of optimizing that a bit and like modern Intel hardware is pretty good optimizing the crypto involved, it was somewhere around like
30 to 40% CPU hit. So if you were bottlenecked on basically like the requests per second that
the cache could do, you would be paying twice as much to enable TLS.
So it's a very big impactful problem.
And basically, you know, the goal at the time of res,
let's keep things simple.
So changing the paradigm, so that's, you know,
we had like, it was actually very intrusive
to like the connect, like the networking layer.
So what everyone was using at the
time was basically a TLS proxy in front of Redis. So they terminated the TLS of
the proxy layer and then send TCP to Redis. But as like how we just talked
about all Redis is is like a TCP server plus a hash map. And so you're basically
adding another TCP server. And so we were like, you know, this was like the dog method
time. It's like use TLS proxies. And it's just like, it was really expensive. So like, what,
this needs to be in this engine. Like it needs to be, you know, natively there. You don't want to
pay this double cost. So we had built this inside ElastiCache and it was sort of like the first big
stage inside our internal fork within AWS. And we were finally starting to pay the cost.
We wanted to stay compatible with Open Source Redis,
but we also wanted to have this code.
And so it was very difficult.
So we were very motivated to get this code out of our internal fork into open source.
And so we were prototyping a lot.
So basically there was three main stages of us pushing this code to open source.
So the first was, hey, let's just dump what we have. And we got sort of the pushback that it was kind of too complicated. So we tried to refactor it a bit into like sort of like a pseudo
proxy where we're able to like reduce some of the overhead and like kind of build the proxy into one
process. But then we saw like, you know, it was using 160% more CPU than it was if we just did it
normally.
And so we finally were able to take some of the learnings of that prototype and build
a new implementation that actually is what got finalized in the subtraction layer.
So we have this connection abstraction, which kind of abstracts away the TLS part.
So it still kind of looks like TCP, but can also support TLS.
So, you know, it's that sort of like iterative approach that like finally
got us to get it accepted in open source.
And I really liked the start.
Like I can go into a little more detail, but it also, it wasn't just AWS,
it was AWS and Red sort of working together to sort of figure out like,
what's the right thing for the project.
And that's always what I think is like the most important thing kind of for to build like the right thing for the project. And that's always what I think is the most important thing,
kind of for to build the right thing for open source projects.
You see a lot single vendor projects,
they just like, they're like, we're doing it the way we want to do it.
And they don't really take into account what the community really wants.
Yeah, interesting.
And so Timeline, was that, was it like 2018, 2019?
Is that when? That was 2018 through 2020 it took two years
Two years to get that okay wow
It was a little hard because like when I started contributing it was really just Salvatore like he was the main contributor to the project
There were some other folks as well
but there was no like like I
well. But there was no like, like I, one of the big contributions I made to the project was I created a Slack channel for us to all to talk to more online, we had like online sinks. They weren't
videos like, you know, it was like every like once a month on Wednesday, we all got on the Slack
together and asked ourselves our questions. And so we talked a lot about CLS and like, so
it was that replacing like email lists or like GitHub issues
or like how, what was it, what was going on before then?
I think I was mostly replacing.
Yeah, I was trying to get like, you know,
cause GitHub issues are very low density.
You have to type a lot.
And so I was trying to replace that
with a more high density communication.
People didn't want to use video at the time,
but actually maybe it may have,
Slack might not have supported at the time.
I don't remember.
This is like pre-Zoom.
This is like pre-pandemic.
So I was thinking it was a little weird.
Yeah, for sure.
I remember Slack bought that one company
and integrated their video for a while.
And then I think, I mean, it's definitely not their current,
their video implementation, but for a while
they had some other VIA video provider in there.
So maybe it was that.
Yeah.
So.
Yep. Okay. And then so at
some point you became a core maintainer. When did when did
that happen?
So I was in 2020. So Salvatore stepped down in July of 2020.
Right? Because it's like, this is three months after the
pandemic started. So it was like kind of weird time wise. It's
like sort of surreal. So he stepped down. So Redis went from a BDFL,
a single benevolent dictator for life to a core team.
So the core team was still facilitated
by the company Redis.
And so they had three folks from Redis, the company,
then me and an engineer from Alibaba named Xiao.
So we were like the core team.
So we made decisions on what we called major decisions
for like any API changes. Like that went through this little core team. So we made decisions on what we called major decisions for any API changes. That went
through this little core team. So we met up and we decided what was the direction of the project.
Yeah, gotcha. So is your work primarily like Redis, now Valky, core data plane type stuff?
Did you work on any control plane type stuff for Elastic Cache or any of that sort of,
you work on all that too?
Well, so not anymore.
So now I just work on open source Valq.
So I've kind of ebbed and flowed
about what I'm working on.
Like at the end of the day,
Amazon wants me to produce value for the company.
So like, what's the best?
Well, and that's in today.
I should say it wants me to be working what's best for
our customers, right? That's. So most of the time, the last
couple years, the best use of my time for our customers is to
deliver software open source. But early on, like we didn't
really do much in open source. One of the first very big
project I did, I was part of the Rolling Stone, the migration
from off of Oracle.
So there's a big initiative within Amazon to move from Oracle to a lot of it to Dynamo,
some of it to Redshift, some of it to other databases.
So I was the one that migrated our Oracle database for our service to Postgres, which
was a fun thing to do.
Yeah.
Yep.
Yeah.
I've heard a lot of Rolling Stone over the years.
So that's fun.
So I've done that stuff.
I also built some other features as well.
But the big thing I have done was I own version currency.
So I was the one responsible for making sure our internal version was compatible with open
source.
How hard is that maintaining Like maintaining this fork?
Yeah, I guess like, is that like a total headache
or is that like pretty manageable to do something like that?
I mean, that just seems like a, yeah, big job.
It is a big job.
It takes us a good bit of time.
You can kind of see in here how much time it takes
for Amazon ElastiCast to launch a version
compared to when it was launched in open source.
You can kind of see that little leg there.
And it's something that I would like to see us do better at over time.
Because a lot of it's because we've built stuff that really should just be open source. There's some very big things we've built.
Like in the managed service, we have a feature called data tearing.
So we can spill data from in-memory onto SSDs and pull it back in all transparently to the end user.
So that's like one feature we have that,
maybe it should, maybe we'd like to open source it,
but how are we gonna open source it?
How does it work?
It's very finely configured for the last cache service.
But there's other stuff we kind of talked about
that was like, yeah, we should just open source it, right?
Like the performance stuff that we pushed into Velki 8 was all stuff we kind of talked about that's like, yeah, we should just open source it right like the performance stuff that we pushed into Valky 8 was all stuff
we had built internally and we're like, we don't need this to just be internal.
This should just be an open source and we don't want to deal with having to deal with
merge conflicts.
So we just chopped it all into open source.
Yep, gotcha.
Gotcha.
So has that gotten easier with like, I guess y'all still offer Redis somewhat.
And then there's, I guess like, has that gotten easier with Valky of just like, hey, there's
a smaller divergent between like the fork and the open source one.
I would really like to say yes, but at the same time, you know, we're delivering features
for our customers at a high rate.
And so sometimes there's this big internal conversation we have, which is like, do we
upstream something first or do we upstream it later?
So do we build something primarily in the open source, then merge it back, or do we
build it in our internal fork and then figure out ways to push it to upstream later?
And the main difference is just time to market, right?
We can build something internally much faster
and push it upstream, but that's not always.
And sometimes that meets requirements
and there's a lot of internal politics
that just happens at all companies.
And so we were constantly having this conversation.
So we're, you know, and that was the status quo, right?
Status quo, 10 years ago,
we always built something internal first,
and then we upstreamed it.
So it's sort of this like it's been like this 10 year kind of slowly
shifting more towards upstream first.
Yeah, that's nice. Yeah.
So one day I think we'll get there.
But yeah, cool, cool.
OK, I want to talk a little bit about the the license change
just for some background before we get into the Valky Performance stuff.
So I guess first of all, just walk me through this. March 20th, 2004,
Redis publishes this blog post announcing the shift from BSD to SSPL. Did you know about this
before I did or was the blog post the first thing you had heard like how did how did that come down for you? Yeah, so, so they had told us about 24 hours in advance just sort of as like a heads up by the way.
But no, yeah, we were basically learning about the same time that everyone else learned about it.
And so yeah, they moved from from BSD license to a dual license SSPL and RSL, which are two proprietary licenses.
And at the same time, you know,
but like I think that's like the thing
that most people notice, but at the same time,
they also dissolve the open governance
that we're talking about before,
which you kind of have to do
if you're just gonna unilaterally change the license.
And then they also added a CLA at the same time.
So that's when you contribute,
you also have to
sign basically allowing them to take your contribution and relicense it.
Gotcha.
Gotcha.
Okay.
So when you say they removed the open governance, does that mean like you and Zao were no longer
maintainers?
Is that what that means?
Yes.
Okay.
Right.
They moved from...
You know, they're actually a little vague now.
I'm not entirely clear what the system is,
but they did dissolve the open governance
and they said so in their FAQ that,
hey, this wasn't working, so we're gonna do something else.
Yeah, gotcha, gotcha.
Okay, so then how quickly did the fork come together
and decided, hey, we're gonna do some sort of fork.
Yeah, what was that timeline like?
So I think I have these dates, right?
So they happened on March 20.
We had basically decided we were going to fork on March 22.
So then we had AWS, Google, Ericsson, I believe,
Alibaba on board at that time, as well as some other companies.
But those are the four that had like, who eventually would have a maintainer's spots. Like Oracle is also kind
of interested, Snap was interested, Verizon was interested. A lot of other companies who
weren't directly contributing were also sort of like, hey, we'd like to help with this, right?
We all have a vested interest in keeping the Redis project open. So at that point, we had also reached out
to the Linux Foundation.
So we had reached out the,
I'm a very big believer in having a foundation
own the software to sort of prevent it
from ever getting re-licensed.
There's some other strategies,
like we could have used a more restrictive license
or more a copyleft license like LGPL,
which is another conversation that
was going on at the time, to sort of make sure like we couldn't do this forking in the future
or this like rug pull. But I'm a little bit less convinced of that. So we had engaged the Linux
foundation and they're like, yes, we will sort of, we're interested in that. So sort of, I think it
was a week after that. So I think it was March 28th was the Thursday or Friday
we announced it as being under the Linux donation.
So there's about a little over a week.
I believe it was eight days from the announcement
to when the actual creation of Valky happened,
which it's funny.
I think a lot of that time was actually coming up
with the name.
I think we had gone like all the stakeholders together in like, I think we had the stakeholders
by like Monday all kind of like signed off.
But we had this name called placeholder KV.
And so we were still trying to.
I wish that was more popular because I have like a bunch of stickers that say placeholder
KV and nobody knows man, that would have been such a funny one.
Nobody knows what I'm talking about.
Yeah, yeah.
That would have been a great long-term name for it.
Interesting.
And then, so, and then when was,
oh man, I should have written this down,
but when was like the 7.2,
when was the first Valky release?
When I think, I have a lot of dates in my head
and I think they're all right.
So April 16th was when we launched the fork. So that's when we were done with all the rebranding
have the containers all set up. And you know, obviously, it's a fork, right? We didn't,
we could have kept it a little bit leaner and fork, but we also wanted to set ourselves
up well for the future. To make sure we were able to, like, there's some stuff like within
Louis scripts is like redis.call, which is like built in, like we didn't want that to be kind of the de facto thing. So
we add also server.call. And so you could use a more generic term as well. So there
was some stuff like that. And then just a lot of rebranding just to make sure everything
said Valkey instead of Redis.
Yep. Yep. And okay, so this is something that's happened a few times now. And once I think of like Mongo and like Elasticsearch being the one
that seems like most directly on point where there's already hosted services
from a lot of different cloud providers and I guess like,
did it help to have that that sort of prior art?
Did you consult with these other teams like OpenSearch around how to
how to approach us for fork and deal with that?
What I think is interesting is actually, like we'd actually applied a lot of the prior arts
already. Right. So a lot of what we learned from Elastic was that it's important to be involved in
these communities. Right. Like I had been deeply involved with Redis for almost six years at this
point. And like, that was one of the things where like, Hey, one of the common things, you know,
people push it adbistolized, like, hey, you guys aren't contributing back.
And so we're like, okay, we'll contribute back. We'll have people deeply invested in
these projects. And so that when the actual relicensing happened, I was in the community,
I knew all the key stakeholders, I had like spent a lot of time so that we were able to
get everyone together really quickly to sort to move fast in creating the fork.
I think that was the big thing that we learned.
I think that's the thing that Amazon is a big company, but it's one of the few things
that I think Amazon's getting better at over time.
We're trying to be more involved in open source.
Yeah.
Actually.
I see that around.
Yeah, that's absolutely true.
Yeah, good see that around. Yeah, that's that's absolutely true. So yeah, good stuff happening there. And now of course, we have some new developments just last week now, exactly a week ago, redis announces they're sort of back to an open source license, a GPL. I guess like, does this change anything for you and Valky, or how does that? Like, how do you think about that?
Like, how do you think about that? Yeah, so it doesn't change too much
of the direction of the project.
So as I kind of mentioned earlier,
there's like three things that Redis changed back in 2024.
They dissolved the open governance.
The open governance was not reinstated
sort of after this relicensing.
They still have a CLA, so they're still able to,
like any project with a CLA, even if it's an open source,
they're still able to do a rug full
and go close source again later.
But yeah, but AGPL, it's a good license.
It's an open source license.
I'm happy that it's good for the open source community
to have a Red Stack under an open source license.
So that's all great.
And but from Valkyrie's perspective,
it's already been a year, we're so well
established, the code has diverged as far as we know quite a bit. So, you know, it's like trust
has sort of been lost. So like, nobody in the community really wants to go back and like,
re-emerge. Like, a lot of people are like, are you going to merge back with RISC? I'm like,
no, like a lot's diverged since then. And then also, we have to have the conversation moving
from BSD to AGPL, which I don't think anybody really wants to do.
Yep.
Yep.
Is that the only way it could happen if you re-emerge?
I don't know enough about licenses
on how that stuff works.
Would it have to be that license?
So there would be basically two things.
One is Redis could use its power in the fact
that it has the CLA.
It can re-license everything back to BSD,
which I think they've said they're not going to do.
Yeah.
The other thing was, yes, Valke could.
So Valke's code's all under BSD.
So BSD can be relicensed to be AGPL.
So Redis can take any code that we've written,
relicense it, and merge it.
We can't do the inverse.
We can't take AGPL, relicense it to BSD, and merge it. right? We can't do the inverse, right? We can't take HPL relicense it to be as the end merchant.
So it's kind of like a one way, right?
But Redis the company could do it.
Yeah, yeah, gotcha.
On the, you said that, so you said the code base
is starting to diverge and I want to get into improvements
and stuff next, but like, as of right now,
is it still like a hundred percent drop in
from Redis to Valky?
Where like, hey, some under the hood stuff has changed,
but like the API is still pretty compatible.
Or like, are there certain areas where it's like, hey, do you use this?
Well, then you need to think about this change.
Or like, where's compatibility between the two right now?
So the way Valky's been framing our compatibility story
is if you're using the set of APIs from Redis,
open source 702, right?
That's where the fork point happened.
We're still fully compatible at that point in time, right?
Cause that's sort of what we think is like, you know,
the real, you know, core set of APIs.
Redis has introduced some APIs since then.
They've introduced a lot in 8.0, for example,
which Valq does not have compatibility with.
So there's sort of like a superset of APIs that Redis has.
Valky also, it's like they're not overlapping supersets.
Valky also has some APIs we've built and we have implemented some of Redis' functionality.
So there is some amount of, you know, if you're using this functionality, it might work and behave differently in Valky.
But if you're using these core APIs,
they should all behave basically the same way.
Got you. Got you. So the new,
whatever, vector sets I meant,
yeah, not going to be in Valky.
But yeah, a lot of core stuff will.
Okay. Okay, cool. Okay.
So let's talk about improvements about the coming out.
You had a 8.0 release,
you recently had the 8.1 release. If I sort
of look at it, like the big, the things that stick out to me, and you can correct me if
I'm wrong here, would be like the IO threads improvement, the new hash table stuff, which
was 8.1, some replication stuff, a lot of like core sort of like performance and quality
of life type stuff around that. I guess like, let's start with the IO threads improvement,
because, like, Redis is famously single threaded,
chest and benefits, some trade offs.
We've seen some, like, sort of Redis compatible caches
that are multi-threaded.
Like, is this similar to that, like, where it's like,
hey, now Redis is fully multi-threaded?
Or, like, what are these IO thread improvements?
Yeah, what do they add in BoutB?
Yeah, so I wanted to take a step back and talk about why we care about multi-threaded performance generally,
and then how we sort of try to take that and apply that to what we built via threads.
So if we're on most of these kind of caching that run in cloud virtualized environments,
if you have an Excel, let's talk about AWS.
So if you're on like an R7G.ATXL, you have, you know, about like 100 gigabytes of RAM and a lot
of extra cores. So if you're running in a caching environment, you're generally not using a lot of
cores. Caching is a very efficient CPU type of workload, but it tends to be very
much bottlenecked on the amount of memory you need. So you need a lot of memory and not a lot of CPU
cores. So in that world, it was fine that Redis was single-threaded because you would just have
a lot of memory and it would work out fine. We've started to see somewhat of a change because we're
seeing more CPU intensive operations start to take over workloads like vector similarity search, search are all much more CPU
intensive operations. And so you might have kind of the same amount of data, but now you have,
you know, you're running on these boxes a lot of extra cores, so it's good to utilize them.
So like that's sort of the mindset, like we still think command processing, like simple gets and sets are still really fast.
And Valkey and Res both have horizontal scaling,
so you can have multiple processes.
So just doubling the number of cores
and doubling the throughput,
you can just do that by scaling out.
So we wanna make sure that we're benefiting these more
like search intensive workloads.
And the other benefit is if you have a single hotkey,
having multiple threads be able to serve that one key
reduces tail-end latency.
Because if you have two, most latency comes from contention.
Commands are getting queued behind each other.
So if you're able to drain that queue more quickly,
you'll see lower tail latency. So, okay, so then the last point about that is,
with Valq and REST specifically,
they have a replication process that does like a full sync.
So usually you're basically have idle cores
that are sitting around most of the time,
so that when you need to,
you could do that fork operation.
Right?
And so we have the, like,
you often have cores kind of sitting around.
So if you are a CPU bound workload,
we want to be able to kind of burst into those cores.
So to try to like piece all of that together,
when you have systems like,
like Garmin and Dragonfly are the two,
and I can say cache maybe to some extent,
are these big other caching systems,
which are very thread scalable.
They're targeting more the workloads of,
I just have one box, I just want to run one box,
and I just want to keep increasing the number of threads
like to not have to deal with any
of the horizontal scaling.
And so that's fine.
Like I think they're kind of trying
to solve a different use case.
The way we're trying to think about ISo-threading is really about optimizing the price performance
or the cost optimized performance of a single look of a box.
Right.
So we want to best utilize the cores we have available.
So we don't want to degrade single threaded performance because most caching workloads
kind of still run with that single thread performance, but we want to be able to kind
of burst into these extra threads sort of as
needed.
And so I think it kind of helps to also add in why ElastiCache specifically
built this IO threading work.
So IO threads had been available for a while, uh, within open source.
I think you mentioned it was in 20, no, I'm getting things messed up.
Someone else told me this morning that it was released
in 2020. Okay.
In that blog we both read. Within Redis.
Yeah, within Redis. There's IO thread. Because I know ElastiCache
has this history of what like Enhanced IO and 2019. Like a lot of stuff that ElastiCache
has done. Yeah, go ahead. Yeah. So one of the things we launched in ElastiCache has done. Yeah, anyway, go ahead.
Yeah, yeah.
So one of the things we launched in ElastiCache recently
was the serverless feature.
And one of the things the serverless needs to do
is dynamically scale the amount of,
so reports very quickly.
So one thing that being able to quickly add cores does
allows you to burst into spare capacity.
So if you are over-provisioned,
you're able to kind of dynamically add cores, increase throughput,
while at the same time, horizontally scaling, adding more shards to a cluster.
So we added that so we could get that dynamic range, not because we care about how big of a number
a single process can do, but so we can dynamically adapt to performance.
So that's a lot of context,
which is sort of to say that I don't think
just having a single process do a lot of requests per second
isn't intrinsically beneficial, right?
It's really, you know,
it needs to be benefiting the end users we have.
So that sort of tries to add a lot of context
about why we specifically try to not make,
like the main
like alternative implantations. Why is Valky not truly multi-threading? Like why is everything
not multi-threading? That's because that's a lot of complexity and it's not solving these
core issues.
Interesting. Okay. Interesting. And that's like, again, like sort of beyond my, my expertise
on that. So tell me about, I was going to ask you about Dragonfly later.
Let's talk about it now. So they, I guess there's like a, a certain set of workloads where that
happens to be better for like their approach is sort of better for, is that what I'm understanding?
Yeah, like if you have a box with like 32 cores and like you've paid for this box, right? And you don't want to manage a cluster
distribution. You could just start Dragonfly on it and it will pretty efficiently. It's a very
thread scalable architecture. So when you add more threads, the throughput goes up.
And I have lots of great documentation about why it works well.
And they have a fully concurrent, I shouldn't say too authoritatively. I believe I have not actually looked at the code because it's BSL,
but I believe and Roman can argue with me if I'm wrong.
Sure. Yeah.
But it's like a fully concurrent HashMap, right?
You can do concurrent operations.
I believe they have a blog post.
They are able to lock specific segments of the hash table.
And so they're able to do
parallel work on multiple different commands at the hash table. And so they're able to do parallel work
on multiple different commands at the same time.
And this also helps with latency as well.
It brings down tail latency because commands
don't get blocked.
So yeah, it works well and scales pretty well.
OK, interesting.
Especially with CPU-intensive operations,
if they're doing multiple very expensive CPUs,
if they're doing the command in
Valq and REST called SCUnionStore,
which does merging of two sets,
which is very CPU-intensive.
The fact that they can block
different parts of the hash map and then do that work together,
is good to help increase throughput.
Yeah. Gotcha.
Then tell me about in 8.1,
we have these hash table improvements,
and there's just really good blog posts on how that works.
It talks about Swiss tables and also how you had to optimize that
for Valkyrie's unique needs.
I guess tell me about,
first of all, before we even get into those improvements,
how do these types of improvements
even get on your radar?
Do they start in industry first?
Do they start in academia?
Is it just like somebody's like, hey, I
think there's some better way we can do this sort of thing?
What is the flow for these sorts of ideas?
So for the Swiss table, the hash table improvement stuff,
it definitely came from this like general
knowledge that, so the current, the hash table that existed before, well, so you know, for everyone,
so the way hash tables work is you have a key, you run a hash function on it, and that points to a
bucket. And so when you have a hash collision, so two different keys point to the same bucket,
you need a way to like resolve which of those two you want to do.
So in the original implementation,
we used a linked list.
So we had two.
We would just have them point one after the other.
And that uses a lot of extra memory,
because a lot of overhead and having a dedicated allocation.
So the common wisdom in the 2010s
was you should do what's called an open addressing type of
hash table.
So instead of having...
Basically you basically put the item in the wrong bucket, but you know that if you checked
an item in the bucket and it's not the item, you basically keep checking until you find
an empty bucket.
So these are open addressing type of tables.
And so it's well known that these
are more performance and more memory efficient
on modern hardware.
On older hardware, it didn't matter as much.
But modern hardware is very good at prefetching and doing
a lot of instructions in parallel.
So although you're doing more work
and counterintuitive to stick an IAM in the wrong bucket,
modern hardware handles this really well.
So there was a,
like a paper and I guess like a video that was produced.
So Google built this thing in 2018
where they talked about a lot called the Swiss table.
And so Swiss table sort of like a very finely tuned version
of this table.
And so we had like known about it, we had seen it.
But there was there's some very specific requirements that
Valkey needed and Redis needed, I guess this isn't Redis time frame, that made it kind of
difficult to adapt this implementation to our implementation. Specifically, we have a way to
have like a cursor and you can like scan through all the items.
And we provide very specific guarantees
on how we scan over all the items.
And we weren't able to preserve those in the Swiss table.
So like, you can go, there's an issue.
I don't remember what it was open.
It was something like 2021 or something.
And we just started like just noodling on these ideas.
And I think it wasn't until a conversation I had with another maintainer
of Alki named Victor, we were at an open source, an open source conference, and we'd kind of
finally figured out kind of how to solve all these problems. And so we had like a good
idea. And so, you know, it just took a lot of time to figure out how to adapt sort of
what was from academia into industry.
I guess in this case it was more from industry to industry.
But yeah, we were finally able to adapt it.
And we were able to kind of,
I was kind of surprised.
Like we knew we were all in theory,
we put it all together and it was,
how we'll present faster,
it was saved like 16 bytes per object
and all sort of worked well, which is nice.
We launched in 8.1, which is about a month ago.
And I was worried there would be weird crashes or something.
But weird performance, more importantly.
But it's worked pretty well.
It's worked kind of as expected.
Yep.
How do you get confidence on such a big change like that,
that it's not going to have
weird performance stuff like that? Yeah.
Performance is hard. We were fairly sure about the functionality of it. We ran a lot of fuzzing.
We have a lot of built-in integration testing. So we were pretty sure it wasn't going to
crash. Performance is really hard to measure, especially because it's just very use case dependent.
We have no good proto-typical examples of use cases.
We have some people who have large objects inside Belki,
like 16 kilobytes.
We have other people with lots of very small objects.
And so we have a couple of workloads we ran
and from what we could tell,
the hash table is faster in all cases.
So that's a good signal that everything seems faster.
And the theory is it should always be faster, right?
Cause we're doing fewer memory lookups.
I guess I didn't talk about this
when I was talking about the IO threading work,
but one of the things that sort of changed recently is if you actually do profiling on the IO threading that was in Redis before,
is a lot of the bottleneck is actually on the CPU waiting for the memory access to DRAM, right? So when you go and like try to fetch that item, the CPU stalls. It's like, I need this from main memory.
And it'll just sit there and wait for it to get sent over.
So one of the things that we spent a lot of time doing
is before actually executing the command,
we do a bunch of stages to prefetch memory
into the CPU caches so that when we actually execute the commands,
we're not stalling on all this stuff.
So it increases like actual command execution
by like two or three X, which only ends up being like a,
you know, two X performance overall improvement.
So that's, you know, it's all stuff.
But one of the big things about this hash table
is we remove one of the memory, random memory accesses,
right, doing a common memory access, right?
If you have some piece of memory that you're hitting a lot,
it'll stay in the CPU cache.
If it's kind of random, then it won't be in the CPU cache.
You have to do the full miss.
So we moved from two full misses to one full miss,
which is a really big improvement.
Yep, yep, interesting.
You mentioned you sort of tested it
on some different workloads.
Do you have like, like, are those like internal AWS workloads
that you know of that are like good test candidates
or other people, you know, Alibaba and Google
and different things like that,
or are they certain customers that are a little more,
I don't know, I don't wanna say like adventurous
on some of this stuff or like how do you,
before you actually release that out,
get some of those tests, like real world tests?
Yeah, that's a good question.
So AWS has some benchmarks that we use,
and they're kind of based on data we've collected from the ElastiCache fleet.
And so we've made those available to the community as testing grounds.
We also have a release candidate process,
and we had a couple of folks actually go and run their benchmarks against our system
sort of before we did the GA with the hopes that
they would be able to tease out some of these performance
issues if they were there.
Though it's not perfect, but those are sort of the two
strategies we have as of right now.
Yeah, for sure.
Do you have a bunch of like,
like if you're making a change like this,
even theoretically sort of early on, do you have like a bunch of like micro if you're making a change like this, even theoretically sort of early on,
do you have a bunch of micro benchmarks to figure out,
hey, is this thing faster?
Or is it more like theoretical where it's like, hey,
we know we're going from two random misses to one
random miss, and we know it's going to be faster,
we're pretty sure.
And so we can go pretty far with implementation
before we can test some of that stuff.
Or how are you testing even early on these different performance ideas?
Yeah.
So ideally there's probably like a three-step process.
The first is you should profile and actually see where the CPU is spending time.
So stuff like perf can kind of tell you where in the code base you're spending time.
Right?
So perf periodically probes and unwinds the stack and like, this is where you are at this given point in time.
So if you do that thousands of times,
you know roughly where the application
is spending a lot of its time.
We also use like, they're like,
Intel has a performance counter that you can use.
You can basically say, hey, every time you hit this point,
you can run a counter.
So you can sort of build up, you know, how much,
like actually like instruction level, question like question, you can answer questions on like instruction level.
How much time is this taking?
Like how many DRAM misses are seat like L1 cache misses are you seeing?
So we try to build up this like intuition first, like, is this actually the bottleneck?
Right?
Because like, we can make something a hundred times faster, but if it's not the bottleneck,
doesn't matter.
So once we have that, then we'll sort of isolate that code,
and then we do micro benchmarks on that code, right?
Say like, hey, how do we make this faster?
But it's important to remember that just because it's faster in isolation
doesn't mean it's faster in the whole system, right?
So once we saw that intuition, we still need to do the final benchmark
of everything stitched together and make sure it's actually faster.
And then we kind of can cycle back to the first part and like re-profile it to make sure
that, you know, part of the code base is consuming less time than it did before.
Cause like there's this thing that's like still sort of Xing me is like, if you actually
go and check, so Valky has this clustered mode, right?
And one of the things it needs to do in this cluster mode is make sure like,
is the request sent to me, can I serve it?
Right? Am I the right node to be able to handle this?
And it's like pretty BP chunk of time.
It's like 5% of the CPU of command processing.
And we go and profile it and we saw this function.
It's like taking a hundred percent of the time.
Like, perfect.
Let's just fix this one function.
We spent a bunch of time, we fixed it.
And we benchmarked, like micro benchmarked, it was faster. But in the whole system, it was the same
performance, didn't do anything. And just something else was taking up 5%. And we're like, what's
going on? And it's because basically, this 5% was shielding the fact that this other function
still was going to like, like this whole like, this one function was pulling a bunch of memory into main memory. And now
it wasn't doing that. But now this other part of the codebase
still had to wait on main memory to pull this memory in. So it
actually didn't make the system faster. And we're like, well,
and this other part was much harder to solve. And so we're
like, I guess we're not fixing this today. So we're at it.
Yeah, yeah. Like, how long does that whole process take process take of like, hey, identifying this thing, micro-bench
marketing, and then figuring out, oh, man, it didn't work.
Is that like, oh, man, we just lost two months of work?
Or is that something you can rate through pretty quickly?
It's one of those things where if I'm not being randomized by meetings all day, you
can do a full loop in a day.
You can have an idea, test it, loop
it like in a day. But it's something that you need like a lot of focus on. So I feel
like, you know, we definitely don't lose months of effort. But it's definitely one of those
things that you can do in a day if you want. And it's kind of fun, especially if it's like
very if you have like a good idea, a good theory. The worst thing is, though, is like,
you know, you do it and it doesn't,
there's no performance delta. You're like dang it.
Yep. Okay. So if I look at these like 8.0 and 8.1 releases, it seems like a bunch of
again like good performance, reliability improvements in like a pretty quick amount of time, right?
You've only been around for a year. Was there a lot of low-hanging fruit there?
Was it an explicit decision to focus on,
hey, let's really nail the performance stuff that can
distinguish us in certain ways or yeah,
I guess what accounts for it?
Or is this about the pace that's happened
every year with Redis in the past and now about?
I think we're able to innovate a little bit faster.
I think a lot of these ideas existed.
Like not a lot of what we built in Valkey was like super novel.
It's like the thing is we're coming out of the blue and we're like,
this is a brand new idea.
We talked about a lot of these ideas with Redis.
I think the Valkey project as a whole is set up in a way
that it can move a little bit faster.
I think for better or worse, like, so when we were in Redis,
we had this open source team, this like five member team.
And like, there was a time where like, you know, we wouldn't,
we had a weekly meeting, right?
And we'd like help sort of like, for things that require like high bandwidth, right?
When you're talking about IO threading improvements, you know,
it's good to just be in a meeting and talk it through.
And so like there was times we would go like months without having a meeting
just cause like, and like the issue wasn't making progress.
And one of the big things I wanted to change when moving to Valky is like,
I want to make sure there was no bottlenecks, right?
Like we want to make sure everything was still making progress,
even if like someone was like process bottlenecks, not like within the process bottlenecks around the project. Okay.
Yeah, process bottlenecks like we'd always just get hung up on stuff stalling because like we weren't talking about we and six different individuals. So stuff gets hung up a lot less, I think, just because the way the project
sort of operates.
So I think it's just that.
I think it's just a lot more people.
There's also a lot of excitement in a new project.
A lot of people are excited to write code, make progress.
So I think it's kind of those two things.
Yeah.
Cool.
I also saw the release of the Valky Bloom module, and there's this Valky Modules API.
Are modules going to be a big focus going forward?
How are you balancing modules versus,
again, that core performance type stuff?
Or again, is it multi-threaded enough?
You have different people working on different things that you
can make progress on both pretty well.
Yeah. This gets into this,
what I just said about I about the process bottlenecks,
is we really wanted to be able to say,
I took a lot of inspiration from how Postgres operates.
PG Vector, I think is a good example.
PG Vector started off as a separate project,
and I believe the plan is to slowly move it into the core of Postgres.
I like that idea of having this very rich, extensive API so that someone
can go start a project with a new data type.
They can build it out.
They can write all the code.
And then it can sort of like, you know,
it becomes super well adopted.
It can then move into the main project.
So JSON was one of those things that was actually donated
by AWS.
It was something we built a while ago,
and it's nice that we built it within this module API framework
so we could just take the code and didn't require a lot of modification.
So there's three modules, say, which are JSON, Bloom, and VectorSimilaritySearch.
So the goal is those are all available in the container.
It's called Valkey Extension. it's kind of in preview.
That's one of the things like of the launches got kind of shoved down the most,
like people were interested in VSS,
but didn't really care about how to get VSS, I guess. Um,
so like it's interesting that you've heard about bloom,
but like we haven't seen a lot of people actually using the container for it.
Like a lot of people are interested, but I guess there's not a lot of adoption yet.
Yep, interesting.
What about in terms of, so I know Redis has their own modules and those were sort of licensed
separately and I think they had a Bloom filter one, or like a probabilistic one.
I imagine there's only so many ways to write a Bloom filter.
Is that hard to, I guess, write a Bloom filter without encroaching on their license?
I don't know how licensing works in that particular area.
Well, first, yeah, yeah, like this is...
You understand what the question I'm asking there?
I mean, yeah, I'll first definitely say I'm not a lawyer.
I don't.
Yeah, for sure.
But like, the way we think about it is that the API, we can kind of emulate as needed,
like decide what's useful, what's not.
And then the underlying implementation,
we just, you know, don't look at their code.
We just build our own thing.
In this case, I do know that theirs is written,
actually, I don't know this.
I think it's in C++.
It might be in C, I don't know that specific,
but ours is written in Rust.
So at least that level, we know the code looks different.
But yeah, so just like from a project hygiene perspective,
we don't look at their code.
They can, we look at, we might look at their docs,
but we won't look at their actual code that they've written.
Yeah. Yeah. Interesting.
Okay. You said that's written in Rust.
Okay. What, I guess what said that's written in Rust.
I guess what is Valqy written in?
C?
Written in C.
OK.
But then modules can be written in Rust,
or parts of Valqy can be written in Rust,
or how does that interrupt work?
So the way we've structured it right now
is that the core project, Valqy itself, will be in C.
And I guess technically, we allow C++ for certain types of surrounding libraries,
but the main project needs to be in C.
So one of the benefits of using modules is these modules can be built with a different compiling.
They could be in Rust, they could be in C++.
Theoretically, they could be in Go.
And that was, you know, if it makes more sense to write in Rust, we also have this like Rust
SDK to make it very easy to write safe Rust code.
I don't know how familiar with Rust.
I'm not.
No, we just I just had another person not talking about Rust.
And I was like, man, this is good stuff.
But I can't ask a question here.
So yeah, I you should you should learn Russ.
I love us so much.
I should do.
OK, that's what I hear about. So so many people.
So, yeah, if you want it, well, so my slightly more nuanced take,
even though this wasn't your question, system level programming.
If you're working in a large team where there's varying levels of skill,
Ross is very good at forcing everyone to write high quality
code. Right? So I think a lot of organizations are like swooning over it because it's very good.
Like you can take a college grad. It's very easy to learn Rust because there's so much documentation
about it. And you don't have to worry about them, like making mistakes about like memory safety or
like leaking memory too much because it constrains big projects really well.
So that's why I think a lot of people kind of like it. It does some syntax stuff I like,
it does some stuff I don't like. And yeah, I think it's a good language.
Gotcha. Gotcha. Do you think most modules will be written in Rust going forward or
it hasn't been in the practice so far?
So right now we have three modules, right?
So we have JSON, which is in C++.
We have Bloom, which are in Rust.
I don't think it matters too much which one of those,
between those two, because to be blunt,
those two implementations are basically,
we took a third party
implementation of the data type, and then we implemented the APIs on top of it.
There's not a lot of logic there. The one that's more interesting is vector similarity search, which is like a searching engine. And a lot of that code is custom. And so that's in C++.
I would have liked to see that in Rust for a couple of reasons, especially around
the multi-threading memory safety stuff. But Google built it, they were on C++, we're not
going to say no, rewrite your code in Rust. And if you're very methodical, again, I'll
go back to if you're very methodical in writing C++ code, I don't think there's a lot of value in Rust. But from an open source project perspective,
I think it has a lot of value to write in Rust.
Yep. Yep. How close are you to that, the vector search stuff? Like, have you stayed close to it?
And I guess, like, how different are the requirements in, you know, something like
Val-key, a memory-based system, as compared to PG Vector and Postgres or Mongo's implementation and things like that.
How different are those?
I will talk a little bit out of my depth,
but I have mostly ignored it until recently.
My understanding is in-memory vector similarity search is good for very high recall.
So when you're basically descending
a lot of the tree to look for the best possible matches,
because then you're doing a lot of random memory lookups.
So you want to avoid as many disk-based lookups
as possible.
And so that's where Valqy can be more, can do better throughput than
something like open search or PG vector. It sort of remains to be seen where, which use
cases are that's most useful. Historically Valqy is mostly used for very online applications,
very real time applications, right? You're serving, caching traffic, you know,
someone's like on a website,
someone's like looking for real-time, you know,
ticker values from the stock market.
So when you're doing batch operations,
it's a little bit less important
to have this real-time stuff.
So I think there's an open question
of like where this makes sense.
We've also seen an increasing push away from,
so recall's only relevant if you're trying to do
approximate nearest neighbor.
If you're just doing, you want the exact closest object,
you could just flat search everything in the index.
And so we're seeing a lot of people just stick indexes
and valkey as a key value string.
They extract the string and then they do a flat search on it.
as a key value string, they extract the string and then they do a flat search on it.
So I think the space is rapidly evolving. And so it's very interesting to watch it evolve.
And so I think what we're really just focusing on is providing good fundamentals, which is why, you know, performance, efficiency, reliability, all that good stuff.
And then, you know, kind of, you know, this is where it's nice to also work.
Like I said, I work primarily on open source Valqi.
It's really nice to also pay attention to what's going on
in WS because I can at least be a little bit more tapped in
to like see what people are talking about,
what use cases they're missing,
where do they need high performance stuff.
And then we can try to use that to help inform
where Valqi should be going. Yeah, yeah, Cool. Do you still like talk? Do you stay pretty
close to customers? I imagine just like talking to customers all the time, like sort of standard
AWS? Yeah, literally right before this call, I was talking with the customer.
I love talking with customers. It's my favorite thing. Yeah. Oh, that's cool.
Seriously, the best AWS engineers I know are just like that. They're just like learning info machines,
just want to suck it in all the time
and figure out what people are having.
For some of these performance improvements,
are there trade-offs or does it ever happen
where it's like, hey, this is better for 90% of customers,
but this workload is slower? Does that ever happen or is it mostly like, eh, no, this is better for 90% of customers, but this workload is slower.
Does that ever happen?
Or is it mostly like,
and now this is usually pretty good across the board.
It's easy to build the stuff
that's good across the board for sure.
It's, I think it's a really good question.
Cause you know, I have this conversation
with our product managers where I'm there like,
like someone was like, oh, we saw a performance degradation. And we're like, yeah, that's sort of expected. And I'm like, what do you mean? You told
us it was faster. I'm like, well, it's complicated, right? Not for every single person. Yeah,
not for every single person. Like the most common thing. So the way our IO threading works,
if you are not throughput bound, you are not going to see better performance.
And under certain circumstances, if you're doing these operations we talked about earlier,
like the S-Union store, you can actually see a little bit of degradation depending on your
workload.
So, those do exist.
And we try hard to fix them and mitigate them.
Our goal is to make, if we can make everyone better by 10%, that's better than making 1%
of users better by 100% and everyone else worse by 1%.
So there's a lot of trade-offs there.
Yeah.
Yeah, like we had a-
Would you say, oh, go ahead.
Yeah, I was thinking, there's another example.
There's this data structure in Valky called HyperLogVlog, which does approximate size
of sets and
Someone was like I want to make this change it will make
You know, I think something like 14% faster, but it'll make a 2% more memory
you know use to some more memory and we're like I
I don't know how to quantify that distinction
and so we ended up not accepting it because we're
like, I don't know if people which one like which, like, we don't know which of these trade offs
people prefer. And it's better to sort of keep it the same than to try to change it.
Yeah, yeah, very interesting. Would you say the Valqe project, I guess, like has a North Star,
like performance is like sort of the North Star, or is that balanced across adding more features,
adding ease of use or different things?
It is more of a balance of a few different things,
or is performance, hey, that's the thing we care about most?
So I don't think we have one thing we care about the most.
There's been five pillars we've talked about,
which is performance, memory efficiency, reliability, observability, and ease of use, basically. We want to build features that make
it easy to build applications on top of. And that's sort of like what JSON's for, right? Those
vector similarity search stuff. So I wouldn't say any of those is more important than the other ones. We've sort of tried to not regress on one
to build a feature in another,
unless it's like, observability is the big one.
Observability is one of those things that like
almost by definition has to impact performance
because you're doing, you're introspecting in the commands.
So we've made all the observability stuff all opt-in,
for the most part, because we don't want to regress on performance to give you observability,
but people still also want observability. When your cache is having issues,
you really want to know why it's having issues.
Yep, yep, sure. Okay, I want to talk a little bit about these sort of like other caching
options we talked about. We talked about drag and fly a little bit, multi-threading. I think
you mentioned another multi-threading one.
Garnet.
Garnet, okay. Is that like a pretty similar approach there around?
No, Garnet's doing something completely different. I actually think Garnet's pretty cool. Garnet
is a... I sometimes a little mean. I just say, you know, they
think the problem with Redis is it wasn't written in C sharp. So it's a C sharp based
implementation. No, it's actually much smarter than I'm being a little tongue in cheek. So
it's a, it's like a full, some people also say this is reductive, but it's like it's a concurrent hash map.
So every operation is a lockless concurrent, right?
So it's not taking, like so Dragonfly takes locks
on like pieces of the data set.
What Gartner does is every single,
it's able to like update stuff atomically
using lockless algorithms.
It's built on some previous research done by Microsoft,
and it talks over the REST protocol.
It's really pretty cool.
They have some really impressive high performance numbers
because they're not doing basically any locking,
which can take a lot of time.
If you have very well-formed or uniform access
patterns with a lot of keys,
you could do a lot of requests per second on the internet.
Interesting. That one is not like Redis compatible, uniform access parents with a lot of keys, you could do a lot of requests per second on Tarnet.
Interesting, okay.
And so that one is not like Redis compatible,
but sort of serves similar workloads as Redis,
or Valkyrie, is that what you're saying?
No, it actually does the same rise protocol.
It is, okay.
It doesn't, it has some weird performance under edge cases,
but if you're doing simple gets and sets, it's another,
it's also fully open source, right?
So I guess it's permissively open source.
I think it's MIT.
Okay, okay.
What about KeyDB, which you mentioned,
you mentioned Snap earlier,
Snap picked up KeyDB.
What was their sort of big insight there?
Yeah, so KeyDB was originally created
as a multi-threaded fork of Redis.
So as I said, so, actually, I don't think I talked about
our IO threading architecture.
So our IO threading architecture is we have a single
command execution thread that delegates works
to other threads.
So the main thread will say like, hey, I need to go read
from these 12 clients, you IO thread,
go read from these 12 clients.
So there's still a main thread doing almost all
of the work and coordination.
So KDB did something slightly different.
They had like a single global level lock.
So they had threads that were context switching,
some of them were doing IO,
and then they'd wait for the global lock
to do command processing,
and then they'd go back to doing IO.
And so they have a slightly different
multi-threading architecture,
and ours is a little bit better, a little higher performance.
But again, KDB built this implementation a long time ago,
like three or four years ago.
And so they iterated a bunch.
So KDB was originally started as a separate project.
It built a bunch of functionality, like active-active replication,
spilling data to disk.
It was then later acquired by Snap.
And Snap was sort of looking to,
my understanding of Snap didn't really wanna be
in the maintaining and open source cash business.
And so they were like,
how do we merge this functionality together?
So we've been working close with KeyDB engineers
to get the functionality they want merged into ValKey.
Gotcha.
And did you say Snap is like one of the sort of,
this part of ValKey?
Yeah.
Yeah, it was one of the original companies
that sort of backed ValKey.
OK.
OK, nice.
Tell me also about MemoryDB, which I think you said you worked on as well.
Yeah.
So MemoryDB is just the, so it's basically ElastiCache.
So ElastiCache is the managed Valky service
and Rattus and Memcache,
but memory DB has a slightly different replication system
that makes all the data durable be logged.
So instead, so ElastiCache just replicates to other replicas,
keeps everything in memory.
Memory DB first pushes the data to a durable log.
So when a memory DB node fails, you don't lose data.
It just comes back up, has all the data.
So it has nice stuff like Dean.
It's consistent.
You can really use it for primary database workloads.
So that's sort of the big thing about memoryDB.
And so that's like, hey, I love the Redis API
and I'm willing to trade off a little bit of write latency,
and now I can use it as my primary database,
is the thought there.
Yeah.
That's the pitch.
Nice.
OK.
Yep, very cool.
Was that a fun one to work on?
I'll say yes.
Yes.
Nice, nice.
OK.
One thing I meant to ask during the performance work section is Glide.
So Glide is like this. Yeah. Okay. Is this Glide is within Valkyrie or is that an AWS project?
I can't remember. So it was started as an AWS project. So the key insight for Glide was that
within the REST ecosystem, a bunch of clients got built up incrementally over time. Right?
So there was like the Ruby space, they built a Ruby client, Python space, built a client.
And they all sort of had different standards, different ways of doing like cluster topology
discovery and, you know, TLS and all this stuff. And so we had to say, you know, we
see a lot of issues at scale within Alaska. So we had customers who were like, hey, I have a 400-node cluster and every time it
does failovers, we see like really long, it takes a really long time for our client to
re-build the topology.
And so, you know, we looked into all these clients and like, well, we could fix them
all individually, but instead, why don't we try to go build a single client that sort of has a Rust core.
So we have the core implementation of Rust
and it's what talks to the servers.
And so we can solve this logic once here.
So that's like, we call like the glide core.
And then we build high level bindings that talk to this core.
So instead of building all the complex logic in Python,
we can build it just in Rust. And this is actually great for certain types of interpreted languages, which are pretty slow
inherently. So all of the expensive, you know, I-O work is done in Rust, as opposed to in a higher
level language. So we built that originally for Redis, just to sort of, you know, it was with
Natives project. And then when the licensing changed, we were like,
oh, we can just donate this and make it a Valky project.
So now it's officially a Valky like driver.
It still works with Redis because it's, again,
we're like compatible with Redis open source 702.
Yep, nice.
What, like how many language bindings
do you have for the four glide?
So we have four which are practically done.
I think those not quite GA, but it's close.
So we have go, which is almost done.
Java, which is done.
Python, which is done.
And no Java, Python.
I don't know which one I missed.
So those are done. We also have someone working
on a C++ binding. We have someone working on a C sharp binding and we have someone working on a
Ruby binding. So a lot of those are kind of in progress.
Yeah, gotcha. When you talk to customers and they're having performance issues, are there
a lot of things they can do? Like, is it like, oh,
man, your, your client is misbehaving in certain ways? Is it like, oh, you need to tune a few
parameters on your server and that will work? Is it data modeling? Is it all of the above?
Like how do you like, what do you see out there a lot?
I mean, it's definitely all of the above. We've had issues in clients, especially.
You know, there's a lot of very simple tuning you can do on a client to make it much more performant.
There's things like, you know,
connection counts, pipelining, all that kind of stuff,
which we sometimes see.
We generally see that performance ball
not to be on the server side though.
And so sometimes it's there, you know,
using like a bad cluster sizing
or topology, like using if they could change scale in and scale
out, or sorry, scale down, like change the instance type
and scale out, can solve their problems.
Oftentimes, it's data modeling issues, especially when you're
doing cross-key operations, you're
using very inefficient data types.
Like a lot of the operations in Val-Key are built in such a way that they should scale well,
but there's some operations which are, which are like done at all, right?
So we see a lot of people, I'm going to keep going back to this edge to union stuff,
because I talked about it, like we see there's a, there's a customer that was doing this S to union
basically say, I guess they're doing S difference. They're basically saying, okay, this is the set of
objects a user had at point A or point time t. And then this is the one they had in time t2.
Like I want you to do a difference and tell me what the difference is.
And that works unless the sets get very large because it actually scales
log n of time it takes to do these differences.
So that was one of the things that was like, well, you know.
And what's very large in this case? Is that like a thousand? Is that like?
Usually on the order of more millions, right? Because all this is dying in memory. It's
going to be fast.
Yeah, true.
And so we just sort of work with them to be like, you know, instead of having one have
like 25 of these individual sets.
And cause like, if you do a simple hashing,
you can split the, the objects up, right?
And also that also basically brings down the analog end part
as well.
So they're actually just overall,
like overall producing a lot of CPU usage.
So yeah, sort of across the board.
It's very really tuning parameters.
There's not a lot of tuning parameters in Valqy today.
There are some, like we have different representations
of data types.
So the set I kind of have been talking a bunch about,
you can either have just basically an array of objects.
Because at some point, you feel like is an object in a set,
it's actually faster to just have it all in a block of memory
than it is to actually have efficient data structures.
So how we, and it's way more memory efficient.
So when we transition from these dense encodings
to these sparse encodings, is how we typically call it,
we have hard-coded values.
So sometimes people tune those, but generally the defaults work.
It's kind of unique. It's not a big thing. Yeah.
Yeah.
Okay. Yep. Yep.
Sort of going forward on Valkyrie, if I'm interested in Valkyrie and want to find out,
hey, what's getting focused on in 8.2 and sort of those things, is that pretty easy for me to just
see or is it like, hey, wait for the release candidate and you'll see what you get or like, what's that look like?
It's funny, because like a lot of people come from the world,
like they want a roadmap.
And so a lot of single vendor open source projects
can have a roadmap because there's a product manager
who's like, here's the roadmap.
The problem with like vendor neutral open source projects
is it's like, here's the things we'd like to build.
Here are the things we think someone's going to build, but also someone could show up tomorrow and be like, I want to build this. And we're like, wow, that things we'd like to build. Here are the things we think someone's going to build,
but also someone could show up tomorrow
and be like, I want to build this.
And we're like, wow, that's so much better to build
than what we had in our roadmap.
Let's go work on that instead.
So there's some stuff we know is going to come in Valq 9.
There's stuff like hash field expiration,
being able to set a TTL and like a specific field
within a hash.
We're trying to fix, this might be too detailed,
but like, so when we're doing like resharding of data,
so in cluster, if you have multiple shards
and you want to move data between them,
right now it has to be driven entirely
by a third party observer.
Like someone has to move all the data themselves
and we're trying to make it built in.
So it like moves it between nodes directly.
So it's much faster.
So that will probably be there.
We call that
atomic slot migration. Because when it's third active, third actor can die, and then the server
is just like, what's going on? Whereas in this new world, if the node dies, it just reverts and
doesn't leave it in a broken state. So there's stuff like that. Full text search is also something
we're building. Would that be a module or that be built in? That's built in a module, yeah.
Okay.
Our goal is like very experimental features,
module first, and then if it becomes super standard,
we'll pull them in core.
That's the thinking.
Yep, you mentioned Valky 9.
When do you, I guess like when do you do
major bumps versus minor bumps?
Is that like time-based or is that?
Right now we've just been releasing every six months.
The thinking is we'll do alternating major, minor, major, minor.
We're in the awkward position.
Like we don't really do breaking changes.
So we don't really need to do major bumps.
But one of the things we learned about eight dot one is people were way more excited about
eight because it was a major version.
Yeah, it's true.
No one's excited about like, yeah,
seven dot 14 years and like that. Yeah. So yeah. So like, maybe we should just do major version.
But I don't know. Yeah. Yeah. Interesting. But so I think having the I know it's hard. It's like,
is it a marketing thing? Is it a, you know, encoding some information about breaking changes,
like you're saying? Probably not in that. Yeah.
You know, you keep saying marketing, but like the way I feel is like you want to have the most impact we want people
to be using it if that involves getting people doing marketing.
I don't think I don't think market I don't mean it in like a
derisive way. Right? Like I think like getting that I know
I'm just saying it's like it's Yeah, getting that information
out like the way that people consume it. Yeah. Yeah. Yeah for sure
I think especially in the engineering world people are derisive about marketing, but that's true. Yeah
You gotta realize it's like part of the part of the thing right about like, you know users of this have limited time and sort
Of paying attention to all these different projects. So it's like how can you most efficiently?
Just communicate to them like what's's new, this is worth upgrading
for and those sorts of things like that.
Yeah.
So is it hard to find good people to work on, Valky, like being so sort of low level
and things like that?
Or do you feel like, hey, we have like a really good team, we have enough people going, like,
what's that look like?
You know, we definitely have a lot of good people, but we definitely need more people.
There's a split between like,
the big thing that we're missing is like,
maintainers on Valqy.
We need more people like helping review this deep code,
which is hard.
Like that, I think that's like, you know,
a long-term problem with open source projects
that are very, they require a lot of tech depth
to understand is like people come in, they're like, I have my one feature, I would like
to build my one feature. And then they got their feature merged. And they're like, I'm
good. And they leave. Which is fine. It's great. Like, that's part of the process. But
getting people to stick around for the long term, especially with something that's so
it's not super sexy. It's not, you know, like, it's so much easier to get maintainers on
like, you know, projects that are rapidly evolving and rapidly
changing.
This is sort of like a lot's going on in the project, but it's all very incremental.
So it doesn't feel as grand.
We're not tripling the performance.
10% a year, 5% there.
Things like that, for sure.
10% a year.
Well, that's funny.
I was writing super exciting,
it's like this big number, but like,
yeah, it's hard to have these big 3x improvements year over year.
Yeah, yeah, for sure.
Yep. Usually I ask people like how they're using AI in their day to day work.
Like, are you able to use like any of the AI stuff for your day to day work?
Or is it so low level and specialized that it's hard to pop in Tertiary and have it do stuff for you?
I'll give a concrete example. Yesterday, I was debugging an issue in Valkyrie. There's a data
type called geospatial indexes. We had a customer who basically reported the fact that if I have a point and
I want to find all of the points that are within a zero meter radius of that point.
And they had a valid use case. I was like, this is weird, but whatever. They it's like
it should return that point.
What do they mean a zero meter? What do they mean a zero like less than one? But what?
No, like literally zero, right? So technically, by the
math, it's like the point, the point. Okay. Right. So there's,
there's a reasoning they're doing this. And so it should
return itself, right? Because it's technically within the
bounding area. On Intel, it was returning that point on arm, it
was not returning that point. And ARM, it was not returning that point.
And so I was just like, well, this makes no sense.
And so I went to my normal debugging, like using GDB, stepping through this, like the
C code, and ARM was just behaving weird.
I was like, what's going on?
And so I had to go and actually look at the assembly instructions.
So as I said, I said at the beginning, I know X86, so it's very easy for me to read x86. But I'm like, oh, these instructions. But it's very nice,
because I just would just give it to chat tpt, and chat tpt would just like translate everything for
me in real time to be like, hey, this is what this all these instructions do. And it made it
very fast for me to read and figure out what's going on. And I mean, the reasoning is not that
interesting. Like, there was a there was an instruction at ARM that just was having a rounding error,
which it's allowed to do. It's not, it's not violating the spec.
It's like Intel has this super fancy instruction that does like two multiplications and then like a subtract,
like it takes two numbers, multiplies them by something and then subtracts them,
which is exactly the operation that we were doing in the code,
but ARM doesn't have that.
So it was doing it in two steps
and that two step was introducing
a little bit of rounding error.
And because of that, it was putting it outside the zone
and so it was not returning it.
So.
Oh my goodness.
Interesting.
So when you change that, do you have to,
like, how do you even fix something like that?
Do you have it run differently on ARM versus Intel
or do you just like abstract it a little higher level
to where it works for both of them?
It's not doing that.
In this case, I was actually very worried
with debugging this, that there would be like something
and there was nothing to do.
But in the reality, someone would just like,
you know, a best practice when dealing with floating points
is you should never compare something to zero.
You should compare it to some extremely small value
and we weren't doing that. So in this case, we just changed it and to do some very extremely small value
and that fixed it.
Yep. Yep. Okay. When you were having chat GPT translate for you, did you have it translate
to English or to x86?
To English in this case, I basically said, can you rewrite this to explain in English
what each of these instructions are doing?
Yeah, okay.
That is definitely the most, the most fun use of, of AI that I've heard.
So well done.
I just yeah, I like it for that.
I'm not a big fan.
I've been trying to get into like agentic stuff.
Like I've been using the built in stuff.
For some reason, I get free GitHub copilot.
Apparently, because it's because I work on an open source project,
but I'm the only Valq maintainer that gets it.
So I don't know how I'm on this secret group.
So I do try to use it.
But the agent stuff, I tried to, I asked it to build reverse search into the Valky CLI.
So Valky CLI is all in C.
It just, it kept writing the wrong thing.
It kept just, it kept just assuming,
it seems to work well very tactically.
It's like, hey, I want this very tactical function
to do actions.
It's pretty good at that.
But when you're like, hey, taking everything together,
build this, it struggles a lot.
Yep.
Yeah, for sure.
Yep.
That's interesting.
You get there, I think.
Yeah.
Yeah, we'll see.
Well, Manon, thanks for coming on.
This is super interesting.
I learned a ton.
And it was great to talk to you.
Will you be at re.Invent this year?
Hopefully not, but I probably will be.
Hopefully not.
Yeah, nice.
Yeah, I hope you're there.
And yeah, it's been fun to see Valky's progress and all that stuff.
So best of luck going forward.
If people want to find out more about you, about Valky, where should we send them?
Yeah, so you can go to Valky.io. It's the best place. We have lots of blog posts on there.
You can also, I'm pretty active on LinkedIn and Blue Sky these days.
So presumably that handle will be posted somewhere.
Yep. Cool. We can do that. Awesome. Well, thank you for coming on, Madeline.
All right. Sounds good. Yeah. Thanks a lot for having me. It's great chat.