Software Huddle - Elasticsearch Fundamentals with Philipp Krenn
Episode Date: March 14, 2024Today, we have Philipp Krenn on the show. He's the head of DevRel for Elastic, and we took a deep dive on all the Elasticsearch stuff like Indexes, Mappings, Shards and Replicas and how to think about... performance and all that stuff. We also discussed the Use Cases and applications where Elastic is not suitable to use. This episode is packed with fundamentals and we think you'd love it. Timestamps 02:00 Introduction 04:13 What is Elasticsearch 05:33 Use Cases 11:25 Where not to use Elasticsearch 13:51 Index 16:44 Shards 23:29 Routing 33:57 Replicas 41:08 Bottlenecks 01:02:30 Upgrading an Elasticsearch Cluster 01:06:12 Rapid Fire
Transcript
Discussion (0)
Elasticsearch started off as a distributed document store, but it's much more than an
irregular database because databases are, I would say, very much like black and white.
You store something and you want to retrieve that back.
Search is much more about those shades of gray, like you are looking for a concept and
you want to, you don't know exactly what you want, but you want to have the concept or
some context around it.
If you could master one skill that you don't have right now what would it be since i go to a lot of
meetups and conferences i want to pick up sketchnoting that's a great idea and it's hard
to find someone that can do that because generally you need at least like a little bit of domain
understanding and knowledge because it's like deep technical talks and then finding someone that can
like have the art have the the words because they're also doing notes like just being
able to do all that stuff is like a tough tough skill to find but those are really nice which
person influenced you the most in your career i i want to name shy the this he he used to start
elastic search i think from his bedroom more or less um And then it was the CTO and the CEO,
and now he's the CTO again. And he, A, because he found an interesting problem. By the way,
do you know the backstory why he created Elasticsearch?
Hey folks, this is Alex. I always say that Elasticsearch is super powerful, but it also
scares me because I just don't understand how it works in comparison and scaling and performance compared to other databases. Great episode today. I have Philip Crenn. He's the head of DevRel for
Elastic. And we just like deep dive on all that Elasticsearch stuff, like indexes and mappings
and shards and replicas and how to think about performance in all that stuff. I learned a ton.
I wish I would have had this eight years ago when I was using Elasticsearch, but I think it'll be
just useful for, you know, just getting a good mental model of how Elasticsearch
works. So I hope you enjoy it. If you have any questions, comments, guests, anything like that,
feel free to reach out to Sean or to me. And with that, let's get to the show.
Philip, welcome to the show. Thanks a lot for having me.
Yeah, so I'm excited to have you because Elasticsearch is in some ways my white whale.
I feel like it's like this amazingly powerful thing and I also don't know enough about it so then it scares me a little bit.
And you are the head of DevRel at Elastic, so I'm guessing there are not that many people
on earth that would be better to sort of explain that to me.
So with that introduction, maybe tell us a little bit more about you and your background.
Right.
So I've been at Elastic for almost eight years at this point.
I've done a ton of events and just tried to help out our users.
And we're still on that mission to make it work for the developers and help them along. So
hopefully we can avoid a bit of that white whale thing. I mean, any complex system has its learning curve and Elasticsearch is definitely no
exception, but it's not magic and it shouldn't be scary. Most of the stuff, once you know it,
is almost easy. I feel like that's always the problem for me. It's like, once you know,
then you know and it's easy, like for any other thing that you have learned. And once you know,
it's like, oh yeah, of course. Yep of course yep yep well one thing i learned is like it just surprised me how little developers
know about databases and just like basic stuff like basic indexes and i'm talking like on a
relational database just like an index and how it works and how you should use it and stuff like
that and then elastic search is just like another another sort of level of power with like all the different things being distributed, but also having like the invert index and text searching.
Also now having like vectors and, you know, having semantic search for a while, like geo stuff, all sorts of things.
And it's amazing in what it can do.
But again, it's just like, you know, you see people do bad things with it because people do bad things with relational databases, which have been around straightforward forever. And now just another level of complexity,
but also power and all that stuff. So I'm excited to learn a lot of this stuff. And I think there's
probably no one better here. Cool. Yeah. I mean, it's a bit like the hammer and nail thing, right?
If you only know one tool and you try to use it or abuse it for everything, then you get
interesting results.
And yeah, Elasticsearch can do a lot of stuff, but not everything.
Yep. Yep. Cool. Okay. So let's jump in and maybe just like very high level first,
like what is Elasticsearch? Elasticsearch started off as a distributed document store or search engine. I think search engine is what we commonly refer ourselves to even though it is JSON documents that you put in but it's much
more than a regular database because databases are I always say that very
much like black and white you store something and you want to retrieve that
back search is much more about those shades of grey like you are looking for
a concept and you want to you don't know exactly what you want, but you want to have the concept or some context around it.
And that's where the search power is coming from.
And Elasticsearch has developed amazingly well in full text search over many years.
And then we also branched out to other areas because for us, logs, but also observability
is very much a search problem.
It's like the storing stuff is the boring
part what you want to do is you want to find the most recent or relevant errors or why something
is happening with root cause analysis or you want to have these problems that are for us basically
a search problem and then we have branched it out to observability and nowadays also to security
which for us is also a search problem you want to find the
bad stuff in your system uh before something really bad happens uh so search is kind of like
at the core of all of that yeah very cool so in terms of like i would say broad use cases that
you see how would you break down how people are using it maybe in terms like rough percentages
and i i like the categories i think of are, I have a search box somewhere on my website
where I'm searching through customer names or big text documents or something like that.
That one.
More like internal operational search, like I'm dumping all my logs or metrics in there
and trying to figure out what's going on.
Maybe that one.
And then I think also as an analytics analytics engine it's it's pretty good on
doing aggregations and things like very good at distributed compute basically and that stuff so
like maybe those three buckets if there's other buckets you see a lot of like how well how much
do you see in these different buckets i mean a we a lot and b the the beauty of this is since
everybody can just download it we often don't know or almost surprised ourselves when we we
see somewhere and it's like oh cool use case or oh we didn't even know that but yeah i mean the the starting point
was for many the the search platform and like the search box so if you use stack overflow and you
search on stack overflow that search is going through elastic search in the background as one
of many many examples the classic elk stack for logs was extremely widely used, so that has
reached wide and it has evolved also a lot to open telemetry and being great with Kubernetes
and all the things around that. Security is often a bit of a different beast, but that's also in
there. And then, yes, you have these things
where people use it more like a data store or an analytics engine. Kibana is a very powerful tool.
So, for example, for our own team, I pull together a lot of stats and we have our own OKRs and
metrics. We have Kibana dashboards for everything. So we do use it for that a lot as well, just to
aggregate what we do or how things are moving along in other parts of the company.
So there is a wide use case and range of things you can do.
It's sometimes just a question of like, what is the biggest one?
I don't know. It's like, if you want to come back to the question, you can also then turn it around and kind of like say like, well, is it like in terms of the users that touch it?
Or is it like the installation size? Because like many logging clusters are very large, whereas search is often a smaller use case.
So it's like, I can give you the core one. And it's also thanks to vector search and LLMs, a very hot topic.
In terms of installation base, I guess logs and the classic ELK or evolved ELK, sometimes I call it ELK++ nowadays, is very widely used.
And security has also gained a lot of foothold, especially in critical infrastructure.
Yeah.
You say evolved Elk or Elk++.
Tell me more about that, Ivan.
I always say your old self is often your worst enemy.
And that is for Elk.
It's kind of like people started with that and it was great, but that was potentially 10 years ago or maybe eight years ago.
Like in terms of like products products it's 10 years or so
the problem is what was a best practice or worked back then is probably not a best practice or
changed a lot nowadays and we often see that people are still doing what they used to or was
a best practice eight years ago and then we see them do that then it's like it might not be optimal
anymore or there might be a new competitor coming in and saying like oh this is all wrong like you should do this differently and then i'm
always like yes we wanted to or we have been changing that for years and we try to tell
everybody but it's very hard to make people actually go through that um and elk plus bus is
my personal thing this is not an official brand or anything um it's more like the i think that
the brand elk is very strong and worked very well and um
i have a prop here or actually i have two yeah yeah so we perfect we have these in different
sizes these are the elks um this is the and the elk is very hard to kill like those are still
well known and very popular um so elk is kind of like what we have been doing for a long time,
but we have moved it along because it used to be Elasticsearch, Logstash and Kibana.
But for ingestion, there is beats like Filebeat or Metricbeat as a lightweight agent or shipper
to collect the data. Nowadays, we also have something called Agent or Fleet because at some
point many people were like, there are too many beats, we don't want to install five different beats on each system, we just want
to have one single agent.
And fleet is basically a centralized management for that, that you have in Kibana, you can
just say like, roll out this policy across 100 nodes or 1000 nodes and collect this type
of data.
So there has been a lot of development.
Also one of the problems, for example, for logs was that Elasticsearch was actually quite rigid.
If you change the type of a column or a field, then it would just drop your entire log message on the floor and say, well, this field is wrong.
I don't know what to do.
We are still in the process of making that a lot more lenient. It's like, even if one field is not right, we still keep everything else and we just
store one that's like raw or in a raw format and don't index it the right way, but we don't
drop the entire message.
So there is a lot of evolution going in there also in terms of like how to manage the data
and how to manage time series better and automate a lot of the pain points away that's why the classic elk is still
there but it has gone through a lot of iterations and that's why i'm keen of
pulling people along whenever i see them use something that was the best practice five years
ago and i'm like you're totally right nowadays maybe i have these other things that you should
be using that will make your life a lot easier.
Yeah.
Yeah.
Okay.
Stay on use cases really quickly.
Are there any use cases where you advocate people like, hey, don't really use Elasticsearch for that if they're thinking about Elasticsearch?
For background, what is actually storing the data in Elasticsearch is Apache Lucene, the library, which is more than 20 years old at this point.
And the point of Lucene is really that it's an immutable data structure.
So if you have something that updates extremely frequently, that is not a great use case.
So if you have, for example, a counter for each page on your website and every time a
user comes to your website and you would need to write or replace that and basically increment it by one and you have like this page has been visited and then you have in there like 500, 501, 502.
Every time you do that, we would need to take the entire document and not just replace that one field, but replace that entire document.
That is a very expensive thing the immutable nature has other advantages
and you could potentially structure data differently that you just log every event and
then you aggregate it at the end and say like this page had 510 users rather than updating
one document 500 times you just have individual events maybe at some point you roll them up and
say like we had x users for this domain or this URL per day.
And then you drop the individual documents because you don't need that granularity anymore.
So there are totally ways to work around that.
But there are other ways to just use it and do it yourself.
And Elasticsearch does not have multi-document transactions.
So if you have that requirement, it will also not help you out.
Okay, so very fast-moving, especially counter-type data.
And because of known multi-document transactions,
probably not like your straight OLTP store.
Yeah, I mean, it's not a relational database.
There is no transaction semantic in like you start a transaction,
you do multiple things, you close the transaction anymore.
People have built all kinds of things to kind of like compensate for failures or do that.
Well, I'm not sure if you really need multi-document transactions, maybe you need a different system.
Yeah.
Yeah.
Okay.
Sounds good.
So let's dive into like terminology and concepts to make sure,
like, cause I feel like I just don't understand it as well as I want. So
starting with indexes and mappings, first of all, index, I guess, describe an index. Cause it's
different than an index in a relational database, right? Yeah. Index, even within elastic search is
an overloaded term because to index the verb is basically what we call to store a document because since
search is doing more work up front that's what we call it the index operation and the index is it
something that holds related documents normally they have a similar structure and those can then
be broken up because your next question will probably be about charts.
So a chart is one piece.
Hold on.
So in index, the noun is basically like a table or a collection in MongoDB. It's like a grouping of data that has likely the same shape, not necessarily, but roughly.
We tried that comparison with tables early on.
I don't think it worked so well.
So I'm always flinching slightly when somebody says, a table and it's like yeah more or less uh i mean
let's say more or less it's like what you normally do is you put related documents together um in in
terms of like structure because even though it's json um if a field can only have one data type because
there is what it's what we call a mapping it would be a schema in the relational world um
so you would say the the field name is of the type text for example um it cannot be a number
afterwards um or it would need to be cast to a string type, basically.
So there is a schema-like structure,
so you shouldn't have totally different documents
because you might have conflicts in that mapping.
You might also have problems if the data is extremely sparse
that got better over time,
but it was not very efficient in disk space.
So if you have like totally varied data that has very little overlap, you're probably not getting a lot of benefit.
The other thing why I'm slightly mad about the table concept is if you have anything that is more like a time series, like anything with a timestamp, like log events or metrics or whatever, you would have normally multiple indices,
like maybe an index per whatever time span or amount of data.
So you would have an index pattern, basically,
and then you would have some qualifier, like which iteration this is.
So maybe like a partitioned table in a relational database,
in that time series pattern.
Yeah.
Yeah, that's what you're saying.
Yeah, okay.
And in terms of if you're doing that time series pattern,
is that because we want to keep indices relatively small?
Is it because it's easier to then drop an entire indice
when stuff ages out?
What's the reason for, I guess, splitting up those?
Let's add that shard concept before diving into that,
because that is kind of like the relevant unit here.
So one index can have one or more so-called shards.
It's basically you split up an index into multiple pieces.
And then each piece can be on any node.
So if you have one index, let's say we have three primary shards, it has three parts.
Those could be on three different nodes.
This is kind of like how do you do the distribution of the data.
There's on top of that the so-called primary shard, there is the concept of a replica shard, which is like the secondary copy for high availability and also to increase search throughput. Nowadays, by default, we only have
one primary shard per index.
It used to be five.
Many people didn't have enough data.
It's kind of like coming back to your,
what is the size of an index or a shard?
So the rough idea for a shard is that it
should be between 10 and 50 gigs of data
so it if you have a lot of shards with just kilobytes um you're not doing yourself a service
because that is just like extra overhead and you're not gaining a lot of from that
if you make them huge if a node goes down and you have a huge shard, it might take quite a while to re-replicate it somewhere else.
And there are also some limits, very hard to reach, but like how large a shard could be in terms of like how many documents it contains and things like that.
But 50 gigs or maybe a little larger is what normally works. For some search use cases,
smaller might have some benefits as well.
But it should be in the, I would always say,
double-digit gigabyte range.
If you're much smaller, probably too much overhead.
If you're much larger, you might have some hotspots
or it might just not,
the distribution might not be as even as you want anymore.
So you should keep that in mind i was gonna say can i reshard uh an existing index or do i have to
you know so that's what i wanted to say so what while you that the resharding concept that we
have nowadays um it's uh like you can split or shrink charts though it can only be a factor that's why the initial number of
primary charts in elastic search was five prime numbers are not great for that because the only
place where you can shrink down from five was one so if you nowadays we only have one so you can then split into whatever number you want almost.
But if you think that to parallelize stuff today, you want to have like, let's say, four primary shards and then you want to reduce it to two primary shards after a week or so, that might make a lot more sense.
Whereas five only leaves you one place to go.
So it's underscore split and shrink are the endpoints that you can use.
So when you're resharding, you have to essentially reshard the whole cluster.
It's basically like you have to take every partition and cut it in half to double it, or you can take two partitions and merge them.
So you take one index and you basically change the number of primary shards for that index.
But I can't go from like 8 to 3.
I could go like 8 to 4, 8 to 2.
No, it needs to be a factor.
And if you, for example, shrink,
all the data needs to be on the same node.
And the nice thing about that is the operation itself is then very fast
because it will just symlink the data together.
So the operation itself is pretty fast. Okay. Wait just symlink the data together so the operation itself
is pretty fast okay wait sorry if you if you do what if you split it it has to be on the same node
uh if you combine it oh combine it okay okay so because then it's on the same file system and
then you can just symlink them together and it doesn't need to move a ton of data around and
it doesn't block anything it's relatively the operation or the call itself is relatively fast then. So there are some trade-offs around like what is possible with sharding today.
Maybe we'll actually change it in the future, we'll see.
But for most cases that primary shard number is actually less of an issue nowadays because
either you have a relatively static data set, then you don't need to change the number of primary shards very frequently.
Or you have something that is a time series data.
Then we have, we call it Index Lifecycle Management, ILM, which is basically, it looks at it and
says like, you configure a rule and say like every 50 gigs, roll over to a new index.
So you will create very even sized shards
and you don't need to ever split or shrink them
because it will just,
once it reaches the right threshold
that you have configured,
it will roll over to the next one
and then the next one.
So you don't need to touch those anymore.
So there's a lot more plumbing nowadays.
So that's why the size of shards,
I mean, people can do those wrong
and then it hurts.
But it's like from an operational perspective, the tooling is so much better that it should not be as much of a burden anymore.
Gotcha, gotcha.
And if I recall, when you do a query, you can do like a wildcard on indexes you want to hit.
Is that right?
So like if you have time series, yeah, you don't have to have to know all your different indices when they've been rolled over.
You can just say, hey, hit me metric star, and it's going to hit all the different indices.
Exactly.
You can provide a wildcard.
You can also provide a list, a comma-separated list.
So you can say, search through these five indices, but just these five.
Or you just say this pattern.
And then, of course, you can add a filter on top and only say, whatever the time frame is in the last month just give me the data in that so there
is a lot of flexibility built in that so ideally you don't i mean once you set it up the right way
you don't have to think so much about the indices and charts anymore but we hopefully abstracted
that pretty well away and
that is one of the big differences to the early lx stack where you had to do a lot more manual
work around that or you needed extra tooling that might run outside the cluster whereas the ilm is
part of the cluster so it's just like configured and it's running within the elasticsearch cluster
to take care of that yeah okay okay very. Let's talk a little bit about,
so we talked about sharding.
And so I come, I do a lot with like DynamoDB.
So that's what I'm thinking about here and routing.
I think of like routing, right?
Like when I make a request to Dynamo,
it's going to get routed to a specific partition
to handle that request.
Yeah.
With Elastic, am I like,
how often are people choosing a routing key versus just saying, basically distribute this document wherever, and now a query hits?
I guess, talk about routing within Elasticsearch.
When a document comes in, how does it get assigned to a shard?
This is almost like a trick question.
So, there are multiple answers to this so by by default what happens is um if you a document comes in and
it has either it so we have a field that's called underscore id um and either you provide that
yourself or the cluster will generate one for you and that will basically be hashed so it's
evenly distributed and then calculated modulo the number of primary shards and then you get out
shard two and then the cluster knows that the primary shard two for this index lives here and
then it will route the data there you can provide so that is what happens either the fastest way is
normally that elasticsearch generates the ids for you automatically if you the problem is if you
provide your own id then it might need
to check does this document exist do i create a new one or do i overwrite an existing one
so it's often faster when you just generate it and then we don't need to make any lookups if it
exists and you can change this routing information so you can provide an explicit routing key or it's underscore
routing is the field that you can use why i'm slightly cautious is because i feel like people
are often hurting themselves more than they are helping themselves so the idea is that you
co-locate related data and then your your query can just be answered for example i i know there is a very
big austrian online banking system and i think they have a mainframe in the background but that's
expensive and slow to query and so they have had all of their data for 10 years or more in elastic
search and they basically i think your customer id or whatever is the routing key so my data is
always on the same shard so when they run a query they only need to query a single shard and not the 50 or whatever other shards for all the other customers there it makes
a lot of sense and they have that built in that is reasonably evenly distributed i've seen that
go totally wrong like i know an austrian logistics company that used the country um for the routing
key happened that i think 80 of their their customers were Austrian or 80% of the
shipments were Austrian. So 80% of the data went to the same shard. And then they added more nodes
because they wanted to scale out, but that didn't help because everything was going to the same
shard. So there they hurt themselves. So co-location can be a powerful feature, but if you do it wrong,
it can be very painful. So that was one, what happens automatically.
Two, what you can do yourself.
And then there's a third way now.
So we have something that is called time series data stream, which is more like a time series database.
And what time series databases normally exploit very well is the locality of data that it compresses better. And this TSDS system that we have,
you basically have a couple of fields
that you configure as routing keys,
so the similar data will always be co-located.
So there it's kind of like built in,
but in a slightly guided way,
or it's like a tool that is under the covers,
hidden there, and it's there and active and working for you.
But there, it's a bit more guided working for you uh but there it's a bit more guided so you
don't hurt yourself so much but data locality is a big feature as you know from other systems um
it just depends on how you use it and how how to really use that because you want to avoid the
hot spots but if you have some locality you can potentially make
good use of that um and that's kind of like the kind of like the story um but you had a question
no no so i guess yeah the first question i was going to ask you is like how many let's let's
put aside the time the tsds one and just look at the first two either auto or specifying a routing
key how often do you guess people specify the routing key? Or would you even
recommend people specify the routing key? Or do you think it's like heavily, heavily
in cases you see where it's like, hey, let it assign that automatically into the shard?
I want to say that 90% or 95% use the automated hashing of the ID and then it just works. Also,
again, I think it depends a bit on the use case because for time series data, you might want to
have similar data grouped together. For a full text search use case, there is often maybe not
even a lot of locality that you can exploit and then you just want to evenly distribute it.
So there are these trade-offs and they're also like in search accuracy multiple shots can add
minor variations that normally should even out but it's another way how you can screw up your data
weirdly so unless you know what you're doing and you have a very good reason i would not
reach for custom routing what if okay so what if most what if I'm doing a SaaS
and everything is within an organization or a tenant,
so all my queries are within an org or a tenant,
all of them. In that case, would it make sense
to use the routing key and get that locality? Or should I just
take advantage of the system and say,
no, let's fan that out and throw compute at it from these different nodes
rather than putting it on one node.
You know what our favorite saying is at Elastic?
And it's the same as the favorite saying of every consultant.
But what's that?
It depends.
Yeah, yeah, sure.
Yep.
Okay.
So I think for any reasonably complex question, the first answer would always be, it depends.
And then you need to give a better answer.
So let me try to give a better answer as a follow-up.
I think the main danger is that you might have smaller or larger customers and you might create hotspots.
So I'm slightly cautious.
I mean, if you know what you're doing really well, then maybe. But there are a lot of other optimizations in
the system that maybe you don't need the locality to make best use of something.
Sometimes, depending on, for example, if you have timeframes, it can figure out,
this is the timeframe covered by one chart, and then you don't need to look at the others actually.
So I feel like it's slightly tricky to answer this in a case that will work for everything.
I'm just slightly cautious that, yeah, if you know really well what you're doing and you want to squeeze out some optimization, maybe custom routing helps.
I'm not sure this is the first thing to reach for though. Yeah. Okay. squeeze out some optimization, maybe custom routing helps.
I'm not sure this is the first thing to reach for though.
Yeah, okay.
And I think that's where I get tripped up because, again, I do a lot with Dynamo, which
is so heavy on the partition key and making sure you use that so you're going to one specific
partition or even thinking about planet scale.
You know, if you're talking about sharded SQL or if you're talking about Mongo, just
like the shard key being important in your query like you want it to be something that's
used in all your queries so you're you're going to one whereas elastic is sort of on the other
end of that spectrum and just saying hey you know they're going to be indexed well on each
on on each shard within the nodes there And it's fine to do a scatter gather there
and just sort of lean on that
rather than trying to shard it well
and now dealing with hotspotting
and different size shard,
different, yeah, different things like that.
If you push it and you have a very large use case
or a very specific scenario
or you want to have specific data compression
or whatever else,
probably I think 90% of the users are served better by not thinking about that
than letting the defaults work.
And if you have that problem later on, I would return to that.
But it would not be one of the first things to reach for.
Yeah, okay.
What about, is the query engine, I don't want to say smart enough, like will it take, let's say I have an index that has a routing key.
Let's say it's tenant ID and I provide a tenant ID in my query.
Is it going to know, okay, I only need to hit this one shard now?
Or is it going to like hit those other shards, but they're immediately going to be empty for that tenant and return quickly?
Do you know what I'm saying?
Yeah.
Or same thing with like time series is it smart enough to sort of know oh that time range is only
on this shard or something like that yes so for the time ranges i think it's simpler because uh
there you have some metadata on the shards and then you can uh early terminate uh pretty quickly
um i don't know the the implementation details if you have a routing key, if it will just figure out
very quickly that you don't have any hits and then return or not. It will probably also depend
on the type of query and what you do there. Also, in terms of queries, it's getting extra
complicated because we have just a few months ago added a new query language and query engine. There are multiple options now. We had the historic query language. The query that is
added in Elasticsearch is JSON. Maybe you start appreciating SQL more once you start typing JSON
as a human. But it's great, especially for systems. I think it can be complicated for humans to type.
So now we have relatively recently added a piped query language.
So if you know CloudWatch or Splunk or Microsoft Kusto, those are all piped query languages.
And we have a similar piped query language.
And it's not just a language.
It's also a query engine under the hood with various optimizations that tries to push more and more concepts down into lucene but that's also
more block oriented rather than than row oriented so there are quite a few subtle differences at
this point that's why i'm like when we say query languages it's already getting tricky um uh but again um once you start pushing the system
then you can look into those i think for the average user those are are not the first things
i would look for or reach for yep okay okay all right last thing on terms of like terminology
and things like this replicas which you talked about a little bit. So basically, I have an index.
I'm going to shard it into multiple shards.
Each one will have a primary and zero to N replicas.
Probably want at least one.
How much do you recommend for high availability purchases?
Do you recommend having two replicas so you have three?
Or what do you recommend there?
The replica terminology, by the way, is, I think, very confusing
because depending on the system, the replica includes the primary or doesn't include it.
So for us, we have the primary and then we have zero to N replicas.
We generally recommend one for high availability.
But maybe let me take one step back. So how the write actually works is that
you send your request as a JSON, whatever, to write, for example, to Elasticsearch, to the
cluster. And then we'll pick a random node normally. That is the so-called coordinating
node. That coordinating node will figure out what is the primary shot for that document based on the
ID or the routing. And hold on just one sec. So the coordinator node can be a storage node potentially,
but it's just for this request, it happens to be a coordinator.
It's not like a request router outside of the storage.
Yeah, let's assume we have the simplest case where we have,
for high availability, we need three nodes.
So we can have a quorum.
So we have three nodes that are data nodes and so-called master nodes
to keep the cluster state running.
So they're all data nodes.
So your requests will round-robin between them.
And then one of them gets the right request.
And that one, the coordinating node in that case,
figures out what is the primary shard.
And then it will send the data
or forward the data to the primary shard. The primary shard will then apply that, forward it to the replica shard, you
get the acknowledgement back from the replica, then the primary acknowledges to the coordinating
node, and then you acknowledge to the client. That is how a write works. And sorry, on that
write path, it synchronously waits until all replicas have hacked it before it returns?
It will only acknowledge back to the client once it has been written.
To all replicas?
Yeah, it gets tricky if you have more replicas and there is the chance of timeouts.
But it will try to write to the majority of the replicas.
But we don't want to make it too complicated now.
But yeah, the happy path is that it's kind of like it goes,
it gets forwarded to the primary, writes to the replica,
acknowledges back to the primary, acknowledges back to the coordinate,
acknowledges back to the outside system.
And then there are, of course, like timeouts
and what happens if you have five replicas.
Do you need to write to all of them?
But we don't want to make it too complicated because those are getting outliers then.
Yeah, yeah.
We don't mind complicated here, though.
I like doing deep dives, so I want to know.
But if it's just so edged, then you don't need to go into it.
I mean, so maybe let me do the search first to explain why you might want to have more replicas.
Because if you have a search aggregation request that comes into a coordinating node again, that one figures out like all the shards that you need to query.
And then it will pick either the primary or a replica shard for reading or searching the data.
So it can go to either one.
It doesn't always go
to the primary. And hold on just one sec. You said all the shards it needs to query, but in
most cases, that's going to be every shard in the index, right? Yes, sorry. But my thought was you
have an index pattern, so it might be multiple indices, and then multiple indices might be many
shards. So that's why I'm... Okay, yeah. Yeah... It could be a single shard if you query one index
with one primary shard, or it could be n, so it can be many. And then you query those,
and it can be either a primary or a replica shard that you query um and you actually there is an algorithm nowadays built in that it
will kind of like keep track of which nodes are more or less busy and if it has like a multiple
options it will go to the least busy one um which helps you routing around like some busy or hot
spotting nodes for example or like busy nodes because it's doing a garbage collection or
whatever else and is that are those statistics based on that coordinating node in the request it's
made recently?
Or are they sort of like gossiping around saying, oh, this is like a slower one, whatever.
So the thing is generally called adaptive replica selection.
I thought the statistics come from the previous queries that you have been running.
From that specific node's previous queries or across your cluster?
Don't nail me down on that one, but I thought it was like the coordinating node is just basically piggybacking on previous queries.
I don't think we have extra things that we send around,
though I have not looked at the implementation details
for that in a while.
I thought it was part of the responses that you get.
And then so the coordinating node will then collect
all the sub results.
And by the way, if you do a full text search,
basically what you tell each
chart is like let's say you want the top 10 documents or something it will tell each single
chart give me your top 10 documents but only the id and the so-called score like how how good the
result is it will then aggregate from all the charts, the top 10 documents overall. It will actually, in a second step, fetch those actual documents, and then it will send that back.
We call it query, then fetch.
So first we query for each chart's best results, and then we actually fetch the documents that globally top results.
So we don't need to move a lot of data around because if you let's say you
have a one kilobyte large document and you query 100 charts there's a lot of data that you need to
move around otherwise and it's also like the the coordinating node would need to to then uh
deserialize and serialize that again so we skip all of that we only actually get the documents that we need and that is the case where having more primary shots helps
you spread out the write load having more replica shots helps you spread out the read load because
you have more copies from which you can read yep okay but let's say you don't have any replicas
at all adding a like doubling the number of shards that you have probably is
not going to help you on the read side because they're all taking the same like since they're
doing scatter gather every time anyway roughly it's it's again again it depends it depends it
depends but like generally yeah i want to say it depends um we could probably set up a very weird cluster that um you have five nodes
but you only have two primary shots and then like only two nodes are doing all the work and then if
you add more sharks then then if they distribute evenly then then it would be better uh but yeah
generally um normally you have quite a few more primary shards or shards in total than nodes.
So, yeah.
Yeah.
Okay.
Okay.
This is awesome.
Now let's go into like where do people hit bottlenecks?
And I want to think of it both in terms of like what resource?
Like is it CPU memory constrained?
Is it disk constrained?
And then also is it different on the right path, on the read path? Are people hitting more on the read path? Are they hitting different on the right path on the read path are people hitting more on the
read path are they hitting more on the right path so like where do you see people like hitting
bottlenecks with their elastic search clusters again there's a lot maybe just walk me through
on the right path is it like i guess or maybe is it 50 50 between like reads and writes are
the problem or is like one one of them tend to be more of an issue i mean there are potentially
so many things that people can do so for example um one thing that you could do is we have something
called an interest pipeline to basically change documents on the way in that's normally more cpu
heavy um and is that on a separate node or is that happening on the coordinator, each coordinator, each data node when it's acting as coordinator?
It can happen on the data node.
You could also have dedicated coordinating nodes that don't do anything other than this kind of like being a smart router.
Or you could even have a dedicated ingestion node.
So you could break those out in larger clusters if you have like 30 plus nodes or so.
It might make sense. Also because you might have different hardware profiles because that ingestion node might need more CPU,
but might need less other things because it's just processing the data.
Whereas the data node probably needs fast disks and then as much memory as possible to keep stuff in memory in terms of caching.
If you run large aggregations maybe you you will also need memory
maybe you need some cpu to crunch some data you will need sufficient network so i think you can
pretty much run out of any resource that you you want depending on how you hit it and like what is
your read write ratio also for example vector search in general
the data structures should or hnsw the the small worlds is like the data structure that we use for
vectors right now those should be in memory they are normally the the biggest constraint is the
the memory size that you have available, even though we have quantization now
to reduce the memory and things like that.
But their memory, for example, is a big constraint.
If you have more like a logging use case
where you store terabytes of data,
probably the disk is the bigger thing
because there is no point or no way
to keep all of that in memory,
but you'll just need to fetch the right data
pretty quickly from disk.
There is also, especially for the time series case um we have normally data tiers where we say the data for today this is
where all the writes happen and probably most of the reads and like from three days ago then it's
what we call a a warm node anymore rather than a hot node that gets many reads but no writes anymore and also fewer reads
and then there might be uh cold or frozen which is mostly like once a week somebody runs a query
that spans more than a month or so or a full month or whatever and there you might just want to be
more patient maybe you add spinning disks somewhere or you move it to a blob store uh just to reduce
the cost so there are a lot of different things that you can do there.
And then, for example, if you have a lot of data on a blob store and you need to pull
it in, maybe the network becomes a bigger bottleneck.
So I have a very hard time saying there is this one thing that you will run out of.
Normally, I think memory and disk is the first thing.
But then you will need to keep an eye on what you're doing and the problem the rise from
that uh after are there certain things during the ingest pipeline that drive a lot of cpu like maybe
if they're doing the built-in semantic search and and that's generating vectors is that pretty so we
we have another data type that's that's a machine learning node that's doing the inference and and and stuff like that um so
that is that would normally be separate then um or that you would often do separate also because
again their more cpu is probably the hardware profile that you want on that one um coming back
to what is that the bottleneck for ingestion um one thing that in elastic we do a lot this grok the the named regular expressions
and regular expressions especially if you write bad regular expressions can be very
cpu heavy or inefficient so i i always point to grok first i'm not 100 sure it is always accurate
but i i like pointing to grok because you can do horrible things with Grok. And I always say that the plural of regex is regret.
That it's A, nobody can read it.
And B, it's also often very obscure to debug.
And it's a great tool to parse apart what you have.
So this is another thing.
In a classic Elk stack, you would always use grok and parse and I call this kind of like the Stockholm syndrome that people got so used to doing that that they think this is the right way.
Ideally you can have structured data when you write it already in your logs for example so if you write JSON you don't need to do all the parsing anymore so it's easier for you for not having to write grok patterns.
And it's easier for your CPU not having to write Grok patterns.
And it's easier for your CPU because it doesn't need to do all of that parsing anymore.
That's one of the classic ways where it's like, where I say like the evolution of like, it doesn't have to be so heavy or painful if you move it in the right way.
Just, yeah.
Think about where you're doing that sort of stuff.
Oh man.
I remember working at a place and we had Elasticse, and we were using... So Grok is the...
Is that the scripting language?
Is that what you're saying?
Yeah.
So Grok is like these named regular expressions.
It's basically a regular expression, but it has named patterns.
So it has like some...
For email or timestamp or log level, because it's basically...
You can find it on GitHub or in the documentation.
It's basically, these are the patterns that happen to all the people.
So you can write your own regular expression, but you can also,
90% of the things you can just use patterns that exist.
Yeah. Or what about painless? What's painless? Is that also scripting?
Yeah. So painless is first a very unfortunate name. The backstory is that most data stores
want to have some kind of scripting language.
And Elasticsearch, when it started,
Groovy as a dynamic Java or JVM-based language
was the tool of choice.
The problem is that Groovy
was a general-purpose programming language.
And people found ways to escape from that sandbox and take
over clusters and we had a couple of bad security issues that is like 2010 to 13 14 whatever um so
old times and then we made um a strategic decision that we want to have a more specific
scripting language just for elastic search thatarch that runs on the JVM.
It has some performance improvements because it doesn't need to re-evaluate everything again.
And it is built so that there is no way to escape the sandbox or some concepts just don't exist.
It is if you have written Java in the past, it kind of makes sense or it will feel more familiar.
If you have not written Java, it will feel potentially very foreign.
The unfortunate thing is that we named it painless.
It's not because we thought that the language is so painless.
It's because the guy who created it mostly has chronic back pain.
And his dream is to be pain-free or painless.
And so he called it painless.
And I think I get once a month or so a complaint that painless is everything but painless.
But it's a scripting language or a purpose-built scripting language for Elasticsearch that you can use.
But it is something that you need to learn.
And while you might need that, unless you have to, I wouldn't.
Also, because it can get a bit tricky to debug.
It's not a language that you can run outside in your IDE easily.
So it's just in terms of tooling and everything, it can be a bit tricky.
We have made various improvements over time, and I think it's better now. But if you write
500 lines of painless, you will probably, again,
regret your decisions. And maybe that's also not what you should
be doing in a data store.
I remember back the last time I used Elasticsearch, someone had written
I believe it was painless,
some sort of scripting thing within a query on a match,
which just means now your index, it was like a keyword index,
but they were using a, I think, painless to match on it,
which just blows your index, right?
Because now you have to scan the whole index and run that function of it,
which is true in a relational database or anything else.
I see that every now and then. It is not a lot of fun to write it is maybe it's great for your
hardware vendor or cloud provider but otherwise it is not recommended yeah yeah okay um okay you
mentioned lucene and i know lucene is just like amazingly powerful that's. Is Lucene doing just the actual text search indexing,
or is it used for keyword indexing and other indexing types as well,
or is it just the text part?
Everything.
Lucene is the thing that stores any data on disk.
Sometimes I get asked, what is the database behind Elasticsearch?
Sometimes I get asked, is it MySQL?
No, there's no MySQL in there.
It's just Lucene. Lucene is the thing that analyzes your text it creates the structures
it's right step to disk it runs the actual searches basically. Elasticsearch around it is
basically the distribution part it provides the query DSL or type query language esql and it's kind of like doing a lot of the the work on top
but it all builds on lucene and all the features in elastic search are basically built in lucene
that's kind of like the the beauty that's also why sometimes it takes a bit longer to
implement something because it needs to first land in lucene and only then can it be used in
elastic search that's why the vector search or KNN, the approximation search that uses the small worlds,
took a while until it landed in Lucene.
And then normally when there's a new major Lucene version, there's a new major Elasticsearch version.
And once that came out, that was eight.
But that was only like two years ago at this point.
So that sometimes can take a while.
On the other hand, Lucene is an established powerhouse
and the full-text search engine or library out there
that we heavily build on to do all of that.
Is Elastic and the Elastic team,
are they like, I assume,
one of the main contributors to Lucene at this point?
Like, are they, yeah one of the main contributors to Lucene at this point? Do they dominate that?
I think just two hours ago it was
announced
that somebody on the team got
on their PMC
or we have a lot of contributors
and it is
yeah. Lucene
is very strategic for us and we drive a lot of
the initiatives
but there are also many others Lucene is very strategic for us and we drive a lot of the initiatives.
But there are also many others involved in Lucene.
That's kind of like the power of Lucene, that it's not just one company, but it's very broad and then the foundation for many tools.
But we do invest a lot.
I don't have any statistics from the top of my head. There's also, we can, again, spin this any possible way.
It's like the number of pull requests, the lines of code, the complexity of something.
I don't want to give any percentages here or say, like, we're this big or not that big.
But you will run into a lot of the Elastic colleagues if you do anything around Lucene.
Yep, yep.
Okay. Tell me a little bit about vectors. plastic colleagues if you do anything around Lucene. Yep, yep, okay.
Tell me a little bit about vectors,
and you talked about HNSW,
and I guess just like I think of search and inverted index as like a much more compute-intensive version
of just like regular database indexing,
like normal B-tree lookup type stuff.
And inverted index is like more storage,
more compute to make that match. And then I think HNSW is
another level above that. Am I thinking about that correctly?
I mean, so far, for classic keyword-based search,
the big trick, I would say, is that you do a lot of
work up front. So, for example,
my go-to example is always from Star Wars, like,
these are not the droids you're looking for. What would end up in Elasticsearch or a search engine?
So when you run that through Elasticsearch, if you run that, you will get droid you look.
And what you can see is first we remove remove or we remove the stop words we tokenized
we broke it out into the individual words and we do did stemming so rather than looking it's only
look so it's the word root because you normally don't care if it's like singular plural or what
flexion of a verb but you care about the concept so you do that up front and then you store
alphabetically sorted all of those terms.
And then you basically point to all the documents where they appear.
And then you know in the document it's at that position so you could do highlighting.
And you have the position so you could do a phrase search.
All of that is extracted basically at ingestion or index time.
So your query is then pretty fast because when your search query is
coming in you also analyze normally with the same analyzer the search query and then you can just
look for direct matches because then you look for droid and you're alphabetically used to go down to
this is where droid is then you look like which documents contain droid these are the documents
that are being returned and that's why it's so fast.
But it does more work and it creates those index structures to enable that.
So that is extra work on top of that.
And then what you do with vector searches, you basically have these arrays of floating
point values that have 100 hundred or a thousand dimensions and then you just
afterwards try to find something that has a similar vector representation the trick for that
is what the small worlds or hnsw is giving you is that you can do an approximation or an approximate
search like the approximate k nearest neighbors k and n is generally the
keyword for that because if you have a million documents and you need to compare like a million
vectors like how similar are they this computationally very expensive and probably
not going to work out and hnsw provides you with a smarter data structure that is
layered and basically kind of like puts you closer
to what where you want from each layer and then you can find in a relatively or actually very
efficient way the nearest neighbors in a vector space yeah okay and so if i'm using an hnsw index
i'm naturally doing an approximate like i can't't do an exact K-nearest neighbor search
with a HNSW index.
Is that right?
Yeah, that is an approximation.
And by the way, you can do without HNSW.
So in previous versions,
there was still a data structure
that we call dense vector,
which is this array of vectors.
But then you couldn't do an approximation,
but that was expensive.
So you could do that before, but only since eight has been two years but still you can do
the approximation or the approximate k nearest neighbors to to get those but that is like if you
have a large amount of data you want to do that if you have something like 10 000 documents or so
hnsw is not going to help you. You can't just brute force that.
That will be faster.
And you can skip creating the data structure, but for large collections of documents, you
want to have this approximation because otherwise you just spend a lot of time on doing vector
comparisons.
Does HNSW update well?
Like if I have an updatee type workload where my vectors are actually changing?
Now we're coming back to one tricky aspect of Lucene that it's immutable.
So what, by the way, when I say immutable, how that works is, and you might have stumbled
over the refresh rate at some point.
So it's basically we batch all writes together by default it's every second but you can change
it if you want to have higher write throughput so we batch all the operations in the last second
together and create the so-called lucene segment that's at first only in memory but will eventually
be written to disk this is the data structure that is searchable that's also why it takes
up to a second to find documents if you
retrieve it by id you can get it right away if you search any kind of like multi-document operation
needs these segments so it might take up to a second until that is created you can block the
right operation until the refresh happens or you could force a refresh don't do that too frequently
because it will create many tiny segments and they need to be merged away later on, which is expensive. So there are ways around that.
But you have all of these segments. And eventually they get merged because each segment basically
contains a data structure on its own. And you don't want to search through thousands of them.
So you want to merge them into larger ones over time.
One of the downsides that we have made some improvements there is that HNSW,
you cannot easily merge it.
You basically need to recreate an HNSW.
There are optimizations how we have sped that up and also made like making concurrent searches across segments faster now
but it is a bit of a pain point that we're actively working on to make that merging faster
so one of the tricks we're doing now is we look at the different segments and look like which one is
the largest one that doesn't have a lot of deletes or which doesn't have deletes in it because then
we can basically use that as a foundation and then plop all the others on top of that and we don't need to restart everything from scratch um but if that's
kind of like the downside of hnsw that it's not easy to to merge existing ones together but
there are ways to make that better so anybody who is kind of like saying oh this will always
be slow in lucene and this is a horrible approach.
There are improvements in that.
Just wait. Yeah, we're still early on that stuff. So these segments that get written, are they like a combination of indexes, like the inverted indexes, but then also the full documents?
They have multiple data structures in them, depending on what you have.
Because Lucene in general, even before vectors had multiple data structures because there
is the inverted index, but there is also the un-inverted index because that's what you
need for sorting, for example, or for aggregations.
So there are different data structures and types of how the data is stored. And there are, text is stored differently than a number or a geopoint.
And so there are different data structures to help you out in terms of query efficiency,
but also storage efficiency and just allow you to do specific things and features with them.
Yeah, gotcha. Okay. Last question I have before I move into
just sort of the random questions thing.
When Elastic is doing a query,
or when I think of a relational database query,
maybe I have multiple filters in it,
and based on statistics, I can say,
okay, this is going to cull it down the most.
We'll use that one and then sort of go from there does elastics sort of have to perform like all the
searches or like especially like the search like the inverted index look up like even if i let's
say i have like a term match on on so like again maybe a tenant id right i only want to do search
for users within this given tenant so i use this tenant but i have
thousands of tenants is it going to be easy to like filter out all those other tenants or is it
just going to have to like you know do the search and then sort of intersect it with the tenants
that match luckily lucene has been around for so long that there are a lot of optimizations in that
by now um so okay depending on what kind of filter you have the filter can also be cached that is one
of the tricks that will make it very fast but but it can also kind of like generally limit it.
Also, there are some smart data structures.
One of them is called MaxWand, which basically allows you like an early termination.
For example, one trick there is if you search for three terms and it will figure out that one is very rare and that normally gives you a very high score, and the other two are less relevant or something is not as competitive anymore, it can early terminate some steps.
And it will just do the most relevant things to get you the right results.
And it knows when it can skip specific things.
Yeah, okay.
Interesting.
I guess one last question on Elasticsearch.
You mentioned 8 is the latest version,
it has H, N, S, W, all that.
What does upgrading an Elasticsearch
search cluster look like?
Is that a pretty straightforward operation?
Is it difficult?
How do those usually go?
It depends.
So there are multiple
more or less short answers to that so
elastic search has for a long time operated on like major version means breaking changes
and there have always been some more or less often unfortunately more painful uh changes
i think as we have matured over the time, things have improved a lot and there
are way fewer breaking changes and it's not as complicated anymore. But there can always be
breaking changes. The one thing that is a bit of a complication is that Lucene in general can
write its current version and read one version back, but it cannot read further back than that.
So if you have data written a long time ago and you have upgraded more than one major version,
we have an upgrade assistant built in and it will shout and tell you,
you need to delete the data or re-index or rewrite it in the current version to keep it accessible.
And by the way, these are the five settings that you need to change.
So there is an upgrade assistant and you can... Rolling upgrades, I think, I forgot, I think they came in six or so before it was stopped
but that was 2016 or 17 or whenever that feature came.
It's so long I can not remember the version anymore.
But rolling upgrades have been a thing for a long time.
Most of the stuff you can just fix either through a re-index or changing some settings.
If you, for example, had a cluster that was not using security and TLS, that probably
will require a full restart that you put all the nodes on TLS and then restart the cluster.
Historically, it wasn't always super easy.
I hope we've made it easier.
There are also, I would say, ways around that to make it easier.
So we have a cloud service.
We have a Kubernetes operator. So there is a lot more tooling nowadays around all of that to help you uh but i don't want
to deny that especially early on like elasticsearch upgrades were not always as as fun as we would
wish they were i think it's true that just like databases in general have really improved that
where like you know 10 years ago that was not the case like and people were major versions behind a
lot just because it was such a pain to do that sort of stuff.
So yeah, I think it's just gotten a lot better
in the last five, 10 years.
Hopefully, I mean, we still see a lot of old clusters.
And I mean, I can understand.
Everybody has end tools they need to manage.
And it's sometimes also a testament that it just works.
So you don't worry about it anymore
and then you forget about it.
But then the upgrade at some point
will be a major operation.
So yeah, I do see old or even ancient clusters.
We of course always recommend like the most secure
and by far fastest and stable version
is probably the latest one.
Or maybe if it's a rough release
with a lot of features,
the 0.1 release then.
But that's generally the recommended version
that you want to have.
And otherwise, you just miss out
on a lot of performance improvements,
security, and bug fixes.
New features.
Yeah.
Yeah.
Very cool.
Okay, this has been great.
I feel like I know a lot more hopefully this will
be helpful to others um we always end with just like some common questions i ask everyone a little
bit of rapid fire type stuff so first question if you could master one skill that you don't have
right now what would it be so the one thing that i i want to do and this is very random maybe um
um since i go to a lot of meetups and conferences, I want to pick up sketchnoting.
Oh, yes.
Yes.
When someone's doing a talk and it's like, yes.
Yeah, exactly.
And then you put together this great overview because I think it's like very helpful.
And it's like, I love it when somebody does it for my talks.
And I don't see it much, but I think it's actually great if somebody really gets the
gist out of that into one picture, basically, and makes a nice picture.
And since I'm at so many events, I'm always like, I wish I could do that because it would be very cool.
Also, I guess, a takeaway for everybody from the talk.
So maybe this is an uncommon one, but that is the thing.
No, that's a great idea.
And it's hard to find someone that can do that because generally you need at least like a little bit of domain understanding and knowledge because it's like
deep technical talks and then finding someone that can like have the art have the the words
because they're also doing notes like just being able to do all that stuff is like a tough tough
skill to find but those are really nice yeah i i really want to do it i'm really not a drawer so i
i'll see how how bad that it will look.
On the other hand, maybe it just doesn't have to be pretty. But as long as it's more or less
functional, it's still much better than nothing. So we'll see.
You should try it at a meetup or a conference this year. Give it a shot.
Yeah, that's a great answer. Number two, what wastes the most time in your day yeah i i will always
jokingly say at this point i'm doing 80 management and uh 80 devrel um so um or sometimes i say like
that devrel is almost a hobby at some point anymore um no but it's it's a good thing. I really love our products and what we do.
And the only way to scale it out is through others.
That's why I'm doing my best to build an amazing team
and move us forward through that
because otherwise I'm always limited to myself.
But it is taking a lot of my time, of course.
But that is an important step
to move us forward
and get us to all the developers out there
because a podcast like this is great
in terms of scaling,
but we need to be on more shows,
at more events,
answer more questions on Disqus and Slack,
write more blog posts, do more demos.
We just need to scale this out
and I'll try to lead by example through that.
Yeah, yeah.
Well, for those listening, he's talking about, you know,
Philip's talking about doing this as his hobby time.
It's like almost midnight.
Is it 1145 at your place?
Yeah, it's almost midnight.
I just came off a meetup.
Yeah.
Right after a meetup and he said he would be willing to do this podcast.
So it truly is.
That's amazing.
All right.
Next question. If you could invest in one company, that's amazing um all right next question if you
could invest in one company that's not the company you work for who would it be i think that the
mandatory choice right now is nvidia nvidia yeah because everybody needs those gpus right now um i
i guess right now you're too late already anyway but but it feels like this is the top of where everybody wants to be.
And I think it's great to see all the excitement.
I think we'll see how long lasting all of that is and where it goes next.
But even though I feel like we're kind of like in a bit of a dry spell in IT to some
degree, it's great to see that there is movement and things are working and
there's a lot of innovation and excitement. And then I think NVIDIA is the perfect player in that
thing right now. Yep. Yep. Well, I made a mistake last week of saying I thought NVIDIA was over
valued, you know, because they like passed Google and Amazon in their total value. And then earnings
came out this week and they jumped like another 15 20 so
yeah and probably now it's too late so don't don't take financial advice uh from me um or
none of the listeners please um uh that that is not uh what i'm good at uh but i think it's a very
interesting story because i feel like at some point people were a bit like oh the bitcoin or
a blockchain hype is over like nobody needs those gpus anymore and
and suddenly we need even more gpus than before yeah i know it's amazing yeah they really set
themselves up well for that so uh okay what tour technology could you not live without
besides elastic search does it include our stack because i i use our stack really every day um yeah i mean
oh that's a that's a very tricky one i i'm very torn so i i have a shared background between
ops and dev and i think for for for ops i i feel like it's it's linux in general and maybe automation like Terraform.
And on the development side, I just did a lot of Java in the past.
And I think it's also another surprisingly vibrant and resilient ecosystem that's also doing a lot of things and also passed a lot of its problems.
So I'm not one of the python people i'm still one of the java people and
we have langchain4j and spring ai and and other things so there's also movement there
and it's also fascinating because it's a very different take than the the jupyter notebook
world for example because when you look at a langchain4j example it often includes like
tests integration tests with test containers and Elasticsearch, for example.
I saw something yesterday, which is like I feel like the Jupyter notebook world is very different.
It's much more about experiments and it often doesn't have automated tests.
And I'm just coming more from production.
I really like production because that's where I like the pressure or where things things happen um so i'm i like that world yeah
yeah cool okay which person influenced you the most in your career who i feel like there were
there were a lot of people that that influenced us i i don't i don't want to make this too much of a of like a sob story or anything
uh it's uh no it's like i i want to name shy the this he he used to start elastic search
i think from his bedroom more or less um and then was the cto and the ceo and now he's the cto again um and he a because he he found an
interesting problem by the way do you know the backstory why he created elastic search maybe we
can quickly bring this in here i do but tell it yeah i thought this is a good one okay so um his
wife was becoming a chef and she had a lot of recipes and he wanted to write the recipe search
um and he didn't find a good search system.
So he started building a search system.
Legend has it that she is still waiting for that recipe search because he got slightly distracted on the way.
And Elasticsearch is kind of like the third implementation because there was Compass 1, Compass 2.
And it wasn't Compass 3, but Elasticsearch.
And 3 kind of like is the magic number, maybe.
I don't know.
And that one stuck around.
But that's where he got on the tech side.
And at some point, he took over more and more of the company, and he didn't have time, or I don't want to say he wasn't allowed to write code anymore.
That would maybe give the wrong impression.
And I really liked his answer when he was asked, like, do you miss writing code?
And he's like, well, I'm mostly about solving problems.
And at first the problem was to write the code, but now the problem is something else.
And he has been there from the start and he's still there and pushing us.
And I think that has been a influential.
And I remember back when I joined the company i had an interview with him and i i
remember it was very interesting because he i think i mean he's jewish and i think he was in
israel at the time and he scheduled the interview with him on the 25th of december at 9 a.m or
something like that and um i was like well he has kids and if he can do that i guess i can do this
as well and then then i i agreed to
that and then i think the evening before he sent me an email and we said like oh crap it's christmas
um will we schedule this and then we rescheduled and then then we did the interview and i was like
man it's so hard to understand him and it's so noisy and at the end he's like well this was great
because my wife is driving and the three kids are in the back and I'm in the car.
And that's where we did the interview.
And since I survived that, I was like, okay, well, this all seems very manageable.
So, yeah, I think that's the person I pick for that.
That's great.
Yep.
I've seen him a little bit around and, yeah, just seems like a great guy.
And then, obviously, Elastic, obviously elastic you know great product great company and he sometimes has uh at least in the
old days he had a picture where he looked very mean and like he had a shaved head and it's like
he looks very mean he's not that mean he it might just look like that or threatening i don't get the
wrong impression uh but um at my previous company where i worked before we were using compass 2
already uh so everybody knew about him
and we were always saying like,
oh yeah, the search guy
and the mean looking search guy.
So, and I got to know
the mean looking search guy a bit better.
That's great.
Yeah, very cool.
All right, last question.
What is your probability
that AI equals doom for the human race?
I'm always slightly pained here. I think there are so many steps on the path to
humanity's doom at this point. I'm not sure AI is in the top three right now.
Is the last one? I'm not sure. Maybe it works its way up.
On the other hand,
I think that the doom where AI is,
the doom is kind of like
for the information on the internet
because the randomly generated noise
in poor quality all over the place,
I think that it's not doom in itself,
but it is a big pain point.
It also does, I think, to some degree,
destroy some established communities.
Like I mentioned Stack Overflow before,
and I think that traffic took a big hit,
but you can also see it in other places.
And I think it makes sense
because why write the question on Stack Overflow
and then either wait half a day for somebody to answer or b be told like oh this is a bad question
or like you should you shouldn't do this um when you can just ask any random thing to a chatbot
and you get an answer right away and oftentimes it's right and ideally you try it out and figure
out if this is really true or not so i can see the appeal of that but it does um change a lot
of the ecosystem and like it has a big impact on like how people learn and also where people
exchange ideas and get to know each other and everything and it's i guess every every win
like ai is clearly adding value and working for a lot of things. And we have products about like AI assistants that are doing a great job and are really helping make things easier and faster.
But they do come at a cost.
Yeah, yeah, yeah, absolutely.
Yeah, I agree.
It's a weird world we're going into.
But like with anything else, think it's it's not replacing
the pain or bringing new pain it's just exchanging the pain for something else so it's solving
something and adding something else and it's always a combination it's like it depends there
is i think very few things have a a clear-cut 100 answer like it's like this or it's good or it's
bad um the reality is more complex
on the other hand i always say that that complexity is great because it keeps all of us employed and
for now and entertained and busy and we can learn things so the complexity is to some degree a
feature yeah yeah absolutely i love it well hey I appreciate you staying up this late and doing this and explaining Elasticsearch.
Thanks for having me. I hope we could reduce the complexity or confusion around Elasticsearch.
Definitely. Yeah. Yep. This is great. This is truly great. I love it. I think it'll be helpful for a lot of people.
If people want to find out more about you or Elastic, I guess, where should they go to find you? I mean, Elastic is elastic.co, not.com, but co.
And then all the social media and whatever.
Me personally, my Twitter handle is, and maybe we can put it in the show notes, it's XERAA.
If you wonder why that is my Twitter handle, it's a rob 13 of my last name if people
still remember what rob 13 was it's like when you rotate the letters by 13 the nice thing about
rob 13 is it's like encryption and decryption is the same so if you rotate by 13 again you're back
at the original um so if you take my last name and you rotate the letters by 13, this is the handle that I try to use everywhere.
I learned about that, I think, when I was in school and I was on a tram and I basically on a piece of paper, I kind of like the 13 calculation.
And since then, I have been sticking or using that.
How do you pronounce it?
Yeah, I say Xeagah, but there is no official pronunciation.
I didn't put it in any urban dictionary or anything yet. I have an explainer on my website.
I saw that. Yeah. Yeah. I haven't added the pronunciation there.
Okay. There you go. Yeah. So awesome. Well, Phil, it was great to have you. I really appreciate it.
And thanks for coming on. Thanks for having me.