Software Huddle - Elasticsearch Fundamentals with Philipp Krenn

Starting point is 00:00:00 Elasticsearch started off as a distributed document store, but it's much more than an irregular database because databases are, I would say, very much like black and white. You store something and you want to retrieve that back. Search is much more about those shades of gray, like you are looking for a concept and you want to, you don't know exactly what you want, but you want to have the concept or some context around it. If you could master one skill that you don't have right now what would it be since i go to a lot of meetups and conferences i want to pick up sketchnoting that's a great idea and it's hard

Starting point is 00:00:35 to find someone that can do that because generally you need at least like a little bit of domain understanding and knowledge because it's like deep technical talks and then finding someone that can like have the art have the the words because they're also doing notes like just being able to do all that stuff is like a tough tough skill to find but those are really nice which person influenced you the most in your career i i want to name shy the this he he used to start elastic search i think from his bedroom more or less um And then it was the CTO and the CEO, and now he's the CTO again. And he, A, because he found an interesting problem. By the way, do you know the backstory why he created Elasticsearch?

Starting point is 00:01:15 Hey folks, this is Alex. I always say that Elasticsearch is super powerful, but it also scares me because I just don't understand how it works in comparison and scaling and performance compared to other databases. Great episode today. I have Philip Crenn. He's the head of DevRel for Elastic. And we just like deep dive on all that Elasticsearch stuff, like indexes and mappings and shards and replicas and how to think about performance in all that stuff. I learned a ton. I wish I would have had this eight years ago when I was using Elasticsearch, but I think it'll be just useful for, you know, just getting a good mental model of how Elasticsearch works. So I hope you enjoy it. If you have any questions, comments, guests, anything like that, feel free to reach out to Sean or to me. And with that, let's get to the show.

Starting point is 00:01:58 Philip, welcome to the show. Thanks a lot for having me. Yeah, so I'm excited to have you because Elasticsearch is in some ways my white whale. I feel like it's like this amazingly powerful thing and I also don't know enough about it so then it scares me a little bit. And you are the head of DevRel at Elastic, so I'm guessing there are not that many people on earth that would be better to sort of explain that to me. So with that introduction, maybe tell us a little bit more about you and your background. Right. So I've been at Elastic for almost eight years at this point.

Starting point is 00:02:26 I've done a ton of events and just tried to help out our users. And we're still on that mission to make it work for the developers and help them along. So hopefully we can avoid a bit of that white whale thing. I mean, any complex system has its learning curve and Elasticsearch is definitely no exception, but it's not magic and it shouldn't be scary. Most of the stuff, once you know it, is almost easy. I feel like that's always the problem for me. It's like, once you know, then you know and it's easy, like for any other thing that you have learned. And once you know, it's like, oh yeah, of course. Yep of course yep yep well one thing i learned is like it just surprised me how little developers know about databases and just like basic stuff like basic indexes and i'm talking like on a

Starting point is 00:03:16 relational database just like an index and how it works and how you should use it and stuff like that and then elastic search is just like another another sort of level of power with like all the different things being distributed, but also having like the invert index and text searching. Also now having like vectors and, you know, having semantic search for a while, like geo stuff, all sorts of things. And it's amazing in what it can do. But again, it's just like, you know, you see people do bad things with it because people do bad things with relational databases, which have been around straightforward forever. And now just another level of complexity, but also power and all that stuff. So I'm excited to learn a lot of this stuff. And I think there's probably no one better here. Cool. Yeah. I mean, it's a bit like the hammer and nail thing, right? If you only know one tool and you try to use it or abuse it for everything, then you get

Starting point is 00:04:03 interesting results. And yeah, Elasticsearch can do a lot of stuff, but not everything. Yep. Yep. Cool. Okay. So let's jump in and maybe just like very high level first, like what is Elasticsearch? Elasticsearch started off as a distributed document store or search engine. I think search engine is what we commonly refer ourselves to even though it is JSON documents that you put in but it's much more than a regular database because databases are I always say that very much like black and white you store something and you want to retrieve that back search is much more about those shades of grey like you are looking for a concept and you want to you don't know exactly what you want, but you want to have the concept or some context around it.

Starting point is 00:04:47 And that's where the search power is coming from. And Elasticsearch has developed amazingly well in full text search over many years. And then we also branched out to other areas because for us, logs, but also observability is very much a search problem. It's like the storing stuff is the boring part what you want to do is you want to find the most recent or relevant errors or why something is happening with root cause analysis or you want to have these problems that are for us basically a search problem and then we have branched it out to observability and nowadays also to security

Starting point is 00:05:21 which for us is also a search problem you want to find the bad stuff in your system uh before something really bad happens uh so search is kind of like at the core of all of that yeah very cool so in terms of like i would say broad use cases that you see how would you break down how people are using it maybe in terms like rough percentages and i i like the categories i think of are, I have a search box somewhere on my website where I'm searching through customer names or big text documents or something like that. That one. More like internal operational search, like I'm dumping all my logs or metrics in there

Starting point is 00:05:58 and trying to figure out what's going on. Maybe that one. And then I think also as an analytics analytics engine it's it's pretty good on doing aggregations and things like very good at distributed compute basically and that stuff so like maybe those three buckets if there's other buckets you see a lot of like how well how much do you see in these different buckets i mean a we a lot and b the the beauty of this is since everybody can just download it we often don't know or almost surprised ourselves when we we see somewhere and it's like oh cool use case or oh we didn't even know that but yeah i mean the the starting point

Starting point is 00:06:30 was for many the the search platform and like the search box so if you use stack overflow and you search on stack overflow that search is going through elastic search in the background as one of many many examples the classic elk stack for logs was extremely widely used, so that has reached wide and it has evolved also a lot to open telemetry and being great with Kubernetes and all the things around that. Security is often a bit of a different beast, but that's also in there. And then, yes, you have these things where people use it more like a data store or an analytics engine. Kibana is a very powerful tool. So, for example, for our own team, I pull together a lot of stats and we have our own OKRs and

Starting point is 00:07:16 metrics. We have Kibana dashboards for everything. So we do use it for that a lot as well, just to aggregate what we do or how things are moving along in other parts of the company. So there is a wide use case and range of things you can do. It's sometimes just a question of like, what is the biggest one? I don't know. It's like, if you want to come back to the question, you can also then turn it around and kind of like say like, well, is it like in terms of the users that touch it? Or is it like the installation size? Because like many logging clusters are very large, whereas search is often a smaller use case. So it's like, I can give you the core one. And it's also thanks to vector search and LLMs, a very hot topic. In terms of installation base, I guess logs and the classic ELK or evolved ELK, sometimes I call it ELK++ nowadays, is very widely used.

Starting point is 00:08:20 And security has also gained a lot of foothold, especially in critical infrastructure. Yeah. You say evolved Elk or Elk++. Tell me more about that, Ivan. I always say your old self is often your worst enemy. And that is for Elk. It's kind of like people started with that and it was great, but that was potentially 10 years ago or maybe eight years ago. Like in terms of like products products it's 10 years or so

Starting point is 00:08:46 the problem is what was a best practice or worked back then is probably not a best practice or changed a lot nowadays and we often see that people are still doing what they used to or was a best practice eight years ago and then we see them do that then it's like it might not be optimal anymore or there might be a new competitor coming in and saying like oh this is all wrong like you should do this differently and then i'm always like yes we wanted to or we have been changing that for years and we try to tell everybody but it's very hard to make people actually go through that um and elk plus bus is my personal thing this is not an official brand or anything um it's more like the i think that the brand elk is very strong and worked very well and um

Starting point is 00:09:26 i have a prop here or actually i have two yeah yeah so we perfect we have these in different sizes these are the elks um this is the and the elk is very hard to kill like those are still well known and very popular um so elk is kind of like what we have been doing for a long time, but we have moved it along because it used to be Elasticsearch, Logstash and Kibana. But for ingestion, there is beats like Filebeat or Metricbeat as a lightweight agent or shipper to collect the data. Nowadays, we also have something called Agent or Fleet because at some point many people were like, there are too many beats, we don't want to install five different beats on each system, we just want to have one single agent.

Starting point is 00:10:09 And fleet is basically a centralized management for that, that you have in Kibana, you can just say like, roll out this policy across 100 nodes or 1000 nodes and collect this type of data. So there has been a lot of development. Also one of the problems, for example, for logs was that Elasticsearch was actually quite rigid. If you change the type of a column or a field, then it would just drop your entire log message on the floor and say, well, this field is wrong. I don't know what to do. We are still in the process of making that a lot more lenient. It's like, even if one field is not right, we still keep everything else and we just

Starting point is 00:10:48 store one that's like raw or in a raw format and don't index it the right way, but we don't drop the entire message. So there is a lot of evolution going in there also in terms of like how to manage the data and how to manage time series better and automate a lot of the pain points away that's why the classic elk is still there but it has gone through a lot of iterations and that's why i'm keen of pulling people along whenever i see them use something that was the best practice five years ago and i'm like you're totally right nowadays maybe i have these other things that you should be using that will make your life a lot easier.

Starting point is 00:11:26 Yeah. Yeah. Okay. Stay on use cases really quickly. Are there any use cases where you advocate people like, hey, don't really use Elasticsearch for that if they're thinking about Elasticsearch? For background, what is actually storing the data in Elasticsearch is Apache Lucene, the library, which is more than 20 years old at this point. And the point of Lucene is really that it's an immutable data structure. So if you have something that updates extremely frequently, that is not a great use case.

Starting point is 00:12:00 So if you have, for example, a counter for each page on your website and every time a user comes to your website and you would need to write or replace that and basically increment it by one and you have like this page has been visited and then you have in there like 500, 501, 502. Every time you do that, we would need to take the entire document and not just replace that one field, but replace that entire document. That is a very expensive thing the immutable nature has other advantages and you could potentially structure data differently that you just log every event and then you aggregate it at the end and say like this page had 510 users rather than updating one document 500 times you just have individual events maybe at some point you roll them up and say like we had x users for this domain or this URL per day.

Starting point is 00:12:48 And then you drop the individual documents because you don't need that granularity anymore. So there are totally ways to work around that. But there are other ways to just use it and do it yourself. And Elasticsearch does not have multi-document transactions. So if you have that requirement, it will also not help you out. Okay, so very fast-moving, especially counter-type data. And because of known multi-document transactions, probably not like your straight OLTP store.

Starting point is 00:13:17 Yeah, I mean, it's not a relational database. There is no transaction semantic in like you start a transaction, you do multiple things, you close the transaction anymore. People have built all kinds of things to kind of like compensate for failures or do that. Well, I'm not sure if you really need multi-document transactions, maybe you need a different system. Yeah. Yeah. Okay.

Starting point is 00:13:41 Sounds good. So let's dive into like terminology and concepts to make sure, like, cause I feel like I just don't understand it as well as I want. So starting with indexes and mappings, first of all, index, I guess, describe an index. Cause it's different than an index in a relational database, right? Yeah. Index, even within elastic search is an overloaded term because to index the verb is basically what we call to store a document because since search is doing more work up front that's what we call it the index operation and the index is it something that holds related documents normally they have a similar structure and those can then

Starting point is 00:14:20 be broken up because your next question will probably be about charts. So a chart is one piece. Hold on. So in index, the noun is basically like a table or a collection in MongoDB. It's like a grouping of data that has likely the same shape, not necessarily, but roughly. We tried that comparison with tables early on. I don't think it worked so well. So I'm always flinching slightly when somebody says, a table and it's like yeah more or less uh i mean let's say more or less it's like what you normally do is you put related documents together um in in

Starting point is 00:14:59 terms of like structure because even though it's json um if a field can only have one data type because there is what it's what we call a mapping it would be a schema in the relational world um so you would say the the field name is of the type text for example um it cannot be a number afterwards um or it would need to be cast to a string type, basically. So there is a schema-like structure, so you shouldn't have totally different documents because you might have conflicts in that mapping. You might also have problems if the data is extremely sparse

Starting point is 00:15:41 that got better over time, but it was not very efficient in disk space. So if you have like totally varied data that has very little overlap, you're probably not getting a lot of benefit. The other thing why I'm slightly mad about the table concept is if you have anything that is more like a time series, like anything with a timestamp, like log events or metrics or whatever, you would have normally multiple indices, like maybe an index per whatever time span or amount of data. So you would have an index pattern, basically, and then you would have some qualifier, like which iteration this is. So maybe like a partitioned table in a relational database,

Starting point is 00:16:21 in that time series pattern. Yeah. Yeah, that's what you're saying. Yeah, okay. And in terms of if you're doing that time series pattern, is that because we want to keep indices relatively small? Is it because it's easier to then drop an entire indice when stuff ages out?

Starting point is 00:16:41 What's the reason for, I guess, splitting up those? Let's add that shard concept before diving into that, because that is kind of like the relevant unit here. So one index can have one or more so-called shards. It's basically you split up an index into multiple pieces. And then each piece can be on any node. So if you have one index, let's say we have three primary shards, it has three parts. Those could be on three different nodes.

Starting point is 00:17:16 This is kind of like how do you do the distribution of the data. There's on top of that the so-called primary shard, there is the concept of a replica shard, which is like the secondary copy for high availability and also to increase search throughput. Nowadays, by default, we only have one primary shard per index. It used to be five. Many people didn't have enough data. It's kind of like coming back to your, what is the size of an index or a shard? So the rough idea for a shard is that it

Starting point is 00:17:44 should be between 10 and 50 gigs of data so it if you have a lot of shards with just kilobytes um you're not doing yourself a service because that is just like extra overhead and you're not gaining a lot of from that if you make them huge if a node goes down and you have a huge shard, it might take quite a while to re-replicate it somewhere else. And there are also some limits, very hard to reach, but like how large a shard could be in terms of like how many documents it contains and things like that. But 50 gigs or maybe a little larger is what normally works. For some search use cases, smaller might have some benefits as well. But it should be in the, I would always say,

Starting point is 00:18:30 double-digit gigabyte range. If you're much smaller, probably too much overhead. If you're much larger, you might have some hotspots or it might just not, the distribution might not be as even as you want anymore. So you should keep that in mind i was gonna say can i reshard uh an existing index or do i have to you know so that's what i wanted to say so what while you that the resharding concept that we have nowadays um it's uh like you can split or shrink charts though it can only be a factor that's why the initial number of

Starting point is 00:19:09 primary charts in elastic search was five prime numbers are not great for that because the only place where you can shrink down from five was one so if you nowadays we only have one so you can then split into whatever number you want almost. But if you think that to parallelize stuff today, you want to have like, let's say, four primary shards and then you want to reduce it to two primary shards after a week or so, that might make a lot more sense. Whereas five only leaves you one place to go. So it's underscore split and shrink are the endpoints that you can use. So when you're resharding, you have to essentially reshard the whole cluster. It's basically like you have to take every partition and cut it in half to double it, or you can take two partitions and merge them. So you take one index and you basically change the number of primary shards for that index.

Starting point is 00:20:05 But I can't go from like 8 to 3. I could go like 8 to 4, 8 to 2. No, it needs to be a factor. And if you, for example, shrink, all the data needs to be on the same node. And the nice thing about that is the operation itself is then very fast because it will just symlink the data together. So the operation itself is pretty fast. Okay. Wait just symlink the data together so the operation itself

Starting point is 00:20:25 is pretty fast okay wait sorry if you if you do what if you split it it has to be on the same node uh if you combine it oh combine it okay okay so because then it's on the same file system and then you can just symlink them together and it doesn't need to move a ton of data around and it doesn't block anything it's relatively the operation or the call itself is relatively fast then. So there are some trade-offs around like what is possible with sharding today. Maybe we'll actually change it in the future, we'll see. But for most cases that primary shard number is actually less of an issue nowadays because either you have a relatively static data set, then you don't need to change the number of primary shards very frequently. Or you have something that is a time series data.

Starting point is 00:21:10 Then we have, we call it Index Lifecycle Management, ILM, which is basically, it looks at it and says like, you configure a rule and say like every 50 gigs, roll over to a new index. So you will create very even sized shards and you don't need to ever split or shrink them because it will just, once it reaches the right threshold that you have configured, it will roll over to the next one

Starting point is 00:21:34 and then the next one. So you don't need to touch those anymore. So there's a lot more plumbing nowadays. So that's why the size of shards, I mean, people can do those wrong and then it hurts. But it's like from an operational perspective, the tooling is so much better that it should not be as much of a burden anymore. Gotcha, gotcha.

Starting point is 00:21:53 And if I recall, when you do a query, you can do like a wildcard on indexes you want to hit. Is that right? So like if you have time series, yeah, you don't have to have to know all your different indices when they've been rolled over. You can just say, hey, hit me metric star, and it's going to hit all the different indices. Exactly. You can provide a wildcard. You can also provide a list, a comma-separated list. So you can say, search through these five indices, but just these five.

Starting point is 00:22:20 Or you just say this pattern. And then, of course, you can add a filter on top and only say, whatever the time frame is in the last month just give me the data in that so there is a lot of flexibility built in that so ideally you don't i mean once you set it up the right way you don't have to think so much about the indices and charts anymore but we hopefully abstracted that pretty well away and that is one of the big differences to the early lx stack where you had to do a lot more manual work around that or you needed extra tooling that might run outside the cluster whereas the ilm is part of the cluster so it's just like configured and it's running within the elasticsearch cluster

Starting point is 00:23:00 to take care of that yeah okay okay very. Let's talk a little bit about, so we talked about sharding. And so I come, I do a lot with like DynamoDB. So that's what I'm thinking about here and routing. I think of like routing, right? Like when I make a request to Dynamo, it's going to get routed to a specific partition to handle that request.

Starting point is 00:23:18 Yeah. With Elastic, am I like, how often are people choosing a routing key versus just saying, basically distribute this document wherever, and now a query hits? I guess, talk about routing within Elasticsearch. When a document comes in, how does it get assigned to a shard? This is almost like a trick question. So, there are multiple answers to this so by by default what happens is um if you a document comes in and it has either it so we have a field that's called underscore id um and either you provide that

Starting point is 00:23:54 yourself or the cluster will generate one for you and that will basically be hashed so it's evenly distributed and then calculated modulo the number of primary shards and then you get out shard two and then the cluster knows that the primary shard two for this index lives here and then it will route the data there you can provide so that is what happens either the fastest way is normally that elasticsearch generates the ids for you automatically if you the problem is if you provide your own id then it might need to check does this document exist do i create a new one or do i overwrite an existing one so it's often faster when you just generate it and then we don't need to make any lookups if it

Starting point is 00:24:36 exists and you can change this routing information so you can provide an explicit routing key or it's underscore routing is the field that you can use why i'm slightly cautious is because i feel like people are often hurting themselves more than they are helping themselves so the idea is that you co-locate related data and then your your query can just be answered for example i i know there is a very big austrian online banking system and i think they have a mainframe in the background but that's expensive and slow to query and so they have had all of their data for 10 years or more in elastic search and they basically i think your customer id or whatever is the routing key so my data is always on the same shard so when they run a query they only need to query a single shard and not the 50 or whatever other shards for all the other customers there it makes

Starting point is 00:25:29 a lot of sense and they have that built in that is reasonably evenly distributed i've seen that go totally wrong like i know an austrian logistics company that used the country um for the routing key happened that i think 80 of their their customers were Austrian or 80% of the shipments were Austrian. So 80% of the data went to the same shard. And then they added more nodes because they wanted to scale out, but that didn't help because everything was going to the same shard. So there they hurt themselves. So co-location can be a powerful feature, but if you do it wrong, it can be very painful. So that was one, what happens automatically. Two, what you can do yourself.

Starting point is 00:26:07 And then there's a third way now. So we have something that is called time series data stream, which is more like a time series database. And what time series databases normally exploit very well is the locality of data that it compresses better. And this TSDS system that we have, you basically have a couple of fields that you configure as routing keys, so the similar data will always be co-located. So there it's kind of like built in, but in a slightly guided way,

Starting point is 00:26:37 or it's like a tool that is under the covers, hidden there, and it's there and active and working for you. But there, it's a bit more guided working for you uh but there it's a bit more guided so you don't hurt yourself so much but data locality is a big feature as you know from other systems um it just depends on how you use it and how how to really use that because you want to avoid the hot spots but if you have some locality you can potentially make good use of that um and that's kind of like the kind of like the story um but you had a question no no so i guess yeah the first question i was going to ask you is like how many let's let's

Starting point is 00:27:15 put aside the time the tsds one and just look at the first two either auto or specifying a routing key how often do you guess people specify the routing key? Or would you even recommend people specify the routing key? Or do you think it's like heavily, heavily in cases you see where it's like, hey, let it assign that automatically into the shard? I want to say that 90% or 95% use the automated hashing of the ID and then it just works. Also, again, I think it depends a bit on the use case because for time series data, you might want to have similar data grouped together. For a full text search use case, there is often maybe not even a lot of locality that you can exploit and then you just want to evenly distribute it.

Starting point is 00:28:01 So there are these trade-offs and they're also like in search accuracy multiple shots can add minor variations that normally should even out but it's another way how you can screw up your data weirdly so unless you know what you're doing and you have a very good reason i would not reach for custom routing what if okay so what if most what if I'm doing a SaaS and everything is within an organization or a tenant, so all my queries are within an org or a tenant, all of them. In that case, would it make sense to use the routing key and get that locality? Or should I just

Starting point is 00:28:37 take advantage of the system and say, no, let's fan that out and throw compute at it from these different nodes rather than putting it on one node. You know what our favorite saying is at Elastic? And it's the same as the favorite saying of every consultant. But what's that? It depends. Yeah, yeah, sure.

Starting point is 00:28:55 Yep. Okay. So I think for any reasonably complex question, the first answer would always be, it depends. And then you need to give a better answer. So let me try to give a better answer as a follow-up. I think the main danger is that you might have smaller or larger customers and you might create hotspots. So I'm slightly cautious. I mean, if you know what you're doing really well, then maybe. But there are a lot of other optimizations in

Starting point is 00:29:27 the system that maybe you don't need the locality to make best use of something. Sometimes, depending on, for example, if you have timeframes, it can figure out, this is the timeframe covered by one chart, and then you don't need to look at the others actually. So I feel like it's slightly tricky to answer this in a case that will work for everything. I'm just slightly cautious that, yeah, if you know really well what you're doing and you want to squeeze out some optimization, maybe custom routing helps. I'm not sure this is the first thing to reach for though. Yeah. Okay. squeeze out some optimization, maybe custom routing helps. I'm not sure this is the first thing to reach for though. Yeah, okay.

Starting point is 00:30:08 And I think that's where I get tripped up because, again, I do a lot with Dynamo, which is so heavy on the partition key and making sure you use that so you're going to one specific partition or even thinking about planet scale. You know, if you're talking about sharded SQL or if you're talking about Mongo, just like the shard key being important in your query like you want it to be something that's used in all your queries so you're you're going to one whereas elastic is sort of on the other end of that spectrum and just saying hey you know they're going to be indexed well on each on on each shard within the nodes there And it's fine to do a scatter gather there

Starting point is 00:30:45 and just sort of lean on that rather than trying to shard it well and now dealing with hotspotting and different size shard, different, yeah, different things like that. If you push it and you have a very large use case or a very specific scenario or you want to have specific data compression

Starting point is 00:31:02 or whatever else, probably I think 90% of the users are served better by not thinking about that than letting the defaults work. And if you have that problem later on, I would return to that. But it would not be one of the first things to reach for. Yeah, okay. What about, is the query engine, I don't want to say smart enough, like will it take, let's say I have an index that has a routing key. Let's say it's tenant ID and I provide a tenant ID in my query.

Starting point is 00:31:33 Is it going to know, okay, I only need to hit this one shard now? Or is it going to like hit those other shards, but they're immediately going to be empty for that tenant and return quickly? Do you know what I'm saying? Yeah. Or same thing with like time series is it smart enough to sort of know oh that time range is only on this shard or something like that yes so for the time ranges i think it's simpler because uh there you have some metadata on the shards and then you can uh early terminate uh pretty quickly um i don't know the the implementation details if you have a routing key, if it will just figure out

Starting point is 00:32:06 very quickly that you don't have any hits and then return or not. It will probably also depend on the type of query and what you do there. Also, in terms of queries, it's getting extra complicated because we have just a few months ago added a new query language and query engine. There are multiple options now. We had the historic query language. The query that is added in Elasticsearch is JSON. Maybe you start appreciating SQL more once you start typing JSON as a human. But it's great, especially for systems. I think it can be complicated for humans to type. So now we have relatively recently added a piped query language. So if you know CloudWatch or Splunk or Microsoft Kusto, those are all piped query languages. And we have a similar piped query language.

Starting point is 00:33:01 And it's not just a language. It's also a query engine under the hood with various optimizations that tries to push more and more concepts down into lucene but that's also more block oriented rather than than row oriented so there are quite a few subtle differences at this point that's why i'm like when we say query languages it's already getting tricky um uh but again um once you start pushing the system then you can look into those i think for the average user those are are not the first things i would look for or reach for yep okay okay all right last thing on terms of like terminology and things like this replicas which you talked about a little bit. So basically, I have an index. I'm going to shard it into multiple shards.

Starting point is 00:33:49 Each one will have a primary and zero to N replicas. Probably want at least one. How much do you recommend for high availability purchases? Do you recommend having two replicas so you have three? Or what do you recommend there? The replica terminology, by the way, is, I think, very confusing because depending on the system, the replica includes the primary or doesn't include it. So for us, we have the primary and then we have zero to N replicas.

Starting point is 00:34:17 We generally recommend one for high availability. But maybe let me take one step back. So how the write actually works is that you send your request as a JSON, whatever, to write, for example, to Elasticsearch, to the cluster. And then we'll pick a random node normally. That is the so-called coordinating node. That coordinating node will figure out what is the primary shot for that document based on the ID or the routing. And hold on just one sec. So the coordinator node can be a storage node potentially, but it's just for this request, it happens to be a coordinator. It's not like a request router outside of the storage.

Starting point is 00:34:55 Yeah, let's assume we have the simplest case where we have, for high availability, we need three nodes. So we can have a quorum. So we have three nodes that are data nodes and so-called master nodes to keep the cluster state running. So they're all data nodes. So your requests will round-robin between them. And then one of them gets the right request.

Starting point is 00:35:18 And that one, the coordinating node in that case, figures out what is the primary shard. And then it will send the data or forward the data to the primary shard. The primary shard will then apply that, forward it to the replica shard, you get the acknowledgement back from the replica, then the primary acknowledges to the coordinating node, and then you acknowledge to the client. That is how a write works. And sorry, on that write path, it synchronously waits until all replicas have hacked it before it returns? It will only acknowledge back to the client once it has been written.

Starting point is 00:35:53 To all replicas? Yeah, it gets tricky if you have more replicas and there is the chance of timeouts. But it will try to write to the majority of the replicas. But we don't want to make it too complicated now. But yeah, the happy path is that it's kind of like it goes, it gets forwarded to the primary, writes to the replica, acknowledges back to the primary, acknowledges back to the coordinate, acknowledges back to the outside system.

Starting point is 00:36:19 And then there are, of course, like timeouts and what happens if you have five replicas. Do you need to write to all of them? But we don't want to make it too complicated because those are getting outliers then. Yeah, yeah. We don't mind complicated here, though. I like doing deep dives, so I want to know. But if it's just so edged, then you don't need to go into it.

Starting point is 00:36:39 I mean, so maybe let me do the search first to explain why you might want to have more replicas. Because if you have a search aggregation request that comes into a coordinating node again, that one figures out like all the shards that you need to query. And then it will pick either the primary or a replica shard for reading or searching the data. So it can go to either one. It doesn't always go to the primary. And hold on just one sec. You said all the shards it needs to query, but in most cases, that's going to be every shard in the index, right? Yes, sorry. But my thought was you have an index pattern, so it might be multiple indices, and then multiple indices might be many

Starting point is 00:37:21 shards. So that's why I'm... Okay, yeah. Yeah... It could be a single shard if you query one index with one primary shard, or it could be n, so it can be many. And then you query those, and it can be either a primary or a replica shard that you query um and you actually there is an algorithm nowadays built in that it will kind of like keep track of which nodes are more or less busy and if it has like a multiple options it will go to the least busy one um which helps you routing around like some busy or hot spotting nodes for example or like busy nodes because it's doing a garbage collection or whatever else and is that are those statistics based on that coordinating node in the request it's made recently?

Starting point is 00:38:09 Or are they sort of like gossiping around saying, oh, this is like a slower one, whatever. So the thing is generally called adaptive replica selection. I thought the statistics come from the previous queries that you have been running. From that specific node's previous queries or across your cluster? Don't nail me down on that one, but I thought it was like the coordinating node is just basically piggybacking on previous queries. I don't think we have extra things that we send around, though I have not looked at the implementation details for that in a while.

Starting point is 00:38:52 I thought it was part of the responses that you get. And then so the coordinating node will then collect all the sub results. And by the way, if you do a full text search, basically what you tell each chart is like let's say you want the top 10 documents or something it will tell each single chart give me your top 10 documents but only the id and the so-called score like how how good the result is it will then aggregate from all the charts, the top 10 documents overall. It will actually, in a second step, fetch those actual documents, and then it will send that back.

Starting point is 00:39:31 We call it query, then fetch. So first we query for each chart's best results, and then we actually fetch the documents that globally top results. So we don't need to move a lot of data around because if you let's say you have a one kilobyte large document and you query 100 charts there's a lot of data that you need to move around otherwise and it's also like the the coordinating node would need to to then uh deserialize and serialize that again so we skip all of that we only actually get the documents that we need and that is the case where having more primary shots helps you spread out the write load having more replica shots helps you spread out the read load because you have more copies from which you can read yep okay but let's say you don't have any replicas

Starting point is 00:40:21 at all adding a like doubling the number of shards that you have probably is not going to help you on the read side because they're all taking the same like since they're doing scatter gather every time anyway roughly it's it's again again it depends it depends it depends but like generally yeah i want to say it depends um we could probably set up a very weird cluster that um you have five nodes but you only have two primary shots and then like only two nodes are doing all the work and then if you add more sharks then then if they distribute evenly then then it would be better uh but yeah generally um normally you have quite a few more primary shards or shards in total than nodes. So, yeah.

Starting point is 00:41:06 Yeah. Okay. Okay. This is awesome. Now let's go into like where do people hit bottlenecks? And I want to think of it both in terms of like what resource? Like is it CPU memory constrained? Is it disk constrained?

Starting point is 00:41:24 And then also is it different on the right path, on the read path? Are people hitting more on the read path? Are they hitting different on the right path on the read path are people hitting more on the read path are they hitting more on the right path so like where do you see people like hitting bottlenecks with their elastic search clusters again there's a lot maybe just walk me through on the right path is it like i guess or maybe is it 50 50 between like reads and writes are the problem or is like one one of them tend to be more of an issue i mean there are potentially so many things that people can do so for example um one thing that you could do is we have something called an interest pipeline to basically change documents on the way in that's normally more cpu heavy um and is that on a separate node or is that happening on the coordinator, each coordinator, each data node when it's acting as coordinator?

Starting point is 00:42:06 It can happen on the data node. You could also have dedicated coordinating nodes that don't do anything other than this kind of like being a smart router. Or you could even have a dedicated ingestion node. So you could break those out in larger clusters if you have like 30 plus nodes or so. It might make sense. Also because you might have different hardware profiles because that ingestion node might need more CPU, but might need less other things because it's just processing the data. Whereas the data node probably needs fast disks and then as much memory as possible to keep stuff in memory in terms of caching. If you run large aggregations maybe you you will also need memory

Starting point is 00:42:46 maybe you need some cpu to crunch some data you will need sufficient network so i think you can pretty much run out of any resource that you you want depending on how you hit it and like what is your read write ratio also for example vector search in general the data structures should or hnsw the the small worlds is like the data structure that we use for vectors right now those should be in memory they are normally the the biggest constraint is the the memory size that you have available, even though we have quantization now to reduce the memory and things like that. But their memory, for example, is a big constraint.

Starting point is 00:43:30 If you have more like a logging use case where you store terabytes of data, probably the disk is the bigger thing because there is no point or no way to keep all of that in memory, but you'll just need to fetch the right data pretty quickly from disk. There is also, especially for the time series case um we have normally data tiers where we say the data for today this is

Starting point is 00:43:52 where all the writes happen and probably most of the reads and like from three days ago then it's what we call a a warm node anymore rather than a hot node that gets many reads but no writes anymore and also fewer reads and then there might be uh cold or frozen which is mostly like once a week somebody runs a query that spans more than a month or so or a full month or whatever and there you might just want to be more patient maybe you add spinning disks somewhere or you move it to a blob store uh just to reduce the cost so there are a lot of different things that you can do there. And then, for example, if you have a lot of data on a blob store and you need to pull it in, maybe the network becomes a bigger bottleneck.

Starting point is 00:44:33 So I have a very hard time saying there is this one thing that you will run out of. Normally, I think memory and disk is the first thing. But then you will need to keep an eye on what you're doing and the problem the rise from that uh after are there certain things during the ingest pipeline that drive a lot of cpu like maybe if they're doing the built-in semantic search and and that's generating vectors is that pretty so we we have another data type that's that's a machine learning node that's doing the inference and and and stuff like that um so that is that would normally be separate then um or that you would often do separate also because again their more cpu is probably the hardware profile that you want on that one um coming back

Starting point is 00:45:17 to what is that the bottleneck for ingestion um one thing that in elastic we do a lot this grok the the named regular expressions and regular expressions especially if you write bad regular expressions can be very cpu heavy or inefficient so i i always point to grok first i'm not 100 sure it is always accurate but i i like pointing to grok because you can do horrible things with Grok. And I always say that the plural of regex is regret. That it's A, nobody can read it. And B, it's also often very obscure to debug. And it's a great tool to parse apart what you have. So this is another thing.

Starting point is 00:46:03 In a classic Elk stack, you would always use grok and parse and I call this kind of like the Stockholm syndrome that people got so used to doing that that they think this is the right way. Ideally you can have structured data when you write it already in your logs for example so if you write JSON you don't need to do all the parsing anymore so it's easier for you for not having to write grok patterns. And it's easier for your CPU not having to write Grok patterns. And it's easier for your CPU because it doesn't need to do all of that parsing anymore. That's one of the classic ways where it's like, where I say like the evolution of like, it doesn't have to be so heavy or painful if you move it in the right way. Just, yeah. Think about where you're doing that sort of stuff. Oh man.

Starting point is 00:46:47 I remember working at a place and we had Elasticse, and we were using... So Grok is the... Is that the scripting language? Is that what you're saying? Yeah. So Grok is like these named regular expressions. It's basically a regular expression, but it has named patterns. So it has like some... For email or timestamp or log level, because it's basically...

Starting point is 00:47:02 You can find it on GitHub or in the documentation. It's basically, these are the patterns that happen to all the people. So you can write your own regular expression, but you can also, 90% of the things you can just use patterns that exist. Yeah. Or what about painless? What's painless? Is that also scripting? Yeah. So painless is first a very unfortunate name. The backstory is that most data stores want to have some kind of scripting language. And Elasticsearch, when it started,

Starting point is 00:47:32 Groovy as a dynamic Java or JVM-based language was the tool of choice. The problem is that Groovy was a general-purpose programming language. And people found ways to escape from that sandbox and take over clusters and we had a couple of bad security issues that is like 2010 to 13 14 whatever um so old times and then we made um a strategic decision that we want to have a more specific scripting language just for elastic search thatarch that runs on the JVM.

Starting point is 00:48:07 It has some performance improvements because it doesn't need to re-evaluate everything again. And it is built so that there is no way to escape the sandbox or some concepts just don't exist. It is if you have written Java in the past, it kind of makes sense or it will feel more familiar. If you have not written Java, it will feel potentially very foreign. The unfortunate thing is that we named it painless. It's not because we thought that the language is so painless. It's because the guy who created it mostly has chronic back pain. And his dream is to be pain-free or painless.

Starting point is 00:48:44 And so he called it painless. And I think I get once a month or so a complaint that painless is everything but painless. But it's a scripting language or a purpose-built scripting language for Elasticsearch that you can use. But it is something that you need to learn. And while you might need that, unless you have to, I wouldn't. Also, because it can get a bit tricky to debug. It's not a language that you can run outside in your IDE easily. So it's just in terms of tooling and everything, it can be a bit tricky.

Starting point is 00:49:22 We have made various improvements over time, and I think it's better now. But if you write 500 lines of painless, you will probably, again, regret your decisions. And maybe that's also not what you should be doing in a data store. I remember back the last time I used Elasticsearch, someone had written I believe it was painless, some sort of scripting thing within a query on a match, which just means now your index, it was like a keyword index,

Starting point is 00:49:52 but they were using a, I think, painless to match on it, which just blows your index, right? Because now you have to scan the whole index and run that function of it, which is true in a relational database or anything else. I see that every now and then. It is not a lot of fun to write it is maybe it's great for your hardware vendor or cloud provider but otherwise it is not recommended yeah yeah okay um okay you mentioned lucene and i know lucene is just like amazingly powerful that's. Is Lucene doing just the actual text search indexing, or is it used for keyword indexing and other indexing types as well,

Starting point is 00:50:31 or is it just the text part? Everything. Lucene is the thing that stores any data on disk. Sometimes I get asked, what is the database behind Elasticsearch? Sometimes I get asked, is it MySQL? No, there's no MySQL in there. It's just Lucene. Lucene is the thing that analyzes your text it creates the structures it's right step to disk it runs the actual searches basically. Elasticsearch around it is

Starting point is 00:50:56 basically the distribution part it provides the query DSL or type query language esql and it's kind of like doing a lot of the the work on top but it all builds on lucene and all the features in elastic search are basically built in lucene that's kind of like the the beauty that's also why sometimes it takes a bit longer to implement something because it needs to first land in lucene and only then can it be used in elastic search that's why the vector search or KNN, the approximation search that uses the small worlds, took a while until it landed in Lucene. And then normally when there's a new major Lucene version, there's a new major Elasticsearch version. And once that came out, that was eight.

Starting point is 00:51:39 But that was only like two years ago at this point. So that sometimes can take a while. On the other hand, Lucene is an established powerhouse and the full-text search engine or library out there that we heavily build on to do all of that. Is Elastic and the Elastic team, are they like, I assume, one of the main contributors to Lucene at this point?

Starting point is 00:52:05 Like, are they, yeah one of the main contributors to Lucene at this point? Do they dominate that? I think just two hours ago it was announced that somebody on the team got on their PMC or we have a lot of contributors and it is yeah. Lucene

Starting point is 00:52:22 is very strategic for us and we drive a lot of the initiatives but there are also many others Lucene is very strategic for us and we drive a lot of the initiatives. But there are also many others involved in Lucene. That's kind of like the power of Lucene, that it's not just one company, but it's very broad and then the foundation for many tools. But we do invest a lot. I don't have any statistics from the top of my head. There's also, we can, again, spin this any possible way. It's like the number of pull requests, the lines of code, the complexity of something.

Starting point is 00:52:56 I don't want to give any percentages here or say, like, we're this big or not that big. But you will run into a lot of the Elastic colleagues if you do anything around Lucene. Yep, yep. Okay. Tell me a little bit about vectors. plastic colleagues if you do anything around Lucene. Yep, yep, okay. Tell me a little bit about vectors, and you talked about HNSW, and I guess just like I think of search and inverted index as like a much more compute-intensive version of just like regular database indexing,

Starting point is 00:53:19 like normal B-tree lookup type stuff. And inverted index is like more storage, more compute to make that match. And then I think HNSW is another level above that. Am I thinking about that correctly? I mean, so far, for classic keyword-based search, the big trick, I would say, is that you do a lot of work up front. So, for example, my go-to example is always from Star Wars, like,

Starting point is 00:53:46 these are not the droids you're looking for. What would end up in Elasticsearch or a search engine? So when you run that through Elasticsearch, if you run that, you will get droid you look. And what you can see is first we remove remove or we remove the stop words we tokenized we broke it out into the individual words and we do did stemming so rather than looking it's only look so it's the word root because you normally don't care if it's like singular plural or what flexion of a verb but you care about the concept so you do that up front and then you store alphabetically sorted all of those terms. And then you basically point to all the documents where they appear.

Starting point is 00:54:30 And then you know in the document it's at that position so you could do highlighting. And you have the position so you could do a phrase search. All of that is extracted basically at ingestion or index time. So your query is then pretty fast because when your search query is coming in you also analyze normally with the same analyzer the search query and then you can just look for direct matches because then you look for droid and you're alphabetically used to go down to this is where droid is then you look like which documents contain droid these are the documents that are being returned and that's why it's so fast.

Starting point is 00:55:05 But it does more work and it creates those index structures to enable that. So that is extra work on top of that. And then what you do with vector searches, you basically have these arrays of floating point values that have 100 hundred or a thousand dimensions and then you just afterwards try to find something that has a similar vector representation the trick for that is what the small worlds or hnsw is giving you is that you can do an approximation or an approximate search like the approximate k nearest neighbors k and n is generally the keyword for that because if you have a million documents and you need to compare like a million

Starting point is 00:55:52 vectors like how similar are they this computationally very expensive and probably not going to work out and hnsw provides you with a smarter data structure that is layered and basically kind of like puts you closer to what where you want from each layer and then you can find in a relatively or actually very efficient way the nearest neighbors in a vector space yeah okay and so if i'm using an hnsw index i'm naturally doing an approximate like i can't't do an exact K-nearest neighbor search with a HNSW index. Is that right?

Starting point is 00:56:27 Yeah, that is an approximation. And by the way, you can do without HNSW. So in previous versions, there was still a data structure that we call dense vector, which is this array of vectors. But then you couldn't do an approximation, but that was expensive.

Starting point is 00:56:44 So you could do that before, but only since eight has been two years but still you can do the approximation or the approximate k nearest neighbors to to get those but that is like if you have a large amount of data you want to do that if you have something like 10 000 documents or so hnsw is not going to help you. You can't just brute force that. That will be faster. And you can skip creating the data structure, but for large collections of documents, you want to have this approximation because otherwise you just spend a lot of time on doing vector comparisons.

Starting point is 00:57:19 Does HNSW update well? Like if I have an updatee type workload where my vectors are actually changing? Now we're coming back to one tricky aspect of Lucene that it's immutable. So what, by the way, when I say immutable, how that works is, and you might have stumbled over the refresh rate at some point. So it's basically we batch all writes together by default it's every second but you can change it if you want to have higher write throughput so we batch all the operations in the last second together and create the so-called lucene segment that's at first only in memory but will eventually

Starting point is 00:57:57 be written to disk this is the data structure that is searchable that's also why it takes up to a second to find documents if you retrieve it by id you can get it right away if you search any kind of like multi-document operation needs these segments so it might take up to a second until that is created you can block the right operation until the refresh happens or you could force a refresh don't do that too frequently because it will create many tiny segments and they need to be merged away later on, which is expensive. So there are ways around that. But you have all of these segments. And eventually they get merged because each segment basically contains a data structure on its own. And you don't want to search through thousands of them.

Starting point is 00:58:41 So you want to merge them into larger ones over time. One of the downsides that we have made some improvements there is that HNSW, you cannot easily merge it. You basically need to recreate an HNSW. There are optimizations how we have sped that up and also made like making concurrent searches across segments faster now but it is a bit of a pain point that we're actively working on to make that merging faster so one of the tricks we're doing now is we look at the different segments and look like which one is the largest one that doesn't have a lot of deletes or which doesn't have deletes in it because then

Starting point is 00:59:21 we can basically use that as a foundation and then plop all the others on top of that and we don't need to restart everything from scratch um but if that's kind of like the downside of hnsw that it's not easy to to merge existing ones together but there are ways to make that better so anybody who is kind of like saying oh this will always be slow in lucene and this is a horrible approach. There are improvements in that. Just wait. Yeah, we're still early on that stuff. So these segments that get written, are they like a combination of indexes, like the inverted indexes, but then also the full documents? They have multiple data structures in them, depending on what you have. Because Lucene in general, even before vectors had multiple data structures because there

Starting point is 01:00:09 is the inverted index, but there is also the un-inverted index because that's what you need for sorting, for example, or for aggregations. So there are different data structures and types of how the data is stored. And there are, text is stored differently than a number or a geopoint. And so there are different data structures to help you out in terms of query efficiency, but also storage efficiency and just allow you to do specific things and features with them. Yeah, gotcha. Okay. Last question I have before I move into just sort of the random questions thing. When Elastic is doing a query,

Starting point is 01:00:53 or when I think of a relational database query, maybe I have multiple filters in it, and based on statistics, I can say, okay, this is going to cull it down the most. We'll use that one and then sort of go from there does elastics sort of have to perform like all the searches or like especially like the search like the inverted index look up like even if i let's say i have like a term match on on so like again maybe a tenant id right i only want to do search for users within this given tenant so i use this tenant but i have

Starting point is 01:01:25 thousands of tenants is it going to be easy to like filter out all those other tenants or is it just going to have to like you know do the search and then sort of intersect it with the tenants that match luckily lucene has been around for so long that there are a lot of optimizations in that by now um so okay depending on what kind of filter you have the filter can also be cached that is one of the tricks that will make it very fast but but it can also kind of like generally limit it. Also, there are some smart data structures. One of them is called MaxWand, which basically allows you like an early termination. For example, one trick there is if you search for three terms and it will figure out that one is very rare and that normally gives you a very high score, and the other two are less relevant or something is not as competitive anymore, it can early terminate some steps.

Starting point is 01:02:13 And it will just do the most relevant things to get you the right results. And it knows when it can skip specific things. Yeah, okay. Interesting. I guess one last question on Elasticsearch. You mentioned 8 is the latest version, it has H, N, S, W, all that. What does upgrading an Elasticsearch

Starting point is 01:02:32 search cluster look like? Is that a pretty straightforward operation? Is it difficult? How do those usually go? It depends. So there are multiple more or less short answers to that so elastic search has for a long time operated on like major version means breaking changes

Starting point is 01:02:54 and there have always been some more or less often unfortunately more painful uh changes i think as we have matured over the time, things have improved a lot and there are way fewer breaking changes and it's not as complicated anymore. But there can always be breaking changes. The one thing that is a bit of a complication is that Lucene in general can write its current version and read one version back, but it cannot read further back than that. So if you have data written a long time ago and you have upgraded more than one major version, we have an upgrade assistant built in and it will shout and tell you, you need to delete the data or re-index or rewrite it in the current version to keep it accessible.

Starting point is 01:03:43 And by the way, these are the five settings that you need to change. So there is an upgrade assistant and you can... Rolling upgrades, I think, I forgot, I think they came in six or so before it was stopped but that was 2016 or 17 or whenever that feature came. It's so long I can not remember the version anymore. But rolling upgrades have been a thing for a long time. Most of the stuff you can just fix either through a re-index or changing some settings. If you, for example, had a cluster that was not using security and TLS, that probably will require a full restart that you put all the nodes on TLS and then restart the cluster.

Starting point is 01:04:29 Historically, it wasn't always super easy. I hope we've made it easier. There are also, I would say, ways around that to make it easier. So we have a cloud service. We have a Kubernetes operator. So there is a lot more tooling nowadays around all of that to help you uh but i don't want to deny that especially early on like elasticsearch upgrades were not always as as fun as we would wish they were i think it's true that just like databases in general have really improved that where like you know 10 years ago that was not the case like and people were major versions behind a

Starting point is 01:05:04 lot just because it was such a pain to do that sort of stuff. So yeah, I think it's just gotten a lot better in the last five, 10 years. Hopefully, I mean, we still see a lot of old clusters. And I mean, I can understand. Everybody has end tools they need to manage. And it's sometimes also a testament that it just works. So you don't worry about it anymore

Starting point is 01:05:23 and then you forget about it. But then the upgrade at some point will be a major operation. So yeah, I do see old or even ancient clusters. We of course always recommend like the most secure and by far fastest and stable version is probably the latest one. Or maybe if it's a rough release

Starting point is 01:05:44 with a lot of features, the 0.1 release then. But that's generally the recommended version that you want to have. And otherwise, you just miss out on a lot of performance improvements, security, and bug fixes. New features.

Starting point is 01:05:59 Yeah. Yeah. Very cool. Okay, this has been great. I feel like I know a lot more hopefully this will be helpful to others um we always end with just like some common questions i ask everyone a little bit of rapid fire type stuff so first question if you could master one skill that you don't have right now what would it be so the one thing that i i want to do and this is very random maybe um

Starting point is 01:06:22 um since i go to a lot of meetups and conferences, I want to pick up sketchnoting. Oh, yes. Yes. When someone's doing a talk and it's like, yes. Yeah, exactly. And then you put together this great overview because I think it's like very helpful. And it's like, I love it when somebody does it for my talks. And I don't see it much, but I think it's actually great if somebody really gets the

Starting point is 01:06:43 gist out of that into one picture, basically, and makes a nice picture. And since I'm at so many events, I'm always like, I wish I could do that because it would be very cool. Also, I guess, a takeaway for everybody from the talk. So maybe this is an uncommon one, but that is the thing. No, that's a great idea. And it's hard to find someone that can do that because generally you need at least like a little bit of domain understanding and knowledge because it's like deep technical talks and then finding someone that can like have the art have the the words because they're also doing notes like just being able to do all that stuff is like a tough tough

Starting point is 01:07:18 skill to find but those are really nice yeah i i really want to do it i'm really not a drawer so i i'll see how how bad that it will look. On the other hand, maybe it just doesn't have to be pretty. But as long as it's more or less functional, it's still much better than nothing. So we'll see. You should try it at a meetup or a conference this year. Give it a shot. Yeah, that's a great answer. Number two, what wastes the most time in your day yeah i i will always jokingly say at this point i'm doing 80 management and uh 80 devrel um so um or sometimes i say like that devrel is almost a hobby at some point anymore um no but it's it's a good thing. I really love our products and what we do.

Starting point is 01:08:07 And the only way to scale it out is through others. That's why I'm doing my best to build an amazing team and move us forward through that because otherwise I'm always limited to myself. But it is taking a lot of my time, of course. But that is an important step to move us forward and get us to all the developers out there

Starting point is 01:08:28 because a podcast like this is great in terms of scaling, but we need to be on more shows, at more events, answer more questions on Disqus and Slack, write more blog posts, do more demos. We just need to scale this out and I'll try to lead by example through that.

Starting point is 01:08:44 Yeah, yeah. Well, for those listening, he's talking about, you know, Philip's talking about doing this as his hobby time. It's like almost midnight. Is it 1145 at your place? Yeah, it's almost midnight. I just came off a meetup. Yeah.

Starting point is 01:08:57 Right after a meetup and he said he would be willing to do this podcast. So it truly is. That's amazing. All right. Next question. If you could invest in one company, that's amazing um all right next question if you could invest in one company that's not the company you work for who would it be i think that the mandatory choice right now is nvidia nvidia yeah because everybody needs those gpus right now um i i guess right now you're too late already anyway but but it feels like this is the top of where everybody wants to be.

Starting point is 01:09:26 And I think it's great to see all the excitement. I think we'll see how long lasting all of that is and where it goes next. But even though I feel like we're kind of like in a bit of a dry spell in IT to some degree, it's great to see that there is movement and things are working and there's a lot of innovation and excitement. And then I think NVIDIA is the perfect player in that thing right now. Yep. Yep. Well, I made a mistake last week of saying I thought NVIDIA was over valued, you know, because they like passed Google and Amazon in their total value. And then earnings came out this week and they jumped like another 15 20 so

Starting point is 01:10:05 yeah and probably now it's too late so don't don't take financial advice uh from me um or none of the listeners please um uh that that is not uh what i'm good at uh but i think it's a very interesting story because i feel like at some point people were a bit like oh the bitcoin or a blockchain hype is over like nobody needs those gpus anymore and and suddenly we need even more gpus than before yeah i know it's amazing yeah they really set themselves up well for that so uh okay what tour technology could you not live without besides elastic search does it include our stack because i i use our stack really every day um yeah i mean oh that's a that's a very tricky one i i'm very torn so i i have a shared background between

Starting point is 01:10:54 ops and dev and i think for for for ops i i feel like it's it's linux in general and maybe automation like Terraform. And on the development side, I just did a lot of Java in the past. And I think it's also another surprisingly vibrant and resilient ecosystem that's also doing a lot of things and also passed a lot of its problems. So I'm not one of the python people i'm still one of the java people and we have langchain4j and spring ai and and other things so there's also movement there and it's also fascinating because it's a very different take than the the jupyter notebook world for example because when you look at a langchain4j example it often includes like tests integration tests with test containers and Elasticsearch, for example.

Starting point is 01:11:46 I saw something yesterday, which is like I feel like the Jupyter notebook world is very different. It's much more about experiments and it often doesn't have automated tests. And I'm just coming more from production. I really like production because that's where I like the pressure or where things things happen um so i'm i like that world yeah yeah cool okay which person influenced you the most in your career who i feel like there were there were a lot of people that that influenced us i i don't i don't want to make this too much of a of like a sob story or anything uh it's uh no it's like i i want to name shy the this he he used to start elastic search i think from his bedroom more or less um and then was the cto and the ceo and now he's the cto again um and he a because he he found an

Starting point is 01:12:47 interesting problem by the way do you know the backstory why he created elastic search maybe we can quickly bring this in here i do but tell it yeah i thought this is a good one okay so um his wife was becoming a chef and she had a lot of recipes and he wanted to write the recipe search um and he didn't find a good search system. So he started building a search system. Legend has it that she is still waiting for that recipe search because he got slightly distracted on the way. And Elasticsearch is kind of like the third implementation because there was Compass 1, Compass 2. And it wasn't Compass 3, but Elasticsearch.

Starting point is 01:13:22 And 3 kind of like is the magic number, maybe. I don't know. And that one stuck around. But that's where he got on the tech side. And at some point, he took over more and more of the company, and he didn't have time, or I don't want to say he wasn't allowed to write code anymore. That would maybe give the wrong impression. And I really liked his answer when he was asked, like, do you miss writing code? And he's like, well, I'm mostly about solving problems.

Starting point is 01:13:46 And at first the problem was to write the code, but now the problem is something else. And he has been there from the start and he's still there and pushing us. And I think that has been a influential. And I remember back when I joined the company i had an interview with him and i i remember it was very interesting because he i think i mean he's jewish and i think he was in israel at the time and he scheduled the interview with him on the 25th of december at 9 a.m or something like that and um i was like well he has kids and if he can do that i guess i can do this as well and then then i i agreed to

Starting point is 01:14:25 that and then i think the evening before he sent me an email and we said like oh crap it's christmas um will we schedule this and then we rescheduled and then then we did the interview and i was like man it's so hard to understand him and it's so noisy and at the end he's like well this was great because my wife is driving and the three kids are in the back and I'm in the car. And that's where we did the interview. And since I survived that, I was like, okay, well, this all seems very manageable. So, yeah, I think that's the person I pick for that. That's great.

Starting point is 01:15:00 Yep. I've seen him a little bit around and, yeah, just seems like a great guy. And then, obviously, Elastic, obviously elastic you know great product great company and he sometimes has uh at least in the old days he had a picture where he looked very mean and like he had a shaved head and it's like he looks very mean he's not that mean he it might just look like that or threatening i don't get the wrong impression uh but um at my previous company where i worked before we were using compass 2 already uh so everybody knew about him and we were always saying like,

Starting point is 01:15:27 oh yeah, the search guy and the mean looking search guy. So, and I got to know the mean looking search guy a bit better. That's great. Yeah, very cool. All right, last question. What is your probability

Starting point is 01:15:40 that AI equals doom for the human race? I'm always slightly pained here. I think there are so many steps on the path to humanity's doom at this point. I'm not sure AI is in the top three right now. Is the last one? I'm not sure. Maybe it works its way up. On the other hand, I think that the doom where AI is, the doom is kind of like for the information on the internet

Starting point is 01:16:15 because the randomly generated noise in poor quality all over the place, I think that it's not doom in itself, but it is a big pain point. It also does, I think, to some degree, destroy some established communities. Like I mentioned Stack Overflow before, and I think that traffic took a big hit,

Starting point is 01:16:38 but you can also see it in other places. And I think it makes sense because why write the question on Stack Overflow and then either wait half a day for somebody to answer or b be told like oh this is a bad question or like you should you shouldn't do this um when you can just ask any random thing to a chatbot and you get an answer right away and oftentimes it's right and ideally you try it out and figure out if this is really true or not so i can see the appeal of that but it does um change a lot of the ecosystem and like it has a big impact on like how people learn and also where people

Starting point is 01:17:12 exchange ideas and get to know each other and everything and it's i guess every every win like ai is clearly adding value and working for a lot of things. And we have products about like AI assistants that are doing a great job and are really helping make things easier and faster. But they do come at a cost. Yeah, yeah, yeah, absolutely. Yeah, I agree. It's a weird world we're going into. But like with anything else, think it's it's not replacing the pain or bringing new pain it's just exchanging the pain for something else so it's solving

Starting point is 01:17:51 something and adding something else and it's always a combination it's like it depends there is i think very few things have a a clear-cut 100 answer like it's like this or it's good or it's bad um the reality is more complex on the other hand i always say that that complexity is great because it keeps all of us employed and for now and entertained and busy and we can learn things so the complexity is to some degree a feature yeah yeah absolutely i love it well hey I appreciate you staying up this late and doing this and explaining Elasticsearch. Thanks for having me. I hope we could reduce the complexity or confusion around Elasticsearch. Definitely. Yeah. Yep. This is great. This is truly great. I love it. I think it'll be helpful for a lot of people.

Starting point is 01:18:40 If people want to find out more about you or Elastic, I guess, where should they go to find you? I mean, Elastic is elastic.co, not.com, but co. And then all the social media and whatever. Me personally, my Twitter handle is, and maybe we can put it in the show notes, it's XERAA. If you wonder why that is my Twitter handle, it's a rob 13 of my last name if people still remember what rob 13 was it's like when you rotate the letters by 13 the nice thing about rob 13 is it's like encryption and decryption is the same so if you rotate by 13 again you're back at the original um so if you take my last name and you rotate the letters by 13, this is the handle that I try to use everywhere. I learned about that, I think, when I was in school and I was on a tram and I basically on a piece of paper, I kind of like the 13 calculation.

Starting point is 01:19:37 And since then, I have been sticking or using that. How do you pronounce it? Yeah, I say Xeagah, but there is no official pronunciation. I didn't put it in any urban dictionary or anything yet. I have an explainer on my website. I saw that. Yeah. Yeah. I haven't added the pronunciation there. Okay. There you go. Yeah. So awesome. Well, Phil, it was great to have you. I really appreciate it. And thanks for coming on. Thanks for having me.

Software Huddle - Elasticsearch Fundamentals with Philipp Krenn

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.