The Changelog: Software Development, Open Source - Elasticsearch and doubling down on "open" (Interview)

Starting point is 00:00:00 Bandwidth for Changelog is provided by Fastly. Learn more at fastly.com. We move fast and fix things here at Changelog because of Rollbar. Check them out at rollbar.com. And we're hosted on Linode servers. Head to linode.com slash changelog. This episode is brought to you by Rollbar. Move fast and fix things. Resolve errors in minutes and deploy with confidence. Head to rollbar.com slash changelog.

Starting point is 00:00:26 Request a demo. Get started today. It's loved by developers, trusted by enterprises, and most of all, we use it here at Changelog. Move fast and fix things with Rollbar. Once again, rollbar.com slash changelog. Thank you. Welcome to Philip Crenn. We're talking about Elasticsearch and the problem it solves, where it came from, and where it's at today. We discussed the query language, what it can be compared to, whether or not it's a database replacement or a database complement. Elasticsearch versus Elastic, the company, which is a very big company, by the way. We also talked about the details behind Elastic's plan for doubling down on Open to open up XPAC, which is their open code paid add-on features to Elasticsearch,

Starting point is 00:01:27 the implications of this on their business model, and what changes will take place at the code and the license level on GitHub. So we're here to talk about Elasticsearch. And I don't know about you, Adam, but I got to claim a little bit of ignorance on Elasticsearch. And I'm guessing you as well, because I've never touched the thing. I've heard some hand waving on the Internet. I'm very conservative in my data stores and my search engines. So I haven't actually played with it, but I'm excited to learn about it. We have Philip Grin here to talk us all about it.

Starting point is 00:02:11 So Philip, let's start with Elasticsearch, what it is, where it came from, what problems it's solving, and then we'll get into where it's at today and where it's going. Right, so it's based on the base library we're using is Apache Lucene, but that's not really the story we normally try to tell. Like there is a kind of cute or interesting story around it our currently CEO shy he started Elasticsearch back in the days and it actually the first iteration wasn't even called Elasticsearch it was called a compass and compass was kind of like the tool for his wife to search her recipes because she wanted to be a chef and she had a ton of recipes she needed to search. And he started building a system to make that possible. She is, by the way, still waiting for that recipe search solution because he kind of over-engineered that.

Starting point is 00:02:59 And so he built Compass One and then he found out, well, that's kind of like a dead end. And then he redid the entire thing, and it was Compass 2. And then a third iteration, which is kind of the lucky number, obviously. It wasn't called Compass 3, but that he called Elasticsearch. And that was back in 2010 when he first released that. That's kind of how it all got started. And what his idea was about that search should be kind of an ubiquitous solution, that it needs to scale, that it should be simple to use.

Starting point is 00:03:31 And that's kind of like where Elasticsearch started from. It was scalable right from the beginning, and it had an easy-to-use REST API, and it should just work. And that was kind of like the promise or the start where it all began. Gotcha. So Apache Lucene, I think it's a Java project. Started all the way back in like the 90s, right? Like late 90s, early 2000s.

Starting point is 00:03:54 How does that fit into the Elasticsearch story? Lucene is kind of an incredible piece of work. So a lot of work has gone into that already and it's very mature. So it's kind of if i say the de facto search solution that everybody is using or the standard is maybe a bit of an overstatement but it is kind of the most commonly used base library that people are using for full text search the problem is it's really just a library so yes it's written in java and you could include that

Starting point is 00:04:23 in your own java application but it's really a library, and you just have to call it very explicitly. And the API is not the most user-friendly or nice to get started with. So that's not really what you want to do. It's a bit barebone, but it has all the necessary pieces. And what Elasticsearch then did basically around that, it does the distribution and the replication of your data, and it provides a query DSL and a nice REST API to the outside. Yeah, so as somebody who's not a Java developer, with Elasticsearch, it's also Java, but you don't have to care about that because it's a REST-based API that any client library can speak to without having to include Java embedded into embedded into your application totally um so yes since lucene is based on java elastic search is java as well um but kind of shy already saw

Starting point is 00:05:12 that initially he had it the entire system uh compass very tightly coupled to the java ecosystem but he saw that that is not really what people want and if you just bind yourself to one ecosystem it's kind of very limited in the long run so with a nice rest api and then we have drivers or clients for all the major programming languages it's much easier to get started and have kind of like that base system that everybody can use and then everybody can just build whatever they want and we really don't care what is your programming language like whatever makes sense for your product product or project um that is fine by us uh we're just trying to provide the right client and then you build awesome stuff with it so he set out to build a recipe search and he ended up building a quite a large company called

Starting point is 00:05:57 elastic which is where you work tell us about elastic versus or and elastic search give us the the lines between the open source project the the company, and how all that shakes out. Right. So initially it was Shai and I always imagined him sitting in his bedroom coding day and night. And at my job before Elastic, we were already using Elastic Search and we were always like curious like how that one guy

Starting point is 00:06:25 could produce so much code and he was like answering all the issues and writing the documentation and still coding so much every day. And at some point later on in 2012, he joined forces with three other guys to start a company. And back then, since the product was Elasticsearch, the company was also called Elasticsearch. Since we have then added a few more products along the way, we had to rename the company at some point since, well, it was not only about Elasticsearch anymore. Even though Elasticsearch is still kind of the core of everything and everything else is built around that and around search kind of. But the company is now, yeah, I think we were about 820 or something like that, though it's changing pretty much every day by now. And we've kind of built the various other tools around it.

Starting point is 00:07:15 So people might be familiar with the ELK stack, Elasticsearch, LOCKS-Kibana. So LOCKS-Kibana is the thing to get data and transform it and then put it either into Elasticsearch or some other system. And then Kibana for the visualization part. We always say we want to democratize data. Basically, you have a nice browser-based tool where you can just explore your data, build dashboards, and just see what you have there later on we even added the beats which are like lightweight agents forwarders shippers whatever you want to call them written in go to collect log files or system metrics or

Starting point is 00:07:52 ping systems and that's when we renamed the entire thing again back from the elk stack and we're now now trying to call it the elastic stack since well our products are always about kind of being scalable and elk stack or whatever we first we tried to call it belk or elk b because well you know elastic search locks is cubana plus beats so out of those four letters the only thing we could make up uh was belk or elk b and we we even had a logo for that so there was like the b with the elk horns um uh which was a cute idea. But since we're always about scaling, we figured out this is not really that scalable because if we add any other open source products,

Starting point is 00:08:33 we would need to redo the entire branding again and making up new animals, which whatever letter we would get afterwards, it would not get any easier. Typical naming. So yeah, now we're trying to do or call it the Elastic Stack. And internally, every time we see when somebody is doing a meetup or some other event and calls it Elk, we raise the internal Elk alert. And somebody will reach out and say, hey, this is super cool,

Starting point is 00:08:57 but we try to call this thing now Elastic Stack. But Elk alert is pretty interesting because we always get called change log or change capital l log all sorts of formations of it and we need a a change log uh thing jerry we need to do this like an actual log yeah something that logs the fact that people are saying it incorrectly i love that but naming geez so for for log stash we had actually the original logo was a wooden log and people found it super cute though now everything is kind of like it's just letters and yeah at some point as you grow as a company um kind of the cuteness uh has to take a step back

Starting point is 00:09:39 i guess and you you need to grow up a little and try to be more professional. Elastic, the company, supports Elasticsearch and these other services as well. Is the model basically you're hosting around the infrastructure, or is there also like an open core thing? How does it break out in terms of the open source projects? I realize they're plural now, versus the proprietary stuff. Building a sizable company is kind of a challenge if you're an open source company. We're actually trying to do kind of a bit of everything. So we provide Elasticsearch and Kibana as a service, which we call Elastic Cloud.

Starting point is 00:10:22 But we also have this open core model where you get the core features as open source and you can just do whatever you want. It's Apache 2 licensed, go crazy, do whatever you want. But we do have some commercial plugins around that. We don't have a special commercial version like some of our competitors or other vendors in the database space have, where they have like a community version and an enterprise version. We don't really believe in that model. We have like, it's really plugins that you plug into that core system. So even the paying customers, they're using the same open source base, but you just add some functionality on top of that. One thing that I've said, i've read some hand waving most

Starting point is 00:11:05 people when i see elastic search come up uh it'll be somewhere along the lines of hey try elastic search and then the person will say well i don't really need advanced search or i don't need that much for my search which maybe adam's heard me say that to him sometimes and then they'll say elastic search is not just for search and then they go into, that's why I say the hand starts waving. And I'm sure they provide ample evidence for that, but I usually close tab. Does that ring true for you? Is Elasticsearch supposed to? They begin evangelizing and I duck out.

Starting point is 00:11:40 Is Elasticsearch more than just for search? Is it a full-on database? What's the core use case that it really slays at? Yeah, I'm very careful about the term database because people have a very specific expectation of what a database does. And I'm not sure we're 100% that since we're first and foremost, we're a search platform. But we kind of want to be the data platform for lots of different use cases. So we started off with the full text search use case, but then we found these other use cases. And we always think about it that everything else that we add around it is also a search problem.

Starting point is 00:12:15 So for example, logs, which is kind of one of the most common use cases for us, storing the logs itself is not that helpful. What you actually want to do is you want to search them in the end again and find what is going on. And we are extending that further, and we are doing metrics by now, and we are doing more and more in the security space, and we are also adding, or we always say we add to the family. We are adding more companies and features and products to the family. So we have a machine

Starting point is 00:12:46 learning component now and we're trying to do the application performance monitoring the APM space as well and adding that to the platform so we're trying to to broaden out we're also doing search as a service now so we have been adding more and more companies around that and trying to get from the kind of like these core functionalities also more into the solution space. Because some people are a bit overwhelmed when you just say them or give them the options and say like here you have this building block and then you can build pretty much anything you want with that. But some are kind of like more okay I need a solution for this exact problem. And we're also adding that or going more and more into direction to add more of these solutions so you you just need search for

Starting point is 00:13:31 your website for example we want to provide you a solution to do that you can totally build it yourself with the open source tools but we also try to give you more of a solution just to get to the result quicker or you want to build a logging platform and you can totally build that yourself but we're trying to get you started in a kind of quicker way so we we always have these building blocks and elastic search is kind of i would still say the centerpiece and what everything else is built around that but we're trying to give you more solutions uh now that well we try to help you with the heavy lifting it actually reminds me of something and i'm not sure if you remember this conversation but back on go time 48 uh alexander newman or neumann i can't neumann recall how his name is pronounced neumann was

Starting point is 00:14:17 talking about restic which is his backup solution and he said something really poignant during that episode he said nobody nobody wants backups. Everybody wants restore. And he got some pushback on that, but I thought it was so insightful because backups are actually a pain in the butt, and they don't do you. They're not the end game, right? They're just like an artifact that you have to deal with. And if they can't restore, they're worthless. So what you really are after is the restore.

Starting point is 00:14:43 And you said something there philip which which made me think of that with regard to logging and like collecting the logs and having them and storing them and it's like nobody really wants logs right nobody wants this stuff what we want is uh answers right even with search like search is it and as a means to an end we're looking for insights we're looking to find that thing that, you know, we're looking for that piece of data that we remember. And so it seems like what you're trying to do is build around that, like you said, these solutions, right?

Starting point is 00:15:14 Like give us the solutions, not necessarily the tools. We're happy to cater for both because we have people in the open source space who say like, oh, it's awesome. I want just this building block and then I can take it wherever I want. And then there are others who say like, oh, it's awesome. I want just this building block and then I can take it wherever I want. And then there are others who are like, oh, I have this business need and I just want to get to the solution quickly. And we're happy to help both of them.

Starting point is 00:15:34 Because, well, we are an open source company and we will try to always or we are doing our open source work and you can just build anything you want around that. But then again, we try to broaden that out into the solution space. It makes sense to going back to what you said with the fact that you're growing, which we haven't really talked much about the company size. Not that we have to go too deep on it, but from what I understand, you've got a pretty large company and your model is build open source tools, or at least it seems you can tell me if this is true or not build open source tools that you can give freely out there but at the same time

Starting point is 00:16:10 you're about solutions so you take these open source tools that jared or i or anybody else can freely grab contribute to and use and build our own solutions but you've gone ahead and as a company as a mission as a model business model, built solutions around your open source as paid-for services to sustain yourselves and grow? Well, not only paid services. Some of these solutions are also in the open source space. Really? So you can run them yourself.

Starting point is 00:16:36 So for example, the APM company that we acquired, the base components for that are all in the open source space. Also because we kind of saw an opportunity there that they're like in the APM space, there is not that much open source. There are not that many open source solutions that you can use today. But we think for us as a data platform,

Starting point is 00:16:56 it makes a lot of sense to not only have logs and metrics, but also cover more things like the tracing or APM functionality there. So we're trying to extend that. But of course, if you don't want to host it yourself, we're happy to host it for you and provide it as a service. Or we have some more features around the entire thing that you might be interested in as an enterprise

Starting point is 00:17:20 and you want to get our open core features or you also want support. But we're always packaging support and the plugins that we have together. This episode is brought to you by DigitalOcean. DigitalOcean is a cloud computing platform built with simplicity at the forefront. So managing infrastructure is easy. Whether you're a business running one single virtual machine or 10,000, DigitalOcean gets out of your way so teams can build, deploy, and scale cloud apps faster and more efficiently. Join the ranks of Docker, GitLab, Slack, HashiCorp, WeWork, Fastly, and more.

Starting point is 00:18:11 Enjoy simple, predictable pricing. Sign up, deploy your app in seconds. Head to do.co.changelog. And our listeners get a free $100 credit to spend in your first 60 days. Try it free. Once again, head to do.co.changelog. So, Philip, when I said as a database, you were very careful around that word, and you said that it's very much a search platform. Perhaps you could say it's a better complement to a data store

Starting point is 00:19:00 or an additional data store that you have in your application. I'd like to kind of take a small look at Elasticsearch kind of from a micro perspective of an application, maybe perhaps similar to changelog.com, which is a relational database on Postgres that has some search functionality that's just using Postgres' full-text search and how an elastic search would fit into that equation and really be a good complement and how it would do better

Starting point is 00:19:31 at the search side of postgres but then do worse at kind of maybe the acid side or the relational side of postgres so um with postgres and the full text search features in postgres it's kind of an interesting approach because Postgres is first and foremost the relational database. And then they have kind of added more and more full-text features around that just because you saw that, well, people need to search at some point. And that's fine. It's just like at the core of Postgres, there is still kind of the relational database,

Starting point is 00:20:05 whereas Elasticsearch for the search use case is really built on having as many features and being as scalable around search as possible. And it's not just an afterthought as with other products where they have like some full-text search capabilities, which is often like, I'm not saying this is Postgres in specific, but like on some products,

Starting point is 00:20:24 we have the feeling that it's kind of like this checkbox where you say, oh, we do full text search as well. And then when you press further, it's like, ah, yeah, we're doing this one or two things. But if you really want to take advantage of it, then it's not going to help you that much. But what Elasticsearch does is basically is whenever you store some text, we have this analysis pipeline. So, for example, we know something is an English text. And for an English text to search, you have some rules what makes sense and what doesn't make sense. For example, you do something like stemming. Stemming basically means you cut off English.

Starting point is 00:21:00 It's a very simple language in that regard. You cut off the ending of a lot of words because you don't really care if something is a singular or plural. It's just you're just interested in the concept or you're not kind of concerned with the specific form of a verb. You're just really interested in the concept that you're looking for. Then you're normally kicking out stuff like stop words, which are like very common words that appear in nearly every sentence or text, but they add very little meaning because and or an article would be in nearly every sentence, and you don't add any values. So that is what full text search does. And Elasticsearch is kind of elaborate in that area. So we support a lot of languages. We support a lot of features to

Starting point is 00:21:42 refine your search. And that is where kind of the benefit of full-text search would come in normally. Yeah, I think that's where I'm driving at. Can you enumerate those additional features that you're going to get by complementing your relational database with an Elasticsearch platform? Like what additional things is it going to give you in terms of search relevance? What search is generally giving you, I'm always comparing it that databases are very much black or white. You're searching for something and then you get a hit or you don't get a hit. Whereas search is much more shades of gray. It's more like how relevant is that to what

Starting point is 00:22:23 I have entered? And it is normally a number that is being calculated in the background. I'm not sure how deep you want to dive into that, but there are multiple factors that play into calculating that relevancy. For example, so the one sentence I'm always using is from Star Wars, these are not the droids you're looking for. Let me see your identification. You don't need to see his identification. We don't need to see his identification. These aren't the droids you're looking for. These aren't the droids we're looking for. Move along. Move along.

Starting point is 00:22:59 So if you saw that in Elasticsearch the sentence these are not the droids you're looking for after removing the stop words and stemming what remains is droid you look because these are kind of the three main concepts that might stick out or that people might be searching for so these are not they're all irrelevant even the not like full-text search doesn't generally understand what you're saying, like if this is positive or negative or what this is. It's kind of just matching on these terms. And draw it, you look, are the three terms that would remain when you do the search. Depending on the sentence, you will have more or fewer stop words.

Starting point is 00:23:41 And we will kind of extract these base concepts. And then since we're just storing this stemmed version of the the concepts that you have the lookup afterwards is very fast because whatever you're searching for if you search for droid or droids it doesn't really matter the term you're searching for runs through the same pipeline so the stop words are removed we're doing the stemming and then we can just go on the direct matches. And then you can see, oh, we are searching for Droid, and this sentence contains Droid.

Starting point is 00:24:11 Then we're doing the calculation of how relevant the specific text is. For example, if a text contains Droid multiple times, that is probably more relevant for your Droid search than if the Droid term was only appearing once in the sentence. And then we're assuming, okay, DROID is kind of like a relevant concept. We give a specific weight to that. And then we will also take into consideration how long a specific element is.

Starting point is 00:24:37 So, for example, if your search term is appearing in a title, titles are normally very short, that is much more relevant than if it's just appearing in text body because that is much longer. And the base concept that is being applied there in the background, which I've tried to describe here, is called TF-IDF, the term frequency inverse document frequency, which is kind of calculating this relevancy. The algorithm has been slightly refined by now. It's called

Starting point is 00:25:05 best match 25, BM25. So it's the 25th iteration of a best match algorithm. And this one is slightly better now. And this is what is doing the heavy lifting behind the scenes for your search. And if you compare that to the classical like search, a lot of people are probably still doing in the relational database. A, you will have a hard time because this doesn't support anything like stemming. This also doesn't support anything like fuzzy search. This doesn't support synonyms and lots of other concepts. And if you have the wildcard in the beginning, so if you're doing the like percentage, whatever term you have percentage,

Starting point is 00:25:48 you cannot even use an index, so your search will always be very slow because you're basically going through all the entries. Since you have the wildcard at the beginning, you cannot use the index because you don't even know where to start. You need to basically go through all the entries. Whereas full-text search just extracts the right terms,

Starting point is 00:26:05 and then you basically check where are these terms, in which documents do I have appearances of these terms that I'm trying to find. And these different facets that I'm just thinking of, like an equation, like this factor plus that factor equals relevance, rank, or some sort of scoring. Is all that stuff you know tweakable customizable either at like elastic search configuration time or maybe even at query time with regards to how

Starting point is 00:26:32 how you get your results back there are a lot of tweaks that you can apply one you can tweak some parameters in the search but a lot of the functionality is also like the way you store the data for example if you resolve synonyms at index time that is some index time feature or you could also do that at query time where you say these five terms are equal and if the user is using any one of them i want to find all the other four as well or all the all the other four places where i've where these synonyms are appearing. And you can build quite complex queries. We have a proper query DSL that is giving you lots of power, where you can say this must appear, this must not appear,

Starting point is 00:27:14 this term should appear, or at least two of these three should terms should appear. Or you can say, I'm looking for either one of these terms, or if you have them as a phrase or in combination like first one of them and followed by the other then it should be ranked higher so you have a lot of ways to actually tweak that i suppose the the underlying bm25 algorithm i would suppose that itself is not tweakable because you you know, after 25 tries, they probably are doing better than I could go in there. You can still slightly tweak it.

Starting point is 00:27:48 If that is improving your search a lot is very much up to you or up to your use case. We always like to say it depends. Whatever you're doing there, it depends on what exactly you want to achieve. I would just start kind of with the basics and try to expand from there and not overthink it from the start. Otherwise, it can get kind of a bit complicated. How well is full text search in Postgres, Jared? Like, since we're asking him on the Elastic side how it compares, what are some of the things that you know about Postgres and its full text search that we like or dislike in terms of indexing or

Starting point is 00:28:26 being able to you know query you know at index time or different things or being able to create indexes and all that stuff yeah so you can do full-text search specific indexes in Postgres that allow it to not do full scans on you know specific, and does fuzzy search and stuff like that. But you can't, I don't, I don't know. Maybe you can do more than just that, but, um, you can't do all of these different relevance facets and stuff that he's talking about as far as I know. It's a specialized thing. Yeah. And that's, you know, Postgres' full text search is better than other RDBMSs, you know, Postgres' full-text search is better than other RDBMSs, you know, reputationally as being, like, slightly better than a like query. And so it gets you a little further.

Starting point is 00:29:12 And in many cases, you know, for small data sets and small uses, like if you're not searching very often, it's fine. But in many cases, like you said, you kind of know when you outgrow it, I think. And probably we're at a point now, Adam, where we're just getting to the edge. I know we have a user story in our Trello board about search and some different ways that it should be matching, which it's not. And maybe I could stretch our current implementation to work that way. But at a certain point, it's going to become, especially as our data set grows, it's just going to become less relevant over time. And we'll probably end up reaching for something like Elasticsearch when that makes sense. Yeah, because it seems like things like plurals, which Philip, it sounded like that's something that's just baked right into Elasticsearch,

Starting point is 00:29:59 where pluralization of nouns or different things, different terms, that comes for free. You don't have to be an exact match. I find that a lot of times I don't find something because I haven't searched precisely enough where it should be a little bit more forgiving to the user. Yeah, and once you start growing, probably you need to scale past what Postgres can give you. So for example, if you're searching on Wikipedia,ipedia stack overflow or github behind that search box there is always elastic search doing the hard work for you well hidden behind the scenes i was just

Starting point is 00:30:31 trying to quickly google some of the actual like the feature list on postgres and we're just picking on it because it's what we use postgres is actually pretty feature rich yeah being pretty good for rdms but it does do stemming. It does do ranking. Supports multiple languages. Has fuzzy search. So, I mean, it can take you a ways. And like I said, I've never used Elasticsearch. I've never used a search engine, like a thing that's built for search for any of my client work or for Chainsaw.com because my data sets are small and my search needs are usually very trivial and so

Starting point is 00:31:06 that's why i said i'm kind of claiming ignorance on this because this is an area that i've never had to move into i'm examining it yeah i very much feel like you know it when you need it and once you hit the wall yeah you will feel yeah you're kind of like okay we need something that that these you know these results are getting less and less relevant all the time and the other thing is that once you have elastic search for one use case there are all these other use cases where it's coming in handy so we are trying to give you a broader tool to cover kind of a lot of base for that can you get some examples of like once you're using it it can also do x y or z well um so once you're using it for search um then probably some analytics use cases come along like you have whatever kind of data

Starting point is 00:31:51 your company is having or what you're trying to do um especially in combination with kibana um you can then just store all of the data and build fancy dashboards by just clicking a few buttons basically um or you have logs for, who is visiting your website? You have, I don't know what your architecture is in the background, but if you have like an Apache or NGINX or something, you might want to collect those log files and just see like who is visiting our site, which IP addresses, which we can then translate to a region

Starting point is 00:32:21 and do the GUIP lookup. Or what errors do we have how many 404s how many 500s if we change anything on the website like who has changed what and why are we suddenly getting more 404s what is up with our system and you could add metrics for example either business metrics like how many people are coming to our website how much time are they spending but it could also be metrics like okay c CPU and memory usage, or if you're using Docker or Kubernetes or whatever system basically you have. We're very good at collecting a lot of metrics for that.

Starting point is 00:32:54 And then you can bring all of that together in some dashboards, and then you get the overall view, both of your business data, but also like on the IT system side, what is my infrastructure doing? I was just thinking about the logging aspect. And so, you know, you said you don't know what our infrastructure is like. Well, we just push everything off to Papertrail, which is a service that we use. And they probably have Elasticsearch on the back end or some sort of search tool allowing us to then, you know, run our searches through them. And so that got me thinking about Algolia and some of these other searches as a service.

Starting point is 00:33:36 And I'm just curious how either Elasticsearch self-hosted infrastructure or even Elasticsearch offerings, how they differ and measure up to other search options that are out there for developers to pick and choose from. So we're getting into two different areas here. I'll go here for the search use case. We have recently acquired a company called SwiftType, which is basically in exactly that area. And while their product was already based on Elasticsearch, they were just doing the crawling for you and just automating that search process, basically. And that is one of the solutions. I've talked about solutions before. This is one of the solutions we want to add.

Starting point is 00:34:10 It's still built on the open source search platform that we have, but it's more of a solution that you probably don't want to build that yourself because you totally could. And if you want to jump into that for a weekend project, you can totally do that. But maybe you just say, oh, I just want to have a site that is easily searchable. And I just want a solution. And I want my page to be crawled automatically. And maybe I want to fine tune some searches.

Starting point is 00:34:36 For example, if I enter this term, this should be the order that I want to have, or I want to have some features where you need some fine tuning, you can totally do that. But generally, it's just a solution that you can get started with. Swift type. I think I actually run that on my blog because it's a static site. Nice. And to add search. Haven't they provided free for personal use for a long time?

Starting point is 00:35:02 So I think maybe I got got elastic search uh power in my my blog searching and you didn't even know it even know it well hidden behind the scenes this is awesome i love it love it well you said we're getting into different territories when you talked about logs versus like search for a database or content can you go into that more does it end with swift type for the log use case uh you can totally use uh one of these kind of like smaller solution providers uh but then again it's one more island because well your search results basically sit on on their solution or their site and if you want to access anything while you're going there and then for any other data like business analytics you might

Starting point is 00:35:42 have another island um but it's just like lots of different islands which you then need to go each individually to get the bigger picture. Our vision is more to have like one dashboard where you can show different things. Where you can have both like, okay, my website did that much revenue today. But also how did the latency of my website or how did the number of errors affect that? And it's just like one tool where you have the overall and bigger picture for that. Maybe you can go deeper into it because I see the user types caring about those interfaces as one team but different cares. Meaning I care about search and I maybe as a marketer I care about

Starting point is 00:36:26 terms or I care about relevancy or I care about people actually finding certain things or caring about content that's getting searched but if I'm a developer I care about logs or if I care about performance maybe I'm a different sector and it seems like those customer types or the the user types of those three different things in one dashboard i'm why one dashboard well obviously you you don't have to like probably everybody will have the one big tv screen in their office which the custom metrics that they are mostly interested in but maybe you want to have like the bigger picture how did one influence the others which right now well if you have different solutions for that might not be all that easy and maybe also this kind of like siloed

Starting point is 00:37:11 approach is a bit partly because you had the different tools and everybody was kind of like using their own view and there was no easy way to to bridge those different views. And I think that is kind of part of our vision to get the bigger picture and have a better integration between all of these different departments. I hated the term DevOps, but I think this is kind of partly that idea that you break down those silos and that everybody's doing the thing that they have been doing for the past or in the past. But you want to kind of like get beyond that and get to the kind of like the inherent value. Where is the value in your company? It's not like doing one of these things, but it's kind of like getting the bigger picture

Starting point is 00:37:56 and see how you can strive and what you by our friends at GoCD. GoCD is an open source continuous delivery server built by ThoughtWorks. Check them out at GoCD.org or on github at github.com slash gocd gocd provides continuous delivery out of the box with its built-in pipelines advanced traceability and value stream visualization with gocd you can easily model orchestrate and visualize complex workflows from end to end with no problem they support kubernetes and modern infrastructure with elastic on-demand agents and cloud deployments to learn more about gocd visit gocd.org slash changelog. It's free to use, and they have professional support

Starting point is 00:38:50 and enterprise add-ons available from ThoughtWorks. Once again, gocd.org slash changelog. so philip elastic recently published an article called doubling down on open in fact shay wrote this february 27th of 2018 and i misread it i thought it said doubling down on open in fact shay wrote this february 27th of 2018 and i misread it i thought it said doubling down on open source and so we were going to talk about that but it stopped short it says doubling down on open and it kicks off with him saying he's excited to announce that he will y'all will be opening the code for your x-pack features security monitoring alerting graph reporting so on and so forth but this is not open source this is opening the code can you give us the distinction and tell us what's going on here this is very much a definition problem but i think the osi has

Starting point is 00:39:57 a definition of open source which says something like you can see the code you can modify it and it's freely available and the freely available is kind of what we're not doing there and since well we're a large company our salaries need to be paid somehow so what we're doing with these features is and you can get the source code on github we will have a special directory or there will be a directory or a folder with these non open source parts so what is Apache 2 licensed right now that will stay Apache 2 license but we will add the code for the commercial features to get up so you will be able to see everything that is going on there but to use it in a production environment you will be able to see everything that is going on there. But to use it in a

Starting point is 00:40:45 production environment, you will still need a commercial license. So it's not open source, but I always say it's open code, because you can see the code, you can totally open issues for that, you can even contribute patches back, we don't really expect anybody to contribute major features to our features that we will sell afterwards. But you can totally see what is going on. And that has multiple reasons. Firstly, especially around security, people always want to see what they're getting. And with bigger customers, sometimes they wanted to have an audit of the source code behind that. Well, it's much easier to tell them, well, the code is open. You just have a look there and you can really see what you're getting.

Starting point is 00:41:34 Secondly, for us internally, it was kind of a problem because we always had the open source GitHub projects. And then we had the XPEC ones, where the commercial code was living. And then you always had the problem that how do you work efficiently with that? You cannot do atomic commits, because part of the functionality might be in the open source part, and part of the fix that you're contributing is on the commercial side. How do you communicate the issues to the outside world? Because, well,

Starting point is 00:42:05 the issue for the commercial part is in the private repository. So nobody can really see what is going on. And that will also make either communication, but also the process for us internally much easier. And we just think it's the right thing to do. And everybody can see what they're getting, you will still need to pay for some or most of the features. You can see that in the feature matrix, what is commercial and what is actually free to use but not available under an open source license. So there might be some minor restrictions, like you cannot provide it as a service for customers,

Starting point is 00:42:41 but you can totally run it for your own projects on premise so this is what we're trying to achieve there to kind of find a way to have or be an open company and build on open source but still survive as a company and not end up like i don't know for example rethink db i think that was one of the products that was really widely loved, but it was just not enough commercial in there that the company made the cut in the end. And I don't think that is benefiting anybody. So it is a fine line to walk, but we are doing our best to kind of be open and make users happy, but also have a sustainable business model and be around for a long time and build good products for a long time. Are you guys following in somebody else's footsteps on this? Or is this paving a new path with regards to this particular layout that you come up with, with the XPAC features in a separate folder

Starting point is 00:43:43 and the license being in the way of it being completely open source? It's definitely not very common. I think one or two other companies have looked into similar things. I think CockroachDB is one of them, though they're much smaller and much younger as a company. I'm not aware of any other more established or larger company doing that. And also from the legal perspective, it is very interesting. And on the one hand side, we really want to kind of like keep the legal text there to a minimum and not scare anybody away. On the other side, it needs to be waterproof so that nobody can kind of legally or find a loophole to legally use our intellectual property or commercial intellectual property to make money themselves or just use it for free and work around that. Some people have had the concern that, well, you can just take the code modify it and kind of like uh comment out all the

Starting point is 00:44:46 licensing restrictions um though we don't assume that this is kind of an issue for any established company like anybody who is capable uh of paying or at least in the western world and i'm not sure how it's like in the rest of the world, especially with the legal system there. But we don't see that as a major risk, that somebody could just easily modify the source code now and run everything because it's open. We have thought about that. We're not afraid of that. We're still in the process of drafting that legal document or that license that we will add. And we're also kind of right now cleaning up the code for the opening because, well, you need to make sure what was closed source code.

Starting point is 00:45:30 There were absolutely no credentials. There cannot be any references to customers. You don't want to have anything else that might be embarrassing. So there is kind of a cleanup process right now that the colleagues are going through. The legal document may be in process, but I can say for sure is that between this blog post doubling down and open, and then also we're

Starting point is 00:45:50 opening XPAC as well documented, so you're definitely doing a good job of like communicating your intentions, which I think is probably the hardest hurdle to get over when making this kind of shift, especially something that can be this controversial or be mistaken or feel misled if not described carefully. You're saying why you're doing it, what's changing, when it's going to change, how things will be affected. These two documents, which will be in the show notes, greatly communicate your intentions here. We are really trying because even internally, people were confused at first. And after the announcement, somebody accidentally, like from within the company, even on the private account, wrote like, oh, we're open sourcing XPAC. And it's like, no, that's not what we're doing. And we have, it's an ongoing fight.

Starting point is 00:46:38 And obviously, once it's being posted on Hacker News, everybody goes crazy and posts whatever they think it means or doesn't mean. And everybody has great fears. And we understand that people kind of like are first a bit surprised because it's not a common model. But we are really trying to do the right thing here. And we think this is a model that might have a lot of benefits for companies as well. So we kind of hope that this will be more common in the future, or at least we're risking it and seeing where we can take this. Curious what you mean by doubling down, it could be the risk portion of it, or just the fact that

Starting point is 00:47:13 something indicated that you should have such a belief in this direction that you're doing it. I think it's kind of both. We really see open source as kind of the driving force and how to get software out there and also what is making us successful we we always see it like that like every paying customer has been an open source user in the beginning that is really where everything is starting and even the sales people understand that even though of course the sales people never want anything in the open source space they would love to have have everything closed source and commercial. But they're kind of understanding that model.

Starting point is 00:47:53 Like, how do you get where you are right now, and how can you take it further? Got to get paid, you know? And they have like 50% of their salary being based on what they're selling. Yeah, you want, you want, you know, as a salesperson, you want no ceiling on your revenue opportunity, you know, how much money you can make, because when you're in sales, usually you risk, you know, what is often a salary, you usually get some sort of stipend or a base or a draw is what the common term is used for. And it's usually very small,

Starting point is 00:48:26 nothing you can actually rely upon. So in that position, you're like, I don't want any restrictions. If I can sell a lot, don't restrict me. I'll sell a lot. If I can sell very little, well, then you fire me or I will starve, one of the two. Totally. Believe me, we commonly have these discussions

Starting point is 00:48:40 and engineering would, of course, want to make everything open source because, well, who doesn't? And sales obviously doesn't discussions and engineering would of course want to make everything open source because well who doesn't and sales obviously doesn't or wouldn't want to make anything open source and we we try to or we need to strike the right balance and it's of course an ongoing discussion but i think we're doing the right thing here um we see um how that develops over time of course when it comes to security i think that's uh you mentioned it earlier when you first started to share the details here, but I think it's so crucial. You hear so often tooling or something being in the security space and you can't get access to the source code. which is totally opposite of this, but third-party CSS not being safe,

Starting point is 00:49:30 where Jake Archibald said the real problem is thinking that third-party content is safe. In this case, it's third-party code or dependencies. And so many issues stem from a dependency that becomes – what's the term for it? Not safe anymore. Unsafe. Unsafe. That's not what I was looking for, but that works in this case here. But you it? Not safe anymore. Unsafe. Unsafe. That's not what I was looking for, but that works in this case here.

Starting point is 00:49:49 But you can't trust it anymore. It becomes compromised. That's the word. And you've got that in your code base. You don't even know it. But the point is that you can see these because you have opened them up. And it sounds like you also have issues open, but you're not looking for people to contribute,

Starting point is 00:50:04 but you want people to for people to contribute, but you want people to be able to see the code, scrutinize the code, maybe even file bug issues and or patches that may be security related. Is that correct? Oh, totally. And especially if you're a more advanced user and you run into an issue, the first thing you might want to do is just check out the source code and see like, okay, this is what it's doing and this is what it's supposed to do. And then can say oh either i'm using this wrong or no there is a bug and i can report that bug and then i can see the progress and i can be part of that discussion

Starting point is 00:50:32 and it's all on github where it's like much more uh kind of inclusive in the regular process you have around everything uh you do in the open source space and we want to give people the opportunity to participate in that as well and just be able to show like hey this is what we everything you do in the open source space. And we want to give people the opportunity to participate in that as well and just be able to show like, hey, this is what we are doing and this is when this release is coming out. Otherwise, that communication was kind of very complicated

Starting point is 00:50:54 because then you would have like had somebody to always communicate that like, oh, we have fixed the bug and it will be in that patch level release. And then you shouldn't forget anybody. Otherwise, people are surprised like, oh, is my issue now fixed in that patch level release. And then you shouldn't forget anybody. Otherwise, people are surprised like, oh, is my issue now fixed in that release or not? And it's just like creating an unnecessary barrier

Starting point is 00:51:12 that we're trying to get rid of. Well, for the developers out there that are like thinking, okay, so how big is Elastic? You know, great, you got to make money, but how much? Why don't we share with them how many people you got in your company so they can kind of, you know, quantify that so to speak it's changing every day we're i think like 820 or maybe we're already 830 today um right now we are growing by 50 a month um which is an insane number um if anybody is looking for a job by the way, just shoot me a message. I'm happy

Starting point is 00:51:46 to connect you. We have for pretty much any technology that you can imagine. We're not just Java. We have lots of other stuff as well and lots of open positions. What's driving that growth? Obviously, we have more and more products and we're getting more into that solution space. So that is the engineering side. But of course, since we have all these solutions, you always also need to sell them. So we have also a lot of sales and marketing people there. How has your community responded to this new direction? You have your customers, you have lots of users of the open source project, even just on the

Starting point is 00:52:25 Elasticsearch repo on GitHub, there's 983 contributors over time. Now, maybe with 820, you know, you've got a lot of those be your employees. But surely there's other companies using this other individuals. And now this change for this direction of open, but not open source, proprietary open code things that are going to be in the repos and this vision that's been laid out. I know there's been some confusion, but has there been a backlash? Have people received it pretty well? What's the response been? I partially uh confusion and partially people are waiting since the the final license is not out there and they don't really know what it means um that yeah i guess we will get the final vote once that is being done on the other hand if you're an

Starting point is 00:53:20 existing user nothing is changing like what has been out in the open source space is staying out in the open source space. We're just adding more or viewable source code. So if you want to take a look behind the scenes for those features, that is totally possible in the future. So we're not taking anything away. We're just adding more features. And I think a lot of people care more about the free part

Starting point is 00:53:44 than the open source part, to be honest. For those, not too much will change in that area. That's an interesting question, Jared, to consider the response, obviously. I didn't think to ask that. That seems like the obvious thing to ask, which is like, okay, you've got this many employees. You must have a large customer base. What's the response? And it looks like this announcement was made

Starting point is 00:54:05 at elastic on is that right is that how you say that elastic on or elastic on we normally say elastic on right it's our annual conference okay and maybe it was just timing but but uh have you asked or does any are you aware of why you would announce this change prior to the uh the end user license agreement being available? You know, because, I mean, you said confusion. It seems to me that maybe some of the confusion can be guarded, I guess, or just, you know, not there at all if the whole deal is clear and that's the missing piece. A, you want to announce something at your annual conference and we really wanted to to put that out there and and show our commitment to openness

Starting point is 00:54:51 on the other hand since there is not that much uh prior art there um kind of just finding the right legal text is a lot of work and we just were not there on the legal side for having the text. We were aware that, well, it might have been better or probably would have been better if we had the final text there. But on the other hand, speed in that regard could really kill if you just put out something that is not foolproof or does have some loopholes that would totally impact the company. So we really want to draft something that is substantial there

Starting point is 00:55:30 and is doing the right thing. And it's kind of like the engineering discussion is very interesting. It's like, oh, so since you have the part of the source code that is Apache 2 license, maybe you could just modify the Apache 2 license code to circumvent the license check for the commercial part. Maybe you could do stuff like that. And this needs a lot of kind of discussion, both between engineering and the legal side. On the other hand, we don't

Starting point is 00:55:58 want to make this too restrictive to scare anybody off. So we are really trying to walk a fine line of doing the right thing. And unfortunately, that takes some time. And it's really a back and forth. I think it's important maybe to put in perspective the reasons why. You know, there's a lot of confusion on the details, but the why usually helps everyone understand the direction and maybe even gain some trust, right? The why is because you need to be a profitable company and survive and continue to have the necessary employees to innovate and to deliver services, right?

Starting point is 00:56:32 I mean, that's the why, right? I mean, that's not the why for opening it, yes, but that's the why we need to have commercial features that you can continue to get cool features and we can innovate other products. But we're also, like we said,

Starting point is 00:56:50 committed to this openness and it's just like finding the right balance. And we would love to see that we are not the last ones to do something like that, where you have a commercial offering because, well, once you have a company, you need that. But also having this open part and not be like, I don't know, Oracle,

Starting point is 00:57:09 where you just cannot see anything in the source code, and then something doesn't work out, you write to support, and then you wait for some answer from support. And maybe it's not giving you the right answer of how something is supposed to work. Whereas once you have the open code approach, if you're knowledgeable enough, you just look up how is this working behind the scenes. I can just figure it out like in 10 minutes myself and see what is going on. And I think there is kind of a tremendous value in that as well.

Starting point is 00:57:34 If you don't go this route, it sounds like you referenced RethinkDB earlier. So it sounds like you're familiar with that story that if not sustainable, that Elastic could see a downturn in employment. That means lost jobs. That means – heck, that could potentially mean we see you on Patreon at some point rather than finding ways to sustain yourselves in ways that meet your own business model. Not that that's going to happen, but that's an extreme case. What I'm trying to say is that we see open source projects and or products attempting to, and in a lot of cases, succeeding and sustaining through open collective, Patreon, direct support. Obviously you're a company, so that may be slightly different. If you don't find a way to deliver these things you want to in a commercially viable way, then it means lack of success. It means, you know, company failure, potentially.

Starting point is 00:58:31 Yeah, and it's in nobody's interest to shut down a project like it happened to RethinkDB. I mean, yeah, the code is available on GitHub, but I checked just a month ago or so, and I think, like, pretty much nothing is happening there. So this is, I guess, pretty much nothing is happening there so this is i guess pretty much the end of it and nobody is benefiting from that because it was it was a great product and it was also widely loved from what i understood so that's not what you want to do we'll put in the show notes we did the uh was it the future of rethink db jared was that the last episode we did it's a great show i mean it kind of end capped the chronicling of this podcast covering RethinkDB, which was two episodes of Slava.

Starting point is 00:59:10 And I can't recall the person we spoke with right now. That's right. Mike Glukovsky, episode 266, The Future of RethinkDB. I got the title, but the person I forgot. Mike was great to have on. He greatly shared the backstory, the founding portion of this, and then ultimately how the IP was, you know, bought by the Linux Foundation and what that meant.

Starting point is 00:59:33 And yeah, so we'll put that in the show notes. Philip, anything else we can cover here? What, maybe what's, what's next? So this is probably the hottest topic in your company and in your projects.

Starting point is 00:59:47 Where can we go from here? What's best to cover to close out the show so continuing kind of like the open theme uh is we're we're doing google summer of code this year for the first time it's sponsored by google and organized by google it's basically open source organizations can apply uh to run student projects and a student will then implement some feature for the project in three months and Google is paying the student for that and that has been going on for I don't know I don't even know which year we're in about 10 plus years from what I remember because I think I was a Google Summer of Code student like nine or ten years ago and participated in the project. And now we are trying to be there as,

Starting point is 01:00:30 or we are there as an organization as well. And we are currently selecting a student, so unfortunately it's over for this year. But if you're a student and you want to work in open source during the summer and don't, I don't know, serve drinks or anything like that then it's a great opportunity and keep your eyes open in february for for the call for that and then you can see more than 100 open source product projects where you can apply for either ideas they are

Starting point is 01:00:57 putting out or you can come with your own project ideas and if you're being selected you can work on that code for three months during summer and being paid by google so that's kind of a very nice thing for students to do i can highly recommend that and we also see that as being part of that open source ecosystem and the openness that we're participating in initiatives like that and try to bring on students into the projects and like just the the new generation into open source and help them getting started you were a student in that uh in this google summer code i was a student in um in in google summer code uh i worked on a php based cms system

Starting point is 01:01:38 uh called silver stripe which nobody knows because it's from new zealand that was kind of like my start into the open source world uh where i worked on a project and then i kind of kept ties with the project and then two years later three years later that that organization uh was a mentor organization and then i was a mentor with them as well and that's kind of like a common topic that you bring on on people or students as uh on the student side, and then they continue as the mentors or as we now do on the organizational level, driving that to kind of help the next generation

Starting point is 01:02:15 of striving the open source ecosystem. I'm looking at their homepage. 13,000 plus students, 108 countries, 13 years, 608 open source organizations, and 33 million plus lines of code over Google Summer of Codes history. Pretty impressive statistics and what an impact it's had over time. Well, Philip, thank you so much for schooling us on the use cases of Elasticsearch, how a relational database like Postgres can leverage it, potentially even how you can bridge the gaps across various different vectors.

Starting point is 01:02:50 But yeah, thank you so much for sharing us that backstory because that certainly educated me quite a bit. And the fact that this is open source, it began as open source, and the direction of your company is so great. So thank you for sharing that. And thank you for being a fan of the show. Thank you for coming on.

Starting point is 01:03:04 We appreciate it. Thanks for having me. And thank you for being a fan of the show. Thank you for coming on. We appreciate it. Thanks for having me. And I hope you can fix all your search problems. Let me know if you need a hand. We need a hand. All right. Thank you for tuning into this week's episode of The Change Log. If you enjoyed this show, do us a favor.

Starting point is 01:03:19 Share it with a friend. Hit that favorite button. Add it to a list. Tell somebody about this show. And of course, thank you to our sponsors,bar digital ocean and go cd also thanks to fastly our bandwidth partner head to fastly.com to learn more and we move fast and fix things around here at changelog because of rollbar check them out at rollbar.com and we're hosted on linode cloud servers head to linode.com slash changelog. Check them out and support this show.

Starting point is 01:03:46 The changelog is hosted by myself, Adam Stachowiak, and Jared Santo. Editing is done by Tim Smith. Our music is produced by Breakmaster Cylinder. And you can find more shows just like this at changelog.com or on Overcast or Apple Podcasts or wherever you subscribe to podcasts. Search for us. You'll find us. That's it. It's done. We'll see you next week.

The Changelog: Software Development, Open Source - Elasticsearch and doubling down on "open" (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.