The Data Stack Show - 102: Building Pinot for Real-Time, Interactive User Analytics with Kishore Gopalakrishna of StarTree

Episode Date: August 31, 2022

Highlights from this week’s conversation include:Kishore’s background and career journey (2:30)Internal analytics versus user-facing analytics (3:49)New ways of thinking about analytics (8:06)What... makes Pinot different (13:45)How Pinot transforms systems (21:53)Understanding the data landscape (32:40)The Pinot user experience (36:27)Something exciting about StarTree (40:05)When you should adopt this technology (43:15)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. Kostas, today, I'm so interested to talk to our guests because it's a subject that I actually don't think we've covered before, which is user-facing analytics. One of the creators of Apache Pino developed that.
Starting point is 00:00:40 It was a technology that came out of LinkedIn that really drove a lot of their user-facing analytics. And I'm just excited to chat about that because we have new subjects. But one of the questions I have is, we talk so much about analytics in terms of reporting inside of a business, KPIs, et cetera, developing analytics for users, executive team marketing, et cetera. And I want to know if there's a big difference in the way that you think about user-facing analytics and developing that. And then, of course, like the technology around it, because obviously they developed Pinot to sort of, you know,
Starting point is 00:01:14 deliver those in the context of LinkedIn. So that's what I want to ask about. How about you? Yeah, I'm very, like, excited about our show today because we are touching, category, a subcategory of analytical databases that we haven't touched in the past, these real-time or lab databases. And Pino is one of them, there are a couple of others, but it's the first time that we are going to discuss about that.
Starting point is 00:01:39 And I'd love to learn more about what makes them unique, both from like the use case perspective, because what you mentioned is one of the use cases that uniquely serve the systems, but also from the technology perspective, like what makes Pinot different than BigQuery or Snowflake, for example. So yeah, I'm really looking forward to chatting with our guest today. All right, well, let's dive in. Kishore, welcome to the Data Stack Show. We're so excited to chat guest today. All right, well, let's dive in. Kishore, welcome to the Data Stack Show.
Starting point is 00:02:08 We're so excited to chat with you. Thank you. Thank you for having me. I'm so excited to be here. All right, well, could you just give us a background, you know, sort of your work history and how you got into data and then what you're doing today at StarTree? Yeah, absolutely. So maybe I can go backwards and start from what I'm doing today.
Starting point is 00:02:28 Sure, that's great. Yeah. So I'm Kishore, so I'm the CEO of, CEO and co-founder of StarTree. Prior to that, I authored, co-authored Apache Pinot at LinkedIn. And before that I built a lot of distributed systems. That's kind of what got me excited. The world of distributed systems where it's fascinating to see one node going down, kind of renders other nodes completely useless or the entire system useless.
Starting point is 00:02:53 That kind of caught my curiosity and I built a lot of distributed systems in my career, Espresso was one of the things which is similar to MongoDB, which is the document store. And then I've also built Apache Helix, which is the cluster management framework, which is used to build other distributed systems. And then Apache ThirdEye is anomaly detection and root cause analysis.
Starting point is 00:03:13 And of course, Apache Pinon, which is built on to operate as a server. Yeah, amazing. What a resume of work. I almost don't know where to start. But I think one thing that would be helpful is, why don't we start by talking about sort of internal analytics versus user-facing analytics? Because that's a really important paradigm for Kino and, of course, StarTree. Could you explain sort of the main differences between internal analytics and user-facing analytics?
Starting point is 00:03:48 Yeah, absolutely. I think I'll probably start off with, before getting to the low-level details of that, just the concept of user-facing analytics. I mean, I was really not familiar with this term back at LinkedIn when we actually built, you know, I was building this other system, which is Espresso, which was user facing, but it was the OLTP workload, right? Which is solving all the transactional workloads. Uh, but then LinkedIn is really embarked on this new thing, which is, hey, we have all these data that we are collecting from the users, right, and how can we actually provide insights back to them? Like all the LinkedIn members who are visiting the, visiting the LinkedIn website and that's kind of where user
Starting point is 00:04:27 facing analytics really originated. For most of you, you're familiar with who you'd make profile, right? Where you go to LinkedIn, you see your page and then you see like, Hey, these are the people who actually viewed there from XYZ companies there from XYZ skillset. Here is the geo. So it was a very interesting app that we actually started off with. And this was built not on Hino, it was actually a prior version, which was built on a search engine.
Starting point is 00:04:55 That's kind of where everything originated. So that's kind of really the concept of user-facing analytics. It's not from a use case point of view, it's very similar. You have the data that you're collecting already. The internal analytics is about surfacing that insights to your internal employees within the company, be it analysts, operators, even engineers or CEOs and execs, right, whereas user facing is really taking that outside the organization, which is to your customers, to your partners, and then providing
Starting point is 00:05:25 them with all the insights so that they can actually make better decisions. Right. Take another classic example. If you kind of go out outside of LinkedIn, which is on Uber Eats, right? So now you can think of a restaurant owner as a, as a providing analytics to him as a user facing, right? Like all the orders that is happening on Uber Eats is coming into Pino. And now that data is being surfaced in, as in real time to a restaurant owner.
Starting point is 00:05:53 Now he knows like, Hey, what's my revenue? What's what's my wait time. How long am I taking to process this order? So this is this whole thing of providing insights directly to the end user, who is actually the one who is making these micro decisions that, hey, should I actually bring in another person to help me so that the order wait time actually reduces, right? Because that's directly impacting my business. So what used to happen via support before, via reports that were sent like once a month or once a day now is being directly surfaced to the end users via interactive apps.
Starting point is 00:06:28 So that's kind of really the change that we saw with user-facing analytics. Oh, interesting. So, yeah, so I actually, that makes total sense in terms of like the aggregate of like, hey, here's your weekly report on who viewed your profile, right? As opposed to logging into the app and saying, you know, instantly, here's who's reviewed your profile, you know, maybe since the last time you logged in. One thing I'd love to get your perspective on is how you, are there
Starting point is 00:06:59 helpful ways of thinking about internal analytics versus user-facing analytics. And I'll just give you one example. So a lot of times internal analytics are aggregated, right? So you want to know monthly active users. You want to know whatever trending on revenue or margin or usage or activity, you know, whatever, whatever it is, right? What's interesting about the example you gave for LinkedIn around like, okay, well, I log in and I, you know, you sort of instantly deliver this. Here's who viewed your profile.
Starting point is 00:07:35 Here's their geo, whatever, is that it's almost like each user gets their own filtered pivot of data, right? From like what internally would be a pivot of data, right? From like what internally would be a large data set, right? So it's like, okay, how many people on average are viewing these sorts of profiles? That's more of like an internal KPI for a product manager or something like that, right? But the user-facing analytics, at least in that regard,
Starting point is 00:08:01 with that example are like pretty specific to that user. So as you embarked on this project at LinkedIn, were there sort of new ways of thinking about analytics that you had to, you know, sort of master in order to, you know, in order to build Pina? Yeah, absolutely. I think that you brought up a very important point, right, which is really about providing analytics in a personalized fashion, right? Like it's, it's really geared towards that one person who is actually looking at it. What's the view that he or she will actually be excited.
Starting point is 00:08:35 So it is, it is to certain extent, personalized analytics, right? But there are other use cases that can be aggregated even at a different level. So if you kind of look at it at an operator level on your customer side, so if you kind of look at it at a partner, Talent Insights is another example that I can actually think of, which is think about yourself as a company
Starting point is 00:08:56 in Amazon or Google. You are now seeing like, hey, where are my people living? Where are they going to? Are they going to Facebook? Which area of Facebook are they going? Or which area of this XYZ company they are going? And which skill set I'm actually losing it?
Starting point is 00:09:11 Losing people here? Or where am I gaining more people? So this kind of inflow and outflow of talent is an aggregate metric. But now it is only per customer, right? So LinkedIn provides talent insights as a product to Amazon, Facebook, Google and all these.
Starting point is 00:09:27 Now, if you kind of go back before, you know, this world was basically reports based. So every month, end of the month, someone internally in LinkedIn would run a Hadoop job
Starting point is 00:09:37 or a Spark job and then send out this report to them and saying, okay, here is the report that you asked for. Here's the number of people. But now it has become interactive, right? So now it is that report has turned into a data product.
Starting point is 00:09:48 So that's the mindset that I generally see with what happens with with user facing is, hey, we were generating all these reports and sending to the user. Can we actually flip this report generation into a data product so that we directly give these data and insights in the hands of the customers, right? So they can actually make use of it the way they want to. They can slice and dice however they want to. And they are directly generating insights instead of them telling,
Starting point is 00:10:16 hey, this is the report I want. Can you go run this report and then coming back? So that's like the old way of solving this problem. So in a, in, in, to a large extent, it's the problem is the same. It's just that the way you think about solving the problem is what is changing. And that's where Pino came in. Right. So LinkedIn kind of changed the way they were trying to solve this.
Starting point is 00:10:36 And then Pino was born as a result of that, because once we tried to solve that, we hit all sorts of challenges. Right. I think that's something that I didn't mention in the previous topic that you brought up was internal versus external is like these three dimensions, which is what is the freshness of the data? How fresh is it? Like the one that I'm seeing, the data that I'm seeing.
Starting point is 00:10:58 The second one is latency. Can I, can I process this data? Can I ask questions at the speed of my thought? Right? Like I don't want to, as the user ask a question and then go for a coffee break and then come back. So those are all the things that happen in the internal world, which is you run a query, it takes a minute or two to actually run it. And it's batchy internal analytics. But in user, you want user-facing at real time and very, very low latency.
Starting point is 00:11:23 It should be interactive. And the third dimension that I want to bring up is the concurrent, right? Like in internal, typically it's very few people who are actually accessing this dashboard, right, because it's always bounded by the number of employees in your company, but when you grow out, it's, it's, it's limitless, right? It's, it's, it's like, depending on how many users and how many, like LinkedIn has like hundreds of millions of users and continues to keep growing. Uber Eats has like hundreds and thousands of restaurants and that keeps growing. So now you are seeing orders of magnitude changes in both the latency, freshness, and the concurrency challenges. That's kind of why the main reason why we built Kino, because we just couldn't scale it with the old system that we had. Super interesting. Yeah. And that makes total sense. When you think about the, when you think about internal users, right? Like maybe someone's looking at a report a couple times a day or once a day, right? When you think about, you know, 50 million people logging in and needing to provide them with like a piece of data, you know, in a couple, I don't know how long it takes, a couple of milliseconds or something.
Starting point is 00:12:29 That's wild. Okay. My next question, I'm actually getting handed off to Costas because I could keep going here, but I want to know how you actually did that and learn more about Pina, but Costas is way better to ask those questions. And my guess is that that's, that's what he's going to ask about. So Costas, please take it away. Costas Pintascaga- Yeah. So I might, you might have like to repeat yourself a little bit. Sure. Because like, I think the question that I'm going to ask you is probably going to overlap a little bit with the stuff that you discussed with, with Eric.
Starting point is 00:12:59 But it is good, like to put some things like under the right perspective. And I was, taking like websites of Pinot, for example, and if you try to figure out what Pinot is, the first thing that you will see is that it's a distributed international OLAP database system. And I'd like to ask you, as an OLAP system, how it is different compared to something like Snowflake or Redshift or BigQuery, right? Because this is like also like very typical examples of an OLAP system. So what makes Pinot different and most importantly, why we need to architect differently OLAP systems and create something like Pinot? Yeah, it's a great question.
Starting point is 00:13:45 I think this is one of the things that it's kind of very hard to just if you take a 10,000 foot, it's very hard to say like, Hey, these are all analytical system at the end of the day, you're asking SQL queries and then you're answering, right, so you can kind of even bubble it even more and say like, Hey, OLTP systems also support analytics, right? Like why can't I use Postgres? Why can't I use MySQL? Why can't I use any of this NoSQL document stores, right?
Starting point is 00:14:11 Because at the end of the day, they're all having data and they all support SQL queries, right? I think where the differences start coming up is like the workload, the kind of workload and the kind of use cases that are at power. You definitely can't power a very, very high throughput analytical use cases on an OLTP system. Even because the way the data is organized internally, it's row oriented. And it's kind of very well known that you need to now shift from the row oriented structure to the column oriented. Because analytical queries is not a key lookup, right? Unlike the OLTP systems, because you need to scan a lot of, lot of different data.
Starting point is 00:14:50 Now you come to the analytics side. Now you have this variety of systems, like, as you mentioned, the data warehouses of the world, the big queries, the snowflake and redshift and others. Right. And now then you have this OLAP systems, which are kind of, kind of, in the original legacy systems, people used to think of OLAP as this cube generation. Right? OLAP and OLAP cube were very synonymous, right?
Starting point is 00:15:15 And the data concept of data marts and data warehouse existed before. So it's kind of very similar in the sense, but some things have changed in the last decade, right? I think what, what used to happen with OLAP, OLAP was at some point very, very popular, if you all remember, right. So the, you had the, the, the things from SAP and others who IBM Cognos. So there are a lot of these OLAP cube systems. So it's, it's important to know, like the way that things have changed
Starting point is 00:15:43 and how they were solved before. Right. And before what used to happen was all these cubes were pre-computed. Like every question that you want, you want to ask, how many people from US and Chrome and Windows actually looked at my ad, right? So that was the question. And this answer would be pre-computed on an ETL job, and then it would be pushed into an OLAP system, right?
Starting point is 00:16:07 And then the answer will be very, very fast. The reason why it really broke down is two things. One is it was, there was huge explosion in storage, right? To answer, pre-compute all these things, it was very, very expensive. And two, it was inflexible. You the moment you add one more one more dimension everything is gone like all all your previous computed answers is wrong so so olab cubing technology never really took off because of that because there was a lot of it wasn't the technology was not really made for this
Starting point is 00:16:37 whereas the data warehouse was really good for internal analytics the analyst could ask like a very complex join, which is like a hundred tables join and things like that. And, but the query would take a long time, right? You couldn't really take a data warehouse and then put it to your end users. That's not what they're really made for. They're generally about throughput, not about the latency. Like how much, how much data I can actually scan in answer this question
Starting point is 00:17:04 and how quickly can I actually scan. So their goals were actually slightly different compared to the OLAP versus the data warehouse. So that's really the three systems. It's the OLTP system, it's the data warehouse, and it's the OLAP. And Pino and Droid and ClickHouse, I think these are all the three systems which kind of revolutionized the way OLAP is stored. OLAP is solved, right? And we kind of went, especially the rise of SSD, I think, which helped us not to pre-compute all these things upfront. So we could just keep the raw data as it is. You don't have to aggregate it and you can actually do the aggregation on the fly.
Starting point is 00:17:40 So really the challenge is do you do the processing beforehand during or during the query? It's on the flight computation versus pre-computation. That's kind of the challenge that OLAP always had. And with the rise of SSDs and other techniques like SIMD, columnar and indexing, you could actually say like, hey, I will compute all these things on the fly and still be able to give you like really, really good research. And that's kind of how this world has evolved so now you can think of olap as like an acceleration layer in some cases on top of the data warehouses but you can also think of it because of the real-time nature it's no longer sitting on top of the data warehouses which is like a bad source but it's directly sitting on top of streaming sources like Kafka, right?
Starting point is 00:18:28 So that's the, that's where the freshness attribute comes in, which is. Hey, I can't wait for you, someone to take the data in Kafka and other messaging systems, do an ETL, put it into data warehouse, and then get it into the old app, right? So that's getting, that layer is getting short-circuited right now. So you like point at your streaming source, right? So that layer is getting short-circuited right now. So you point at your streaming source, the moment a data event is created, we can ingest it
Starting point is 00:18:49 and then we can directly serve it to the end user. So I think there are a lot of these changes, a lot of variables that changed in the world that kind of resulted in creation of these systems.
Starting point is 00:19:00 And the nice thing is it empowered the product guys and everyone else in the company to think of things that they had not thought about before. And it just, the nice thing is it empowered the product, product guys and everyone else in the company to think of things that they had not thought about before, right, like if you look at LinkedIn, it's serving 200,000 queries per second on keynote today, right? It's, it's huge. It's like almost taking OLAP workload and you are actually, sorry, OLAP system,
Starting point is 00:19:21 and you're actually serving OLTP kind of workload. Yeah. And if you look at internal analytics, it's hardly single digit queries per second, right across the entire company. So it's a, it's a day and night difference in terms of what you can accomplish with something like, you know, versus something like the internal data warehouses. Sorry, that was a long answer, but I think these dimensions actually matter a lot when you actually deep dive into what are the use cases and what it empowers and enables a company to achieve. A hundred percent. because even I think like for data engineers out there, it might be hard like sometimes like to understand
Starting point is 00:20:05 what are the differences and why they might need like, not just like a data warehouse, but also like a real-time on-lab system and vice versa, right? Like it's, we take, you know, like people that have been like working with that stuff for a long time take like all these things like for granted, but they are not, right? Like, so that's why I said, and I was expecting that it's going to be like
Starting point is 00:20:30 some kind of overlap with, because you talked about like freshness, the latency, the concurrency, like all these things that, I mean, I remember when I first used Redshift and like, you could see that like Redshift was back then was supporting like, I don't know, like a hundred concurrent queries maximum or something like that. Right. And then you had to start using like queues and things like this to do your analytics, which made total sense because it was, it's a system that is built like, as you said,
Starting point is 00:20:58 like go through a lot of data, many queries that take a long time, like to compute, but okay, how many users you are going to have, right? Right. So it's all these trade-offs that at the end, they give a different system as an output. And that's what I'm trying to emphasize here, because these trade-offs are very important for us engineers to understand. Yeah.
Starting point is 00:21:21 Okay. So what's the secret search? How do you take a system like, I don't know, like a data warehouse system as we were used and like turn it into something that it's all up, but has also the performance, let's say of OLTP, right? So how does Pinot do this? Pinot Duvallis- No, no, that's a great segue into getting into the internals, right, I think, but I'll try to like start from the high level and then just
Starting point is 00:21:50 deep dive into like what happens. Right. I think let's take a query, for example, like what are the phases in the query? Like when someone runs a query on a database. So typically a query has a filter step, which is you have a predicate in the query and then it tries to call the filtering part. And the second one is like the scanning part after that, like once I have
Starting point is 00:22:10 filtered and narrow it down and like, okay, now I need to scan a bunch of bunch of data, right? And then the third part is really the aggregation or the group, right? What kind of operation are doing on top of it? So you kind of break down most of the query into these three phases. Again, there is a join and other things that come on the top, but I'm going to keep it simple for this, right? Where it's really filter project and aggregate and group, right?
Starting point is 00:22:34 So that those are the three phases. Now, if you kind of look at how data warehouses did this, they just on the filter phase, it would always be brute force, right, till they, I'm going to scan as, as fast as I can and I'm going to filter it, right, and then yes, they just on the filter phase, it would always be brute force. They will like, I'm going to scan as, as fast as I can and I'm going to filter it. Right. And then yes, they would have some additional metadata that they would try to keep like min max and then bloom filters and things like that. But they were all at a high level as sparse indexes, what I would call it. Right.
Starting point is 00:23:00 It is, can I eliminate like some chunks of data? Right. would call, right, edges, can I eliminate like some chunks of data, right? But what we found was that was not scalable because there is very hard for someone to actually say that they predictably saying that, okay, my filter is going to take X number of seconds or X number of milliseconds. And for us, it was very, very important to have that predictable latency. Like if you are looking at who viewed my profile page, we have a 99th percentile latency, which is less than a hundred milliseconds. So we would get alerted if something goes beyond that.
Starting point is 00:23:33 Right. So we had to make sure that we put in all these things to solve that, to address that latency requirements. So that's where indexing comes into picture. So now you go from this scan based filtering to actually index based filtering. Right. And that's where Pinoch shines a lot. So it does two things.
Starting point is 00:23:51 One is it rearranges the data automatically so that you get memory locality first. Because in terms of personalized analytics, you can automatically rearrange the data internally to make sure that all your profile views are stored together in the same place so that in one seat, I can actually get all your profile views instead of just staggered and then have to do like a lot of different things. So that's the first optimization that we do is we automatically reorganize the data as they are coming in, right? Which is a very, very important thing for us to make use of the C, the amount of seats, minimize the number of seats that we need to have.
Starting point is 00:24:28 The second one is the indexing, right? We go from like sparse indexing or like block level and like chunk level indexing to get into like row level indexing. Very, very fine grained. So I can exactly tell these are the rows where your profile views are placed. These are the segments in case of, you know, where we can actually get into. So it's, it's really about that filtering phase, right? Like how efficiently can you prune down a lot of data so that you
Starting point is 00:24:56 can, you don't have to do the work. So the philosophy, if you kind of look at this is the data warehouses on all these other databases, try to do the same work, but faster by trying to use SIMD and all these other techniques. Right. But the philosophy of Pino is like, can we not do the work? Right. So that's kind of where the indexing comes in. It's about eliminating the work rather than doing the same work faster.
Starting point is 00:25:22 That's, that's the most important one in the filtering phase. Now that's where indexes come into, like, you know, has like tons of indexes, right? Like it, for every use case, every kind of thing that you can think of. If it is a range index, if you have range indexes, you can say like, Hey, tell me all the queries that are more than three seconds, right? Now a classic inverted index will not work for that, but you have something called as a range index, which quickly knows like, Hey, these are the rows that actually have more than three second latency. You have inverted index, which is classic, which is probably something
Starting point is 00:25:56 common to all the other systems. We have text index, so you can index text. You have JSON index, so you can actually do that. We have geo index. So there are tons of indexes like this, and that's index, so you can actually do that. We have geo index. So there are tons of indexes like this, and that's the, that's kind of how we architected it is like the indexing schemes itself is pluggable, so you can keep adding more and more indexes. And we have added a lot of indexes over a period of time, and then we continue
Starting point is 00:26:19 to keep adding more and more of them. So that's kind of the innovation that we did on both the storage layer and on the indexing layer right and then the next part is the aggregation right in terms of hey how can if after the filtering how can we make sure that the aggregation is faster so there again there is classic techniques of making the scan very fast but you know i introduced an other technique called a starter index which is kind of how we named our company as well, based on that indexing technique. Because it makes things too much faster, orders of multiple orders of magnitude.
Starting point is 00:26:54 So just to give you an example, right? Let's say a classic case is how many ads did we show in US? Let's say someone asked that. And typically what happens is in any company your 50 of your ads are probably coming from us right compared to the rest of the world so even if your index you have the indexing technique you will have to end up scanning 50 of your data right what is the revenue for for us so that's where start with comes into picture so start analyzes this data upfront and then it automatically figures out
Starting point is 00:27:27 that, Hey, if a question comes for us, it's going to be very expensive if a query comes. So I'm going to pre-compute only for us. But if it comes for something like Kenya, which is, you don't really have to pre-compute. It can actually do the on the fly computation. So it's going to do on the fly computation for Kenya. So it has this smartness built in to actually figure out what needs to be aggregated and what can be aggregated on the flight, but you don't have to explicitly say that they aggregate for us aggregate, don't aggregate for Kenya.
Starting point is 00:27:56 It has the, it profiles the data as they come in and it creates these smart indexes on top of it. So it's like this enhancement and optimum innovation at every layer, from storage to indexing to aggregation. And even there is a lot of pruning that happens in the broker. Like how can we minimize, how can we eliminate the work? So it's across all these stacks that we have actually done the optimization that actually gives us the speed.
Starting point is 00:28:22 So there is actually a nice blog on what makes Pinot pass. And we have listed out all these different techniques there. OK, that's awesome. OK, follow up question to that, because it sounds like, OK, almost like too good to be true in a way, right? Like, why do we need Snowflake if we can have like something that's like as fast as OLTP and at the same time is an OLAP system, right? So what are the tradeoffs? Because when we are engineering same time is an overlap system, right? So what are the trade-offs? Because when we are engineering, we always make trade-offs, right?
Starting point is 00:28:49 So what do we lose by adopting, let's say, and using like a solution like Pinot compared to the traditional overlap system like Snowflake? I think there are two things that, one is the flexibility of queries, right? Definitely Pinot is not built as a data warehouse. You are not, Pino is not going to support like 100-way join. I mean, we are at, we don't have joins as well, right, right now.
Starting point is 00:29:14 We have lookup joins and then we are adding the joins as well. But again, the key thing for us is to make even joins faster. So we are going to address the cases, which is where joins can also be predictable and very, very fast. So we have some innovative things that we are actually adding onto the join layer that will help us even solve some of these join use cases, but again, keeping in mind very, very low latency, right? And data warehouses are really made for the analyst who can actually
Starting point is 00:29:40 write a thousand SQL, right? I cannot really parse that, but that's the purpose for Data Warehouse. So you have a lot of different tables that need to be turned. So Pinot is definitely not built for that. And we don't intend to go there as well. We want to make sure that we solve one use case and solve it really well. So that's one. And the second one is the cost aspect, right?
Starting point is 00:30:02 I think it's, so you need to have a certain level of usage for for this to me because one of the things with you know is it's again this is changing but it's a tightly coupled system so that means you have the you pay the cost for storage and compute because the compute is always running it's unlike snowflake you can actually bring up the compute when you want when you're running the query and only when you're doing it, you access, you pay for that, which is, which is good, it serves a particular purpose, but you can't do that for user facing analytics, you cannot say that a user came to my code, my profile page. Now I'm going to spin up, spin up the compute to actually answer that.
Starting point is 00:30:39 So it's, it's that break point. So you typically, once you get to like hundreds, tens and hundreds of queries per second, you end up actually needing something like Pinot because the other side becomes a lot more expensive because you're paying for per query, right? So you, you want to really look at, if you see the two, two systems, it's like the cost per query. The cost per query in Pinot will be hundreds of magnitude lower than the Snowflake, but only after a particular scale, like you need to have that level
Starting point is 00:31:11 of concurrency, you're serving your apps, you're serving your users, and then you start having that break even point at which Snowflake and other systems will become super expensive, right? And the key thing to realize there is it's kind of a chicken and egg problem. So sometimes you try to build your user facing apps on these data warehouses and the latency is bad, you won't be able to add also, you won't have the freshness, you won't have the dimensionality that you need for your app. And then your app never takes off.
Starting point is 00:31:41 Then you say, okay, I'm okay with this. But the key thing is like to make sure that the users get what they want. And then you will see the concurrency, the number of requests actually go up a lot. And that's kind of what we see. Like, as I said, like LinkedIn is solving like hundreds of thousands of queries per second on top of it. That's because you provide an app to the end user, right? That you're not, you're not writing SQL queries against this.
Starting point is 00:32:04 The restaurant owner is not writing SQL queries. It's very different in terms of the way people interact with something like Pino versus the data aware system. So you always want to think in terms of apps. What apps do I build on top of Pino? How do I actually showcase this value to the end users? And that should always be through apps and not providing them with a SQL editor and say like, oh, we'll write your SQL query.
Starting point is 00:32:29 And that will never work. And then that's also probably the wrong way of using something as powerful as Pinot. Okay. Yeah, makes total sense. So you mentioned before two other vendors outside of like StarTree or like Pinot, also like ClickHouse and Druid, right? And I think Druid is probably the oldest one or am I wrong on that?
Starting point is 00:32:51 I think, yeah, you're right. Druid is probably the oldest one. And I would say, I don't know, Elasticsearch was probably there. I think that's also something that people use for analytics. Even though it was not purpose-built for that, it was really built as a search engine. But once you have an inverted index, you can throw in anything like that. So a lot of people ended up using Elasticsearch as well.
Starting point is 00:33:12 Yeah. Yeah. So what's the, what's the difference between these three, at least like different tools? Give us like a little bit of, help us understand the landscape, like understand like the vendors there and what are the difference? Yeah. I think the key thing for us to look at is the evolution of analytics itself.
Starting point is 00:33:31 Right. Then it kind of becomes apparent on where Pino shines versus the other system. So the world went from like pure batch analytics to real-time analytics. Right. So that's where Druid came in. Right. And I think Kikos added a little bit of the Kafka connectors and other things analytics to real-time analytics right so that's where druid came in right and i think they got added a little bit of the kafka connectors and other things later on but it was really going from batch to real time is where these two systems came in and where dino is coming is going from
Starting point is 00:33:57 real time low concurrency to external with high concurrency and predictable low latency right so the so it was kind of the freshness was the first factor, which drove Druid and ClickHouse in the beginning. And then latency seconds was completely fine. Right. And with, you know, we are actually going from, we still need the real time that says table that has become a table stake right now, but now you are adding other dimensions, which is is it has to be millisecond
Starting point is 00:34:25 response time and be able to serve very very high concurrent request right and then be able to be predictable about that so it's really when you draw the graph of latency versus concurrency is when you start seeing the differences between these systems so pino is able to keep that low latency as the workload increases and whereas the other two systems were not really made for that in terms of the low-level system design, as I mentioned about the fine-grained indexing, right? ClickHouse still has sparse indexes. It doesn't have the fine-grained indexes that Pino has.
Starting point is 00:34:58 Same with Druid. Druid has only one level of indexing, which is the inverted indexer. It doesn't have all the other indexes that we mentioned. It doesn't have the concept of start B indexes and other others. So we are kind of very, very focused on providing predictable low latency at any scale, whereas both Druid and Clickhouse, the main purpose was to provide low latency for the internal use cases. That's kind of where they really started. And that was the key premise behind their design.
Starting point is 00:35:30 Okay. Yeah, that makes sense. And that's actually interesting. Like the differentiation there between like the internal and external like use cases, because yeah, concurrency is important there. Like you need to have like really high comparison together with the low latency to do that. Okay, so let's go back a little bit to the technical conversation that we had about indices and all that stuff. So what's the experience
Starting point is 00:35:57 the user has with setting up Pino right now? Okay, you give all these options in terms of the indexes that you can use. You also have like a smart indexing system with like the star tree that you mentioned there. How much the user is like responsible for making Pino as fast as it can be? And how much of this is automated? I think our first, i mean we we would definitely want to take these in phases right like we don't want to become try to become too smart there in
Starting point is 00:36:33 terms of figuring out the indexes automatically so our first option was to always have knob i mean this has been what databases have done all through the world right like decades like you have an altar table and you can add indexes. And I think what we took on is like, these are all the features that we have. Here are all the knobs you control. You have the control in terms of what are you trading off? Because adding an index comes with the cost. I think you're adding a little bit of a storage over it then.
Starting point is 00:37:00 But now you're trading that off for, hey, I'm going to get like amazing performance. So, so we kind of have this complete flexibility. That's kind of where we went with like each column can actually be of any type. It can be encoded in any way. It can be raw encoded. It can be dictionary. It can be run length. So you have all these options that we give to the user and each
Starting point is 00:37:21 column can have any type of indexes. It can have inverted index. It can have inverted index. It can have range index. It can have JSON index, like whatever that you want. And even the type is very flexible. So it can be a strict type. It can be long and et cetera. It can also be a semi-structured JSON.
Starting point is 00:37:39 So now you have indexes on the JSON as well, or it can even go further and it can actually be a text index. Right. It's like completely unstructured. So you can go from structure to semi-structure to unstructured. So we kind of give that whole flexibility to the user. Now, again, that now comes with a bonus on the user. Like, Oh my God, what, what, what, what indexes to configure here?
Starting point is 00:38:03 So the way we approach that problem is really about giving them the insights when they run a query. So they can, the way we advise is like, don't try to configure anything because everything can be changed dynamically when the system is running. Because this is something how we built and operated the system at LinkedIn. So all of these indexes that we talked about can be added dynamically without having to re-ingest the data. So you can just ingest the data as it is.
Starting point is 00:38:29 Don't worry about the performance in the first shot. Run a query. And then we, unlike traditional databases, which provide like a explain query plan and say like, okay, these are the things that are actually slow. We actually embed that in every query. So when you run a query, you get the response. You will know exactly how much time was spent in which phase of the query.
Starting point is 00:38:51 Is it the aggregation phase? Is it the filter phase? Is it the project phase? And then based on that, you can now go back and say like, okay, I'm spending a lot more time in filtering. What column am I spending time on that? And then what indexes that I can actually add. So it's more reactive than trying to be smart about it.
Starting point is 00:39:11 But I think these are some of the things that we plan to make it easier with the start-reversion of Pinaform, where we will constantly keep analyzing the queries. And then behind the scenes, we'll say, okay, this is what needs to happen for this. And we automatically set up those indexes so that users can actually automatically start seeing better performances. So it's that the learning and application of those indexes is kind of outside of, outside of, you know, because we don't want to do too many things automatically. It just, it just confuses the user there.
Starting point is 00:39:43 Yeah, of course. It's finished. Okay. It's time for me to give the microphone back to Eric. One last question. Share with us something exciting about StarTree that is going to come in the next, I don't know, couple of weeks or months or whatever. What's the next exciting thing that you are very excited about? Yeah, I think two things. One is definitely the tiered storage, which is a big
Starting point is 00:40:10 design change. It is not easy to change this design in systems that are five to six years old, right? So we, if you look at all these systems, they are tightly coupled. We know drawaway telecows, the storage and compute is tightly coupled. And now we actually announced a decoupled version of that. So now you can say like what tables, with like one system, you can say like, hey, these tables should be local. These tables should be remote. So then now you are, because one of the things that people, users were asking is like, hey, Pinot is so fast.
Starting point is 00:40:41 How can I keep the data much longer? And I don't care too much about the latency for the old data, but I want like very, very fast latency for the recent data. But now I don't want to take this data from one system to another system and then keep moving these things around. Like, can you actually make it fast in the same system? So that's kind of where the tiered storage comes in. So you can say for seven days, I want the data to be local.
Starting point is 00:41:05 And as soon as the data is older than seven days, do it decouple so that it goes into S3 and the queries run directly on top of it. So that's something that we are super excited about. So now you can keep the data in, in Pino for how much longer you want. And then when you have like a date where he's coming for the older data, it's going to be slightly slower instead of like tens of milliseconds, it's going to be hundreds of milliseconds, but that's, that's acceptable, but you're now able to trade off between the cost and latency.
Starting point is 00:41:34 So that ability to trade off between cost and latency was not available before. And now we are, now we use that another level of flexibility to the user in terms of picking one, one versus the other. And the second one is join. We are, we are in very early stages, but we are coming up with like some really cool ways of doing joins and like the other systems in terms of being able to do joins in a streaming manner, right? And that that's something that's going to help us in terms of addressing some of
Starting point is 00:42:05 the user facing analytics use cases where you can, you just need a lookup join between the two or you want like one state join between, between two tables. So that's something that we are also super excited about. Henry Suryawirawan, Super cool. All right. Eric, all yours. Eric Boerwinkle, All right. One more question since we're close to the buzzer here.
Starting point is 00:42:26 I'm interested to know, I'm just thinkingino, StarTree, like what, like at what point, uh, or what are the signals in your mind that indicate, okay, you should sort of look at this or are they related to scale? Are they related to sort of, you know, complexity? Like, you know, when in the life cycle, say of like the growth of an organization what are the indicators that you know this technology is appropriate to implement yeah no i think that that's a great question i think two years back my answer would have been very different i would have said like hey until you grow like linkedin don't don't don't use don't linkedin or uber
Starting point is 00:43:23 don't don't need you don't need you don't, LinkedIn or Uber, don't, don't need, you don't need, you know. But that's, that's changed drastically in the last, last two years. I think there is a lot of value that companies actually have unlocked. Sorry, that, that, that is hidden in the value data that they're collecting, but that's not unlocked to their end users, right? Like look at all the small startup companies that are there, right? And they are collecting huge amount of data from their customers especially the sas vendors right and now what are they doing with the data they cannot just say like hey give me all your data and i'm not giving anything back
Starting point is 00:43:55 to you right so now they're forced to think about like hey what is more value that i can actually provide on the data that i'm collecting so they're building some sort of a data product. So it really starts from there. It's not about the size of the data. It's not about the complexity of the queries or any of those. It's really about what more value can we extract on the data that we are collecting. So that's where it starts. What better decisions can we allow?
Starting point is 00:44:22 Like recently, this is a case of Cisco, which is kind of big. But even if you look at Cisco, the call that is happening right now. How long did you talk? How long did I talk? That's a useful analytics to have. Across all your podcasts that you have done, it would be so cool to get analytics on top of that. And what was the tone? How long did each one talk?
Starting point is 00:44:47 All sorts of analytics is actually very useful. So it's really about, hey, we have all this data. We have been for decades thinking about providing insights to our internal employees within the org. How can we do something better than that? How can we give it to the people who can actually make use of this data and then come up with, make better decisions in their life? So that's kind of the mindset I would start off with.
Starting point is 00:45:13 And once you start seeing the patterns, it's not that hard to actually find out that there is so much more value that we can extract out of this data. And that's kind of where I would start with. That's great. Yeah. I love that answer because I think it's a good example of how technology that's built for sort of super scale enterprise, if you want to call it that, you know, trickles down and becomes democratized, you know, in a way for lots of companies to use.
Starting point is 00:45:42 So super exciting. This has been such a wonderful conversation, Kishore. Thank you for joining us and giving us some of your time and teaching us about Pino and StarTree. Well, thank you. Thank you, Eric. And it was really good to be on this call
Starting point is 00:45:56 and I completely enjoyed it. Thank you for having me again. I think the thing that I really took away from the challenges related to user-facing analytics was just the concept of, let's say, 50 million people doing something at the same time and then needing to basically be delivered some sort of result from a computation. How many people viewed your profile, et cetera. And it's just wild to think about that level of scale with that little latency, you know, it's really just insane.
Starting point is 00:46:35 And so, you know, it sounds like a really, like really challenging problem they solve. And Pino's, you know, seems like a pretty amazing technology to do that. But yeah, that's kind of not the type of problem that you face, you know, that you face every day. Yeah. It's I mean, I really enjoy like the conversation that we had because it's one of these cases where you have, let's say a use case drive some
Starting point is 00:46:59 extremization of a system that almost leads like to a new design, to new design principles of how like database systems. So it was very interesting like to discuss and like hear all that stuff about like how you index data, how do you store the data to make them like the access to that data like faster, how you encode the data to do that, like all these like optimization techniques that have to be applied in order to enable something similar to what you described. And most importantly, what the trade-offs are, right?
Starting point is 00:47:35 Because obviously there are trade-offs there. So yeah, it was a very, very interesting conversation and I hope we'll have the opportunity to repeat it in the future because there's a lot that we haven't covered today. I totally agree. Well, thanks for joining us again and we will catch you on the next Data Stack Show. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
Starting point is 00:48:14 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.