The Data Stack Show - Data Council Week (Ep 1): Discussing Firebolt’s Engine With Benjamin HoppDiscussing Firebolt’s Engine With Benjamin Hopp

Episode Date: April 25, 2022

Highlights from this week’s conversation include:Ben’s career journey (2:55)What makes Firebolt different (3:58)Firebolt’s data product family (7:37)Table engines and Firebolt (10:57)Ben’s fav...orite part of ClickHouse (12:52)The experience of building an optimizer (15:19)Where Firebolt fits into architecture (17:27)Working in the data space: to love and dislike (19:51)Coming soon in the near future (24:35)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. We are recording on site at Data Council Austin, which is super exciting. Let's talk about our first guest. We're in a little conference room here with all the mics set up, which is super fun. And we're going to talk with Ben from Firebolt. Now, Firebolt, Costas, is...
Starting point is 00:00:47 I don't know if it's come up a ton on the podcast, but you and I have talked about Firebolt, and it's a really interesting product. And all their marketing, you know, and even the name of the company is really focused on speed, of course. And so what I'm interested to know is when you talk about, you know, blazingly fast analytics, for example, that can mean tons of different things. And, and specifically what I'm interested to know is, you know, it's impossible for a new-ish tool to be the fastest across the entire, you know, the entire sort of value chain, right? Like as it relates to like data. So I want to know like where,
Starting point is 00:01:26 what specific thing are they like super duper fast at, you know, or a couple of things that they sort of, you know, stake their claim on. So that's what I want to learn. Like, and what does that look like under the hood? How about you? Yeah. I mean, for me, it's always interesting to see
Starting point is 00:01:39 companies going to market with data warehouse solution. Database systems are like notoriously difficult to build. Many teams have tried and failed. They usually take many years to get them on the market. So it's very interesting to learn more about the whole journey of how they started and how they ended up at the state right now where pretty much like competing with other cloud data warehouse solutions
Starting point is 00:02:10 out there like Snowflake and BigQuery. So I'm very curious about this journey and where the product is today, what is missing and what's next. All right, well, let's dig in and talk with Ben. Let's do it. Ben, welcome to the Data Stack Show. We're so excited to have you. You are a solution architect lead at Firebolt, and we're super excited. We've wanted to have you on
Starting point is 00:02:32 the show for a while, and we caught up with you at Data Council Austin. So we're in person in Austin in a really fun way to do a show. So thanks for giving us some time out of your conference to spend on the show. Thank you for having me. Really excited to be here and talk about data. Cool. Okay, so give us your background. How did you get into data originally? And then how did you end up at Firebolt? Yeah, so I've been in the data space my entire career.
Starting point is 00:02:57 I kind of got pulled into it and started out of college, thought I was going to be a Java developer. Worked at a company that decided I was better suited as a database administrator. So I was a Microsoft SQL Server database administrator for a few years. From there, I went to work at Hortonworks back in the days prior to Cloudera. I did consulting for Hortonworks for a number of years, specialized in streaming data with Apache NiFi. That kind of brought me into the streaming world. And then I went into work for a company called Imply with Apache Druid, some streaming data, big data projects, worked briefly at a company called Upsolver doing streaming ETL. And that brings me to Firebolt, where I've been a solution architect for a little over a year now.
Starting point is 00:03:48 Super interesting. Yeah. So you've worked at a lot of companies that sort of built on like core technology, a lot of it's open source,, but our claim to fame is really fast analytical queries. So we are targeting use cases that need sub-second performance that are powering dashboards or powering visualizations that really benefit from low latency queries, high concurrency workloads. Super interesting. Give us a couple examples there because low latency or even real time is like, those terms are like really relative. Like some companies it's like, data every hour is real time, you know? So when we say low latency, I'm talking specifically on the query
Starting point is 00:04:35 latency, not necessarily the data ingest latency. So queries that, you know, when you load a page, it may send out 10, 15 queries. You want all of those queries back sub-second. So that's the query latency. As far as the data load latency, we're not a real-time data warehouse. Got it. Do batch loads. So, you know, 5 to 15 minutes is usually the highest frequency that you're going to see. Obviously, we want to move towards real-time.
Starting point is 00:05:02 We're building out Kafka integration and things like that. That's going to be coming soon. But right it's it's micro batch yeah ingestion okay so on the query latency what are some of the use cases that require that sub-second query latency yeah so we often see companies that have user-facing analytics where their business depends on their users being able to log in and actually see their analytics. We also see a lot of internal use cases like dashboarding, Looker, Tableau, those sorts of things where you want to be able to slice and dice your data and explore the data without waiting 15, 20, 30 seconds every time you issue a new query. Yep. Makes total sense. Okay. Costas, I've been monopolizing the conversation as I often do, but.
Starting point is 00:05:49 Costas Pintas- Wow, Nerman, that's an amazing introduction. So let's talk a little bit about like how, what you did before. Nerman V.: Sure. Costas Pintas- You did Firewalls because you mentioned before Firewalls, you were working with Druid. Nerman V.: Yep. Costas Pintas- As a technology. So there was like a lot of real time kind of like use cases that you were working on.
Starting point is 00:06:08 So how is Firebolt different on the same level? Yeah, great question. So the biggest difference is a separation of storage and compute. So Druid requires the data to be loaded to the actual processing servers prior to being able to run a query. Whereas Firebolt, we aggressively cache data, but your first query can actually go fetch the data from your deep storage in S3. So you don't have to wait for a cluster to start up and fetch all the data before you
Starting point is 00:06:40 can query. And it allows you to spin up multiple, what we call engines, but it's really just clusters of compute resources independently of your data. So you can have, you know, if, if you're just doing a small amount of compute, but you may have lots of historic data, you can have all of that data stored in S3 and a fairly small amount of compute that's actually being utilized because you're only querying a small sliver of the data at any given time. Whereas Druid, if you want to have all of the data available for querying, you
Starting point is 00:07:11 need to have all of the data loaded to the servers, so it's more efficient in that sense. Druid does have some advantages, no doubt, especially as it pertains to streaming data. Being able to query simultaneously your batch data and streaming data is really useful for those use cases that really require that sub minute ingestion latency as well as direct integration with, with Kafka is a real nice feature. That's interesting.
Starting point is 00:07:37 And there is like a family for like technology over there where you probably like, you must be know, but you know, can it's your top gift cards, right? Yep. It's three very, like they belong to the same category of like solutions. Yep. There were the weavies like for similar use cases. Would you say the light firebolt is like part of this body of products? Very intimately part of that.
Starting point is 00:08:01 Yeah. Under the hood firebolt actually is using some ClickHouse code for the compute engine. Oh, interesting. We're forked from ClickHouse. Now, we use a completely different storage handler. So that's what allows us to separate storage and compute, because otherwise ClickHouse does require the storage to be local. We also use a completely different query parser. So our query optimizations are all built in-house.
Starting point is 00:08:23 And then there's some other tweaks and things like that, but the actual kind of engine, the computing, the bits behind the scenes, that's all based on ClickHouse code originally. That's, that's a brilliant thing. And then why did ClickHouse look like one of the other two? Like, yeah. Our reason, though. A loaded question.
Starting point is 00:08:44 You thought about Dru loaded question. Yeah, so obviously our goal, at least the goal that was told to me, I started it after the company was founded, so I can't be sure of these things. But from the stories I've been told, the goal has always been to make a true full-featured cloud data warehouse. That means being able to handle all data warehousing use cases. And ClickHouse is kind of the best position to do that. Whether using Pino or Druid, they don't have very good join support. And I guess both, well, I'm not sure about Pino. Druid is Java-based. I think it is also, and there's some overhead there.
Starting point is 00:09:26 So being a C++ native application kind of gave ClickHouse the edge and having the flexibility to extend it and kind of build it into a full feature cloud data warehouse rather than kind of a specialty streaming solution. And how is that positioned Firewalls in the markets when you have like a couple of different companies out there that they offer ClickHouse? Yeah, seriously. How is that? So I think that, I don't think that we have a negative relationship.
Starting point is 00:09:54 I think that those kind of ClickHouse managed services, people that are very familiar with ClickHouse very well, but ClickHouse is not a simple product to get in and use. manage services, people that are very familiar with ClickHouse very well, but ClickHouse is not a simple product to get in and use, uh, Firebolt is built for simplicity, we, we aren't just a wrapper for ClickHouse, we have less features than ClickHouse, uh, because we want to make it user friendly and, and stable and all of that. So we're, we're not just a ClickHouse kind of fork. We are our own thing, although we're using the ClickHouse engine, but our SQL dialect is completely different.
Starting point is 00:10:35 If you try to use a ClickHouse function in Firebolt, you're not going to have any luck doing that. So a detail, for example, has this concept of like the table engines and the VPLT engines. Yep. Is this something that like, um, can be configured has this concept of like the table engines and the representation engines. Is this something that like can be configured by the user of Firebolt or this is like set up by you and it's like part of how you optimize the engine to deliver the experience?
Starting point is 00:10:55 Yeah. Uh, great question. So yeah, there's no concept of those different table engines in Firebolt. We do have a concept of a couple of different table types. So we have a fact of a couple different table types. So we have a fact table and a dimension table. Behind the scenes, what that means is a fact table is sharded across all of the nodes in the cluster, whereas a dimension table is replicated to all the nodes. On top of that, we have a couple different indexes. So we have, it's called an aggregating
Starting point is 00:11:21 index, which is really just a materialized view that is always updated as you ingest more data. You can set your aggregations and your dimensions, very similar to like a Druid rollup in Fireball. And then the join indexes, which is a in-memory join to really optimize performance. And our goal is to provide an out-of-the-box experience that everybody gets good performance. But if you have specialty use cases, you know ahead of time exactly what aggregations are going to be done and you're going to be running those, you know, potentially hundreds of times per minute, you can optimize for those specific use cases. That's very interesting. It's very, very, very smart. Like the way that, let's say these features are productized in a way.
Starting point is 00:12:09 Yep. How you create like a product experience on top of like something that is, you know, like things like a materialized view or like how you distribute like re-leads like a table is going to be distributed or not and all that stuff. Like that's very, I find this very interesting exactly because like, that's exactly how a product works, right? Like you get what the customer needs and you map the technology stuff and you just like abstract into it.
Starting point is 00:12:36 Yep. Nobody needs to know behind the scenes what is happening there. So that's, that's great. And okay. So having worked with ClickHouse, what's your favorite part of it? Favorite part of ClickHouse? Yeah. Oh, okay.
Starting point is 00:12:51 So I am a big fan of the aggregating indexes because I come from an old school world of databasing where you created summary tables. You had, I've used SQL Server analytic services to build data cubes and summarize data and being able to get that same effect of pre-computing all of your aggregations, but not having to wait for a nightly refresh and being able to build those on the fly, I think is really cool.
Starting point is 00:13:24 And then the automatic query rewriting. So as your users are writing queries or your BI tools writing queries, it's going to automatically use the aggregating indexes that are available. And you can have multiple aggregating indexes or query plan or automatically choose the best one for the query. So, you know, that, again, going back to my history in Druid, Druid had a concept of roll-ups. So as you ingest data, it'll aggregate it to a certain granularity. The aggregating indexes allow you to do that same thing, except you can aggregate to multiple
Starting point is 00:14:00 different granularities. You can aggregate on a field that isn't time-based like you need in Druid. So it's a lot more flexible and, but at the same time, remaining user-friendly as opposed to rolling your own materialized view in another system. Yeah. And what's your favorite Firebolt application, ClickHouse? The query planner. FireFull has its own query planner to optimize queries. ClickHouse has no real query planner, does exactly what you put in.
Starting point is 00:14:31 So when you actually release your product to the world and people write queries and some of them are not optimized, sometimes they're doing massive joins and there's no pushdown or anything like that. So having the query planner automatically do those optimizations, use materialized views, use the join indexes, all of that, that's a huge benefit over using just raw ClickHouse. How was the experience of like building this optimizer? I mean, the reason I'm asking is because I know that like it's one of the toughest problems in like database instance, right? Yep. And one of the hardest and probably like most something that can
Starting point is 00:15:11 be solved like at the end, right? Like it's very discussed topic, like in computer science isn't it? Not so. How, how was the experience of like building an optimizer? Well, you might have to talk to people that are slightly smarter than me, cause I didn't build the optimizer. But I think that it's a ongoing process. We're always encountering new problems and finding new ways to optimize code.
Starting point is 00:15:36 You know, frequently we get data from Tableau or Looker and it's generating queries and we have to kind of understand what it's trying to do and then see if there's a way to do it better. And our solution architecture team, one of their core responsibilities is to take SQL code that customers are generating and find ways to optimize it. And then we provide that information back to our product and engineering teams
Starting point is 00:16:02 so that they can build those optimizations back into the product and ultimately kind of that they can build those optimizations back into the product and ultimately kind of make it more user-friendly yeah it makes sense yeah that's why they're very interesting both problems like um i'll tell you about like kind of feedback link between how the customer experience drives something so deeply technical as an optimizer at the end. I think that's one of the most interesting things of both engineering and product teams have to experience in working in product tech Firebolt, which I find very fascinating. So that's super interesting.
Starting point is 00:16:36 So I'm interested to know, where does Firebolt fit into architecture? So you mentioned that you want it to be sort of a fully featured cloud data warehouse, right? Or that's what it is. So, you know, which actually sounds different than maybe some of the language that we hear from Snowflake a la like a data platform, right? That sort of includes a cloud data warehouse,
Starting point is 00:17:01 but also has this constellation of other tooling around it. So when companies implement Firebolt, includes a cloud data warehouse, but also has this constellation of other tooling around it. So when companies implement Fireball, what I'd love to know is sort of what are the types of companies that are adopting it? And then how do they fit it into their architecture? Is it a replacement, you know, for sort of a Snowflake or a Redshift or whatever? So yeah, just tell us how companies are fitting it into their data stack. Yeah. So I guess, as I mentioned, like we want to be a full featured cloud data warehouse. I'll be the first to admit we're not even there yet. We're a data warehouse with some very specific use cases that frequently, you know, we have customers that are coming from Redshift and Snowflake and different data warehouses, that they continue using those products in a debt into Firebolt.
Starting point is 00:17:47 Okay. Firebolt lacks a lot of the kind of ancillary functionality, a lot of the large-scale data processing capabilities of something like Snowflake. Whereas, you know, we're built for a write once, read a whole bunch of times architecture. Got it. We don't have right now row level updates and deletes. So if you need to make an update to a record, you need to drop a partition of data, which isn't that unusual in the traditional OLAP world. But people have gotten so used to Snowflake allowing things like that, that for some use
Starting point is 00:18:25 cases, it's just required. But for those other use cases where they are doing the analytics, where it's immutable data or not frequently changing data, they can kind of peel off use cases and use Firebolt with those. We built kind of our business model to make that very easy. It's all pay-as-you-go, consumption-based. You don't have to sign a contract or anything. So as Firebolt grows and encompasses more and more features, then you can grow the use cases and move more and more off.
Starting point is 00:18:56 So we want to be very cost-effective for the use cases that we're really good at and then grow into the rest. Sure. Makes total sense. And do you have sort of a particular type of company or even industry that tends to adopt Firebolt because of the use cases? So we oftentimes see more cloud-native organizations, you know, smaller companies that are comfortable with a SaaS data warehouse, that they're comfortable with the data
Starting point is 00:19:22 leaving their walled garden, their VPC. And we also see companies that usually have large data sets. So ad tech data, gaming data, clickstream data, marketing data, all of these sorts of things that have huge volumes of immutable data are really, you know, a natural fit for Firebolt. Yeah, super interesting. Okay, personal question, what you've worked in and seen sort of firsthand, like a lot of data technologies, you're still working in data. What do you love most about it? You know, just from a personal level, or I mean, do you? Maybe like, you know, some of us get like really deep into a career. It's like, you don't know going back. that always provides value. I mean, knowing different programming languages and being able to work with data is immensely valuable, but data itself is something that is always going to
Starting point is 00:20:32 be growing. It's always going to be around. So I think that there's unlimited opportunities for working in data. Yeah, for sure. Okay. And on the flip side, what do you like least about working with data or working in the data space? The thing, I think the thing I like least is anybody that kind of positions themselves as a answer to every question. If you are a system that is really good at doing massive data processing tasks like Spark, chances are very good that you're not going to be great at doing very fast key value lookups, for instance. So there's oftentimes a use case that is a good fit for a tool and use cases that are not a good fit for the tool. And understanding where those kind of good fit is, is very important.
Starting point is 00:21:32 But having one product that says it serves every use case, I think is, is just unreasonable. And, you know, I don't want to. I was literally going to say marketing is the worst part about working in data. And I'm a marketer working in data, but I couldn't agree with you more. I'm sure Firebolt marketing is probably going to be listening to this podcast. So I wanted to dance around it a little bit. Yeah. Other than Firebolt's amazing marketing team that i could not love more
Starting point is 00:22:06 yeah no but i mean i actually really appreciate the sort of transparency or honesty around saying like this is what we want to be and this is what we do excellently now i think that's really helpful and i mean i appreciate that and i think our listeners appreciate that too, where you kind of know what you're getting as opposed to, because you're totally right. It's the like, the disenfranchisement of you sort of look at the site, you look at the product page, you're like, this is awesome. You know, the docs are a little sparse, like, let me try this. And you're like, oh, right. I know why the docs are sparse. Yeah. To be honest, I mean, the industry is like at stage right now is it's like quite early and there's a lot of innovation happening.
Starting point is 00:22:48 So it's not like things change. Sure. From day to day. So it's not just, it's not the market. The fault that's not this is the cool. It's just that they know that the market is still trying to figure them out. Yep. I guess the, another thing that always kind of
Starting point is 00:23:06 rubs me the wrong way is people that make statements based on like outdated information, you know, saying that whatever technology it is, is what it was five years ago. I still hear people saying that like they don't want to use Apache Druid because
Starting point is 00:23:22 they write SQL. I'm like, you've been able to use SQL with Apache Druid for, you know, almost five years, if not more. So I guess people should always be kind of reevaluating their preconceived notion about any technology or any company as time goes on. Yeah, for sure. I think we talked about that with a term like, you know, CDC, change data capture, right? And it's sort of, you know, there are companies doing really interesting things with it, right? But it's not new, right? It's really old technology, even though there are some new companies that there's some excitement around. But it's not like, you know, it's...
Starting point is 00:24:00 I don't want to do CDC because then I have to put triggers in my database and it's going to put an additional overhead. I've been doing this for far too long, but yeah. Yeah. Yeah. No, that's great. That's great. Well, this has been so wonderful to have you on the show. Learned a ton about Firebolt.
Starting point is 00:24:17 Kostas, any last questions before we sign off? I mean, it has been great. Like, it's great that we learn more about the core technology. And before we, we end the show, tell us like something exciting that is coming in the near future. Oh, great question. I think streaming data coming to Firebolt is going to be huge. We are working on building in mutations so you can do those
Starting point is 00:24:42 row level updates and deletes. I think that's going to open up a lot of new use cases for Firebolt customers. Our ecosystem team is booming. We're always adding new partners and new integrations into the system. So anytime we can get another partner and learn more about their product
Starting point is 00:25:02 and cross-sell and all of that, that always gets me very excited as well. You have also, I know, learn more about their product and cross-sell and all of that, that always gets me very excited as well. You also, I guess, I know a lot of great team. Yeah. You never know what's coming out of our marketing department. So that's always exciting. You got to watch the marketers, especially when it comes to data. All right, Ben, thank you so much for taking some time with us on the show.
Starting point is 00:25:22 Yeah, thank you for having me. Here's one of my takeaways. You know, I'm trying to, we could probably count, we could definitely count on one hand the number of times that Hortonworks has come up on the show. You know, I mean, even the name Hortonworks sounds a little bit, you know, enterprise-y. I mean, I guess it is enterprise-y actually. But it was just interesting to hear about that. And my guess is that Hortonworks probably has played a bigger role in the data world than I think a lot of the content on our show necessarily gives it credit for. That's my takeaway. Also, the Hortonworks guy was actually from the East Coast side of Atlanta, which is interesting as well. So yeah, I don't know. It's just interesting to hear him talk a little bit about Hortonworks and kind of
Starting point is 00:26:02 the work that he did there. And then of course, like the Druid stuff is interesting. But what was your takeaway? I think one of the most interesting parts of the conversation was a click house and how actually open source can fuel, let's say, the innovation in this space. like considering that a database is just like a too risky thing to get to market, to actually get to a point where we can start building companies and products and iterate fast without the risk of the past for that. Something that happened a lot with SaaS in the past decade, for example. But if you want to replicate this in, let's say, the data-related infrastructure, we need something similar, right? And it seems that open source, and it's not just ClickHouse,
Starting point is 00:26:49 but I think that in this case, it's a very good example of how they took the core part of this, they cloned the prod, built their own query optimizer on top of it, changed the query parser. I mean, they've done a ton of work, but this work was done like a ton of work, but this work was done like on a very solid core that would help them like accelerate the whole process of taking this product to the market, right?
Starting point is 00:27:13 And this is needed. So I'm very excited about that. And I'm waiting to see like what other like products do something similar like this out there. There are examples like this database space, like we have Vitesse, for example, and PlanetScale, which is like this out there. There are examples like this database space. We have Vitesse, for example, and PlanetScale, which is based on that. Anyway, that was probably one of the most interesting parts
Starting point is 00:27:33 of the conversation, what really made me excited. All right. Great conversation. Several more shows for you from Data Council, so subscribe if you haven't, and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
Starting point is 00:27:59 That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.