The Data Stack Show - Data Council Week (Ep 1): Discussing Firebolt’s Engine With Benjamin HoppDiscussing Firebolt’s Engine With Benjamin Hopp
Episode Date: April 25, 2022Highlights from this week’s conversation include:Ben’s career journey (2:55)What makes Firebolt different (3:58)Firebolt’s data product family (7:37)Table engines and Firebolt (10:57)Ben’s fav...orite part of ClickHouse (12:52)The experience of building an optimizer (15:19)Where Firebolt fits into architecture (17:27)Working in the data space: to love and dislike (19:51)Coming soon in the near future (24:35)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com. Welcome to the Data Stack Show. We are recording
on site at Data Council Austin, which is super exciting. Let's talk about our first guest. We're
in a little conference room here with all the mics set up, which is super fun. And we're going to
talk with Ben from Firebolt. Now, Firebolt, Costas, is...
I don't know if it's come up a ton on the podcast,
but you and I have talked about Firebolt,
and it's a really interesting product.
And all their marketing, you know,
and even the name of the company is really focused on speed, of course.
And so what I'm interested to know is
when you talk about, you know,
blazingly fast analytics, for example, that can mean tons of different things. And, and specifically what I'm interested to know is, you know, it's impossible for a new-ish tool to be the fastest across the entire, you know, the entire sort of value chain, right? Like as it relates to like data. So I want to know like where,
what specific thing are they like super duper fast at,
you know, or a couple of things
that they sort of, you know, stake their claim on.
So that's what I want to learn.
Like, and what does that look like under the hood?
How about you?
Yeah.
I mean, for me, it's always interesting to see
companies going to market with data warehouse solution.
Database systems are like notoriously difficult to build.
Many teams have tried and failed.
They usually take many years to get them on the market.
So it's very interesting to learn more about the whole journey
of how they started and how they ended up at the state right now
where pretty much like competing
with other cloud data warehouse solutions
out there like Snowflake and BigQuery.
So I'm very curious about this journey
and where the product is today,
what is missing and what's next.
All right, well, let's dig in and talk with Ben.
Let's do it.
Ben, welcome to the Data Stack Show. We're so excited to have you.
You are a solution architect lead at Firebolt, and we're super excited. We've wanted to have you on
the show for a while, and we caught up with you at Data Council Austin. So we're in person in Austin
in a really fun way to do a show. So thanks for giving us some time out of your conference to
spend on the show. Thank you for having me. Really excited to be here and talk about data.
Cool.
Okay, so give us your background.
How did you get into data originally?
And then how did you end up at Firebolt?
Yeah, so I've been in the data space my entire career.
I kind of got pulled into it and started out of college, thought I was going to be a Java developer.
Worked at a company that decided I was
better suited as a database administrator. So I was a Microsoft SQL Server database administrator
for a few years. From there, I went to work at Hortonworks back in the days prior to Cloudera.
I did consulting for Hortonworks for a number of years, specialized in streaming data with Apache NiFi. That kind of brought me
into the streaming world. And then I went into work for a company called Imply with Apache Druid,
some streaming data, big data projects, worked briefly at a company called Upsolver doing
streaming ETL. And that brings me to Firebolt, where I've been a solution architect for a little over a year now.
Super interesting. Yeah. So you've worked at a lot of companies that sort of built on like core technology, a lot of it's open source,, but our claim to fame is really fast analytical
queries. So we are targeting use cases that need sub-second performance that are powering dashboards
or powering visualizations that really benefit from low latency queries, high concurrency workloads.
Super interesting. Give us a couple examples there because
low latency or even real time is like,
those terms are like really relative.
Like some companies it's like, data every hour is real time, you know?
So when we say low latency, I'm talking specifically on the query
latency, not necessarily the data ingest latency.
So queries that, you know, when you load a page, it may send out 10, 15 queries.
You want all of those queries back sub-second.
So that's the query latency.
As far as the data load latency, we're not a real-time data warehouse. Got it.
Do batch loads.
So, you know, 5 to 15 minutes is usually the highest frequency that you're going to see.
Obviously, we want to move towards real-time.
We're building out Kafka integration and things like that.
That's going to be coming soon. But right it's it's micro batch yeah ingestion okay so on
the query latency what are some of the use cases that require that sub-second query latency yeah so
we often see companies that have user-facing analytics where their business depends on their users
being able to log in and actually see their analytics. We also see a lot of internal use
cases like dashboarding, Looker, Tableau, those sorts of things where you want to be able to
slice and dice your data and explore the data without waiting 15, 20, 30 seconds every time
you issue a new query. Yep. Makes total sense. Okay. Costas, I've been monopolizing the conversation as I often do, but.
Costas Pintas- Wow, Nerman, that's an amazing introduction.
So let's talk a little bit about like how, what you did before.
Nerman V.: Sure.
Costas Pintas- You did Firewalls because you mentioned before Firewalls,
you were working with Druid.
Nerman V.: Yep.
Costas Pintas- As a technology.
So there was like a lot of real time kind of like use cases that you were working on.
So how is Firebolt different on the same level?
Yeah, great question.
So the biggest difference is a separation of storage and compute.
So Druid requires the data to be loaded to the actual processing servers prior
to being able to run a query.
Whereas Firebolt, we aggressively cache data, but your first query can actually go fetch
the data from your deep storage in S3.
So you don't have to wait for a cluster to start up and fetch all the data before you
can query.
And it allows you to spin up multiple, what we call engines,
but it's really just clusters of compute resources independently of your data.
So you can have, you know, if, if you're just doing a small amount of compute, but
you may have lots of historic data, you can have all of that data stored in S3
and a fairly small amount of compute that's actually being utilized because
you're only querying a small sliver of the data at any given time.
Whereas Druid, if you want to have all of the data available for querying, you
need to have all of the data loaded to the servers, so it's more efficient in that
sense.
Druid does have some advantages, no doubt, especially as it pertains to streaming
data.
Being able to query simultaneously your batch data and streaming data is really useful for those use cases that really require
that sub minute ingestion latency as well as direct integration with,
with Kafka is a real nice feature.
That's interesting.
And there is like a family for like technology over there where you
probably like, you must be know, but you know, can it's your top gift cards, right?
Yep.
It's three very, like they belong to the same category of like solutions.
Yep.
There were the weavies like for similar use cases.
Would you say the light firebolt is like part of this body of products?
Very intimately part of that.
Yeah.
Under the hood firebolt actually is using some ClickHouse code for the compute engine.
Oh, interesting.
We're forked from ClickHouse.
Now, we use a completely different storage handler.
So that's what allows us to separate storage and compute, because otherwise ClickHouse does require the storage to be local.
We also use a completely different query parser.
So our query optimizations are all built in-house.
And then there's some other tweaks and things like that, but the actual
kind of engine, the computing, the bits behind the scenes, that's all
based on ClickHouse code originally.
That's, that's a brilliant thing.
And then why did ClickHouse look like one of the other two?
Like, yeah.
Our reason, though.
A loaded question.
You thought about Dru loaded question. Yeah, so obviously our goal, at least the goal that was told to me, I started it after the company was founded, so I can't be sure of these things.
But from the stories I've been told, the goal has always been to make a true full-featured cloud data warehouse.
That means being able to handle all data warehousing use cases.
And ClickHouse is kind of the best position to do that.
Whether using Pino or Druid, they don't have very good join support.
And I guess both, well, I'm not sure about Pino.
Druid is Java-based.
I think it is also, and there's some overhead there.
So being a C++ native application kind of gave ClickHouse the edge and having
the flexibility to extend it and kind of build it into a full feature cloud data
warehouse rather than kind of a specialty streaming solution.
And how is that positioned Firewalls in the markets when you have like a couple
of different companies out there that they offer ClickHouse?
Yeah, seriously.
How is that?
So I think that, I don't think that we have a negative relationship.
I think that those kind of ClickHouse managed services, people that are very
familiar with ClickHouse very well, but ClickHouse is not a simple
product to get in and use. manage services, people that are very familiar with ClickHouse very well, but
ClickHouse is not a simple product to get in and use, uh, Firebolt is built for
simplicity, we, we aren't just a wrapper for ClickHouse, we have less features
than ClickHouse, uh, because we want to make it user friendly and, and stable and all of that. So we're, we're not just a ClickHouse kind of fork.
We are our own thing, although we're using the ClickHouse engine, but our
SQL dialect is completely different.
If you try to use a ClickHouse function in Firebolt, you're not
going to have any luck doing that.
So a detail, for example, has this concept of like the
table engines and the VPLT engines.
Yep. Is this something that like, um, can be configured has this concept of like the table engines and the representation engines.
Is this something that like can be configured by the user of Firebolt or
this is like set up by you and it's like part of how you optimize the engine to
deliver the experience?
Yeah.
Uh, great question.
So yeah, there's no concept of those different table engines in Firebolt.
We do have a concept of a couple of different table types.
So we have a fact of a couple different table types. So we have a fact
table and a dimension table. Behind the scenes, what that means is a fact table is sharded across
all of the nodes in the cluster, whereas a dimension table is replicated to all the nodes.
On top of that, we have a couple different indexes. So we have, it's called an aggregating
index, which is really just a materialized view that is always updated as you ingest more data.
You can set your aggregations and your dimensions, very similar to like a Druid rollup in Fireball.
And then the join indexes, which is a in-memory join to really optimize performance.
And our goal is to provide an out-of-the-box experience that everybody gets
good performance. But if you have specialty use cases, you know ahead of time exactly what
aggregations are going to be done and you're going to be running those, you know, potentially
hundreds of times per minute, you can optimize for those specific use cases.
That's very interesting. It's very, very, very smart. Like the way that, let's say these features are productized in a way.
Yep.
How you create like a product experience on top of like something that is, you
know, like things like a materialized view or like how you distribute like
re-leads like a table is going to be distributed or not and all that stuff.
Like that's very, I find this very interesting exactly because like,
that's exactly how a product works, right?
Like you get what the customer needs and you map the technology stuff
and you just like abstract into it.
Yep.
Nobody needs to know behind the scenes what is happening there.
So that's, that's great.
And okay.
So having worked with ClickHouse, what's your favorite part of it?
Favorite part of ClickHouse?
Yeah.
Oh, okay.
So I am a big fan of the aggregating indexes because I come from an old school world of
databasing where you created summary tables. You had, I've used SQL Server analytic services
to build data cubes and summarize data
and being able to get that same effect
of pre-computing all of your aggregations,
but not having to wait for a nightly refresh
and being able to build those on the fly,
I think is really cool.
And then the automatic query rewriting.
So as your users are writing queries or your BI tools writing queries,
it's going to automatically use the aggregating indexes that are available.
And you can have multiple aggregating indexes or query plan
or automatically choose the best one for the query.
So, you know, that, again, going back to my history in Druid,
Druid had a concept of roll-ups. So as you ingest data, it'll aggregate it to a certain granularity.
The aggregating indexes allow you to do that same thing, except you can aggregate to multiple
different granularities. You can aggregate on a field that isn't time-based like you need in Druid.
So it's a lot more flexible and, but at the same time, remaining user-friendly
as opposed to rolling your own materialized view in another system.
Yeah.
And what's your favorite Firebolt application, ClickHouse?
The query planner.
FireFull has its own query planner to optimize queries.
ClickHouse has no real query planner, does exactly what you put in.
So when you actually release your product to the world and people write queries and some of them are not optimized, sometimes they're doing massive joins and there's no pushdown or anything like that. So having the query planner automatically do those optimizations, use
materialized views, use the join indexes, all of that, that's a huge
benefit over using just raw ClickHouse.
How was the experience of like building this optimizer?
I mean, the reason I'm asking is because I know that like it's one of the
toughest problems in like database instance, right?
Yep.
And one of the hardest and probably like most something that can
be solved like at the end, right?
Like it's very discussed topic, like in computer science isn't it?
Not so.
How, how was the experience of like building an optimizer?
Well, you might have to talk to people that are slightly smarter than me,
cause I didn't build the optimizer.
But I think that it's a ongoing process.
We're always encountering new problems and finding new ways to optimize code.
You know, frequently we get data from Tableau or Looker and it's generating
queries and we have to kind of understand what it's trying to do and then see if there's a way to do it better.
And our solution architecture team,
one of their core responsibilities
is to take SQL code that customers are generating
and find ways to optimize it.
And then we provide that information
back to our product and engineering teams
so that they can build those optimizations
back into the product and ultimately kind of that they can build those optimizations back into the product
and ultimately kind of make it more user-friendly yeah it makes sense yeah that's why they're very
interesting both problems like um i'll tell you about like kind of feedback link between how
the customer experience drives something so deeply technical as an optimizer at the end.
I think that's one of the most interesting things of both engineering and product teams
have to experience in working in product tech Firebolt, which I find very fascinating.
So that's super interesting.
So I'm interested to know, where does Firebolt fit into architecture?
So you mentioned that you want it to be
sort of a fully featured cloud data warehouse, right?
Or that's what it is.
So, you know, which actually sounds different
than maybe some of the language that we hear
from Snowflake a la like a data platform, right?
That sort of includes a cloud data warehouse,
but also has this constellation of other tooling around it.
So when companies implement Firebolt, includes a cloud data warehouse, but also has this constellation of other tooling around it.
So when companies implement Fireball, what I'd love to know is sort of what are the types of companies that are adopting it? And then how do they fit it into their architecture? Is it a
replacement, you know, for sort of a Snowflake or a Redshift or whatever? So yeah, just tell us how
companies are fitting it into their data stack. Yeah. So I guess, as I mentioned, like we want to be a full featured cloud data warehouse.
I'll be the first to admit we're not even there yet. We're a data warehouse with some very specific
use cases that frequently, you know, we have customers that are coming from Redshift and
Snowflake and different data warehouses, that they continue using those products in a debt into Firebolt.
Okay.
Firebolt lacks a lot of the kind of ancillary functionality, a lot of the large-scale data processing capabilities of something like Snowflake.
Whereas, you know, we're built for a write once, read a whole bunch of times architecture.
Got it.
We don't have right now row level updates and deletes.
So if you need to make an update to a record, you need to drop a partition of data, which
isn't that unusual in the traditional OLAP world.
But people have gotten so used to Snowflake allowing things like that, that for some use
cases, it's just required.
But for those other use cases where they are doing the analytics, where it's immutable
data or not frequently changing data, they can kind of peel off use cases and use Firebolt
with those.
We built kind of our business model to make that very easy.
It's all pay-as-you-go, consumption-based.
You don't have to sign a contract or anything.
So as Firebolt grows and encompasses more and more features, then you can grow the use cases and move more and more off.
So we want to be very cost-effective for the use cases that we're really good at and then grow into the rest.
Sure. Makes total sense. And do you have sort of a particular type of company
or even industry that tends to adopt Firebolt
because of the use cases?
So we oftentimes see more cloud-native organizations,
you know, smaller companies
that are comfortable with a SaaS data warehouse,
that they're comfortable with the data
leaving their walled garden, their VPC. And we also see companies that usually have large data sets.
So ad tech data, gaming data, clickstream data, marketing data, all of these sorts of things that
have huge volumes of immutable data are really, you know, a natural fit for Firebolt.
Yeah, super interesting. Okay, personal question, what you've worked in and seen sort of firsthand,
like a lot of data technologies, you're still working in data. What do you love most about it?
You know, just from a personal level, or I mean, do you? Maybe like, you know, some of us get like
really deep into a career. It's like, you don't know going back. that always provides value. I mean, knowing different programming languages and being able
to work with data is immensely valuable, but data itself is something that is always going to
be growing. It's always going to be around. So I think that there's unlimited opportunities
for working in data. Yeah, for sure. Okay. And on the flip side, what do you like least about working with data
or working in the data space? The thing, I think the thing I like least is
anybody that kind of positions themselves as a answer to every question. If you are a system that is really good at doing massive data processing
tasks like Spark, chances are very good that you're not going to be great at doing very fast
key value lookups, for instance. So there's oftentimes a use case that is a good fit for
a tool and use cases that are not a good fit for the tool.
And understanding where those kind of good fit is, is very important.
But having one product that says it serves every use case, I think is, is just unreasonable.
And, you know, I don't want to.
I was literally going to say marketing is the worst part about working in data.
And I'm a marketer working in data, but I couldn't agree with you more.
I'm sure Firebolt marketing is probably going to be listening to this podcast.
So I wanted to dance around it a little bit.
Yeah.
Other than Firebolt's amazing marketing team that i could not love more
yeah no but i mean i actually really appreciate the sort of transparency or honesty around saying
like this is what we want to be and this is what we do excellently now i think that's really helpful
and i mean i appreciate that and i think our listeners appreciate that too, where you kind of know what you're getting as opposed to, because you're totally right. It's the like,
the disenfranchisement of you sort of look at the site, you look at the product page,
you're like, this is awesome. You know, the docs are a little sparse, like, let me try this. And
you're like, oh, right. I know why the docs are sparse. Yeah. To be honest, I mean,
the industry is like at stage right now is it's like quite early and there's a lot
of innovation happening.
So it's not like things change.
Sure.
From day to day.
So it's not just, it's not the market.
The fault that's not this is the cool.
It's just that they know that the market is still trying to figure them out.
Yep.
I guess the, another thing that always kind of
rubs me the wrong way is people that
make statements based on like outdated
information, you know, saying
that whatever technology
it is, is
what it was five years ago.
I still hear people saying that like
they don't want to use Apache Druid because
they write SQL.
I'm like, you've been able to use SQL with Apache Druid for, you know, almost five years, if not more.
So I guess people should always be kind of reevaluating their preconceived notion about any technology or any company as time goes on.
Yeah, for sure.
I think we talked about that with a term like, you know, CDC, change data capture, right?
And it's sort of, you know, there are companies doing really interesting things with it, right?
But it's not new, right? It's really old technology, even though there are some new companies that there's some excitement around.
But it's not like, you know, it's...
I don't want to do CDC because then I have to put triggers in my database and it's going to put an additional overhead.
I've been doing this for far too long, but yeah.
Yeah.
Yeah.
No, that's great.
That's great.
Well, this has been so wonderful to have you on the show.
Learned a ton about Firebolt.
Kostas, any last questions before we sign off?
I mean, it has been great.
Like, it's great that we learn more about the core technology.
And before we, we end the show, tell us like something exciting that
is coming in the near future.
Oh, great question.
I think streaming data coming to Firebolt is going to be huge.
We are working on building in mutations so you can do those
row level updates and deletes.
I think that's going to open up a lot of new use cases
for Firebolt customers.
Our ecosystem team is booming.
We're always adding new partners
and new integrations into the system.
So anytime we can get another partner
and learn more about their product
and cross-sell and all of that,
that always gets me very excited as well. You have also, I know, learn more about their product and cross-sell and all of that, that always gets me very excited as well.
You also, I guess, I know a lot of great team.
Yeah.
You never know what's coming out of our marketing department.
So that's always exciting.
You got to watch the marketers, especially when it comes to data.
All right, Ben, thank you so much for taking some time with us on the show.
Yeah, thank you for having me.
Here's one of my takeaways. You know, I'm trying to, we could probably count,
we could definitely count on one hand the number of times that Hortonworks has come up on the show.
You know, I mean, even the name Hortonworks sounds a little bit, you know, enterprise-y. I mean,
I guess it is enterprise-y actually. But it was just interesting to hear about that. And my guess is that Hortonworks probably has played a bigger role in the data world than I think a lot of the
content on our show necessarily gives it credit for. That's my takeaway. Also, the Hortonworks
guy was actually from the East Coast side of Atlanta, which is interesting as well. So yeah,
I don't know. It's just interesting to hear him talk a little bit about Hortonworks and kind of
the work that he did there. And then of course, like the Druid stuff is interesting.
But what was your takeaway? I think one of the most interesting parts of the conversation was a click house and how actually open source can fuel, let's say, the innovation in this space. like considering that a database is just like a too risky thing to get to market,
to actually get to a point where we can start building companies and products
and iterate fast without the risk of the past for that.
Something that happened a lot with SaaS in the past decade, for example.
But if you want to replicate this in, let's say, the data-related infrastructure,
we need something similar, right?
And it seems that open source, and it's not just ClickHouse,
but I think that in this case, it's a very good example
of how they took the core part of this,
they cloned the prod, built their own query optimizer
on top of it, changed the query parser.
I mean, they've done a ton of work,
but this work was done like a ton of work, but this work
was done like on a very solid core that would help them like accelerate the
whole process of taking this product to the market, right?
And this is needed.
So I'm very excited about that.
And I'm waiting to see like what other like products do something
similar like this out there.
There are examples like this database space, like we have Vitesse, for example, and PlanetScale, which is like this out there. There are examples like this database space.
We have Vitesse, for example, and PlanetScale,
which is based on that.
Anyway, that was probably one of the most interesting parts
of the conversation, what really made me excited.
All right. Great conversation.
Several more shows for you from Data Council,
so subscribe if you haven't, and we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rutterstack.com.