The Data Stack Show - 71: ETL at the Edges with Jimmy Chan of Dropbase
Episode Date: January 19, 2022Highlights from this week’s conversation include:Jimmy’s career background (3:01)How to use Data cubes (5:52)What Dropbase is and who it is built for (11:01)Getting sales and marketing data in usa...ble formats (16:46)Ensuring data remains flexible and transferable (28:36)Defining what “offline data” is and how to use it (34:09)How Dropbase can work with the rest of the data stack (43:30)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. If you're watching this on video, you can see that
it's evening on the East Coast and midday on the West Coast, which is what we get in the winter.
We are going to talk with Jimmy from Dropbase. He's the CEO and co-founder and really interesting
company. I'm just going to give you a little preview here.
You can load data into DropBase and it's database included.
So first company like this we've talked to,
which is fascinating.
I have a lot of questions
as sort of the user of this type of product
because I think it would have helped me a lot in the past.
I think my main question is going to be
about the architectural decision, right?
I mean, we talked so much about cloud data
warehouse, data lake, et cetera, and how that's the modern architecture. And they chose to do what
Jimmy calls batteries included, which is database included. So I just want to know where that came
from. And I think his past working with data will inform us on that. But Kostas, tell us what you're thinking about.
Yeah, I want to learn who is the user. We live in a time where it's all about the data engineer,
data engineering teams, infrastructure for data. And we just assume that everywhere there is a
data engineer ready to do, I don't know, like anything you want with your data.
Obviously, it's not true.
So, yeah, I mean, it's obviously like from a business perspective, you think about like
a very underserved segment of the market right now.
So I really want to see like who the users are and how do they feel about this and how
they use it.
And the other question has to do a little bit with the data sources,
because I think that we are going to hear some more, let's say,
unique data sources that they are encountering then,
like maybe things like FTP or email and stuff that usually a data engineer
does not consider as a data source, right?
So I think it's going to be very interesting.
I am amazed that you did not say you wanted to ask him
what database they're running under the hood.
I mean, isn't that like...
Well, you started and you were saying
that you want to talk about the architecture.
So I'll leave that to you.
Okay, all right.
Well, I know the questions are going to come up.
We'll see who gets to that question first. All right. Let's jump in and talk with Jimmy.
Let's do it.
Jimmy, welcome to the Data Sack Show. We're extremely excited to chat with you.
Thanks for having me.
Well, give us a little bit about your background and what led you to DropBase.
Sounds good. Yeah. So I have been directly in data for the last five years through my startup.
And I've been more indirectly with data for almost 10 years now.
I was an early adopter of Tableau, a Tableau BI tool.
And that was one of my first jobs.
And this was at the time when the company was still using DataCubes.
So it's been quite a while.
Data management was a little bit harder than it is today.
So that was sort of like where it was coming from. And since then, I've always excited about
just the data space and how useful it is to have data that's accessible,
but also extract insights from it that you can use to make business decisions.
Sure. Okay. So I'd love to talk a little bit. We have so many things to talk about
as it relates to DropBase.
But really quickly, so being an early adopter of Tableau.
So when I hear Tableau, and I think a lot of our audience may feel this way, is my first question is, was it fast back then?
Because today, if you have a big Tableau implementation, it's kind of slow, a little bit
cumbersome.
I mean, super powerful.
But what was it like back then?
I mean, because that was kind of a pretty cutting edge BI solution that allowed you to do analytics that were even harder before.
I think it's been just as fast as before, but also just as slow at the same time.
And now I'll explain what I mean by that.
So the BI tools have always been just as fast as your database can produce and transform data for you. And back then, because the company was using data cubes,
if I needed to perform a different analysis, it was much, much faster to actually have the
cube generated by a data engineer first. So you'd have to submit a ticket, get the cube set up, and then you'd connect your Tableau
to it. And then you'd slice that. Tableau was a lot less featureful back then. It did the basics
pretty well. It was still very impressive at the time because it was very visual. You could drag
and drop things. And it was just really, really cool. And I think on that basis, they got a lot of customers to sign up.
So it was faster and slower.
It was slower because of data infrastructure that we had back then, but faster because
it was less feature-full as it is today.
And then, well, today people just have way more data.
And Tableau itself, it's just a lot more complicated product.
It's super powerful, but it's very complicated.
It does a lot of things, right?
And so it slows it down a little bit too.
One thing before we continue, because I don't think that we have thought before what a data
cube is.
So Jimmy, would you like to give like...
Yeah, sure.
Yeah, absolutely.
Yeah.
So I should have mentioned that before.
So data cubes are the simplest way to think about them is as an array, sort of
like an N dimensional array, where you just have that array pre-computed for the purpose of being
able to pull it faster with a downstream data, right? The difference with how we do it today
is that if you use say a data warehouse or something like ClickHouse, right? They're column-based analytical
databases, and they're massively parallel. You can compute columns really quickly, right? But before
it wasn't like that. We didn't have Snowflake as well. I think maybe they were alive back then,
but it was nowhere what it is today, right? And so people were still in these old systems,
and you just have to pre-compute a lot of these cubes, which were arrays, and you kind of have to set on them, right?
Settle on them, right?
So like you have like product and then price and then time, and then that's your cube.
That's a three-dimensional cube with those three dimensions, right?
But if you needed something else, it's like, well, you need to just build a new cube and
then store that cube and then you can pull data from it.
So that was just how it worked before.
Not too different from the concept today of kind of like transforming your data
and then pre-computing some tables,
but your tables can still have more flexibility
than the pre-computed cubes that you used to have before.
Okay. And how was it mainly implemented?
Like how you would implement like a data cube?
Was it something like the realized view
on a Postgres database or like something else?
Yeah, no, they were using these,
I think it was just like Oracle databases.
I don't really recall the exact technology they used,
but it was like,
it's not the kind of tool that you would pick
as your first choice today
if you were to set a data infrastructure.
Let's just put it that way.
And then within those database tools,
they have the concept of cubes sort of embedded in there.
So you could like write some queries, build a cube.
You would schedule the computation of the cubes beforehand
because you had to process on the actual database,
which were also slower back then.
Okay, that's super interesting. And what happened to data cubes today? on the actual database, which were also slower back then. Okay.
That's super interesting.
And what happened to DataCubes today?
Are they still a thing?
Yeah, I don't think they're as popular.
People don't talk about them too much.
People don't, like companies are not signing up for a data warehouse to be like,
yeah, I'm so excited about BuildingCubes today.
Today it's more about flexibility, right?
It's about being able to quickly get the data you need
to do the analysis that you want.
So people sign up for a highly scalable data warehouse
where they can just store all the data.
And then, you know, they can transform data as they need them.
And so you create maybe like tables
that you can then use to perform other transformations,
but you don't have to run these cubes
that are so rigid in structure
that if you needed to do something else,
you have to recomput everything again.
One quick question there.
And it's funny, I find myself looking back on technology
that in the world of tech in general,
like actually isn't that old,
but relative to the tools we have today,
just seems so antiquated, but
I'm just interested to know, did it feel rigid back then? Or was it just kind of, this is
convenient. Like you can build a cube and that makes Tableau more efficient. Like what did it
actually feel like using data cubes back then? Yeah. It's, I think at the time it felt like
it still felt slow though. Like I think as humans, we know when something feels slow, it just feels slow.
No, because like anytime you want it, like a slice of data, you have to wait for it.
It wasn't like, it wasn't like, boom, like you get a table and that's it.
So it felt slow, but at the same time it felt like, well, I mean, how else are you going
to do this?
Right.
Yeah.
So, so that's, that's generally how I remember it feeling like back then.
Yeah.
I just keep thinking of like, what are the differences
and what there's common between like the concept of a data cube
and what we try to do today with using something like dbt, right?
Because at the end, like dbt defines a table at the end.
Like that's what we get at the end.
That could be like a data cube, right?
It is like a multidimensional data set that we are going to use to do whatever we want to do there.
Do you see like anything in common there?
There are parallels.
I'd say there are parallels.
Just the concept of pre-computing
data for the purpose of accessing it faster it's a very common it's just a very common thing to do
right it's like when you think about just like computer systems too right you have like you have
storage and memory you cannot always want to have something quickly available for use and so the
principles of it is the same but but in practice, they're a little bit different
in the sense that even when you use dbt
to pre-compute your data,
like they still remain in database tables
on that same warehouse, right?
And so, whereas before it's like,
you could have like a 10 dimensional cube
and you had to pre-define it so explicitly
and you have to run it on your entire data
to be able to use them.
So that was really painful. Okay. Okay. I think we had enough with the data cubes, Eric. So you
were asking something, Eric, and I stopped you to ask about the data cubes.
No, it's great. I think one thing we've learned in the show is that we can learn from the past,
which is great. And that we hadn't talked about data cubes yet. So I'm so glad you brought it up. Enough of the past. Tell us about DropBase
and what the product does and who it's built for. Sure. Let's jump from the past to the future.
So DropBase is just an end-to-end platform to automatically import, clean, and centralize your data in a database.
And this is database included.
So we're basically a batteries included solution. in a very simple and intuitive way very quickly so that even if you don't have a full-fledged
data engineering or technical team, that you can still access some of these tools that are
beneficial for companies such as databases and quick data pipelines. And so that's what we do
today with DropBase. Very cool. And I'm so interested to know, so lots of ink spilled
about modern data warehouse, data lakes, that infrastructure sort of standing on its own.
What were the variables or observations or needs that you saw to build a product that's database included?
I'd just love to know the thought process behind them.
Sure. I think you can always answer that question if we go back to the kinds of problems that we solve, the kinds of problems that we observe from our user base.
And so at a high level, there are three problems.
There's the how do we democratize data and how do we automate data operations?
And then how do we help them set up infrastructure?
And along democratizing data, it's people are outgrowing
their spreadsheets, right? People start out their analysis with Google Sheets, Excel, maybe they
export some CSVs from some uncompatible system. And then as their needs grow, they sort of,
you know, they can't use these tools anymore. And so that's kind of like a basis from where we're starting from, right?
Users who are in that world, right?
And then when these people deal with data imports, right?
People are sending them data through emails or through batch exports.
Maybe they're connecting like an SFTP server.
They end up doing a lot of repetitive data cleaning work.
So that is downloading an
attachment from an email and then uploading to some new system, cleaning it, and then maybe
moving it to a database eventually. And what happens is when an analyst, for example,
is building some data cleaning steps or is building a data pipeline on their spreadsheet,
it doesn't really carry over to scalable data pipelines that you can then use over and over again. And because of that, so a lot of
non-technical teams end up being paralyzed, right? Like they just can't really do things without
needing an engineer or maybe somebody else to help them set up a database. And even if they've
set up a database, well, how do they move this data to it easily
without writing their own scripts, right?
And with all the data cleaning steps that they've added.
And so if you look at these core problems,
people are growing their spreadsheets,
people having to do repetitive data cleaning work
every time they deal with a spreadsheet.
And then the fact that they can themselves
spin up data pipelines or data infrastructure, those are sort of the core problems we see. And we say, well, if we were to
give them a solution like this, it's going to have to be batteries included. It's going to have to be
something where they can create an account, create a workspace, upload a CSV or an Excel file,
and immediately have it in a database that they can then connect to a BI tool, for example,
or any other tool that connects to a database, right? And so that's sort of how we look at,
you know, the evolution of like observing the problems and then saying, okay, we must give
them these tools. So that's a really core part. The other part is that I think the this group of
users are kind of like the forgotten users, because a lot of tooling and products today focus on
people who we assume already have a database. And I think with big event, big milestones,
events like Snowflake going public, I think it's going to drive this sort of move where we kind of
like leapfrog from like spreadsheets almost straight up to data warehouses, similar to what
we see with like
mobile phones, right? It's like the technology just makes sense. Why are we still going in such
small steps? We can just sort of leapfrog it. And so we see a big portion of the market with these
larger and larger spreadsheets and CSV exports who are just going to need a database and we want to
be there for them. Yeah. One question, and I know that
Costas, I can see in your mind, Costas, you have technical questions about the database,
but when you talk about outgrowing spreadsheets, I think about two vectors there, and maybe there
are more, but I'm just interested in this. So, and these are the two vectors that I've,
I've experienced in my past going through the exact sort of life cycle of spreadsheets that
you're talking about. One is complexity, right? So I'm exporting marketing data, sales data,
transactional data, et cetera. And I'm getting really good at VLOOKUPs. And it's like, okay, this is unwieldy, right?
I mean, kind of the way it played out in my past is like, okay, well, Monday morning,
the first four hours are running all the VLOOKUPs and everything to get the numbers from last
week or whatever.
The other is size.
And I know that these are related, right?
But you have to have a pretty powerful machine to like run hundreds of thousands of rows
in Excel.
Google Sheets is getting better, but these things choke when you get to a certain amount
of data.
And actually now with all the data that companies are collecting, like it's not that much data.
So how do you see those two vectors interacting?
Are there more that sort of force people past the point of like, this isn't working anymore?
Yeah, absolutely. So I think those two vectors are quite accurate ones. There might be a couple
more. So when you think about the VLOOKUP and the complexity of it, and also just the requirement
to have a big machine or a lot of
memory in your local computer to run this. And then when you do a parallel to that with how you
do it in the database, it's just like a left join, right? And today, if you had like two tables in
Snowflake, where you did a left join, you could do it on a million rows in seconds, right? And so
then the gap could be closed through user experience, right? So if we
could just build a function or a UI component that a user who is familiar with the concept of V
lookup could help them perform a left join in a database, then maybe we could bridge the gap,
right? So those two vectors are really important. I'd say the other one is a bit more about
scalability and repeatability,
because a lot of the times you end up just doing this over and over again, right? With the VLOOKUP,
let's say that you have either the reference table is updated or sort of the core table is updated.
In either case, every time you get an update, especially to the reference table, the one that
tells you, okay, Apple is in tech and then the city is in finance. Well, you have to rerun the whole thing again. And that tends to be quite
manual. You'd open your spreadsheets, you do the, you look up and then it's there, right? With,
with databases and with tools like DropBase or other tools, you can just automate that process
and make it more scalable and repeatable. Yeah. I think, yeah, I think the other major thing
when you think about spreadsheets is human error, right? I mean, that's data's messy in general.
And when you think about combining all these different spreadsheets and then trying to use
VLOOKUPs and macros, if you're getting really fancy to try to normalize all this stuff it's like i mean someone's going to fat finger part of the equation part of the formula at some point and that's obviously
very painful especially when it takes like 10 minutes for the for everything to run yeah but
eric and jim like i can't stop thinking while you are talking about spreadsheets like can you
think what is going to happen to our civilization
if suddenly like tomorrow Excel disappears?
Oh, no, it would be disastrous.
So, you know, so spreadsheets still hold a very important part
of the economy together in some way, if you think about it that way.
There are things that spreadsheets are just very good at what they do.
And if it wasn't because they don't
perform at scale, you know, like people would just still use spreadsheets. They're very powerful.
But yeah, the whole world would fall apart. Literally, if spreadsheets stopped working
today, you've heard about the horror stories of big financial models, like just build on Excel
and maintain on Excel. Well, I think, you know, one thing, Kostas, to that point,
I remember a couple of years ago,
we were trying to solve this, you know, sort of in marketing,
you want to tag all your links, you know?
And so I was doing this huge project for this massive company
and they needed all these permissions and everything.
And so we'd like built a Google Sheet to do all this.
We had custom scripts running in the Google Sheet.
We were hashing values with MB5 and a custom Google script to make the string shorter and
all that sort of stuff.
And I remember showing it to my friend who is a software engineer.
He's like, this is software.
He's like, this is so brittle.
That's true, yeah. It's like, you shouldn't, he's like, this is so brittle. That's true, Eric.
But on the other hand, like, if you think about it,
it's amazing how approachable software development
has made, right?
Like, you have all these people out there
who are actually, like, not developers,
but they can still, like, develop automations
for their needs, right? Which is,
it's amazing. I don't know how much of it is just like, that it's there out there like forever. And
you know, like there's a lot of training and all that stuff, but as an interface with a machine
and like as a way to program the machine, I think it's probably like the most successful interface so far.
Now, does it scale? That's a different conversation, right?
And that particular project did not scale. It got really slow, really quickly.
I mean, yeah, think about it. Let's say you had a bunch of different VLOOKUPs or other
sort of transformations in a collection of spreadsheets
that collectively are like gigabytes of data. And then now you're told to scale that thing.
So the first thing you do is probably you'll contact one of your fellow engineers, maybe you
contact them and you say, hey, can you scale this up? And then they'll look at you and be like,
I have no idea what you did here. Ideally ideally it's just like something where the person who is building that spreadsheet somehow can
record those steps and then we can take these steps and deploy those steps right at scale and
then maybe it could work but today it's it's not they're not built that way yeah they're not built to just transfer easily to code. Yeah.
So, Jimmy, can you give us like a small tour on like how is the experience with DropBase like for a person who gets a file through an email and they want to use the product?
Yeah, absolutely.
So we have this new feature coming that's called DropMail.
And it's probably going to be the simplest way to get data from an email straight to
your database, right?
So all you do is you open up your email, right?
You type in a special email address, a special DropMail address, and then you add a CSV
attachment.
And, you know, that's it.
And you just send it.
And on the other side, using the DropBase dashboard,
you can set up a sort of a pipeline
where you grab that data, you apply some cleaning steps.
We can even automatically map it
so that it fits the database schema.
And then we just automatically run it from there.
So the experience, it's quite magical, right?
You just send an email and then if auto run is enabled, it's straight in your database.
So if you have all your downstream tools connected to that, you can imagine that you
do automate the whole process of even downloading that file and then doing what you need to
do with it.
And then somehow writing a script to inject that data into your database, right?
And so, and the use cases where this becomes useful, helpful is when we have users that are, let's say, in the e-commerce space.
Right. And they get shipment updates from their manufacturers sometimes every day.
Right. And, you know, guess what? Those shipment updates are going to come from like they're going to be Excel attachments to emails.
And so now what you can do is you can set up a rule on your email and say, every time it comes from my manufacturer A, I want you to take this data to this pipeline A, which goes to table A in your database straight in.
Now, of course, your data has to be formatted in a particular way, so like a proper CSV file.
And you can pre-build some cleaning steps to it.
And then that's it.
It just goes straight in.
Sometimes data doesn't come from an email.
Sometimes you just export it from a system.
Typically, these systems tend to be more incompatible.
Like there's no API to connect to them.
Or sometimes it's just privacy security issues.
They have to take snapshots of it as CSVs.
And so the same thing is like,
you want to update that data upload that data you want
to clean it up and you want it in your database okay this is very interesting and outside of the
email which i guess is like a very common let's say channel where like data is coming in what other
channel you have seen out there or communication method or whatever that we wouldn't expect, let's
say, to see it as a method of exchanging data?
Well, I mean, I think there's this thing called EDI that big companies still use.
I'm not sure if you're familiar with the concept.
I'm kind of new to it myself.
It's some sort of electronic data exchange protocol that was used before APIs were like
a thing and it was used by big companies to exchange
data with each other in a way that was you know more standard like more compatible so those channels
they actually are pretty big today like surprisingly big because the big companies
operate those but we don't hear too much about them. I only heard about this because we had a user reach out and they're like,
yeah, like we're an insurance industry.
And then EDI is a big thing in there apparently. Right.
And so they have their own set of protocols and ways to make it compatible.
They're very different from APIs, but the underlying concept is the same.
It's like, okay,
how do I connect to a data source that's come from an EDI product so I can pull data in?
So that would be an unusual channel where data comes in.
And then you have your usual suspects, SFTPs, cloud storages.
You have your API, like pulling data directly from Shopify or QuickBooks or something like that.
And then the offline sources, CSVs, Excel files, emails. And then you can build a whole universe of something like that. And then the offline sources, CSVs, Excel files, emails.
And then, you know, you can build a whole universe of sources like that.
Yeah, that's pretty cool.
And what kind of file types or serialization types do you support?
I mean, you mentioned CSV.
So I guess that's like a very common one.
Is there something else that you see like being used out there outside of CSVs and Excel files?
CSVs predominantly, they're a pretty standard way to exchange data.
Excel files as well.
And then Excel derivatives, you have like Excel workbooks.
And then there's other ones that we don't do today, but that we could do.
It could be JSON exports, XML sometimes.
And then some of the open documents formats.
But there's still people using them,
but they're not as common as CSVs and Excel files.
I'd say for offline, like for flat files,
CSVs and Excels would be probably 80-90% of the offline data.
Yeah.
You mentioned two very interesting terms that usually conflict with
each other in reality. One is automation. So you said, for example, you can forward the email and
like if you have automation on like everything, you know, like magically you will see the data
on your database. And the other is data quality. And I'm wondering, because especially like when
you are dealing with CSV,
which, for example, you don't have that much information about data types, right?
Actually, you don't have information about data types.
Everything is a string.
So on the other hand, of course, you have a database,
which is a strongly typed system.
So how does this work?
What's the magic there?
What do you do there?
Sure.
Yeah.
There's a few things we do to ensure that we can still ingest the data.
And you're right.
Given that a database is strongly typed, we must ensure that the data that comes in fits
in that schema, but without having to explain to the user all these things about types,
right?
And so what we do is we do a first pass in automatic inference of data types.
So we try to cast things.
If we see strings that could look like dates,
we attempt them as dates, right?
And so we will help users doing some of this stuff, right?
And then with integers and floating points or decimals, you know, that's
a little bit easier. And then, so that's one level of assistance that we provide our users.
The second level of assistance is an explicit transformation at the moment of ingestion,
right? So they can say, look, I want this to be a date. And then we can, you can click and add a
step that says, okay, turn this into a date type. And then we will sort of force it as a date type.
And then we'll attempt to load that to the database.
So if it's a new table that you're creating from your CSV file, then that first set of
types establish the table in your database.
And then the next time you're trying to append more data to the same table in the database
through a new CSV file, we just automatically do all the mapping for you.
And then you just click load to database and that's done.
Now, there are cases where the data, let's say you have like a thousand rows and all the thousand rows are properly cast as dates, for example.
But there's one row that is just an ambiguous date or it's just
a messed up date.
Those cause problems.
And so the way to address those, the way we're thinking to address those is to provide a
summary of all the rows of data that wasn't compatible with the database and then provide
the user ways to transform that data so that they can successfully ingest it in the database but this is
one of our key key challenges we help solve both from a technical side but also from a user
experience side because if you're coming from excel or from csvs like there's no types right
and the user might not know that you have that database expected type yeah we have to abstract
some of this stuff away for for them the user experience. Yeah, 100%.
That's one of the, I mean, it's a very interesting and very hard problem at the end because you can do some stuff.
You can infer some stuff, but you cannot infer anything, everything, especially when we are talking about, I think one of the most annoying types is Boolean.
Because people can represent Boolean value in so many different ways.
You have true and false.
You have yes and no.
You have zero and one.
There are times that they just merge all of this together.
And of course, when a human reads an Excel file, it looks fine.
You can interpret the semantics around the values, right?
But that's not exactly true when it comes to a database system.
And yeah, it's very interesting because it's also like, there are two things there.
One is, yeah, you have the data that you cannot infer and you need to keep them like somewhere
so someone can go and transform them.
But also like what I have seen is that you can be so aggressive
with trying to adapt everything and auto-cast, let's say,
where it's very easy to end up in a situation
with your dataset being just a string at the end, right?
Which doesn't work.
Yeah, absolutely.
So that is the challenge that we'll have to solve,
is how to, over time, get better and better at accurately inferring types that is aligned to the user's intention of what they want.
What we don't want to do absolutely is we don't want to lose precision in some of their data.
So if they have data that comes as decimals, like floating points, like, you know, 35.54,
like you definitely don't want to mess with that.
You don't want to say, oh, just 35 and forget about the decimal part.
So there's things that we can be very careful about.
But then for, yeah, for the other problem, it is just about over time building a way
to understand that user's intention and then maybe provide them a choice, something more
explicit, something informed and something that they can take an explicit action to make sure it fits.
Yeah.
Your opinion of what's the most annoying data types to work with?
Yeah.
You know, I, I, Booleans.
Yeah.
They, they could be difficult.
I think because they're difficult, they, they just end up as, as text and then you're going
to have to figure out something or you can transform it later from like true to one down the road. I'd say it's like when you get more advanced with your data,
like if you're thinking about like location data and then like, and then the different ways to
store like times and time zones. And, and so those become a little bit challenging. So today we do
deal with data with date types pretty well but for location data like
yeah it's just text today like we don't really have a way we can do it but we think it adds more
complexity for users and have location data be represented as location data explicitly yeah yeah
yeah time is just a pain in the ass. Time is a pain, absolutely.
That's a deep statement because it applies not only to databases,
but just to life in general.
Yeah, yeah, of course.
I think, I mean, database is a projection of reality at the end. But I think the most interesting part that time has to say
is about human nature and human communication,
because at the end, we just cannot agree on how we want to represent this stupid thing,
which governs our life at the end.
It's crazy.
I mean, if you see time manipulation libraries, the amount of work that people have put towards
building these libraries
is just crazy.
Like, and I'm pretty sure
that someone outside,
like in software engineering,
they would consider like,
okay, what is time?
Like, I mean, it's like,
we have a clock for like forever,
since forever, right?
Like how hard it is like to build.
I know, yeah.
No, totally true.
This is something where like people more on the technical side, they're like, you know,
this is important and this is difficult.
And then everybody else is kind of like, I mean, it's just time.
Can you just please add my time to the spreadsheet or to the database?
We're like, yeah, I wish it was that easy.
Jimmy, question on one thing I'd just like to drill in on briefly is you use the term
offline data.
And we haven't used that term a lot on the show.
Could you give us a quick definition for our listeners?
Because I think sometimes I can just refer to this data is in a CSV in an email, but
also it can refer to types of data that don you know, sort of don't emanate from the cloud originally, which can be some of the most valuable data.
We really mean a combination of those. It's just any data. So offline is for us, it's just any data that is not online.
But it's funny because like in theory, you extract a CSV from an online system, presumably, right? But we really just mean
files, CSV files, Excel files, and also data that's sitting locally in your machine.
Yeah, for sure. And so can you give us just a couple of the main use cases, right? So we think about offline data. What are your
customers using DropBase for in terms of types of data? Who are the users of DropBase and how
are they using them? Yeah. E-commerce is always a really good example. It's a market, it's an
industry that's really growing a lot and people use a variety of tools to do this right
so one of the things is for example similar to the example before you have like you know shipping
companies sending you shipment updates and they almost always come in offline files or flat files
you know excel csv files but then you also have companies that when their customers want to use the product, they first need to onboard a lot of this data to the database.
And so they'll export data from another system.
They'll convert it to a CSV.
And now they want to be able to ingest that quickly, repeatedly, but fast as well. And so those use
cases tend to be the ones that, you know, let's say if you're a company that builds software for
managing, for insurance brokers, right? Every broker has their list of customers. And guess
what? Like they're usually in an Excel sheet or in a CSV file, or maybe it's in some system that it's like
a pretty old school system, right?
And so now if this company wants to serve those customers, they need a scalable way
to help all their customers quickly get that data into the system, right?
And so the idea would be that they can bring it, they can clean it, and then they can have
it in a database so that their product, which is on top of that database, can then query it and maybe show some
dashboards or visualizations or some analytics for them. So these use cases, yeah. You know,
it's interesting. As you talk about DropBase, it's maybe one of the first times I've considered a product where when I think
about the users, it's, you know, sort of SMB or enterprise, you know, like the non-technical SMB,
maybe I'm running e-commerce, like, of course, it doesn't make sense for me to have data engineers
on staff or an enterprise where I'm dealing with lots of offline data.
And that's super interesting.
Would you say that's true of sort of your users or?
Yeah, that's pretty accurate.
Yeah, it is pretty accurate.
So the SMB angle certainly is, you know, they're starting with spreadsheets and then those
spreadsheets get bigger, but they still know that they want to save time.
They still know they want to get this data connected because there's now a lot of high
tech, like, you know, nice SaaS B2B tools that, you know, that would be marketed to
a lot of these smaller companies that want to be more tech oriented.
But then the first step to use that tool, it's like, hey, step number one, connect to
database.
And they're like, okay, you lost me there, right?
I'm not sure how that's going to work for me. Yeah, for sure. Yeah. It's super interesting because in that,
I mean, e-commerce were just, you know, sort of early innings in terms of the growth there.
But if you think about someone who's running, you know, a successful e-commerce store on Shopify,
like the machines are all talking together, right? Like, I mean, most of this stuff is sort of connected
or it's completely disaggregated
and you get it in a CSV.
There's no in between, which is fascinating.
So yeah.
So that's sort of the concept of the,
sort of the ETL at the edges
in the sense that like a big company,
you know, today you can sign up
for data integration tools, right?
And you can connect to different sources, right?
So let's say you do ETL or you do ELT, right?
But then there's another set of customers.
Typically, I think the smaller companies
who are not super tech oriented,
they may not even have databases.
And so like for them,
they still need a tool to get on that path
of having a data stack to begin with, right?
And so that's where you can sort of extend the idea
of like ETL,
but extend it at the edges because there's still a lot of data
that's just trapped in offline files or systems.
And so how do we make use of that data?
How do we turn that data online so it can be useful
for that company or for that business?
Jimmy, I have a question that has to do a little bit
with the architecture of the product itself.
You mentioned that in the product experience, like the database is included there, right?
How do you do that?
Like what kind of technologies you are using?
And what was the reason that you decided to do it like this instead of connecting to all the different available databases?
Yeah.
So our initial version of DropBase, because we were kind of like just
proving out the concept, we built it on a Postgres database. Again, you know, open source,
fairly powerful, fairly flexible, you know, piece of technology, Postgres databases.
So we started that way and it was, okay, let's move that data to Postgres databases.
But then you realize that at scale, like when you hit like millions of rows, you know, a transactional like database isn't as good.
Like you definitely either need to add extensions to that Postgres or you just need to use a column based database or a data warehouse.
So our new version of DropBase, we build it straight into Snowflake.
And so with that, we have the benefits of security, of near infinite scale, and of hyper-fast querying, right?
Like millions of rows in a couple of seconds. And so given that we expect our users to continuously import more and
more data over time, and data that is generated in these offline files or systems, that that data set
would become bigger and bigger over time. And so having something like a data warehouse, it's super
neat. We're also exploring looking into other kinds of data storage systems or other databases or the data warehouses that we could use to do this in a way that's super scalable, super accessible, but also affordable for our users.
Yeah, that's super interesting.
So, okay, dealing with Snowflake, I mean, I have there's still, like, some configuration that you need to do, right?
Like, as a user, you have to select your data warehouse.
You have to select the size.
Like, you have, like, this kind of, like, parameters there.
Do you handle that for your customers or they have to figure this out?
No, we handle that for the user.
So, a user signs up for DropBase and they get their
own private database instance. And then we make certain opinions about how we configure that
database for them. So that from their side, all they're doing is they're just creating a DropBase
account and they immediately get access to that database. So they get credentials to the database.
And we mirror our permissions in DropBase to those permissions in Snowflake.
So that if you are an owner of the workspace, you also are the owner of that Snowflake database from a credentials perspective, obviously.
And this allows us to do some pretty neat stuff, right?
Like we can help users manage access to the database.
That's the first thing.
And the second thing, which is sort of some principles that we really like,
which is that users should be able to access and control their own data,
is even though we're managing the database for them,
they can at any point access their data because they have the credentials to do it.
Yeah, that's very interesting. them, they can at any point access their data because they have the credentials to do it.
Yeah, that's very interesting. And I guess, I had the question, I wanted to ask you, if you have seen your system installed together with traditional data warehouses
out there, so a company might be using both, for example.
Yeah. So not currently, but we expect that to happen over time. We expect, there's two ways that this can evolve.
A user starts with DropBase, and if they're sort of on the smaller side, they could grow with DropBase.
And eventually they can say, oh, you know what, this is our data warehouse.
Now we're going to buy other tools to maybe do data integration, maybe we do BI, but then I can just use DropBase to do all of this.
So that's certainly one way it can go, right?
And the other way it can go is just you start with DropBase, and then that is just one of
your data points of data integration.
So that becomes one of your sources for a larger stack that you would have built or
that you already have at your
company. That's pretty neat. And what's your experience so far of using Snowflake as, let's
say, a component of your product? I think we had another company that we talked like in the past
in another episode where they did something similar
they were doing they have built like actually built a lot of lodging on top of snowflake they
pretty much implemented like algorithms on top of that to do their stuff so it was very interesting
and it seems that it becomes some kind of pattern there and makes sense also for snowflake right
like what they try to do is like move away
from being a data warehouse and becoming a data platform
where you can build products on top of it.
So how has your experience so far?
There are many trends that lead to that.
Like the first one is, you're right,
it's just Snowflake's desire to be a database
or a data warehouse for more things
and more companies out there.
So Snowflake has actually started a sort of a startup program, basically, where they work
with startups to just providing them with some credits, some expertise, some time with
their engineers to discuss how to build products on Snowflake.
And by that, they really mean use Snowflake as a data layer,
as like the data store for data,
with the startup's products sitting on top of it
and having a pretty tight integration to it.
So there's certainly that is one of the appeals to this,
is that, well, Snowflake wants to do this,
and so they're investing in this.
And the second one is that Snowflake as a data warehouse
is fairly mature,
it's fairly sophisticated, and it has a really good, let's call it API, where you can basically
interact with the underlying database in programmatically, basically, you can do a lot
of stuff, you can generate permissions on the fly. You can generate tables and connect permissions to those tables in different ways.
And so it's fairly feature-rich, fairly programmatic, right?
And so as someone building a product, you have a lot of flexibility.
We've seen other databases.
We constantly do research about other databases we can integrate with.
And their API, sort of their developer ecosystem
isn't just quite there yet.
But we're always keeping an eye on this
just so we can offer more choice.
I think, and also to address a previous point,
in terms of database included portion,
certainly for the smaller companies,
we want to provide them the database included.
But for the other extreme, which is companies are maybe a bit bigger and maybe they already have like data infrastructure set up.
We're going to have a sort of a BYODB, like sort of like bring your own database, bring your own Snowflake approach, where if they want our user experience, you know, they can provide the Snowflake instance or that database instance.
And then we can sit on top of it. So we built our product to be, our architecture to be flexible
enough to swap databases out and then still work. One last question from my side, and then I'll give
the stage to back to Eric. Is cost of Snowflake a concern so far? Snowflake is a bit on the pricier side.
So Snowflake has some natural,
some pretty good characteristics, right?
Like in terms of like separation of compute and storage is really nice
because we have people importing a lot of data,
but not processing it.
We don't have to have a huge machine
that's constantly hosting that data in that database just so we can store the data.
So that was a pretty nice architecture for Snowflake that makes it nice for us.
Pricing is more expensive, but we are able to actually make it cheaper for our customers because we are the Snowflake customer.
And so we basically have
like the the accounts with snowflake right and so i don't know if you knew this but snowflake has
this i think it's like by the second billing but with a minimum of minutes for their compute and
so the devil is in the details of that right it's like in theory it's really nice if you can charge
by second but if you have a small like an sm right, that's running a tiny little query, that's going
to take what, like a second or two at most, right? And so, and so for the other 58 seconds,
that's just cause that I guess we have to cover unless we, we have enough scale that at any point
in time, there's always queries running and then we And then we can get some economies of scale with Snowflake.
So from the purposes of building product on Snowflake, I think with no scale, it becomes
a lot more expensive for you and potentially for your end customer.
But at scale, I think it's going to be pretty neat.
I think at scale, the math does work out pretty nicely.
And with the other properties of being able to program that database through their APIs
and having storage and compute separated, those are really, really nice properties.
Obviously, the security aspects of it, a lot of it comes built in.
Well, we have time for just two more questions.
One may be more complicated.
The other is not.
And I'm going to put my business hat on here because Kost and I both as former entrepreneurs can't help it, especially when it
comes to data. But building on Snowflake to me, especially with the SMB and enterprise approach
is brilliant because SMBs are going to move towards a database like Snowflake, right? As they sort of grow and expand
and, you know, want to do more stuff. And enterprises that rely on a lot of offline data
are also going to move towards Snowflake, you know, as they want to modernize their
warehousing solution. And so the transferability to me, especially when we think about Snowflake's ability to sort of syndicate data is really interesting.
Did you think about that as you were sort of building DropBase?
I mean, it's amazing to think about like, okay, I'm going to adopt Snowflake and like, great, like it's just going to work.
And then, you know, as an SMB and then the same thing from the enterprise side.
Yeah, I'd like to say that we had all that foresight that the main is we are making a, we are making
a bet basically.
And our bet is that the smaller companies, the, the not so, not so techie companies today
that they will leapfrog like a basic database, and then they'll go for the warehouse.
And I think that data warehouse companies like Snowflake and others,
I think they're also going to want to sort of like,
you know, tackle that market.
So they'll be building new features.
I think they'll be making their, you know,
the pricing structure, the business model.
Eventually, I think it makes sense for every business
to have some sort of database, right?
And I think for us, it's more like, okay,
well, that end user might not know how to make that choice of a database. So why not give them
something that is scalable from day one, that is highly secure, and something that we have very
high level of, of customizability in terms of like how we can program it and develop features around it. And so that as they grow, they can still stay within it. There's no need to migrate from like
these databases to something else. But that's like that sort of, I think it came afterwards.
It wasn't like we had this from the beginning, but we were making a bet on that people will
leapfrog and they'll want databases that work for them. And so that's sort of like,
that was our starting point.
Yeah. Super interesting. And I mean, in many ways it's kind of, you know, you think about the database as a core piece of infrastructure and then the question kind of becomes interface,
right? Like how do you interface with it? And different teams are going to interface in
different ways, which is fascinating. Okay. Final question, Dropbase, if listeners want to check it
out or try it out, where do they go?
Yeah, thank you.
Yeah, so they just go to dropbase.io and they just sign up for a free trial and they can start using it right away.
By creating a workspace, they have a database from day one, a database that they can connect
to, you know, Tableau, Looker, Mode, Retool, really anything they want to because they
have credentials to access that database.
Awesome.
Well, Jimmy, this has been a fascinating conversation.
Thank you for, I mean, we've talked about things that we haven't talked about, you know, over 70 episodes on the show.
So thank you for.
I had a lot of fun.
Thank you guys.
Yeah, it's great.
So thanks for joining us and best of luck with the DropBase.
Thanks, Eric.
And thanks, Costas.
Okay.
My takeaway, which may be obvious, we just don't talk about offline data that much.
And my guess would be, and I actually wish I could go back and ask Jimmy what his gut
sense of this is, but the amount of data that still gets shared via spreadsheets, via email
has to be enormous. I mean, that has to be an enormous part of the market. And obviously,
there are dynamics that are changing that, but it really is crazy to think about. I mean,
you know, we just don't talk about that a ton on the show. But there are people whose jobs revolve around sharing data in flat files or
Excel files over email. And that's probably bigger than sort of the slice of the world that we see
who's trying to do like advanced BI with open source visualization or whatever.
100%. And I think it's probably like something very common in commerce, especially now that pretty much, you know, like every company also has like some kind of digital presence.
But we have to remember that it's not just the physical, there is like an employee there at the end of the day who is probably, you
know, completing an Excel document and sending that back like to the headquarters or whatever.
And that's like a lot of data and important data, actually.
So there is Shopify, but not everything is like on Shopify, right?
Yeah, it was very, very interesting to chat with Jimmy.
I'll keep like two things from the conversation.
One that has to do with how, I mean, for people that are working like, you know,
let's say at the edge of technology that keep forgetting that technology exists for like,
I don't know, like 50 years now. So there's a lot of legacy systems out there that
we need to keep supporting. And we just forget about that. The other thing is how creative
humans can be, right? Suddenly you see that like email becomes like a transportation layer for data.
And there is a third, sorry, I said two, but there is also a third. And that's like the, I think we got a glimpse today of the rise of the data platform, right?
And something that I have a feeling that we will see more and more in the future.
And we will see more products that are being built on top of not just AWS, but on top of Snowflake or on top of Databricks.
So these are like the three things that I keep from this conversation.
I agree.
I think that was a small part of the conversation,
but I think that was probably one of the most interesting points.
That's the first company we've talked to who is building on the Snowflake infrastructure.
I think, I think it was the second.
Oh, that's right.
That's right.
Yeah.
And that's fascinating, especially to hear about sort of the cost arbitrage, if you want
to call it that, and the way the economics work.
Yeah.
And there seems to be also a lot of effort from Snowflake itself to push that.
So anyone out there who is like
considering building a product like on top of data, maybe they should reach out and see like
what this startup program is. Yeah. Yeah. It's super interesting. I mean, I'm going to look it
up just because I'm interested. All right. Well, thank you for joining us. Of course, subscribe
if you haven't. A great episode is coming up and we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You
can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.