The Data Stack Show - 71: ETL at the Edges with Jimmy Chan of Dropbase

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. If you're watching this on video, you can see that it's evening on the East Coast and midday on the West Coast, which is what we get in the winter. We are going to talk with Jimmy from Dropbase. He's the CEO and co-founder and really interesting

Starting point is 00:00:41 company. I'm just going to give you a little preview here. You can load data into DropBase and it's database included. So first company like this we've talked to, which is fascinating. I have a lot of questions as sort of the user of this type of product because I think it would have helped me a lot in the past. I think my main question is going to be

Starting point is 00:01:00 about the architectural decision, right? I mean, we talked so much about cloud data warehouse, data lake, et cetera, and how that's the modern architecture. And they chose to do what Jimmy calls batteries included, which is database included. So I just want to know where that came from. And I think his past working with data will inform us on that. But Kostas, tell us what you're thinking about. Yeah, I want to learn who is the user. We live in a time where it's all about the data engineer, data engineering teams, infrastructure for data. And we just assume that everywhere there is a data engineer ready to do, I don't know, like anything you want with your data.

Starting point is 00:01:46 Obviously, it's not true. So, yeah, I mean, it's obviously like from a business perspective, you think about like a very underserved segment of the market right now. So I really want to see like who the users are and how do they feel about this and how they use it. And the other question has to do a little bit with the data sources, because I think that we are going to hear some more, let's say, unique data sources that they are encountering then,

Starting point is 00:02:15 like maybe things like FTP or email and stuff that usually a data engineer does not consider as a data source, right? So I think it's going to be very interesting. I am amazed that you did not say you wanted to ask him what database they're running under the hood. I mean, isn't that like... Well, you started and you were saying that you want to talk about the architecture.

Starting point is 00:02:37 So I'll leave that to you. Okay, all right. Well, I know the questions are going to come up. We'll see who gets to that question first. All right. Let's jump in and talk with Jimmy. Let's do it. Jimmy, welcome to the Data Sack Show. We're extremely excited to chat with you. Thanks for having me. Well, give us a little bit about your background and what led you to DropBase.

Starting point is 00:02:59 Sounds good. Yeah. So I have been directly in data for the last five years through my startup. And I've been more indirectly with data for almost 10 years now. I was an early adopter of Tableau, a Tableau BI tool. And that was one of my first jobs. And this was at the time when the company was still using DataCubes. So it's been quite a while. Data management was a little bit harder than it is today. So that was sort of like where it was coming from. And since then, I've always excited about

Starting point is 00:03:28 just the data space and how useful it is to have data that's accessible, but also extract insights from it that you can use to make business decisions. Sure. Okay. So I'd love to talk a little bit. We have so many things to talk about as it relates to DropBase. But really quickly, so being an early adopter of Tableau. So when I hear Tableau, and I think a lot of our audience may feel this way, is my first question is, was it fast back then? Because today, if you have a big Tableau implementation, it's kind of slow, a little bit cumbersome.

Starting point is 00:04:01 I mean, super powerful. But what was it like back then? I mean, because that was kind of a pretty cutting edge BI solution that allowed you to do analytics that were even harder before. I think it's been just as fast as before, but also just as slow at the same time. And now I'll explain what I mean by that. So the BI tools have always been just as fast as your database can produce and transform data for you. And back then, because the company was using data cubes, if I needed to perform a different analysis, it was much, much faster to actually have the cube generated by a data engineer first. So you'd have to submit a ticket, get the cube set up, and then you'd connect your Tableau

Starting point is 00:04:47 to it. And then you'd slice that. Tableau was a lot less featureful back then. It did the basics pretty well. It was still very impressive at the time because it was very visual. You could drag and drop things. And it was just really, really cool. And I think on that basis, they got a lot of customers to sign up. So it was faster and slower. It was slower because of data infrastructure that we had back then, but faster because it was less feature-full as it is today. And then, well, today people just have way more data. And Tableau itself, it's just a lot more complicated product.

Starting point is 00:05:22 It's super powerful, but it's very complicated. It does a lot of things, right? And so it slows it down a little bit too. One thing before we continue, because I don't think that we have thought before what a data cube is. So Jimmy, would you like to give like... Yeah, sure. Yeah, absolutely.

Starting point is 00:05:39 Yeah. So I should have mentioned that before. So data cubes are the simplest way to think about them is as an array, sort of like an N dimensional array, where you just have that array pre-computed for the purpose of being able to pull it faster with a downstream data, right? The difference with how we do it today is that if you use say a data warehouse or something like ClickHouse, right? They're column-based analytical databases, and they're massively parallel. You can compute columns really quickly, right? But before it wasn't like that. We didn't have Snowflake as well. I think maybe they were alive back then,

Starting point is 00:06:17 but it was nowhere what it is today, right? And so people were still in these old systems, and you just have to pre-compute a lot of these cubes, which were arrays, and you kind of have to set on them, right? Settle on them, right? So like you have like product and then price and then time, and then that's your cube. That's a three-dimensional cube with those three dimensions, right? But if you needed something else, it's like, well, you need to just build a new cube and then store that cube and then you can pull data from it. So that was just how it worked before.

Starting point is 00:06:49 Not too different from the concept today of kind of like transforming your data and then pre-computing some tables, but your tables can still have more flexibility than the pre-computed cubes that you used to have before. Okay. And how was it mainly implemented? Like how you would implement like a data cube? Was it something like the realized view on a Postgres database or like something else?

Starting point is 00:07:14 Yeah, no, they were using these, I think it was just like Oracle databases. I don't really recall the exact technology they used, but it was like, it's not the kind of tool that you would pick as your first choice today if you were to set a data infrastructure. Let's just put it that way.

Starting point is 00:07:28 And then within those database tools, they have the concept of cubes sort of embedded in there. So you could like write some queries, build a cube. You would schedule the computation of the cubes beforehand because you had to process on the actual database, which were also slower back then. Okay, that's super interesting. And what happened to data cubes today? on the actual database, which were also slower back then. Okay. That's super interesting.

Starting point is 00:07:49 And what happened to DataCubes today? Are they still a thing? Yeah, I don't think they're as popular. People don't talk about them too much. People don't, like companies are not signing up for a data warehouse to be like, yeah, I'm so excited about BuildingCubes today. Today it's more about flexibility, right? It's about being able to quickly get the data you need

Starting point is 00:08:06 to do the analysis that you want. So people sign up for a highly scalable data warehouse where they can just store all the data. And then, you know, they can transform data as they need them. And so you create maybe like tables that you can then use to perform other transformations, but you don't have to run these cubes that are so rigid in structure

Starting point is 00:08:26 that if you needed to do something else, you have to recomput everything again. One quick question there. And it's funny, I find myself looking back on technology that in the world of tech in general, like actually isn't that old, but relative to the tools we have today, just seems so antiquated, but

Starting point is 00:08:46 I'm just interested to know, did it feel rigid back then? Or was it just kind of, this is convenient. Like you can build a cube and that makes Tableau more efficient. Like what did it actually feel like using data cubes back then? Yeah. It's, I think at the time it felt like it still felt slow though. Like I think as humans, we know when something feels slow, it just feels slow. No, because like anytime you want it, like a slice of data, you have to wait for it. It wasn't like, it wasn't like, boom, like you get a table and that's it. So it felt slow, but at the same time it felt like, well, I mean, how else are you going to do this?

Starting point is 00:09:22 Right. Yeah. So, so that's, that's generally how I remember it feeling like back then. Yeah. I just keep thinking of like, what are the differences and what there's common between like the concept of a data cube and what we try to do today with using something like dbt, right? Because at the end, like dbt defines a table at the end.

Starting point is 00:09:45 Like that's what we get at the end. That could be like a data cube, right? It is like a multidimensional data set that we are going to use to do whatever we want to do there. Do you see like anything in common there? There are parallels. I'd say there are parallels. Just the concept of pre-computing data for the purpose of accessing it faster it's a very common it's just a very common thing to do

Starting point is 00:10:12 right it's like when you think about just like computer systems too right you have like you have storage and memory you cannot always want to have something quickly available for use and so the principles of it is the same but but in practice, they're a little bit different in the sense that even when you use dbt to pre-compute your data, like they still remain in database tables on that same warehouse, right? And so, whereas before it's like,

Starting point is 00:10:37 you could have like a 10 dimensional cube and you had to pre-define it so explicitly and you have to run it on your entire data to be able to use them. So that was really painful. Okay. Okay. I think we had enough with the data cubes, Eric. So you were asking something, Eric, and I stopped you to ask about the data cubes. No, it's great. I think one thing we've learned in the show is that we can learn from the past, which is great. And that we hadn't talked about data cubes yet. So I'm so glad you brought it up. Enough of the past. Tell us about DropBase

Starting point is 00:11:09 and what the product does and who it's built for. Sure. Let's jump from the past to the future. So DropBase is just an end-to-end platform to automatically import, clean, and centralize your data in a database. And this is database included. So we're basically a batteries included solution. in a very simple and intuitive way very quickly so that even if you don't have a full-fledged data engineering or technical team, that you can still access some of these tools that are beneficial for companies such as databases and quick data pipelines. And so that's what we do today with DropBase. Very cool. And I'm so interested to know, so lots of ink spilled about modern data warehouse, data lakes, that infrastructure sort of standing on its own.

Starting point is 00:12:08 What were the variables or observations or needs that you saw to build a product that's database included? I'd just love to know the thought process behind them. Sure. I think you can always answer that question if we go back to the kinds of problems that we solve, the kinds of problems that we observe from our user base. And so at a high level, there are three problems. There's the how do we democratize data and how do we automate data operations? And then how do we help them set up infrastructure? And along democratizing data, it's people are outgrowing their spreadsheets, right? People start out their analysis with Google Sheets, Excel, maybe they

Starting point is 00:12:52 export some CSVs from some uncompatible system. And then as their needs grow, they sort of, you know, they can't use these tools anymore. And so that's kind of like a basis from where we're starting from, right? Users who are in that world, right? And then when these people deal with data imports, right? People are sending them data through emails or through batch exports. Maybe they're connecting like an SFTP server. They end up doing a lot of repetitive data cleaning work. So that is downloading an

Starting point is 00:13:25 attachment from an email and then uploading to some new system, cleaning it, and then maybe moving it to a database eventually. And what happens is when an analyst, for example, is building some data cleaning steps or is building a data pipeline on their spreadsheet, it doesn't really carry over to scalable data pipelines that you can then use over and over again. And because of that, so a lot of non-technical teams end up being paralyzed, right? Like they just can't really do things without needing an engineer or maybe somebody else to help them set up a database. And even if they've set up a database, well, how do they move this data to it easily without writing their own scripts, right?

Starting point is 00:14:10 And with all the data cleaning steps that they've added. And so if you look at these core problems, people are growing their spreadsheets, people having to do repetitive data cleaning work every time they deal with a spreadsheet. And then the fact that they can themselves spin up data pipelines or data infrastructure, those are sort of the core problems we see. And we say, well, if we were to give them a solution like this, it's going to have to be batteries included. It's going to have to be

Starting point is 00:14:33 something where they can create an account, create a workspace, upload a CSV or an Excel file, and immediately have it in a database that they can then connect to a BI tool, for example, or any other tool that connects to a database, right? And so that's sort of how we look at, you know, the evolution of like observing the problems and then saying, okay, we must give them these tools. So that's a really core part. The other part is that I think the this group of users are kind of like the forgotten users, because a lot of tooling and products today focus on people who we assume already have a database. And I think with big event, big milestones, events like Snowflake going public, I think it's going to drive this sort of move where we kind of

Starting point is 00:15:19 like leapfrog from like spreadsheets almost straight up to data warehouses, similar to what we see with like mobile phones, right? It's like the technology just makes sense. Why are we still going in such small steps? We can just sort of leapfrog it. And so we see a big portion of the market with these larger and larger spreadsheets and CSV exports who are just going to need a database and we want to be there for them. Yeah. One question, and I know that Costas, I can see in your mind, Costas, you have technical questions about the database, but when you talk about outgrowing spreadsheets, I think about two vectors there, and maybe there

Starting point is 00:16:00 are more, but I'm just interested in this. So, and these are the two vectors that I've, I've experienced in my past going through the exact sort of life cycle of spreadsheets that you're talking about. One is complexity, right? So I'm exporting marketing data, sales data, transactional data, et cetera. And I'm getting really good at VLOOKUPs. And it's like, okay, this is unwieldy, right? I mean, kind of the way it played out in my past is like, okay, well, Monday morning, the first four hours are running all the VLOOKUPs and everything to get the numbers from last week or whatever. The other is size.

Starting point is 00:16:41 And I know that these are related, right? But you have to have a pretty powerful machine to like run hundreds of thousands of rows in Excel. Google Sheets is getting better, but these things choke when you get to a certain amount of data. And actually now with all the data that companies are collecting, like it's not that much data. So how do you see those two vectors interacting? Are there more that sort of force people past the point of like, this isn't working anymore?

Starting point is 00:17:09 Yeah, absolutely. So I think those two vectors are quite accurate ones. There might be a couple more. So when you think about the VLOOKUP and the complexity of it, and also just the requirement to have a big machine or a lot of memory in your local computer to run this. And then when you do a parallel to that with how you do it in the database, it's just like a left join, right? And today, if you had like two tables in Snowflake, where you did a left join, you could do it on a million rows in seconds, right? And so then the gap could be closed through user experience, right? So if we could just build a function or a UI component that a user who is familiar with the concept of V

Starting point is 00:17:53 lookup could help them perform a left join in a database, then maybe we could bridge the gap, right? So those two vectors are really important. I'd say the other one is a bit more about scalability and repeatability, because a lot of the times you end up just doing this over and over again, right? With the VLOOKUP, let's say that you have either the reference table is updated or sort of the core table is updated. In either case, every time you get an update, especially to the reference table, the one that tells you, okay, Apple is in tech and then the city is in finance. Well, you have to rerun the whole thing again. And that tends to be quite manual. You'd open your spreadsheets, you do the, you look up and then it's there, right? With,

Starting point is 00:18:35 with databases and with tools like DropBase or other tools, you can just automate that process and make it more scalable and repeatable. Yeah. I think, yeah, I think the other major thing when you think about spreadsheets is human error, right? I mean, that's data's messy in general. And when you think about combining all these different spreadsheets and then trying to use VLOOKUPs and macros, if you're getting really fancy to try to normalize all this stuff it's like i mean someone's going to fat finger part of the equation part of the formula at some point and that's obviously very painful especially when it takes like 10 minutes for the for everything to run yeah but eric and jim like i can't stop thinking while you are talking about spreadsheets like can you think what is going to happen to our civilization

Starting point is 00:19:25 if suddenly like tomorrow Excel disappears? Oh, no, it would be disastrous. So, you know, so spreadsheets still hold a very important part of the economy together in some way, if you think about it that way. There are things that spreadsheets are just very good at what they do. And if it wasn't because they don't perform at scale, you know, like people would just still use spreadsheets. They're very powerful. But yeah, the whole world would fall apart. Literally, if spreadsheets stopped working

Starting point is 00:19:56 today, you've heard about the horror stories of big financial models, like just build on Excel and maintain on Excel. Well, I think, you know, one thing, Kostas, to that point, I remember a couple of years ago, we were trying to solve this, you know, sort of in marketing, you want to tag all your links, you know? And so I was doing this huge project for this massive company and they needed all these permissions and everything. And so we'd like built a Google Sheet to do all this.

Starting point is 00:20:23 We had custom scripts running in the Google Sheet. We were hashing values with MB5 and a custom Google script to make the string shorter and all that sort of stuff. And I remember showing it to my friend who is a software engineer. He's like, this is software. He's like, this is so brittle. That's true, yeah. It's like, you shouldn't, he's like, this is so brittle. That's true, Eric. But on the other hand, like, if you think about it,

Starting point is 00:20:48 it's amazing how approachable software development has made, right? Like, you have all these people out there who are actually, like, not developers, but they can still, like, develop automations for their needs, right? Which is, it's amazing. I don't know how much of it is just like, that it's there out there like forever. And you know, like there's a lot of training and all that stuff, but as an interface with a machine

Starting point is 00:21:19 and like as a way to program the machine, I think it's probably like the most successful interface so far. Now, does it scale? That's a different conversation, right? And that particular project did not scale. It got really slow, really quickly. I mean, yeah, think about it. Let's say you had a bunch of different VLOOKUPs or other sort of transformations in a collection of spreadsheets that collectively are like gigabytes of data. And then now you're told to scale that thing. So the first thing you do is probably you'll contact one of your fellow engineers, maybe you contact them and you say, hey, can you scale this up? And then they'll look at you and be like,

Starting point is 00:22:02 I have no idea what you did here. Ideally ideally it's just like something where the person who is building that spreadsheet somehow can record those steps and then we can take these steps and deploy those steps right at scale and then maybe it could work but today it's it's not they're not built that way yeah they're not built to just transfer easily to code. Yeah. So, Jimmy, can you give us like a small tour on like how is the experience with DropBase like for a person who gets a file through an email and they want to use the product? Yeah, absolutely. So we have this new feature coming that's called DropMail. And it's probably going to be the simplest way to get data from an email straight to your database, right?

Starting point is 00:22:49 So all you do is you open up your email, right? You type in a special email address, a special DropMail address, and then you add a CSV attachment. And, you know, that's it. And you just send it. And on the other side, using the DropBase dashboard, you can set up a sort of a pipeline where you grab that data, you apply some cleaning steps.

Starting point is 00:23:13 We can even automatically map it so that it fits the database schema. And then we just automatically run it from there. So the experience, it's quite magical, right? You just send an email and then if auto run is enabled, it's straight in your database. So if you have all your downstream tools connected to that, you can imagine that you do automate the whole process of even downloading that file and then doing what you need to do with it.

Starting point is 00:23:36 And then somehow writing a script to inject that data into your database, right? And so, and the use cases where this becomes useful, helpful is when we have users that are, let's say, in the e-commerce space. Right. And they get shipment updates from their manufacturers sometimes every day. Right. And, you know, guess what? Those shipment updates are going to come from like they're going to be Excel attachments to emails. And so now what you can do is you can set up a rule on your email and say, every time it comes from my manufacturer A, I want you to take this data to this pipeline A, which goes to table A in your database straight in. Now, of course, your data has to be formatted in a particular way, so like a proper CSV file. And you can pre-build some cleaning steps to it. And then that's it.

Starting point is 00:24:23 It just goes straight in. Sometimes data doesn't come from an email. Sometimes you just export it from a system. Typically, these systems tend to be more incompatible. Like there's no API to connect to them. Or sometimes it's just privacy security issues. They have to take snapshots of it as CSVs. And so the same thing is like,

Starting point is 00:24:42 you want to update that data upload that data you want to clean it up and you want it in your database okay this is very interesting and outside of the email which i guess is like a very common let's say channel where like data is coming in what other channel you have seen out there or communication method or whatever that we wouldn't expect, let's say, to see it as a method of exchanging data? Well, I mean, I think there's this thing called EDI that big companies still use. I'm not sure if you're familiar with the concept. I'm kind of new to it myself.

Starting point is 00:25:17 It's some sort of electronic data exchange protocol that was used before APIs were like a thing and it was used by big companies to exchange data with each other in a way that was you know more standard like more compatible so those channels they actually are pretty big today like surprisingly big because the big companies operate those but we don't hear too much about them. I only heard about this because we had a user reach out and they're like, yeah, like we're an insurance industry. And then EDI is a big thing in there apparently. Right. And so they have their own set of protocols and ways to make it compatible.

Starting point is 00:25:56 They're very different from APIs, but the underlying concept is the same. It's like, okay, how do I connect to a data source that's come from an EDI product so I can pull data in? So that would be an unusual channel where data comes in. And then you have your usual suspects, SFTPs, cloud storages. You have your API, like pulling data directly from Shopify or QuickBooks or something like that. And then the offline sources, CSVs, Excel files, emails. And then you can build a whole universe of something like that. And then the offline sources, CSVs, Excel files, emails. And then, you know, you can build a whole universe of sources like that.

Starting point is 00:26:29 Yeah, that's pretty cool. And what kind of file types or serialization types do you support? I mean, you mentioned CSV. So I guess that's like a very common one. Is there something else that you see like being used out there outside of CSVs and Excel files? CSVs predominantly, they're a pretty standard way to exchange data. Excel files as well. And then Excel derivatives, you have like Excel workbooks.

Starting point is 00:26:54 And then there's other ones that we don't do today, but that we could do. It could be JSON exports, XML sometimes. And then some of the open documents formats. But there's still people using them, but they're not as common as CSVs and Excel files. I'd say for offline, like for flat files, CSVs and Excels would be probably 80-90% of the offline data. Yeah.

Starting point is 00:27:22 You mentioned two very interesting terms that usually conflict with each other in reality. One is automation. So you said, for example, you can forward the email and like if you have automation on like everything, you know, like magically you will see the data on your database. And the other is data quality. And I'm wondering, because especially like when you are dealing with CSV, which, for example, you don't have that much information about data types, right? Actually, you don't have information about data types. Everything is a string.

Starting point is 00:27:53 So on the other hand, of course, you have a database, which is a strongly typed system. So how does this work? What's the magic there? What do you do there? Sure. Yeah. There's a few things we do to ensure that we can still ingest the data.

Starting point is 00:28:09 And you're right. Given that a database is strongly typed, we must ensure that the data that comes in fits in that schema, but without having to explain to the user all these things about types, right? And so what we do is we do a first pass in automatic inference of data types. So we try to cast things. If we see strings that could look like dates, we attempt them as dates, right?

Starting point is 00:28:37 And so we will help users doing some of this stuff, right? And then with integers and floating points or decimals, you know, that's a little bit easier. And then, so that's one level of assistance that we provide our users. The second level of assistance is an explicit transformation at the moment of ingestion, right? So they can say, look, I want this to be a date. And then we can, you can click and add a step that says, okay, turn this into a date type. And then we will sort of force it as a date type. And then we'll attempt to load that to the database. So if it's a new table that you're creating from your CSV file, then that first set of

Starting point is 00:29:17 types establish the table in your database. And then the next time you're trying to append more data to the same table in the database through a new CSV file, we just automatically do all the mapping for you. And then you just click load to database and that's done. Now, there are cases where the data, let's say you have like a thousand rows and all the thousand rows are properly cast as dates, for example. But there's one row that is just an ambiguous date or it's just a messed up date. Those cause problems.

Starting point is 00:29:48 And so the way to address those, the way we're thinking to address those is to provide a summary of all the rows of data that wasn't compatible with the database and then provide the user ways to transform that data so that they can successfully ingest it in the database but this is one of our key key challenges we help solve both from a technical side but also from a user experience side because if you're coming from excel or from csvs like there's no types right and the user might not know that you have that database expected type yeah we have to abstract some of this stuff away for for them the user experience. Yeah, 100%. That's one of the, I mean, it's a very interesting and very hard problem at the end because you can do some stuff.

Starting point is 00:30:34 You can infer some stuff, but you cannot infer anything, everything, especially when we are talking about, I think one of the most annoying types is Boolean. Because people can represent Boolean value in so many different ways. You have true and false. You have yes and no. You have zero and one. There are times that they just merge all of this together. And of course, when a human reads an Excel file, it looks fine. You can interpret the semantics around the values, right?

Starting point is 00:31:03 But that's not exactly true when it comes to a database system. And yeah, it's very interesting because it's also like, there are two things there. One is, yeah, you have the data that you cannot infer and you need to keep them like somewhere so someone can go and transform them. But also like what I have seen is that you can be so aggressive with trying to adapt everything and auto-cast, let's say, where it's very easy to end up in a situation with your dataset being just a string at the end, right?

Starting point is 00:31:36 Which doesn't work. Yeah, absolutely. So that is the challenge that we'll have to solve, is how to, over time, get better and better at accurately inferring types that is aligned to the user's intention of what they want. What we don't want to do absolutely is we don't want to lose precision in some of their data. So if they have data that comes as decimals, like floating points, like, you know, 35.54, like you definitely don't want to mess with that. You don't want to say, oh, just 35 and forget about the decimal part.

Starting point is 00:32:11 So there's things that we can be very careful about. But then for, yeah, for the other problem, it is just about over time building a way to understand that user's intention and then maybe provide them a choice, something more explicit, something informed and something that they can take an explicit action to make sure it fits. Yeah. Your opinion of what's the most annoying data types to work with? Yeah. You know, I, I, Booleans.

Starting point is 00:32:35 Yeah. They, they could be difficult. I think because they're difficult, they, they just end up as, as text and then you're going to have to figure out something or you can transform it later from like true to one down the road. I'd say it's like when you get more advanced with your data, like if you're thinking about like location data and then like, and then the different ways to store like times and time zones. And, and so those become a little bit challenging. So today we do deal with data with date types pretty well but for location data like yeah it's just text today like we don't really have a way we can do it but we think it adds more

Starting point is 00:33:13 complexity for users and have location data be represented as location data explicitly yeah yeah yeah time is just a pain in the ass. Time is a pain, absolutely. That's a deep statement because it applies not only to databases, but just to life in general. Yeah, yeah, of course. I think, I mean, database is a projection of reality at the end. But I think the most interesting part that time has to say is about human nature and human communication, because at the end, we just cannot agree on how we want to represent this stupid thing,

Starting point is 00:33:52 which governs our life at the end. It's crazy. I mean, if you see time manipulation libraries, the amount of work that people have put towards building these libraries is just crazy. Like, and I'm pretty sure that someone outside, like in software engineering,

Starting point is 00:34:14 they would consider like, okay, what is time? Like, I mean, it's like, we have a clock for like forever, since forever, right? Like how hard it is like to build. I know, yeah. No, totally true.

Starting point is 00:34:24 This is something where like people more on the technical side, they're like, you know, this is important and this is difficult. And then everybody else is kind of like, I mean, it's just time. Can you just please add my time to the spreadsheet or to the database? We're like, yeah, I wish it was that easy. Jimmy, question on one thing I'd just like to drill in on briefly is you use the term offline data. And we haven't used that term a lot on the show.

Starting point is 00:34:50 Could you give us a quick definition for our listeners? Because I think sometimes I can just refer to this data is in a CSV in an email, but also it can refer to types of data that don you know, sort of don't emanate from the cloud originally, which can be some of the most valuable data. We really mean a combination of those. It's just any data. So offline is for us, it's just any data that is not online. But it's funny because like in theory, you extract a CSV from an online system, presumably, right? But we really just mean files, CSV files, Excel files, and also data that's sitting locally in your machine. Yeah, for sure. And so can you give us just a couple of the main use cases, right? So we think about offline data. What are your customers using DropBase for in terms of types of data? Who are the users of DropBase and how

Starting point is 00:35:54 are they using them? Yeah. E-commerce is always a really good example. It's a market, it's an industry that's really growing a lot and people use a variety of tools to do this right so one of the things is for example similar to the example before you have like you know shipping companies sending you shipment updates and they almost always come in offline files or flat files you know excel csv files but then you also have companies that when their customers want to use the product, they first need to onboard a lot of this data to the database. And so they'll export data from another system. They'll convert it to a CSV. And now they want to be able to ingest that quickly, repeatedly, but fast as well. And so those use

Starting point is 00:36:45 cases tend to be the ones that, you know, let's say if you're a company that builds software for managing, for insurance brokers, right? Every broker has their list of customers. And guess what? Like they're usually in an Excel sheet or in a CSV file, or maybe it's in some system that it's like a pretty old school system, right? And so now if this company wants to serve those customers, they need a scalable way to help all their customers quickly get that data into the system, right? And so the idea would be that they can bring it, they can clean it, and then they can have it in a database so that their product, which is on top of that database, can then query it and maybe show some

Starting point is 00:37:29 dashboards or visualizations or some analytics for them. So these use cases, yeah. You know, it's interesting. As you talk about DropBase, it's maybe one of the first times I've considered a product where when I think about the users, it's, you know, sort of SMB or enterprise, you know, like the non-technical SMB, maybe I'm running e-commerce, like, of course, it doesn't make sense for me to have data engineers on staff or an enterprise where I'm dealing with lots of offline data. And that's super interesting. Would you say that's true of sort of your users or? Yeah, that's pretty accurate.

Starting point is 00:38:13 Yeah, it is pretty accurate. So the SMB angle certainly is, you know, they're starting with spreadsheets and then those spreadsheets get bigger, but they still know that they want to save time. They still know they want to get this data connected because there's now a lot of high tech, like, you know, nice SaaS B2B tools that, you know, that would be marketed to a lot of these smaller companies that want to be more tech oriented. But then the first step to use that tool, it's like, hey, step number one, connect to database.

Starting point is 00:38:41 And they're like, okay, you lost me there, right? I'm not sure how that's going to work for me. Yeah, for sure. Yeah. It's super interesting because in that, I mean, e-commerce were just, you know, sort of early innings in terms of the growth there. But if you think about someone who's running, you know, a successful e-commerce store on Shopify, like the machines are all talking together, right? Like, I mean, most of this stuff is sort of connected or it's completely disaggregated and you get it in a CSV. There's no in between, which is fascinating.

Starting point is 00:39:12 So yeah. So that's sort of the concept of the, sort of the ETL at the edges in the sense that like a big company, you know, today you can sign up for data integration tools, right? And you can connect to different sources, right? So let's say you do ETL or you do ELT, right?

Starting point is 00:39:26 But then there's another set of customers. Typically, I think the smaller companies who are not super tech oriented, they may not even have databases. And so like for them, they still need a tool to get on that path of having a data stack to begin with, right? And so that's where you can sort of extend the idea

Starting point is 00:39:44 of like ETL, but extend it at the edges because there's still a lot of data that's just trapped in offline files or systems. And so how do we make use of that data? How do we turn that data online so it can be useful for that company or for that business? Jimmy, I have a question that has to do a little bit with the architecture of the product itself.

Starting point is 00:40:05 You mentioned that in the product experience, like the database is included there, right? How do you do that? Like what kind of technologies you are using? And what was the reason that you decided to do it like this instead of connecting to all the different available databases? Yeah. So our initial version of DropBase, because we were kind of like just proving out the concept, we built it on a Postgres database. Again, you know, open source, fairly powerful, fairly flexible, you know, piece of technology, Postgres databases.

Starting point is 00:40:37 So we started that way and it was, okay, let's move that data to Postgres databases. But then you realize that at scale, like when you hit like millions of rows, you know, a transactional like database isn't as good. Like you definitely either need to add extensions to that Postgres or you just need to use a column based database or a data warehouse. So our new version of DropBase, we build it straight into Snowflake. And so with that, we have the benefits of security, of near infinite scale, and of hyper-fast querying, right? Like millions of rows in a couple of seconds. And so given that we expect our users to continuously import more and more data over time, and data that is generated in these offline files or systems, that that data set would become bigger and bigger over time. And so having something like a data warehouse, it's super

Starting point is 00:41:39 neat. We're also exploring looking into other kinds of data storage systems or other databases or the data warehouses that we could use to do this in a way that's super scalable, super accessible, but also affordable for our users. Yeah, that's super interesting. So, okay, dealing with Snowflake, I mean, I have there's still, like, some configuration that you need to do, right? Like, as a user, you have to select your data warehouse. You have to select the size. Like, you have, like, this kind of, like, parameters there. Do you handle that for your customers or they have to figure this out? No, we handle that for the user.

Starting point is 00:42:23 So, a user signs up for DropBase and they get their own private database instance. And then we make certain opinions about how we configure that database for them. So that from their side, all they're doing is they're just creating a DropBase account and they immediately get access to that database. So they get credentials to the database. And we mirror our permissions in DropBase to those permissions in Snowflake. So that if you are an owner of the workspace, you also are the owner of that Snowflake database from a credentials perspective, obviously. And this allows us to do some pretty neat stuff, right? Like we can help users manage access to the database.

Starting point is 00:43:08 That's the first thing. And the second thing, which is sort of some principles that we really like, which is that users should be able to access and control their own data, is even though we're managing the database for them, they can at any point access their data because they have the credentials to do it. Yeah, that's very interesting. them, they can at any point access their data because they have the credentials to do it. Yeah, that's very interesting. And I guess, I had the question, I wanted to ask you, if you have seen your system installed together with traditional data warehouses out there, so a company might be using both, for example.

Starting point is 00:43:40 Yeah. So not currently, but we expect that to happen over time. We expect, there's two ways that this can evolve. A user starts with DropBase, and if they're sort of on the smaller side, they could grow with DropBase. And eventually they can say, oh, you know what, this is our data warehouse. Now we're going to buy other tools to maybe do data integration, maybe we do BI, but then I can just use DropBase to do all of this. So that's certainly one way it can go, right? And the other way it can go is just you start with DropBase, and then that is just one of your data points of data integration. So that becomes one of your sources for a larger stack that you would have built or

Starting point is 00:44:24 that you already have at your company. That's pretty neat. And what's your experience so far of using Snowflake as, let's say, a component of your product? I think we had another company that we talked like in the past in another episode where they did something similar they were doing they have built like actually built a lot of lodging on top of snowflake they pretty much implemented like algorithms on top of that to do their stuff so it was very interesting and it seems that it becomes some kind of pattern there and makes sense also for snowflake right like what they try to do is like move away

Starting point is 00:45:05 from being a data warehouse and becoming a data platform where you can build products on top of it. So how has your experience so far? There are many trends that lead to that. Like the first one is, you're right, it's just Snowflake's desire to be a database or a data warehouse for more things and more companies out there.

Starting point is 00:45:25 So Snowflake has actually started a sort of a startup program, basically, where they work with startups to just providing them with some credits, some expertise, some time with their engineers to discuss how to build products on Snowflake. And by that, they really mean use Snowflake as a data layer, as like the data store for data, with the startup's products sitting on top of it and having a pretty tight integration to it. So there's certainly that is one of the appeals to this,

Starting point is 00:45:57 is that, well, Snowflake wants to do this, and so they're investing in this. And the second one is that Snowflake as a data warehouse is fairly mature, it's fairly sophisticated, and it has a really good, let's call it API, where you can basically interact with the underlying database in programmatically, basically, you can do a lot of stuff, you can generate permissions on the fly. You can generate tables and connect permissions to those tables in different ways. And so it's fairly feature-rich, fairly programmatic, right?

Starting point is 00:46:32 And so as someone building a product, you have a lot of flexibility. We've seen other databases. We constantly do research about other databases we can integrate with. And their API, sort of their developer ecosystem isn't just quite there yet. But we're always keeping an eye on this just so we can offer more choice. I think, and also to address a previous point,

Starting point is 00:46:56 in terms of database included portion, certainly for the smaller companies, we want to provide them the database included. But for the other extreme, which is companies are maybe a bit bigger and maybe they already have like data infrastructure set up. We're going to have a sort of a BYODB, like sort of like bring your own database, bring your own Snowflake approach, where if they want our user experience, you know, they can provide the Snowflake instance or that database instance. And then we can sit on top of it. So we built our product to be, our architecture to be flexible enough to swap databases out and then still work. One last question from my side, and then I'll give the stage to back to Eric. Is cost of Snowflake a concern so far? Snowflake is a bit on the pricier side.

Starting point is 00:47:48 So Snowflake has some natural, some pretty good characteristics, right? Like in terms of like separation of compute and storage is really nice because we have people importing a lot of data, but not processing it. We don't have to have a huge machine that's constantly hosting that data in that database just so we can store the data. So that was a pretty nice architecture for Snowflake that makes it nice for us.

Starting point is 00:48:14 Pricing is more expensive, but we are able to actually make it cheaper for our customers because we are the Snowflake customer. And so we basically have like the the accounts with snowflake right and so i don't know if you knew this but snowflake has this i think it's like by the second billing but with a minimum of minutes for their compute and so the devil is in the details of that right it's like in theory it's really nice if you can charge by second but if you have a small like an sm right, that's running a tiny little query, that's going to take what, like a second or two at most, right? And so, and so for the other 58 seconds, that's just cause that I guess we have to cover unless we, we have enough scale that at any point

Starting point is 00:49:02 in time, there's always queries running and then we And then we can get some economies of scale with Snowflake. So from the purposes of building product on Snowflake, I think with no scale, it becomes a lot more expensive for you and potentially for your end customer. But at scale, I think it's going to be pretty neat. I think at scale, the math does work out pretty nicely. And with the other properties of being able to program that database through their APIs and having storage and compute separated, those are really, really nice properties. Obviously, the security aspects of it, a lot of it comes built in.

Starting point is 00:49:36 Well, we have time for just two more questions. One may be more complicated. The other is not. And I'm going to put my business hat on here because Kost and I both as former entrepreneurs can't help it, especially when it comes to data. But building on Snowflake to me, especially with the SMB and enterprise approach is brilliant because SMBs are going to move towards a database like Snowflake, right? As they sort of grow and expand and, you know, want to do more stuff. And enterprises that rely on a lot of offline data are also going to move towards Snowflake, you know, as they want to modernize their

Starting point is 00:50:17 warehousing solution. And so the transferability to me, especially when we think about Snowflake's ability to sort of syndicate data is really interesting. Did you think about that as you were sort of building DropBase? I mean, it's amazing to think about like, okay, I'm going to adopt Snowflake and like, great, like it's just going to work. And then, you know, as an SMB and then the same thing from the enterprise side. Yeah, I'd like to say that we had all that foresight that the main is we are making a, we are making a bet basically. And our bet is that the smaller companies, the, the not so, not so techie companies today that they will leapfrog like a basic database, and then they'll go for the warehouse.

Starting point is 00:51:02 And I think that data warehouse companies like Snowflake and others, I think they're also going to want to sort of like, you know, tackle that market. So they'll be building new features. I think they'll be making their, you know, the pricing structure, the business model. Eventually, I think it makes sense for every business to have some sort of database, right?

Starting point is 00:51:23 And I think for us, it's more like, okay, well, that end user might not know how to make that choice of a database. So why not give them something that is scalable from day one, that is highly secure, and something that we have very high level of, of customizability in terms of like how we can program it and develop features around it. And so that as they grow, they can still stay within it. There's no need to migrate from like these databases to something else. But that's like that sort of, I think it came afterwards. It wasn't like we had this from the beginning, but we were making a bet on that people will leapfrog and they'll want databases that work for them. And so that's sort of like, that was our starting point.

Starting point is 00:52:09 Yeah. Super interesting. And I mean, in many ways it's kind of, you know, you think about the database as a core piece of infrastructure and then the question kind of becomes interface, right? Like how do you interface with it? And different teams are going to interface in different ways, which is fascinating. Okay. Final question, Dropbase, if listeners want to check it out or try it out, where do they go? Yeah, thank you. Yeah, so they just go to dropbase.io and they just sign up for a free trial and they can start using it right away. By creating a workspace, they have a database from day one, a database that they can connect to, you know, Tableau, Looker, Mode, Retool, really anything they want to because they

Starting point is 00:52:44 have credentials to access that database. Awesome. Well, Jimmy, this has been a fascinating conversation. Thank you for, I mean, we've talked about things that we haven't talked about, you know, over 70 episodes on the show. So thank you for. I had a lot of fun. Thank you guys. Yeah, it's great.

Starting point is 00:53:00 So thanks for joining us and best of luck with the DropBase. Thanks, Eric. And thanks, Costas. Okay. My takeaway, which may be obvious, we just don't talk about offline data that much. And my guess would be, and I actually wish I could go back and ask Jimmy what his gut sense of this is, but the amount of data that still gets shared via spreadsheets, via email has to be enormous. I mean, that has to be an enormous part of the market. And obviously,

Starting point is 00:53:33 there are dynamics that are changing that, but it really is crazy to think about. I mean, you know, we just don't talk about that a ton on the show. But there are people whose jobs revolve around sharing data in flat files or Excel files over email. And that's probably bigger than sort of the slice of the world that we see who's trying to do like advanced BI with open source visualization or whatever. 100%. And I think it's probably like something very common in commerce, especially now that pretty much, you know, like every company also has like some kind of digital presence. But we have to remember that it's not just the physical, there is like an employee there at the end of the day who is probably, you know, completing an Excel document and sending that back like to the headquarters or whatever. And that's like a lot of data and important data, actually.

Starting point is 00:54:35 So there is Shopify, but not everything is like on Shopify, right? Yeah, it was very, very interesting to chat with Jimmy. I'll keep like two things from the conversation. One that has to do with how, I mean, for people that are working like, you know, let's say at the edge of technology that keep forgetting that technology exists for like, I don't know, like 50 years now. So there's a lot of legacy systems out there that we need to keep supporting. And we just forget about that. The other thing is how creative humans can be, right? Suddenly you see that like email becomes like a transportation layer for data.

Starting point is 00:55:20 And there is a third, sorry, I said two, but there is also a third. And that's like the, I think we got a glimpse today of the rise of the data platform, right? And something that I have a feeling that we will see more and more in the future. And we will see more products that are being built on top of not just AWS, but on top of Snowflake or on top of Databricks. So these are like the three things that I keep from this conversation. I agree. I think that was a small part of the conversation, but I think that was probably one of the most interesting points. That's the first company we've talked to who is building on the Snowflake infrastructure.

Starting point is 00:56:05 I think, I think it was the second. Oh, that's right. That's right. Yeah. And that's fascinating, especially to hear about sort of the cost arbitrage, if you want to call it that, and the way the economics work. Yeah. And there seems to be also a lot of effort from Snowflake itself to push that.

Starting point is 00:56:24 So anyone out there who is like considering building a product like on top of data, maybe they should reach out and see like what this startup program is. Yeah. Yeah. It's super interesting. I mean, I'm going to look it up just because I'm interested. All right. Well, thank you for joining us. Of course, subscribe if you haven't. A great episode is coming up and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack,

Starting point is 00:57:06 the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 71: ETL at the Edges with Jimmy Chan of Dropbase

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.