The Data Stack Show - 47: Taming the Four Dragons of Data with Sven Balnojan of Mercateo Gruppe

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Today, we have a guest who has written content for a while around data companies, the data space and major trends in data. And he wrote an article called How to Become the Next $30 Billion Data Company.

Starting point is 00:00:43 It talked a lot about open source. It talked about major players in billion data company. It talked a lot about open source. It talked about major players in the data space, and it talked about a lot of new up and coming players in the data space. We'll link to it in the show notes. But Sven has an academic background. He studied mathematics. He has a PhD actually, and then has worked in a variety of data contexts. And so we're just really excited to chat with him. Costas, I think one of the things that I'm interested in is Sven talks a lot about, he's a major players in the data space now, but the fact that they're actually not necessarily going to be the next really, really big data company, which on the heels of

Starting point is 00:01:21 Snowflake's IPO feels weird to say, because that was such a monumental event, both in the financial markets and in the tech space. But I actually agree with him. And I want to hear from him why he thinks that is. So that's what I want to know. How about you? Yeah, absolutely.

Starting point is 00:01:39 I think we are going to be discussing about this. I mean, my passion about that stuff anyway, especially when it comes to Snowflake and Databricks. So yeah, I'm looking forward to this conversation, actually. Like, that's the main questions that I have. He has like this unique trait of like being, let's say, a person with like a PhD, very technical, but at the same time, like he really enjoys communicating with people out there. He has a newsletter. He's blogging. So yeah, it would be nice to hear also what drives this passion for communication and

Starting point is 00:02:11 expressing his thoughts through these channels. So these are the things that I would like to chat with him about. Great. Well, let's dig in. Let's do it. All right, Sven, welcome to the show. You have such an interesting background. You have a PhD in a very interesting field in mathematics, but you also work in data day-to-day, and then you

Starting point is 00:02:33 also write a lot about data as an industry and the tooling around it. And so I don't even know how we're going to pack all the things we want to talk about into one episode, but thanks for joining us. Thank you as well. All right. Well, starting out, can you just give us a little bit of your background? How did you get into the field of data? And then what day-to-day work do you do in data at your job? Yes, of course. So how did I get into data?

Starting point is 00:03:02 So I think like now 12 years ago, a friend actually asked me whether I wanted to explore a problem and found a company with him. And it wasn't in the data space. It was just about actually gathering lots and lots of data and providing it to a certain customer set. And so that didn't work out

Starting point is 00:03:21 because I co-founded different ideas about where the company should go. So we decided to split up. That was my very early introduction to the world of startups and data. But then later on, I did my PhD in singularity theory, which is like this weird little subfield of mathematics. And I also kept working at a marketing agency for like, I think about five years. And then I joined, I decided to like leave academia and then join as a data scientist and as a developer, and then as a DevOps engineer for a company like the internal data team.

Starting point is 00:03:58 And after I went through all of these steps, I decided to actually head into product management and became a product owner of an internal data team. And that is where I'm at now. Very cool. And you also, you've been writing on Medium for a while, but you also have a newsletter. Do you want to just tell us briefly about your newsletter and maybe where people can sign up? Oh, yes. Just got to Google for like three data points Thursday.

Starting point is 00:04:25 That's the newsletter. I started that simply because I didn't feel there are too many good data newsletters out there that take like a holistic perspective. So I decided to

Starting point is 00:04:36 like write my own. And yeah, I just write about my kind of special and opinionated perspective on like things that happen. Not like the new stuff, but like I just simply about my kind of special and opinionated perspective on like things that happen.

Starting point is 00:04:47 Not like the new stuff, but like I just simply collect like interesting stuff I found over the years working in the industry. Very cool. And it's a great newsletter. We are huge fans of it here on our end. And actually reading some of your content for a while, I realized, well, this is great. We should just

Starting point is 00:05:06 have him on the show. And I think that's how this whole thing got started. First of all, I want to ask you a kind of a funny question based on your background. So singularity theory and mathematics, you have a PhD. Can you just two questions on that? Just because it's kind of fun to hear about people's backgrounds. Could you just briefly explain what that is? And then I'd love to know, are there any lessons from that study that you still apply in your day-to-day work, working with data on a practical level? That is a good question.

Starting point is 00:05:40 Okay, for the first one. So, singularity Theory is asking what, like, studying places where stuff changes suddenly. Like, if you imagine a function, if like a function has like a little kink. And yeah, that's basically, it just asks like, okay, so how do these things generally look? What happens? Like, what happens to all sorts of weird systems that these things could describe. So that is Singularity Theory. So I do have one thing I apply quite frequently, but other than that, I was just going to apply like a reply.

Starting point is 00:06:16 I don't use anything at all, but one thing is definitely using using examples all day long. So that's like using examples to like play around. Like when I develop products, I always go very, very iteratively and produce like small little steps and then work my way forward. And that's what I actually learned like in my PhD thesis,

Starting point is 00:06:38 like doing little examples and then like trying to find the general picture inside that. Yeah, so it's like a workflow and a process that you're applying to just a different practice or different field of study that's super interesting yes exactly i mean i think like the pretty good like a great engineering professor called hemming called it napkin calculations And that's the very same thing I try to apply every single day. Love it. Love and love the concept of napkin calculations is great. Well, let's start to dig into the world of data, which you've written about and the subject I want to tackle. And we've touched on this in a previous episode, but Costas, I know in particular, is extremely

Starting point is 00:07:25 passionate about the Databricks versus Snowflake conversation. There's a lot there, both from a business perspective, data business perspective. There are considerations around open source and all sorts of different things. But before we get into that, I really love your perspective on what do those two platforms do? Because in one sense, you could say, okay, they're kind of similar, right? They both allow you to store and do things with data, with a constellation of different tools around that, right? But they're pretty different and they're used for fairly different things at this point. But just give us, could you give us a quick rundown of what

Starting point is 00:08:11 Databricks does? What are the most common use cases? And then what Snowflake does in the most common use cases? Yes. So that is a great question. So let's take this specific perspective first, like Databricks came out of the, I think, like, the university spinner, basically, from the Apache Spark guys, and then they layered on, like, to this massively parallel computing framework. They layered on, like, Data Lake,

Starting point is 00:08:36 which allows for asset transactions. And then they slowly added stuff that kind of fits into that context, which is basically, like, notebooks, notebooks and lately added ETL transformations. And the most recent addition actually is like they acquired the company Redash, which I think you also talked about in another episode. So they actually now also feature a front end. So that is Databricks in details. And like Snowflake came out of this hugely cloud data warehouse, which had like a really cool feature, which because

Starting point is 00:09:10 it decoupled, I think like computation from storage. So you could scale both things independently, which you usually can't do like a cloud data warehouse, like everything else actually you have to scale both things mostly. And then they started layering on stuff like integrations to data integration tools. And now they also have like machine learning stuff and so on. So for me, this kind of like data rigs comes from the world of unstructured data,

Starting point is 00:09:38 whereas Snowflake comes from the world of structured data. And they will simply like converge into one space. But if you see if you tap like if you take like a step back i think they're actually going to form like a whole new sector of cool companies because i mean if you take a look at like what do data companies and your data company in that sentence as well of of course. What do data companies actually do like fundamentally? So my take on this would be they enable customers, companies to derive value from their data. I think they have to wrestle with like four forces of data. And so I like to call them the DEX forces, D-A-K-S. One is the sheer amount of data, which is growing exponentially, as far as my really bad forecast actually tells me.

Starting point is 00:10:27 It's like the exponential growth of data. They try to handle that. You're certainly doing this. And there's the kind of data and the kind of data, which will be there in like 10 years with the growth of data. I think the amount of data we will have will be like, 15, 20 X that we have today. But all of this data will be unstructured event data, real-time data, lots of imagery, lots of sensory data. That's the data we will have to deal with in the future. That's the kind of data. Then I did write about that quite a few times. There's the snowflake problem. The snowflake problem is a very simple thing. If you

Starting point is 00:11:02 go into any company and take a second company, compare the data setups, like the different sources, targets, data has to flow from and to, they're gonna match like probably like 80%, but each one will be like their own unique snowflake. And then, so it's snowflake problem. You've gotta solve that in some way. Any company has to solve that problem in some way. Any data company. And then the fourth force is kind of the decentralization.

Starting point is 00:11:30 And that's like a big thing. Decentralization means one, data is now being emitted in lots and lots and lots of decentralized places. And also has to be used by every single employee in a company and like the decentralized way. Like if you go into a bookstore, you will very likely pull out your phone and like, like could see data. They keep on just read like a book critique or like check your prices and so on.

Starting point is 00:11:55 And so that's like decentralized data emission and consumption happening in one place. Now these are four forces. What I'm trying to say is like companies, Snowflake and Databricks are actually moving into the space that use all, that try to like wrestle with all four of these forces. And I don't think there's another company

Starting point is 00:12:13 currently that is actually doing it. There's almost no company which is trying to tackle all four of these forces. I think Databricks has a slight edge actually on that because they come from the unstructured world and Snowflake is not yet there. But that's kind of my take on it. They are moving into the space that's really like this new frontier, which I think actually the big IPOs in the data space will happen like the next five to 10 years. Yeah, it is kind of crazy to think about with Snowflake's IPO. It's crazy to think

Starting point is 00:12:47 about another IPO that's bigger in the data space because that one was so high impact. But I agree with you. I think a lot of the things we're going to see that are going to have an even bigger impact are still nascent. So Sven, I have a question. I want to dive deeper into these differences between these two platforms. I hear you and you are passionate about these two tools together, right? So can you tell us a little bit more about your experience with these tools? And what was the moment that you realized that these two companies and these two products are in a collision trajectory, let's say. And the reason I'm asking is because at the beginning, when Snowflake was out there, everyone was thinking that, okay, Snowflake is a data warehouse.

Starting point is 00:13:34 And the main competition that's going to be between Snowflake is going to, for Snowflake is going to be BigQuery, it's going to be Redshift, right? But it seems that like Snowflake had a to be BigQuery. It's going to be Redshift, right? But it seems that like Snowflake had a much bigger vision around that. And suddenly today, or at least like the past, I don't know, like a year or something, it becomes more and more clear

Starting point is 00:13:53 that the actual competition is going to be between these two companies, Snowflake and Databricks. So what brought you to this conclusion? And also like before that, what's your experience with these tools? How you've used them and how do you feel about them as products so this is a great question so productively i actually haven't used any of those tools i've played around with both of them and i actually

Starting point is 00:14:18 noticed databricks because they bought redash because i was playing around with redash at i think the time they actually bought redash. It was like a weird coincidence. And then there was also the point where I realized they're kind of like their master plan, actually, to move into this huge space. And so Databricks was pretty clear in their intent to move things into this bigger space. Snowflake, on the other hand, is like a weird thing because

Starting point is 00:14:52 as you're saying, there are good alternatives for cloud data warehouses out there. I mean, Redshift took the market by storm. The weird thing is that even though I do that, Snowflake kept coming up as the default choice for lots and lots and lots of companies and teams and simply kept growing. So sometime I just had to check them out and i realized ah they're actually moving into a very different direction actually amazingly they have an amazingly well-designed product that is able to not just compete like to actually beat redshift and big korea in their markets in some way because they they're able to differentiate themselves even with these huge players and their resources and they also are adding lots and lots and lots of stuff in a very well simple and easy way to their product which by the way like the cloud the hyper clouds don't do

Starting point is 00:15:37 they for some reason tend to make stuff a little bit more complicated so that's my experience so far yeah yeah makes total sense. Actually, I find these companies very interesting, because in a way, it's like they started from the completely, completely opposite starting point, right? Like, as you said, like you have the academic background of Databricks, right? You have like a bunch of geeks in Berkeley trying to redefine Hadoop and MapReduce and come up with a new processing architecture and they come up with data. I mean, not with Databricks, but with Spark

Starting point is 00:16:15 and then on Spark, they build like this company today. And on the other hand, you have Snowflake, where you have a bunch of people coming from Oracle, which is the exact opposite. It's the definition of the corporate environment, right? Yeah. And they're like, okay, there's something wrong in the industry, which means that there is an opportunity. Let's go and build like a really nice technology. And they started, if I remember correctly, I think they started with the data warehouse and the main like concept around

Starting point is 00:16:51 what was different in their case was this separation between processing and storage, right? Which was an amazing also marketing material, let's say. I think they made a lot of noise around that and people started thinking of, oh, this sounds cool. It sounds something good. I can store my data

Starting point is 00:17:11 and I pay a different price for that and blah, blah, blah, and all these things. And I'm processing. But I have a question for you because you're also a deeply technical person. Don't you think that this separation already existed in a way, right? Like Databricks or Spark didn't store the data it assumed that the data was on distributed file system like hdfs right

Starting point is 00:17:33 that was the beginning that was like also how hadoop worked right so okay of course we are talking about not a cloud solution we are talking about like on-prem, let's say, installation. But my feeling at least, and I want your opinion on that, because I might be wrong, is that this operation actually already existed. It was just that nobody actually talked about it that much outside of the hardcore engineers that were doing the work, right? Oh, yes. So it's kind of like the debate about self-driving cars so i keep on telling people like um the technology for self-driving cars is already there like dry cars can already drive um by themselves but the the technical stuff isn't figured out that

Starting point is 00:18:19 like like yet so so almost anything is already there i mean that's like the stuff that technology already has invented is mind-blowing so the hard part i don't think is actually inventing something new it's kind of like making it easily accessible and maybe just wrapping it nicely i think that's actually the hard part not so much about the invention it's more about like letting other people um in on the invention. And I think Snowflake does an amazingly well job at exactly that part. Yeah, yeah. I agree with what you are saying.

Starting point is 00:18:51 And I think this is where we have to thank marketing. And Eric, that's a big part of what the value of marketing brings in this world, that educates and re-educates people around the products out there. So thank you so much about that. But I think that from the technical perspective, Sven, I think what was really impressive with Snowflake at the beginning is that not only we had this separation, right? But at the same time, we had a very complex infrastructure

Starting point is 00:19:25 that nobody had to actually manage, right? And that was like, at least at the beginning, I think like a huge difference between Databricks and Snowflake. Because with Databricks, I mean, okay, today they also have like a cloud offering, which is like self-serve. But back then, working and setting up and operationalizing Spark, I don't think that was the easiest thing to do, right?

Starting point is 00:19:50 I'm pretty sure it's basically a bus killer for most companies. I know from first-hand experience, running and deploying actually most open-source tools today, it's a pain in the ass. It's like, almost any tool, I would rather have a hosted solution. So that's definitely true, that point. So you just touched on that, because Snowflake came out of this structured space. Actually, I think they must have realized that they're in this structured data space and they've got to move out of that. Because My personal take, as far as my forecast for the next 10 years goes, in 10 years, I think

Starting point is 00:20:29 structured data, database for structured data will have no customer base anymore. At least just a very little one. I think all of these companies actually have to move out of the space, at least amend their offering as much as that is not going to

Starting point is 00:20:46 be the core anymore. Yeah, makes sense. Although I have to add something here about the structured data. I remember when Snowflake came out, and because back then BigQuery wasn't that big, like the main competition and the main data warehouse in the space was actually Redshift. And one of the biggest problems that you had with Redshift is that Redshift was exactly as you say, like it was completely structured. It was like a database. You had your tables, you had your relations, you had your columns, very well-defined data types, blah, blah, blah, blah, blah.

Starting point is 00:21:32 One of the selling points, early selling points of Snowflake was that they had an amazingly good support for semi-structured data. And when they said semi-structured data, they actually meant JSON. And I think also XML, but anyway, I don't know who was working with XML, but anyway. Yeah, and they still have that. But okay, after a while, also BigQuery came,

Starting point is 00:21:53 and BigQuery, they really had a very good support for nested data structures and stuff like that. So it was a very natural fit for JSON files. But the reason that I'm saying that is because I want to ask you, when you say structured versus unstructured, what do you mean? What in your mind is the unstructured data that Databricks was excelling in working with? So that is a good question. Of course, I presume it's the same. By the way, Redshift is based on Postgres, I think, as far as I know. Postgres has amazing JSON support as

Starting point is 00:22:36 well. I presume, again, like Snowflake and XML as. So I presume like snowflake and a redshift didn't actually had probably feature parity and like support for the CMY structured data. And of course I mean that, I mean, I mean data that comes in like some kind of like big blob. And yes, I also mean like objects, like images and videos. And because the reason I'm saying that is like

Starting point is 00:23:04 in the future we will have lots and lots of data from decentralized spaces, which simply means they come in different forms. It doesn't mean like they're actually unstructured. It just means like, I might not know the structure or I maybe have to deal with lots and lots and lots of different structures. And I might as well treat them as unstructured. That's actually the point. That's actually the point. The true question is more like, do I have a database that actually supports dumping all the stuff in there and maybe dump? I mean, Snowflake actually, I think,

Starting point is 00:23:32 has some kind of table extensions now, which go to S3. So maybe in the future, they will also support images. They may already do so. The question is more like, what do we do with this kind of data? How do I work with it? How do I search through stuff and all these kinds of things? Yeah, that makes a lot of sense. And I think that's like if we see also, I mean, again,

Starting point is 00:23:59 from a product perspective, like if we see the two companies, the use cases and the actual, like, let's say, workloads that they were working on the beginning were like, quite different, right? You had Snowflake on one hand, that was like a more traditional BI workload. So yeah, like people wanted like to put structured data up there and like do BI, you see, like, how is my company performing, like that kind of stuff. And on the other hand, you had Databricks, which if I remember correctly, one of the most important first use cases

Starting point is 00:24:30 about Databricks was actually ETL. You would see like many times using actually a data warehouse together with Spark. So Spark was used for ETLing data, especially like when you had very structured data, as you mentioned, Sven. And of course, there was ML. One of the main use cases after a while about Spark was, okay, let's train models. Let's do statistical analysis. Let's do things that don't naturally fit on the SQL dialect of a data warehouse, right?

Starting point is 00:25:07 And of course, this gap is closing right now. And we can discuss more about this later about the two. But yeah, and that's like a testament of the type of data that Spark can work. You can have unstructured data, you can work with text, you can work with CSV files, you can work with video files or binary files and stuff like that. Things that, yeah, okay, I mean, data warehouse or database might have a binary blob that is supported there, but, meh, okay, that's it. It's not like you are going to train a sophisticated model on top of that. That's also reflected on the tools that they were supporting, right?

Starting point is 00:25:42 How you can work with Spark through Pandas. You would see the difference between also the people that were using it, right? You would see the data scientists versus the analysts. And each one of these categories, they have different tools. And these products and these solutions and technologies were also accommodating these different solutions. But do you see this, Sven, changing lately? Do you see that Databricks, for example, being used more for analytical purposes, let's say,

Starting point is 00:26:13 and not just this kind of sophisticated ML payloads? And on the other hand, see or anticipate that Snowflake might try or do already more of the ML and data science related payloads? Because in this case, we are talking about also they need different features that they might not have and they have to introduce. Yeah, sure. So on the Snowflake side, I'm actually not sure. I don't know the recent product development that well. I can just imagine that actually the analytical background Snowflake has might actually be like a problem there because they still got to keep this customer base and not like

Starting point is 00:26:53 dilute this customer base while moving into this new space. I mean, it's a challenge that can be overcome. On the Databricks side, I'm not so sure as well. So obviously they're trying to move into the analogicals because they, let's see, so they got these new feature areas which is the Delta light, I think, basically like models that you can put into Databricks in a very easy way, by the way, also in SQL. And you got like the re now the Redash integration, which means you can actually type like SQL

Starting point is 00:27:28 right into like a query editor. And then, so you can do quite a bit of what, what like analytical work looks that other companies would run on Snowflake. You can put that into data, but they're not like, they're not quite there yet because they have to like carry on the data engineering and data science and all of these guys because they still live in this notebook world, in

Starting point is 00:27:52 this Apache cluster world. They have, for instance, they do have support for dbt, for instance, but they're slow to move into that direction. And I think paradoxically, Snowflake is slow to move into the machine learning of companies that we see. If you could give us a quick review of those, and then I'm interested to know which one you think will drive the first wave, just as far as the problems that it's going to create or the pain points that it's going to create, then this next wave of data companies will solve. Because, and the reason I'm asking, and would love your

Starting point is 00:28:46 thought on that too, Kostas, the reason I'm asking is the available technology dictates the behavior of an organization, right? So if you come from the analytical, if you're using Snowflake and you come from the analytical structured data background, you align your processes to match that, right? And then technology changes, but organizational change is pretty hard, right? Because you've done things and built your processes and data flows and pipelines and teams around an infrastructure that's more focused on structured data. So I'm just interested to know which of the forces, the four forces that you mentioned, do you think are going to create the first wave of organizational change that responds to the technological change? That's a great

Starting point is 00:29:29 question. So to recap, the four forces are like dragons that have to be tamed, the decentralization of data, the amount of data, the exponentially growing amount of data, the kind of data, so structured real-time event-based data and the snowflake problem, the AKS. These are the four dragons I'm talking about. And I have no idea. So I feel like they're actually all four equally important and all have to be solved for. I see a lot of companies just focusing on the amount and on like, like the growing amount

Starting point is 00:30:06 and like a specific kind of data. But I also see like the decentralization movement that is actually happening. Like it's like the pattern of the data mesh emerging, which is about decentralizing data usage and production. So I feel like all of them will kind of hit all of us. Yeah, that's an excellent question. Actually, I think that, I don't think that's like, in my opinion, at least, I don't think that it's the nature of the data that is going to drive this, mainly because before you start dealing with a problem, you don't really know what kind of these dimensions of the data are going to affect you more.

Starting point is 00:30:44 And I totally agree with Sven. Pretty much all of them are relevant. Probably some of them are more important in some cases than others. And I will give a very easy example here to explain what I mean. If you take a B2C and a B2B company, one of the most obvious differences there when dealing with data is volume. A B2C company has, from almost day one, many, many, many, many more data compared to what a B2B company can have even when they reach a billion dollar valuation or whatever.

Starting point is 00:31:18 On the other hand, what you have with a B2B company, you have the decentralization because data is suddenly like siloed in so many different tools and somehow like you need to access all this data and you have the complexity of the data. Like Salesforce alone, like the default setup of Salesforce comes out with, if you try like to pull the data out of there, you have probably, I don't know, like 250, 300 tables, right? And that's the bare minimum. Do you need all of them? I don't know, like 250, 300 tables, right? And that's the bare minimum. Do you need all of them? I don't know, but there are there, right? And I've seen like companies, especially when they have more applications like installed on top of like their SFDC, you can have easily

Starting point is 00:31:57 like 800, 900 tables. Okay. That's like a huge, huge complexity, even if like the volume is low. So what I would say is that in my opinion, what is going to happen and what we are going to see is that there are going to be business reasons that they are going to push the companies into trying to figure out opportunities to create value from themselves because of the data that they have. And as they will start trying to do that, either by doing like more sophisticated analytics, or more complete analytics, or start doing predictive analytics, right? They will get at the point where they'll be like, okay, my data looks like this. So I need a solution that can accommodate that.. And these things, of course, are going to change, right? Now, I think the industry is still learning.

Starting point is 00:32:49 So all these best practices around what you need if you come from this space or the other space, I think they are still under definition, as also the technologies are. I think what is important

Starting point is 00:33:00 for Snowflake and Databricks is how they can create platforms that they can accommodate potentially all the different use cases. So when they land inside the company for one use case, then they have opportunities to expand. Because both of these solutions, they take a lot of investment from a company to start using them, right? So they are very sticky. Super interesting. Okay, follow-up question for both of you based on that.

Starting point is 00:33:24 Because, I mean, we don't know which of the four dragons are going to have industry-wide impact. And I'm sure depending on the business and the volume and the type of data, different individual companies will probably face them at different points. But let's say, well, this question is, I think, interesting just based on a lot of examples that I've thought about recently of companies who have a core competency more on the Databricks side and are moving towards the structured data side and then the other way around. We have a core competency on the analytics side, and then you want to move more into the Databricks machine learning data science side.

Starting point is 00:34:01 Do you think that moving one direction to the other is easier is there going from one side to the other or vice versa is easier oh that's that is also a good question um um if you look at the both of these companies i don't think so i mean they're both trying and just based on their experience i think these two companies are actually experts in these two moves. They both seem to have trouble. So I think these are both hard moves. What would be my take on that? Yeah, that's an excellent question.

Starting point is 00:34:35 I would say that what Databricks is trying to do is much more ambitious and harder. Now, if they manage to do it, I think it's going to be like an amazing feat. And I'll explain why. By the way, before I explain why it's harder for Databricks, I'll talk about our friends at Snowflake. Snowflake, like recently, they launched a product called Snowpark or something like that. I don't remember exactly the name, but it's like they are playing a little bit with the Spark as name. And actually,

Starting point is 00:35:11 it's a very competitive product to Spark. So they are adding this kind of functionality there. They are adding, they want to support ML use cases. They want to allow users to build much more, let's say, sophisticated business logic on top of the data engine that they have. So they do that. They're moving into the space of Databricks, and they do it fast. And if there's one thing that nobody can deny about Snowflake is their execution. Their execution.

Starting point is 00:35:40 They are extremely efficient at executing. Now, what on the other side our friends at Databricks are trying to do is that they started with a distributed processing engine. That's very good when you want to process data in the way that you usually do with like in machine learning where you have all your data there you know exactly what kind of processing you want to do and you want to scale that right because you have a lot of data for example or because like the computations that you are doing are like so complex that it has to be scaled like more than one machine so they have that now what they are trying to do, and this is, let's say, the promise of Delta Lake or the lake house, as they call it, they try to take the concept of the

Starting point is 00:36:34 data lake and add some of the stuff that are very core to a data warehouse, like acid guarantees, right? Or transactions. You don't have transactions in Spark. Like initially you didn't have, you didn't need transactions because I'm not going, like when you're using like a system like Spark, you don't have one user that's going to compete with another user in accessing like state and changing the state, right? So, but if you want to do analytics,

Starting point is 00:37:06 if you want to do a typical data warehouse payload and workloads, you need to have that. And they try to build this on top of the architecture of a data lake. Now, if they manage to do that, what they are going to do at the end is they are pretty much going to figure out a way to take a database system

Starting point is 00:37:33 and turn it inside out actually, which it's quite a feat if they do it technically. Keep in mind like that building databases and ensuring all these guarantees that the database needs, it's not a simple task, right? Databases are extremely complex systems, especially when we are talking about distributed databases. Of course, we have some of the smartest people in the industry doing that. It's like the good thing of like having a company that it started from a bunch of geeks from Berkeley. So we'll see. It would be amazing if they managed to do that

Starting point is 00:37:57 because they are going to change, I think, completely the way that we work and build databases. So it's going to be like a huge paradigm shift if they do that. Yeah, I think it's really interesting. And when I think about this just on a very practical level without necessarily considering the technology, but considering the day-to-day progression that we see in companies, one thing that I've noticed over the past several years is that getting your analytics to a good place generally is a catalyst for machine learning projects. Because when you have enough data in a clean state where you start to derive insights,

Starting point is 00:38:44 that's a fertile environment for machine learning to accelerate lots of different projects. So as companies grow and get more data, when you have to be at a certain scale for machine learning to add a ton of value to your organization. So I think it'll be really interesting to see. And if you think about that from Snowflake's standpoint, maybe Databricks from a technology standpoint comes after Snowflake more quickly, but Snowflake may have a user base that they can pull into the machine learning world because they have a fertile foundation for it. All right. We are actually pretty close to the end of the show, but

Starting point is 00:39:27 let's hit one more subject. This has been an awesome conversation. I want to hit one more subject. And Sven, you have written a lot about open source. And so one topic that I'd love to hear you expound on is, and the background behind this question is, you've written a lot about open business models, which we probably need to just do a show on that alone, because that's a fascinating topic. And you've recently written about pricing and open source business models. But I kind of want to do the same thing we did with the Databricks versus Snowflake conversation where we start at the very beginning. And I think you would say, and correct me if I'm wrong, when you think about open source business models, you have to ask the question, what is a successful open source project?

Starting point is 00:40:19 And then you can talk about how that feeds into a business model. So talk us through what is a successful open source project and what are the characteristics of that? Again, excellent question because I actually, I do have a blog post. I'm currently writing and I actually do have like around like 300 blog posts, which I haven't published yet. And that might be another one, but like, so the open source industry is actually really young i mean the open source that's been around for like like 20 25 years something like that and the idea of open source and like actual i think like the top launched in 2008 something like that so what's an what's what's a successful open source project? And just to remind you,

Starting point is 00:41:06 the usual research says like around about 98% of open source projects fail. Okay, so it actually went up from like 94 years ago to 98%. So I think like there's some kind of like three dimensional or like three tier model to open source projects. And the very first dimension is when you work in your project, like your one repository

Starting point is 00:41:31 or multiple repositories, and you basically try to expand the project and get others involved. And then what do you want to do? I mean, you would want to get other developers to first of all use your cool new tool and then to contribute to your cool new tool. And then success in that dimension simply means like lots of people using it and lots of people contributing to it. So that's easy.

Starting point is 00:41:54 I mean, you make it really easy to use and to deploy and really easy to contribute by providing whatever like SDK, CDK, making it easy to set up the test environment and so on. But that's just the very first dimension. And so WordPress and the company Automatic actually started out that way. And they worked like for two to three years, they worked on just like the WordPress core, which has been a fork in the beginning. And then they realized, okay, so we need like, we need a second,

Starting point is 00:42:21 we need a second dimension, which is that of what i call like guided extension that's the space where you start to add like modules plugins extensions like the ability for others to customize your project into in their own way so then you get like a little bit of explosion of repositories actually which you don't own anymore but actually that can be like a scary thought like the stuff you don't own anymore. But actually that can be like a scary thought, like the stuff you don't own anymore, but your success in that dimension, and you can only get to like, you have to get a level of success in dimension one

Starting point is 00:42:53 before you can start working on that success in dimension two. So WordPress did that in, I think like 2004 or five, and they introduced plugins. Success there means like making really easy for people to extend stuff, to add plugins, like a plugin SDK kit. You would need something like that. But then around the second dimension, you actually also want to start to create an ecosystem. You actually want to have companies

Starting point is 00:43:18 that sell themes. You want to have a company, that's a WordPress example, you want to have a company that actually sells plugins because these companies are core contributors and they have a business incentive to contribute to your project and make it success successful that's what wordpress did and that's i think at the end of 2000 and something like eight they hit like um just working on these two dimensions like 12 adoption across the cms space it meant like in 2008, like 12% of the web world was powered by WordPress, which already is like an amazing feat. And then they must have realized somewhere along the way, they actually need a third dimension.

Starting point is 00:43:59 And the third dimension is like once you hit that space, the one of guided extension, you actually don't go into the third dimension. It just happens. It's that one of unguided extension. And for WordPress, that happened in 2011 with the appearance of e-commerce sites based on WordPress. That's the space where they thought it's a blog, and then they thought it's a static website.

Starting point is 00:44:20 And then they realized, oh, no, actually people are starting to use our thingy and build something completely new out of it. So like WooCommerce, for instance, emerged in that space. And that is the part, which is I think the scariest part for most open source projects, because only like a level, I think like a dimension three open source project is the only true, really successful project. That actually means helping others build their own thingy out of your project so that for wordpress meant like enabling others to actually build

Starting point is 00:44:52 their own wordpress just targeted at e-commerce for instance and they then reconciled that with like their business incentive and that in the way that they simply bought the company okay so that's a way to deal with that. So that is what I think is like a really successful open source project. And by the way, that took them like 15 years to get to that space. And other projects as well,

Starting point is 00:45:15 like if you think about Linux or MySQL and so on. Yeah, it's super interesting. And I love the way that you've consolidated a lot of different components into a very concise set of characteristics. One question I have though, and this is, it just really struck me that the failure rate for open source projects has increased and now is just incredibly high. And then also using the example of WordPress. And one thing that struck me was that a lot of open source projects are fairly limited in scope. Whereas the examples you gave around MySQL or Linux or WordPress, the total addressable market for what the foundational technology enables, right? So if

Starting point is 00:46:08 you think about a CMS, the worldwide web is expanding at an unbelievable rate, the number of websites are as well, exponential growth. And so you've had this limited potential for growth of a CMS product for web content. Do you think that's, is there some sort of threshold, even if we don't, can't define it exactly, for how large the total addressable problem is for an open source project to be successful? Not at all.

Starting point is 00:46:35 But I mean, the example is great of WordPress because they grew into that market in 2004. They were probably targeting market size of, they were just doing blogging and that was probably like a thousand blogs worldwide. 2004, they were probably targeting market size of, they were just doing blogging. And that was probably like a thousand blocks worldwide. Maybe like, that's too low, but it's still, it's like a super small size. And then they extended to the static site market, which was small at the time.

Starting point is 00:47:02 So, I mean, they kind of found their way and like built these tangible things on top of that. So I don't think so. It depends on the project. And I think like every project can find its way. Yeah, I think that's a great point. And that's a really good reminder that it really was blog focused at the beginning. And actually, I mean, really, in many ways, the community forged it into a more comprehensive, flexible CMS. And I mean, even user accounts now and all

Starting point is 00:47:27 sorts of stuff that were way beyond the initial scope of blogging, which was, I think, just a really great answer. Well, Sven, we are at time here. So many more questions to ask you about open source, but this has been a really fun conversation. And I hope that our listeners have really enjoyed thinking through some of these big market shifts with us. And we'd love to have you back on the show sometime again, to pick back up the open source conversation. Sure, I'd love to do a follow up. Thank you guys for inviting me. What a fun conversation. Anytime that I can ask questions that I feel like elicit strong opinions from both our guests and you, Costas. I feel like I've had a great day. I feel like I've had a really great day,

Starting point is 00:48:12 and I think I accomplished that. So thank you for that. I think my big takeaway, I mean, there were so many things about Databricks versus Snowflake, but I really was caught off guard, I think, by the statistics around open source that Sven shared, the failure rate of open source projects, which is just really interesting to me because he's studied the open source space a lot. He's studied open source data companies. And his bet is that the next round of really big data companies will have open source foundations, yet the failure rate of open source projects has been increasing.

Starting point is 00:48:46 And so that just is giving, it's going to give me a lot to think about this week. Yeah, absolutely. I totally agree with you. Actually, open source as a phenomenon in general, I think it's something that we should discuss more about it on this show. Outside of like the failure

Starting point is 00:49:02 that Sven talked a lot about, there's another characteristic and that has to do with abuse. It's very common for people out there who are maintaining repos, open source and pretty popular ones, that at some point they quit because of all the abuse

Starting point is 00:49:19 that they have to go through from all the people who are just asking and demanding new features and stuff like that it's it's actually open source a very very interesting area of seeing some very interesting both on like let's say the best and the worst of the human nature in a way and yeah that would be very interesting like to discuss more about this and i agree with him that we are going to see more and more big companies

Starting point is 00:49:45 that the foundations are on open source. And that has to do with the type of the industry and the type of the problems and the complexity of the problems around data. And what are also the competitive advantages in these products

Starting point is 00:49:57 compared to a SaaS solution, for example. So, yeah. And we're just at the beginning, by the way. I mean, Snowflake and Databricks, just like the first players in this space. So we have a lot to show in the future, yeah. Yeah, it's going to be a really exciting decade. Well, thank you again to everyone who joined us on the show.

Starting point is 00:50:17 Lots of exciting guests coming up this fall. We'll do a season wrap up here pretty soon for season two. And until next time, we will catch you later. at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 47: Taming the Four Dragons of Data with Sven Balnojan of Mercateo Gruppe

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.