Screaming in the Cloud - Developing Storage Solutions Before the Rest with AB Periasamay

Episode Date: February 2, 2022

About ABAB Periasamy is the co-founder and CEO of MinIO, an open source provider of high performance, object storage software. In addition to this role, AB is an active investor and advisor t...o a wide range of technology companies, from H2O.ai and Manetu where he serves on the board to advisor or investor roles with Humio, Isovalent, Starburst, Yugabyte, Tetrate, Postman, Storj, Procurify, and Helpshift. Successful exits include Gitter.im (Gitlab), Treasure Data (ARM) and Fastor (SMART).AB co-founded Gluster in 2005 to commoditize scalable storage systems. As CTO, he was the primary architect and strategist for the development of the Gluster file system, a pioneer in software defined storage. After the company was acquired by Red Hat in 2011, AB joined Red Hat’s Office of the CTO. Prior to Gluster, AB was CTO of California Digital Corporation, where his work led to scaling of the commodity cluster computing to supercomputing class performance. His work there resulted in the development of Lawrence Livermore Laboratory’s “Thunder” code, which, at the time was the second fastest in the world.  AB holds a Computer Science Engineering degree from Annamalai University, Tamil Nadu, India.AB is one of the leading proponents and thinkers on the subject of open source software - articulating the difference between the philosophy and business model. An active contributor to a number of open source projects, he is a board member of India's Free Software Foundation.Links:MinIO: https://min.io/Twitter: https://twitter.com/abperiasamyMinIO Slack channel: https://minio.slack.com/join/shared_invite/zt-11qsphhj7-HpmNOaIh14LHGrmndrhocALinkedIn: https://www.linkedin.com/in/abperiasamy/

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps.
Starting point is 00:00:37 They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They've also gone deep in depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense. This episode is sponsored in part by our friends at Rising Cloud, which I hadn't heard of before, but they're doing something vaguely interesting here. They are using AI, which is usually where
Starting point is 00:01:16 my eyes glaze over and I lose attention, but they're using it to help developers be more efficient by reducing repetitive tasks. So the idea being that you can run stateless things without having to worry about scaling, placement, etc. and the rest. They claim significant cost savings, and they're able to wind up taking what you're running as it is in AWS with no changes and run it inside of their data centers that span multiple regions. I'm somewhat skeptical, but their customers seem to really like them.
Starting point is 00:01:48 So that's one of those areas where I really have a hard time being too snarky about it, because when you solve a customer's problem and they get out there in public and say, we're solving a problem, it's very hard to snark about that. Multus Medical, Construx.ai, and Stacks have seen significant results by using them, and it's worth exploring. So if you're looking for a smarter, faster, cheaper alternative to EC2, Lambda, or Batch, consider checking them out. Visit risingcloud.com slash benefits. That's risingcloud.com slash benefits. And be sure to tell them that I sent you, because watching people wince when you mention my name is one of the guilty pleasures of listening to this podcast.
Starting point is 00:02:28 Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by someone who's doing something a bit off the beaten path when we talk about cloud. I've often said that S3 is sort of a modern wonder of the world. It was the first AWS service brought into general availability. Today's promoted guest is the co-founder and CEO of MinIO, Anand Babu Piyasamy, or AB as he often goes, depending upon who's talking to him. Thank you so much for taking the time to speak with me today. It's wonderful to be here, Query. Thank you for having me. So I want to start with the obvious thing where you take a look at what is the cloud and you can talk about AWS's ridiculous high-level managed services like Amazon Chime. Great. We all see how that plays out. And those
Starting point is 00:03:24 are the higher level offerings, ideally aimed at problems customers have. But then they also have the baseline building block services. And it's hard to think of a more baseline building block than an object store. That's something every cloud provider has, regardless of how many scare quotes there are around the word cloud. Everyone offers the object store. And your solution is to look at this and say, ah, that's a market rife for disruption. We're going to build through an open source community software
Starting point is 00:03:53 that emulates an object store. I would be sitting here more or less poking fun at the idea, except for the fact that you're a billion dollar company now. Yeah. How did you get here? So when we started, right, we did not actually think about cloud that way, right? Cloud, it's a hot trend, and let's go disrupt it.
Starting point is 00:04:15 It will lead to a lot of opportunities. Certainly, it's true it will lead to the M&As, right? But that's not how we looked at it, right? It's a bad idea to build startups for M&A. When we looked at the problem, when we got back into this, my previous background, some may not know that it's actually a distributed file system background in the open source space. Yeah, you were one of the co-founders of Gluster, which I have only begrudgingly forgiven you, but please continue. And back then, we got the idea right, but the timing was wrong. And I had,
Starting point is 00:04:44 while the data was beginning to grow at a crazy rate, end of the day, GlusterFS has to still look like an FS. It has to look like a file system like NetApp or EMC. And it was hugely limiting what we can do with it. The biggest problem for me was legacy systems. I have to build a modern system that is compatible with the legacy architecture. You cannot innovate. And that is where when Amazon introduced S3, back then, when S3 came, cloud was not big at all, right? When I looked at it, the most important message of the cloud was
Starting point is 00:05:16 Amazon basically threw everything that is legacy. It's not iSCSI as a service. It's not even FTP as a service, right? They came up with a simple, restful API to store your blobs, whether it's JavaScript, Android, iOS, or AML application, or even Snowflake-type application. We all spent 10 years rewriting our apps to speak Object Store, and then they released EFS, which is NFS in the cloud. I didn't realize I could have just been stubborn and waited, and the whole problem would solve itself.
Starting point is 00:05:43 But here we are. You're quite right. Yeah, and even EFS and EBS are more for legacy, stopgap, I could have just been stubborn and waited and the whole problem would solve itself. But here we are. You're quite right. Yeah. And even EFS and EBS are more for legacy, stopgap, come in, buy some time. But that's not how you should stay on AWS, right? When Amazon did that for me, that was the opportunity. I saw that while the world is going to continue to produce lots and lots of data, if I built a brand around that, I'm not going to go wrong. The problem is data at scale.
Starting point is 00:06:06 And what do I do there? The opportunity I saw was Amazon solved one of the largest problems for a long time. All the legacy systems, legacy protocols, they convince the industry, throw them away. And then start all over from scratch with the new API. While it's not compatible, it's not standard, it is ridiculously simple compared to anything else. No FS tabs, no mount, no root, nothing, right? From any application, anywhere you can access, what's a big deal?
Starting point is 00:06:31 When I saw that, I was like, thank you, Amazon. And I also knew Amazon would convince the industry that rewriting their application is going to be better and faster and cheaper than retrofitting legacy applications. I'm wondering how much of that is retcon because talking to some of the people involved in the early days, they were not at all convinced
Starting point is 00:06:47 they would be able to convince the industry to do this. Actually, if you talk to the analysts, reporters, the IDCs, gardeners of the world, to the enterprise IT, the VMware community, they would say, hell no. But if you talk to the actual application developers, data infrastructure, data architects, the actual consumers of data, for them, it was so obvious. They actually did not know how to
Starting point is 00:07:10 write an FS tab. The iSCSI and NFS, you can't even access across the internet. And the modern applications, they ran across the globe and JavaScript and all kinds of apps on the device. From Snap to Snowflake, today's built-on object store, it was more natural for the applications team, but not from the infrastructure team. So who you ask mattered. But nevertheless, Amazon convinced the rest of the world. And our bet was that if this is going to be the future, then this is also our opportunity. S3 is going to be limited because it only runs inside AWS. Bulk of the world's data is produced everywhere, and only a tiny fraction will go to AWS.
Starting point is 00:07:47 And where will the rest of the data go? Not SAN, NAS, HDFS, or other blob store, Azure Blob, or GCS. It's not going to be fragmented. And if we build a better object store, lightweight, faster, simpler, but fully compatible with S3 API, we can sweep and consolidate the market. And that's what happened. And there is a lot of validity to that. We take a look across the industry
Starting point is 00:08:13 when we look at various standards. I mean, one of the big problems with multi-cloud in many respects is the APIs are not quite similar enough. And worse, the failure patterns are very different of, I don't just need to know how the load balancer works. I need to know how it breaks so I can detect and plan for that. And then you've got the whole identity problem as well, where you're trying to manage across different frames of reference as you go between providers and it leads to a bit of a mess. What is it that makes Manio something that has been not just something that has endured since it was created, but clearly been thriving?
Starting point is 00:08:49 The real reason actually is not the multi-cloud compatibility, all that, right? Like, well, today it is a big deal for the users because the deployments have grown into 10 plus petabytes. And now the infrastructure team is taking it over and consolidating across the enterprise. So now they are talking about which key management server for storing the encrypted keys, which key management server should I talk to? Look at AWS, Google or Azure, everyone has their own proprietary API. Outside they have Gemalto, HashiCorp, Vault and like there is no standard here. There is supposed to be a KMAP standard, but in reality, it is not. Even different versions of Vault,
Starting point is 00:09:26 there are incompatibilities for us. That is where, from key management server, identity management server, everything that you speak around, how do you talk to different ecosystem, that actually, MinIO provides connectors. Having the large ecosystem support and large community, we are able to address all that.
Starting point is 00:09:44 Once you bring MinIO into your application stack, like you would bring Elasticsearch or MongoDB or anything else as a container, your application stack is just a Kubernetes YAML file and you roll it out on any cloud, it becomes easier for them. They're able to go to any cloud they want. But the real reason why it succeeded was not that.
Starting point is 00:10:02 They actually wrote their applications as containers on Minikube. Then they will push it on a CICD environment. They never wrote code on EC2 or ECS, writing objects on S3. And they don't like the idea of PaaS, where someone is telling you just, like you saw Google App Engine never took off, right?
Starting point is 00:10:20 They like the idea of, here are my building blocks. And then I would stitch them together and build my application. We were part of their here are my building blocks. And then I would stitch them together and build my application. We were part of their application development since early days. And when the application matured, it was hard to remove. It is very much like Microsoft Windows when it grew. Even though the desktop was Microsoft Windows, server was NetWare. NetWare lost the game, right?
Starting point is 00:10:41 We got the ecosystem and it was actually developer productivity, convenience, that really helped the simplicity of MinIO. Today, they are arguing that deploying MinIO inside AWS is easier through their YAML and containers than going to AWS console and figuring out how to do it. As you take a look at how customers are adopting this, it's clear that there is some shift in this because I could see the story for something like MinIO making an awful lot of sense in a data center environment because otherwise it's great. I need to make this app work with my SAN
Starting point is 00:11:15 as well as an object store. And that's sort of a non-starter for obvious reasons. But now you're available through cloud marketplaces directly. Yeah. How are you seeing adoption patterns and interactions from customers changing as the industry continues to evolve? Yeah, actually, that is how my thinking was when I started. If you are inside AWS, I would myself tell them that, why don't you use AWS S3, right?
Starting point is 00:11:39 And it made a lot of sense if it's on a colo or your own infrastructure, then there is an object store. It even made a lot of sense if you are deploying on google cloud azure alibaba cloud oracle cloud it made a lot of sense because you wanted an s3 compatible object store inside aws why would you do it if there is aws s3 nowadays i hear funny arguments too they're like oh i didn't know that i could use s3 is s3 minio compatible they will be like, it came along with the GitLab or GitHub Enterprise, part of the application stack. They didn't even know that they could actually switch it over. And otherwise, most of the time, they developed it on MinIO.
Starting point is 00:12:15 Now they are too lazy to switch over. That also happens. But the real reason that why it became serious for me, I ignored that the public cloud commercialization. I encouraged the community adoption and it grew to more than a million instances like across the cloud, like small and large. But when they start talking about paying us serious dollars, then I took it seriously. And then when I start asking them,
Starting point is 00:12:38 why would you guys do it? Then I got to know the real reason why they wanted to do was they want to be detached from the cloud infrastructure provider. They want to look at cloud as CPU network and drive as a service. And running their own enterprise IT was more expensive than adopting public cloud. It was productivity for them. Reducing the infrastructure people cost was a lot.
Starting point is 00:13:00 It made economic sense. People always cost more. The infrastructure itself does. Exactly. Right. 70, 80 The infrastructure itself does. Exactly. 70-80% goes into people. Enterprise IT is too slow. They cannot innovate fast and all of those problems. But what
Starting point is 00:13:14 I found was, for us, we actually told the community and customers, if you're on AWS, if you're running MinAO on EBS, EBS is three times more expensive than S3. Or a single copy of it too, where if you're trying to go multi-AZ and you have the replication traffic and not to mention you have to over-provision it, which is a bit of a different story as well. So it winds up being
Starting point is 00:13:35 something in the order of 30 times more expensive in many cases to do it right. So I'm looking at this going, the economics of running this purely by itself in AWS don't make sense to me. Their long experience teaches me the next question of what am I missing? Not that's ridiculous and you're doing it wrong. There's clearly something I'm not getting. What am I missing? I was telling them until we made some changes, right? Because we saw a couple of things happen. I was initially like, at least the erasure code doesn't not make 30 copies. It makes like 1.4x, 1.6x, but still the underlying block storage is not only three times more expensive than S3, it's also slow. It's a network storage.
Starting point is 00:14:11 Trying to put an object store on top of yet another software-defined SAN like EBS made no sense to me. Small deployments, it's okay. But we should never scale that on EBS. So it did not make economic sense. I would never take it seriously because I would never help them grow to scale. But what changed in recent times? Amazon saw that this was not only a problem for Minerva-type players.
Starting point is 00:14:33 Every database out there today, every modern database, even the message queues like Kafka, they all have gone scale-out. And they all depend on local block store. And putting a scale-out distributed database, data processing engines on top of EBS would not scale. And Amazon introduced storage optimized instances, essentially that reduce to between the data infrastructure guy, data engineer or application developer asking IT, I want a Supermicro or a Dell server or even virtual machines. That's too slow, too inefficient. They can provision these storage machines on demand and then I can do it through Kubernetes. These two changes, all the
Starting point is 00:15:09 public cloud players now adopted Kubernetes as the standard and they have to stick to the Kubernetes API standard. If they are incompatible, they won't get adopted. And storage optimized, that is local drives, these are machines like i3, EN, like 24 drives. They are SSDs and fast network, like 25 gigabit to 100 gigabit type network. Availability of these machines, like what typically would run any database, HTFS, Gluster, Minivo, all of them,
Starting point is 00:15:37 those machines are now available just like any other EC2 instance. They are efficient. You can actually put Minivo side-by-side to S3 and still be price competitive. And Amazon wants to, just like their retail marketplace, they want to compete and be open. They have enabled it. In that sense, Amazon is actually helping us. And it turned out that now I can help customers build multi-petabyte infrastructure on Amazon and still stay efficient, still stay price competitive.
Starting point is 00:16:06 I would have said for a long time that if you were to ask me to build out the lingua franca of all the different cloud providers into a common API, the S3 API would be one of them. Now you are building this out multi-cloud. You're in all three of the major cloud marketplaces. And the way that you do that and do those deployments seems like it is the modern multi-cloud API of Kubernetes. When you first started building this, Kubernetes was very early on. What was the evolution of getting there? Or were you one of the first early adoption customers in a Kubernetes space? So when we started, there was no Kubernetes, but we saw the problem
Starting point is 00:16:46 was very clear. And there was containers, and then came Docker Compose and Swarm, then there was Mesos, Cloud Foundry, name it, right? Like there was many solutions all the way up to even VMware trying to get into that space. And what did we do? Early on, I couldn't choose. I couldn't, it's not in our hands, right? Who is going to be the winner? So we just simply embraced everybody. It was also tiring that to allow, implement native connectors to all of them, different orchestration like Pivotal Cloud Foundry alone, they have their own standard open service
Starting point is 00:17:18 broker that's only popular inside their system. Go outside elsewhere, everybody was incompatible. And outside that, even sub-fansible puppet scripts too. We just simply embraced everybody until the dust settled down. When it settled down, clearly a declarative model of Kubernetes became easier. Also, Kubernetes developers understood the community well. And coming from Borg, I think they understood the right architecture. And also, written in go unlike java right it actually matters this minute details resonating with the infrastructure community it took off and then that helped us immensely now it's not only kubernetes is popular it has become the standard from vmware to open
Starting point is 00:17:58 shift to all the public cloud providers gks akks whatever right gke all of them now are basically kubernetes standard it made not only our life easier it made every other isv other open source project everybody now can finally write one code that can be operated portably it is a big shift it is not because we chose we just watched all this we we were riding along the wave. And then because we resonated with the infrastructure community, modern infrastructure is dominated by open source. We were also the leading open source object store. And as Kubernetes community adopted us, we were naturally embraced by the community. Back when AWS first launched with S3 as its first offering, there were a bunch of folks who were super excited,
Starting point is 00:18:45 but object stores didn't make a lot of sense to them intrinsically. So they looked into this and, ah, I can build a file system in user space on top of S3. And the reaction was, holy God, don't do that. And the way that AWS decided to discourage that behavior is a per-request charge, which for most workloads is fine, whatever, but there are some that causes significant burden. With running something like MinIO in a self-hosted way, suddenly that costing doesn't exist in the same way. Does that open the door again to, so now I can use it as a file system again, in which case that just seems like using the local file system only with extra steps. Do you see patterns that are emerging with customers' use of MinIO that you would not see with the quote-unquote providers' quote-unquote native object storage option?
Starting point is 00:19:34 Or the patterns mostly look the same? Yeah, if you took an application that ran on file and block and brought it over to object storage, that makes sense. But something that is competing with object store or a layer below object store, that is, end of the day, the drives are block devices. You have a block interface, right? Trying to bring SAN or NAS on top of object store is actually a step backwards. They completely missed the message that Amazon told that if you brought a file system interface on top of object store, you missed the point that you are now bringing the legacy things that Amazon intentionally removed from
Starting point is 00:20:09 the infrastructure. Trying to bring them on top doesn't make it any better. If you are arguing from a compatibility, some legacy applications, sure. But writing a file system on top of object store will never be better than NetApp, EMC, like EMC Isilon or anything else, or even GlusterFS, right? But if you want a file system, I always tell the community, they ask us, why don't you add an FS option and do a multi-protocol system? I tell them that the whole point of S3 is to remove all those legacy APIs. If I add a POSIX, then I'll be a mediocre object storage and a terrible file system. I would never do that. But why not write a FUSE file system, right? Like S3FS is there. In fact, initially for legacy compatibility, we wrote minFS and I had to hide it. We actually archived
Starting point is 00:20:55 the repository because immediately people started using it. Even simple things like, end of the day, can I use Unix Core Items like CPLs, like all these tools I'm familiar with? If it's not file system, object storage, that S3CMD or AWS CLI is like too bloatware. And it's not really Unix-like feeling. Then what I told them, I'll give you a busy box like a single static binary. And it will give you all the Unix tools that works for local file system as well as object store. That's where the MC tool came. It gives you all the Unix-like programmability, all the Core Utils that is object storage compatible, well as object store that's where the mc tool came it gives you all the unix like programmability all the corridors that is object storage compatible speaks native object store but if i have to make object store look like a file system so unix tools would run
Starting point is 00:21:33 it would not only be inefficient unix tools never scaled for this kind of capacity so it would be a bad idea to take step backwards and bring legacy stuff back inside. For some very small case, if there are simple POSIX calls using ObjectiveFS, S3FS, and for legacy compatibility reasons, makes sense. But in general, I would tell the community, don't bring file and block. If you want file and block, leave those on virtual machines and leave that infrastructure in a silo and gradually phase them out. performance cloud compute at a price that, well, sure, they claim it is better than AWS's pricing. And when they say that, they mean that it's less money. Sure, I don't dispute that. But what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going
Starting point is 00:22:37 to cost. They have a bunch of advanced networking features. They have 19 global locations and scale things elastically, not to be confused with openly, which is apparently elastic and open. They can mean the same thing sometimes. They have had over a million users. Deployments take less than 60 seconds across 12 pre-selected operating systems. Or if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vulture Cloud Compute, they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists
Starting point is 00:23:18 on having something of the scale on their own. But you don't have to take my word for it with an exclusive offer for you. Sign up today for free and receive $100 in credits to kick the tires and see for yourself. Get started at vulture.com slash morningbrief. That's v-u-l-t-r dot com slash morningbrief. So my big problem when I look at what S3 has done is it has name. Because of course, naming is hard. It's simple storage service. The problem I have is with the word simple. Because over time, S3 has gotten more and more complex under the hood.
Starting point is 00:23:56 It automatically tiers data the way that customers want. And integrated with things like Athena, you can now query it directly. Whenever an object appears, you can now query it directly. Whenever an object appears, you can wind up automatically firing off Lambda functions and the rest. And this is increasingly looking a lot less like a place to just dump my unstructured data, and increasingly a lot like this is sort of a database in some respects. Now, understand my favorite database is Route 53. I have a long and storied history of misusing services as databases. Is this one of those scenarios, or is there some legitimacy to the idea of turning this into a database?
Starting point is 00:24:37 Actually, there is now S3 Select API that if you're storing unstructured data like CSV, JSON, Parquet, without downloading even a compressed CSV, you can actually send a SQL query into the system. In Minivo particularly, the S3 Select is SIMD optimized. We can load like every 64K worth of CSV lines into registers and do SIMD operations. It's the fastest SQL filter out there. Bringing these kinds of capabilities
Starting point is 00:25:03 should be just a little bit away from a database, should be to database. I would tell definitely no. The very strength of S3 API is to actually limit all the mutations, right? Particularly if you look at database, they're dealing with metadata and querying. The biggest value they bring is indexing the metadata. But if I'm dealing with that, then I'm dealing with really small block, lots of mutations. The separation of object storage should be dealing with persistence and not mutations. Mutations are database problem.
Starting point is 00:25:34 Separation of database work function and persistence function is where object storage got the storage right. Otherwise, it'll make the mistake of doing post-ex-like behavior, and then not only bringing back all those capabilities, doing IOPS-intensive workloads across the HTTP, it wouldn't make sense, right? So, object storage got the API right.
Starting point is 00:25:53 But now, should it be a database? So, it definitely should not be a database. In fact, I actually hate the idea of Amazon yielding to the file system developers and giving a file tree hierarchical namespace so they can write nice file managers. That was a terrible idea. Writing a hierarchical namespace that's also sorted now puts tax on how the metadata is indexed and organized.
Starting point is 00:26:16 Amazon should have left the core API very simple and told them to solve these problems outside the object store. Many application developers don't need. Amazon was trying to satisfy everybody's need. Saying no to some of these file system type, file manager type users should have been the right way.
Starting point is 00:26:35 But nevertheless, adding those capabilities, eventually now you can see S3 is no longer simple. And we had to keep that compatibility. And I hate that part that actually don't mind compatibility but then doing all the wrong things that Amazon is adding now I have to add because it's compatible I kind of hate that right but now going to a database would be pushing it to the whole new level here is the simple reason why that's a bad idea the right way to do database in fact the database industry is already going in that right direction. Unstructured data, the key value or graph, different types of data, you cannot possibly solve all that even in a single database.
Starting point is 00:27:10 They are trying to be a multi-model database, even they are struggling with it. You can never be a Redis, Cassandra, like a SQL all in one. They try to say that, but in reality, that you will never be better than any one of those focused database solutions out there. Trying to bring that into object store will be a mistake. Instead, let the databases focus on query language implementation and query computation and leave the persistence to object store. So object store can still focus on storing your database segments, the table segments, but the index is still in the memory of the database.
Starting point is 00:27:44 Even the index can be snapshotted once in a while to object store, but use object store for persistence and database for query is the right architecture. And almost all the modern databases now from Elasticsearch to ClickHouse to even Kafka like MessageQueue, they all have gone that route. Even Microsoft SQL Server, Teradata, Vertica, name it Splunk, they all have gone object storage route too. Snowflake itself is a prime example, BigQuery and all of them. That's the right way. Databases can never be consolidated. There will be many different kinds of databases. Let them specialize on GraphQL or Graph API or key value or SQL. Let them handle the indexing
Starting point is 00:28:22 and persistence. They cannot handle petabytes of data that leave it to object store. It's how the industry is shaping up and it is going in the right direction. One of the ways I learned the most about various services is by talking to customers. Every time I think I've seen something, this is amazing, this service is something I completely understand, all I have to do is talk to one more customer. And when I was doing a bill analysis project a couple of years ago, I looked into a customer's account and saw a bucket with, okay, that has 280 billion objects in it. And wait, was that billion with a B? And I asked them, so what's going on over there? And they said, well, we built our own columnar database on top of S3. This may not have been the best approach. I'm
Starting point is 00:29:07 going to stop you there. With no further context, it was not, but please continue. It's the sort of thing that would never have occurred to me to even try. Do you tend to see similar, I would say that they're anti-patterns, except somehow they're made to work in some of your customer environments as they are using the service in ways that are very different than ways encouraged or even allowed by the native object store options. When I first
Starting point is 00:29:36 started seeing the database type workloads coming onto MinIO, I was surprised too. That was exactly my reaction. In fact, they were storing these 256K, sometimes 64K table segments because they need to index it, right? And the table segments were anywhere between 64K to 2MB.
Starting point is 00:29:53 And when they started writing table segments, it was more of an IOPS type I.O. pattern than a throughput type pattern. Throughput is an easier problem to solve. And MinIO always saturated these 100 gigabyte NVMe drive drives. We were IO intensive, throughput optimized. When I started seeing the database workloads, I had to optimize for small object workloads too. We actually did all that because eventually I got convinced the right way to build a database was to actually leave the persistence out of database.
Starting point is 00:30:22 They made actually a compelling argument. Historically, I thought metadata and data, data could be very big and coming to object store makes sense. Metadata should be stored in a database and that's only index page. Take any book, the index pages are only few. Database can continue to run adjacent to object store. It's a clean architecture. But why you put a database itself on object store when i saw a transactional database like mysql changing the inno db to rocks db and making changes at that layer to write the ss tables to minio and then i was like where do you store the memory the journal they said i got to go to kafka and i was like i thought that was insane when it
Starting point is 00:31:03 started but it continued to grow and grow. Nowadays, I see most of the databases have gone to object store. But their argument is the databases also saw explosive growth in data. And they couldn't scale the persistence part. That is where they realized that they still were very good at the indexing part that object storage would never give. There is no API to do sophisticated query of the data. You cannot peek inside the data. You can just do streaming, read, and write. And that is where the databases were still necessary,
Starting point is 00:31:32 but databases were also growing in data. One thing that triggered this was the use case moved from data that was generated by people to now data generated by machines. Machines means applications, all kinds of devices. Now it's like between 7 billion people to a trillion devices is how the industry is changing. And this led to lots of machine-generated, semi-structured, structured data at giant scale coming into database. The databases need to handle scale. There was no
Starting point is 00:32:03 other way to solve this problem other than leaving that. If you're looking at columnar data, most of them are machine generated data. Where else would you store? If they tried to build their own object storage embedded into the database, it would make database immensely complicated. Let them focus on what they are good at, indexing and mutations. Pull the data table segments which are immutable, mutate in memory, and then commit them back, gave the right mix. What you saw was the first step that happened. We saw that consistently across. Now it is actually the standard.
Starting point is 00:32:35 So you started working on this in 2014. And here we are, what is it, eight years later now. And you've just announced a Series B of $100 million on a billion dollar valuation. So it turns out this is not just one of those things people are using for test labs. There is significant momentum behind using this. How did you get there from, because everything you're saying makes an awful lot of sense, but it feels, at least from where I sit, to be a little bit of a niche. It's a bit of an edge case that is not the common case. Obviously, I'm missing something because your investors are not the types of sophisticated investors who see something ridiculous and, yep, that's the thing we're going to go for. They're right more than they're not.
Starting point is 00:33:19 Yeah. The reason for that was they saw what we were set to do. In fact, if you see the lead investor, Intel, they watched us grow. They came into Series A and they saw every day how we operated and grew. They believed in our message. And it was actually not about object storage. Object storage was a means for us to get into the market. When we started, our idea was 10 years from now, what will be a big problem? A lot of times it's hard to see the future, but if you zoom out, it's hidden in plain sight. These are simple trends. Every major trend pointed to world producing more data. No one would argue
Starting point is 00:33:57 with that. If I solved one important problem that everybody is suffering, I won't go wrong. And when you solve the problem, it's about building a product with fine craftsmanship, attention to details, connecting with the user, all of that standard stuff. But I picked object storage as the problem because the industry was fragmented across many different data stores. And I knew that that won't be the case 10 years from now. Applications are not going to adopt different APIs across different clouds. S3 to GCS to Azure Blob to HTFS, everything is incompatible. I saw that if I built a data store for persistence,
Starting point is 00:34:35 industry will consolidate around S3 API. Amazon S3, when we started, it looked like they were the giant. There was only one cloud industry believed in monocloud. Almost everyone was talking to me like AWS will be the world's data center. I certainly see that possibility. Amazon is capable of doing it.
Starting point is 00:34:53 But my bet was the other way, that AWS S3 will be one of many solutions, but if it's all incompatible, it's not going to work. Industry will consolidate. Our bet was, if world is producing so much data, if you built an object store that is S3 compatible,
Starting point is 00:35:09 but ended up as the leading data store of the world and owned the application ecosystem, you cannot go wrong. We kept our heads low and focused on the first 60 years on massive adoption. Build the ecosystem to a scale where we can say, now our ecosystem is equal or larger than Amazon, then we are in business.
Starting point is 00:35:29 We didn't focus on commercialization. We focused on convincing the industry that this is the right technology for them to use. Once they are convinced, once you solve business problems, making money is not hard because they are already sold. They are in love with the product. Then convincing them to pay is not a big deal because data is so critical, central part of their business. We didn't worry about commercialization. We worried about adoption. And once we got the adoption, now customers are coming to us and they're like, I don't want open source license
Starting point is 00:35:57 violation. I don't want data breach or data loss. They are trying to sell to me. And it's an easy relationship game. And it's about long-term partnership with customers. So the business started growing, accelerating. That was the reason that now is the time to fill up the gas tank, and investors were quite excited about the commercial traction as well, and all the intangible, right? How big we grew in the last few years. It really is an interesting segment that has always been something that I've mostly ignored.
Starting point is 00:36:27 Like, oh, you want to run your own. Okay, great. I get it. Some people want to cosplay as cloud providers themselves. Awesome. There's clearly a lot more to it than that. And I'm really interested to see what the future holds for you folks. Yeah, I'm excited.
Starting point is 00:36:40 I think end of the day, if I solve real problems, every organization is moving from compute technology centric to data centric. And they're all looking at data warehouse, data lake, whatever name they give data infrastructure. Data is now the centerpiece. Software is a commodity. That's how they are looking at it. And it is translating to each of these large organizations. Actually, even the mid, even startups nowadays have petabytes of data. And I see a huge potential here. The timing is perfect for us. I'm really excited to see this continue to grow.
Starting point is 00:37:13 And I want to thank you for taking so much time to speak with me today. If people want to learn more, where can they find you? I'm always on the community, right? Twitter and like I think the Slack channel, it's quite easy to reach out to me. LinkedIn, I'm always excited to talk to our users, our community. And we'll, of course, put links to this in the show notes. Thank you so much for your time. I really appreciate it. Again, wonderful to be here, Corey. Nand Babu Puryasamy, CEO and co-founder of MinIO. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice.
Starting point is 00:37:52 Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with what starts out as an angry comment but eventually turns into you, in your position on the S3 product team, writing a thank you note to MinIO for helping validate your market. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point.
Starting point is 00:38:35 Visit duckbillgroup.com to get started. this has been a humble pod production stay humble

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.