The Data Stack Show - 134: Unpacking the AI Revolution and the Technology Behind A Feature-First Future with H.O. Maycotte of FeatureBase

Episode Date: April 12, 2023

Highlights from this week’s conversation include:The journey of H.O. into data and becoming the CEO of FeatureBase (2:37)Characteristics of the super evolution in technology (6:36)ChatGPT as the mis...sionary of AI (9:45)The tension between authenticity and technology (13:12)What is FeatureBase? (17:53)Comparing FeatureBase to feature stores (25:58)Workload capacities and possibilities in FeatureBase (33:20)The importance of developer experience on a platform (38:23)Exciting developments for FeatureBase in the future (47:13)Final thoughts and takeaways (53:52)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Hey, exciting news. We're teaming up with Ananth from the Data Engineering Weekly newsletter for the State of Data Engineering Survey.
Starting point is 00:00:30 It's a five-minute survey, and we need your help. As the data world continues to rapidly evolve, we're interested in your insights around data team priorities, team dynamics, data stacks, how you approach projects like identity resolution, and even data team roles. Your answer will help us build a report that will get featured in Data Engineering Weekly, and after it launches, we'll have a special Data Stack Show episode discussing the results. Plus, we'll send you some Data Stack Show swag for participating. Visit
Starting point is 00:01:01 rudderstack.com slash survey to participate. That's rudderstack.com slash survey. Welcome back to the Data Stack Show. Kostas, today we're chatting with H.O. Maycott and what a story he has to tell. I am not even going to get into it because I'm so excited for our listeners to hear about his upbringing in a hill town without technology in rural Mexico. So there's your little teaser. I want to ask H.O., of course, there's a lot to talk about with Futurebase, his company, which is a fascinating technology, but he's also really into the future and really into AI. And so I want to hear about his vision for the future and for AI. And then did that influence or how did it influence him founding FeatureBase? And then, of course, I want to hear
Starting point is 00:01:56 the technical details and I'm sure you have technical questions. So you've used it. So what are you, you've used FeatureBase. So what are you going to ask about? There are plenty of questions. FeatureBase is very interesting because it's outside of, okay, being like, let's say, a database for features. It has some very interesting use cases. Like I primarily, to be honest, like got interested in it because I felt that it's like a great technology to use for the use cases around CDPs.
Starting point is 00:02:26 And working with event data and creating audiences and all that stuff that you know pretty well from both Routersite and also working in marketing. But it's much more than that. They do a great job in exposing the technology behind it, which is always very interesting for me. So they get into a lot of detail of why they've decided to use it and work with bitmaps and how they use them and all that stuff, which might be a little bit intimidating for someone who just wants to try the product,
Starting point is 00:03:04 but for someone who's curious and understands how it works, it's amazing. And I think we will have a lot of opportunities to talk about building a business around a very new technology, what it takes, and what the journey looks like. Because there is a journey behind FeatureBase. It's not like a product that was launched six months ago. And I think HO is the right person to talk about that stuff. I agree. Well, let's dig in. Let's do it.
Starting point is 00:03:36 HO, welcome to the Data Stack Show. Wow, we have a ton to talk about. The future AI, FeatureBase. So why don't we start where we always do. So give us your background and kind of what led you to starting feature-based. Yeah. Thank you all for having me here. Yeah. My name is H.O. Maycott. I'm the CEO of feature-based. I was born and raised in Mexico in a hill town that had very little access to media and technology. So it was a place where like your wildest imagination could run free without paradigm. And, you know, I always from
Starting point is 00:04:10 a very early age was obsessed with the idea that humans were just on the edge of being superhuman and that, you know, society and life, not necessarily humanity, but life in general was sort of on the edge of this super evolution. And so as that took me through school and my career, I continue to pull on that thread, right? I think especially what's happened in the last six to eight weeks is really fascinating. You know, I've lived my life for this moment, like some call it the fourth industrial revolution,
Starting point is 00:04:38 but without a doubt, like AI is officially here. I think ChatGPT is like the missionary for AI. It's not exactly how I thought it was going to manifest with like funny images and marketing copy, but for good or for bad, it's here. And I do think that, you know, what's happening will allow life to evolve to its next order. I'm fascinated by it. And I think that, you know, we as humans need to sort of unite to make sure that we're helping machines help humans and not, you know, helping machines and humans help machines, because if we're not careful, you know, a lot of our innovators
Starting point is 00:05:10 are focused very much on technology for technology sake. And so we got to remember, we got to keep that in mind. So, you know, so every day I love building on the future. I consider myself an ambassador of the future, but also trying to make sure that, you know, at least the role I play and that my company plays is trying to keep that balance and remember that we're doing this to help us and life and not the other way around. Yeah. So much to dig in there. Let's go into the hill town in Mexico though. You said from an early age, you were enamored by this concept of there sort of being, you know, humanity almost going through like a stepwise change. But you also said you didn't have access to a ton of media.
Starting point is 00:05:53 Where did the seeds of those ideas come from? It's a really good question. I think, you know, I just couldn't handle the idea as a little kid that like, you know, you broke your back. You were paralyzed. Like, why can't you fix those nerves? Like's just you know it's just matter it's just you know it's just it's you should be able to fix that or right like why if we have a heart condition does the rest of the body fail with it so those thoughts have always sort of obsessed me maybe because i didn't have media i didn't realize it wasn't possible like i thought it was possible
Starting point is 00:06:24 i always thought it was possible. And I think as I've gotten older, I continue to try to challenge why are these things not possible? And without a doubt, I think we're going to solve them all now. This is a whole other subject. So we can get together for another episode. I think as long as we don't have a social revolution interrupting all of our advances, I think we're going to see a wild amount of innovation coming at us.
Starting point is 00:06:49 We as humans, like I said in my purpose statement, we have to help machines help us. I think we're just going to have to learn to adapt to very rapid change. Just the impacts that social media have had on kids and on society, it's 15 years old. It just came at us so fast. All of these things are coming at us so fast. And, you know, the first two industrial revolutions were like a hundred years each. The last one was 50 years. You know, this one's going to be like 20. So
Starting point is 00:07:15 anyways, it's going to be, it's going to be fascinating. And I think maybe that, that idealistic, you know, up in the mountains, flowers on the hills background, you know, up in the mountains, flowers on the hills, background, you know, kept me from the paradigms that block my thinking. So I love it. Yeah. I love it. Yeah. The mind of a child and sort of, you know, imagining what adults think is impossible. It seems like you've carried that through, you know, your life and career, which is truly inspiring. One quick question on the super evolution. So you mentioned a couple of things that, you know, were like medical and nature, you know, so sort of biotechnology, but what are sort of the big, you know, I certainly think that's, there's going to be a huge breakthrough in biotechnology for, you know,
Starting point is 00:08:01 sort of medical things, but what are other characteristics of this super evolution that you see? Yeah, I think a lot of people are asking the questions right now, like, you know, sort of medical things, but what are other characteristics of this super evolution that you see? Yeah, I think a lot of people are asking the questions right now, like, you know, what is it going to take to sort of bridge the gap between the AGI, the, you know, the general intelligence that we feel and sort of dream is coming along, right? When does the machine wake up and all of the narrow AI and the simple AI that we're doing today? And I always try to reframe that question that I actually think it's a much broader question, right? It's like AI is on a journey and it will get there again, unless we have social unrest
Starting point is 00:08:32 that stops technology. We will be there sooner than we think. But I think the other end of the spectrum is really what's happening on the synthetic biology side, right? We're trying to make little robots and we're so impressed when it can, you know, jump from one box to the next box. But like, you know, imagine if you could reprogram a cat, like the cat is a million times more advanced than that robot. You know, I always make the joke that like, if I could have a swarm of programmable chinchillas,
Starting point is 00:08:58 right, I could create a lawn business and have these chinchillas come like mow your lawn and fertilize for free. And, you know know they'd have a blast doing it and it would be entertainment at the same time but like that is probably closer to us in capabilities than you know creating an army of 800 little robots that come out and mow our lawn or you know do our landscaping and so there's some fascinating things there's a friend of mine started a company called colossal and their big audacious goal is to bring back the woolly mammoth. He partnered with George Church, the co-founder who invented CRISPR. And they are 100%, and notice I say they, they are 100% convinced that they're bringing back the woolly mammoth in about five years. And it's fascinating to have hundreds of samples. They've been able to put together the genomes. And for us that are technologists, what blows my mind is that they're starting on one end with the Asian elephant genome. And they've sort of mapped
Starting point is 00:09:55 out by putting together all these fuzzy genomes, what the woolly mammoth genome looked like. And there's only 60 genes that are different, right? So they flip the bits per se on these 60 genes. And all of a sudden, it has like four times the mass. It has long hair. It has different hemoglobin flowing through its body. Imagine if we had gotten around and built a robotic Asian elephant, and now we wanted to build a woolly mammoth robotic elephant. It would take us millions of lines of code and years of effort and engineering.
Starting point is 00:10:25 To think that we can go from the Asian elephant to the mammoth with 60 genes is just mind-blowing. So whatever programming language nature uses, it's several orders more efficient than the ones that we're using. So if we can figure out how to harness that, that might be a faster path to some of the things that we're trying to solve. So in any event, I'm still holding out for which path is going to be faster, but I do think we need to consider both paths as we think about the future. Fascinating. I could dig into that for hours, but let's steer it a little bit more towards data and technology and dig into AI, right?
Starting point is 00:11:01 So chat GPT, huge topic of conversation. I love your description of chat GPT as the missionary for AI. I agree with that. We've actually, we've had various discussions about AI on the show, you know, throughout our entire time, which has been really interesting. And I think chat GPT is kind of the first thing that we've seen or that we've discussed on the show that has sort of like mass practical, like appeal and utility, you know, whereas a lot of the previous iterations of AI sort of lived behind a shroud, or you consumed it, you know, with some level of obfuscation as like an end user, or people just didn't understand it, right? But,
Starting point is 00:11:45 you know, anyone can go into chat GPT, ask you the question and they get an answer. It's like, wow, okay, this is real. But you didn't necessarily expect that, you know, me as a marketer, I'm going to use that to like get my new product pages, you know, for Rudderstack, you know, all fancy and then, you know, generate my blog images for my blog posts. What did you think was going to happen? If we rewind five years and you said, okay, the first missionaries are going to be like X, Y, and Z in terms of the manifestations of AI.
Starting point is 00:12:15 Yeah, I mean, I think if I had a robot right now, they could lift and stack boxes. I'd sell the heck out of them to Amazon. I think we thought AI and some of this robotic stuff was going to be taking our jobs from the bottom, right? Like, so just one insight I have with what's happening now is it looks like we're going to go after the middle, right?
Starting point is 00:12:31 Like the legal profession, for example, has always sort of feared technology because it couldn't quite do their job right. For the first time, it sounds like this evolution, revolution on search, which we call chat GPT, which we call these large language models, you know, can finally, you know, do that work really well. In fact, it's probably really well suited for it. And so, so it's so, you know, one of the interesting manifestations is it's sort of going for our jobs in the middle. But I think more than anything, like it relates to us, it relates with us, like we relate to it. And so like we're to us it relates with us like we relate to it and so like we're convinced that it's real and whether it's real or not i don't really know like you know if
Starting point is 00:13:09 we think it's real you know there's a lot of questions about content credibility and bias and all of those things that get get baked into it but like we as humans are kind of suckers right so as long as we think it's real which i think we we do think it's real, and I think it's going to have really profound implications, right? I think, A, we all believe in AI now. My little cousin, my grandmother, everybody talks about large language models. And I think it is going to go after some jobs and change the way we operate. I think what I'm kind of obsessed with is how do we democratize that? It's really like three or four really big companies that have access to all of this. But for everybody else, AI sucks, right?
Starting point is 00:13:50 Like data still sucks. So AI really sucks. Like how do we make this so that we can augment ourselves, right? Every email I have, every text message I have, every time I buy a piece of clothing, like I should be able to have my own version of it that's helping me every day. It's my own personal co-pilot. So we have a lot of opportunity over the next few years to think through the different paths that we've taken to take all of this, but certainly the wheels are in motion now and it makes me so happy. Yeah. I want to go a little sidebar here. You mentioned, is it real? Can you describe the tension there from your perspective? Because I think that's an interesting, and I'll tell you, I'll give you context for the question coming from my brain is that,
Starting point is 00:14:36 you know, I think about it, I tend to think about it from the technology side, right? And so it's, I mean, this is a real model, like taking real input and using real training to produce like an output, right? But there are a lot of people, you know, the education space is a good example, right? Like is an essay that's produced by chat GPT. That's really good. Like, is that real? You know, like, is that kind of what you're getting at? Yeah. I mean, I think maybe even a little more existentially like what is real i uh i helped start a non-profit media organization called the texas tribune a long time ago you know and one of the premises behind starting that organization is that like democracy requires an unbiased source telling you know telling the facts but like even as as factual
Starting point is 00:15:23 as we tried to be there was probably some bias in the way we wrote it and the way we reported it you know the media tends to lean left in general but you know if it was on national tv you know 30 years ago it was real right like for the most part we all believed it whether or not it was real or not like we believed it that it got harder with newspapers and print publications and you know like i think we had a inverse curve hit a spike when the internet got created and the inverse curve is like critical thinking went straight down and and and the proliferation of any version of reality that you wanted went straight up yeah right and so now what chad gp key has done is just
Starting point is 00:16:02 and these large language models have done is just like like obfuscated, turned all of that into a black box. So like, you know, fact checking was already hard and critical thinking was already hard. Like this stuff is really convincing. Like, you know, like Chad GPT, tell me how I'm going to go lift a 10,000 pound weight. Well, it'll tell me and it'll be very convincing and excited and enthusiastic about how I'm going to go about doing that. So, so, you know, that reality, I don't know, but like, it's going to be a lot harder to fact check. It's going to be a lot harder to, you know, to, to figure out what's credible and that's credible, but some degree doesn't matter. Like we've been believing, you know, what media internet now these models are feeding us and, you know, for good or for bad, reality is going to be further distorted. Yeah. How do you balance that? Sorry, I'm going to continue down the sidebar. How do you balance that as an ambassador for the future, right? Because if you think,
Starting point is 00:17:00 and thank you also for answering the question with beginning by saying even more existentially. I love that. So how do you manage, because that's somewhat of an existential crisis where it's like, I'm an ambassador for the future where these things are coming to fruition. However, you also acknowledge that the distortion of reality, serious questions, you know, for society? Yeah, it's a good question, maybe with a quick sidebar on sort of some insights I've had on myself recently. I'm definitely an optimist. Everybody tends to ask me how I'm doing on a scale one to 10. And it's always a 10. And people are how could it be a 10? Like your house is on fire. I'm like, but I'm a 10. You know, so I have this weird ability to not feel to some degree.
Starting point is 00:17:46 So, you know, don't ask my wife and children how having a non-infantic husband goes, but they like scale of one to 10. You will. Good question. But I tend to think that like, you know, it is what it is, right? This is evolution. This is life. I think, you know, Darwin will kick in there somewhere but like there are for sure going to be some tremendously negative consequences that come from this distortion of
Starting point is 00:18:11 reality and those that have the power to create the content whether they're doing it you know consciously or subconsciously have some consequences on their hands right we've seen these with the social networks like without going into the details you know these social networks have had a profound impact on my personal family you know and so I think we're just sort of at the beginning of seeing the consequences of all of this, but it's evolution, right? Like, you know, we will, we will evolve as a result, but you know, let's go read the internet in Russia right now. Let's go read the internet in Mexico right now. And they're going to have a very different version about the exact same, you know,
Starting point is 00:18:45 about the exact same current events. So, yeah. Yeah. Well, always appreciate an optimist and someone, I mean, in many ways, you're accepting the reality of the inevitable, you know, which I really appreciate. Okay, let's talk about how all of that
Starting point is 00:19:02 is packaged into, you know, or what pieces are packaged into you starting FeatureBase. You're a serial entrepreneur. FeatureBase is your latest company. Can you give us a quick overview of, you know, sort of FeatureBase, like what it is, what it's used for, and then circle back and tell us, like, why did you start it? Was it in response to some of those, you know, sort of fundamental beliefs? Yeah, I love that order. And so, so yeah, its core feature base is just a really fast analytical database. It's all in memory. And it's, it was inspired by the feature extraction and engineering process. And so we figured out that
Starting point is 00:19:42 like, most data was originally stored in records. And then people started storing it in columns to be able to analyze it. But every column they created was yet another copy. And so we sort of moved into this, like let's move and copy data in order to analyze it. And when we had this penicillin moment, which we'll get to in just a moment
Starting point is 00:20:01 and create a feature base, we realized that if you stored sort of data at the value at the feature, you know, machines could process that information much more efficiently, that it was a way of empathizing with the way machines wanted to process data and not the way that humans process data. And so feature-based is far more computationally effective and efficient than doing it the way that the human construct has led us to do it
Starting point is 00:20:25 with traditional analytical formats. And so we've invested about $30 million into the technology. As I like to say, it's kind of like the particle physics of data. I think our IO underneath the hood is probably on the Guinness World Book of Records every single time. We go after one of these bigger and bigger workloads. And so taming that has taken a lot of effort. And we're absolutely maniacally obsessed sort of on getting the developer experience so that people can use it. I make the bad joke that it's like a flux capacitor.
Starting point is 00:20:57 So unless you have a DeLorean, time travel is going to be pretty difficult. But we're making huge strides right now to make it adoptable and usable. And further, I'm starting to make some moves to perhaps build a service around it that makes it a lot more than just a database. Like infrastructure is really hard.
Starting point is 00:21:16 And we sell this amazing engine. And if I went to you and said, hey, you've got a car, what if I could give you an engine that went 10 times as fast with 10 times less fuel? would you like it? You'd probably tell me yes. But if the next day I show up on Costas's front door and I say, Hey, here's your engine. Good luck. Like it's not going to, it's not going to go very well. Right. So, so I'm trying to think through like, you know, how do we bolt on a steering wheel and some wheels and then further, like, you know, should it have a driver, right? Like, you know, I've got data in
Starting point is 00:21:47 Snowflake and I want to run this model, right? That would be really nice, right? So I'm currently trying to go from like the unbelievably efficient analytical engine, you know, to what can that power to actually deliver a full experience to the end user. So we're sort of in those throes now. Yeah, absolutely. And you mentioned the penicillin moment. You may have mentioned it, but can you reiterate it if you did? Yeah. So I'll give you two penicillin moments. It was like penicillin squared. So the last company I had was called Humble. Humble was a CDP for sports media and entertainment companies. So we had about 10% of the world's sports teams as clients. And the origin of that humble was a CDP for sports media and entertainment companies. So we had about 10% of the world's sports teams as clients. And the origin of that company was a
Starting point is 00:22:29 little bit less commercial. And it was I've always been obsessed with trying to democratize access of these things to consumers. So when we invented the technology that powers feature base, we were trying to sort of think through like, what would a digital genome look like? How could we represent a human and all of their attributes, And those could be behavioral, medical, otherwise, right? So these led to the format that's underneath the hood for feature-based, it's features, right? It's like the presence of an attribute or a behavior and that underneath
Starting point is 00:22:58 that gets sorted bitmap. So we were like, hey, let's noodle on that. And you could take, you know, my genome and your genome and find the pattern. And those aggregates could tell you how we would behave without knowing HO or Eric, right? So it was pretty powerful for analyzing audiences and consumers and behaviors. Very quickly, we realized that we really couldn't find a way to empower the consumer directly. There wasn't a business model to say like, hey, let's go help you take back your data and we'll go help you monetize it. So we just leveled up a step to sports media and entertainment companies who had these amazing fan bases. And we got quickly enamored with the idea that every customer would bring like hundreds of millions of people into the platform, right? Like a TV network,
Starting point is 00:23:40 a sports league, like it was hundreds of millions of consumers every time. And we love that, you know, because it worked really well inside of this data that we had. But before we fully got our technology ironed out, we were trying to use everything out there that we could, things like Elasticsearch. We were trying to force it to do these filters, aggregation, sorts, rank sorts. And at that scale was just very difficult. And so like our most important queries were going from like sub second to second to minutes. I think at one point our 40 node elastic search cluster was returning that query in about 20 seconds. So, you know, we were starting to cache things and I was just obsessed with the idea, like, no,
Starting point is 00:24:24 no caching. Everything has to be real time. It was just obsessed with the idea like, no, no caching. Everything has to be real time. It was probably a stupid obsession at the time because the end user didn't care, but I was obsessed with it. And so that's where this idea of the digital genome and the ultimate format that ensued came in. We had two engineers that had been doing quantitative stock market trading their entire career. And they just said, hey, H.O., we've got this wild idea. Every time we prepared data for the models that we use to trade
Starting point is 00:24:49 stocks, we'd essentially go select all the features that we wanted and we'd store the data's features, which were essentially decision-ready data. It was a one or a no or a zero. If it was there, it wasn't. And then the model would use these features as input. What happens if we just convert all of the data into features that is getting created? And what if instead of putting it in a database for sort of human-centric data, what if we create a format specifically for features and store it in that native sort of form? And so I gave them six months and they came back with a two-node cluster of the technology. And within weeks, we commissioned the 40 node elastic search cluster and it could do our segmentation and
Starting point is 00:25:29 aggregation queries in single digit milliseconds. And so that was the penicillin moment. It was like, wow, we just defined physics in a way that's so simple. But at the time, it just could do like really high cardinality workloads. Everything was Boolean, so yes or no. So over time, I eventually convinced my board to let me spin that out about four and a half, five years ago into its own company. Like I said, we put about $30 million into it. And so we started teaching it things like integer.
Starting point is 00:26:01 How would you store integers in a binary representation? So we found this white paper called bit slice indexing. And so you could store a 64-bit integer in seven, and it would still have the performance of the underlying bitmaps, right? So you could do range queries on it and all of those kinds of things. And so eventually we taught at floating point, and we use these compression techniques to be able to do dense, mixed density, ultra high cardinality data, a bitmap compression technique called roaring. We modified it and made it a 64-bit version of roaring. I think we were the first to do that. And then we stuck it
Starting point is 00:26:35 in a B plus tree so it could behave more like a regular database. And fast forward to today, we've got what mostly looks like an analytical database, but is very different underneath the hood. Amazing. All right. Well, that is actually the perfect time. Costas, I have tons more questions, but please, I know that there are so many technical questions that cropped up based on HO's description.
Starting point is 00:27:01 So take it away. Yeah, yeah. So HO, let's start by talking a little bit about feature stores, right? Yeah. Like the term feature store has been around like for a while now. Probably they're like post the peak of the hype cycle,
Starting point is 00:27:18 let's say, right? But when I was trying to understand what a feature store is, I was confused, to be honest. Yeah. It wasn't, and I guess it still isn't, in most of its implementations, it's not a single data store, right? It's pretty much a whole architecture that tries to support both online use cases, or let's say real-time use cases, and also batch use
Starting point is 00:27:46 cases in terms of processing the data. Because it makes sense, right? You have your historical data, obviously it's going to be batch processing, right? And then you also have the data that is coming and you want as soon as possible to create features and feed them to the model. So it never felt to me like we are talking about a database system. Yes, they were using various components like from Snowflake to Hive to Databricks to everything. But I think the most interesting part of the feature stores was Redis.
Starting point is 00:28:22 There was always a Redis there that was storing the features and serving the features, right? So at least that's my understanding of the feature store. What's like, how you have experience like feature stores and also how do you compare it to what a feature base is, right?
Starting point is 00:28:44 Yeah. And so I think, you know think this is a lesson for a lot of technologists and entrepreneurs. I'm not going to lie. When we invented this, it solved our problems so well at Humble that we didn't have to do a whole lot more to it. To really solve the high cardinality segmentation use case that we built it for. But I've always had this blind faith that it can solve a whole lot more than just that. And so we've been a bit of a proverbial solution chasing a problem for a while. And that's always a hard place to be. And it takes a lot of blind faith and it takes a lot of optimism
Starting point is 00:29:18 and grit and all of those things that make us crazy as founders. So we've gone through a journey of like, what are we? And trying to meet the market where it is, trying to meet the chasms where they are. And so as we were exploring a category change about four years ago, three, four years ago, we were looking at sort of the underlying process of turning data into features and what we were doing underneath the hood, right? There's one hot encoding, there's, you know, a variety of things that you're doing. And so, you know, can you call it a one hot database? You know, what do you call it?
Starting point is 00:29:50 And at the end of the day, we were storing features. So we're like, hey, it's a feature store, like a data store. But instead of data, it's features, right? So we meant like a storage system for features. We didn't mean like a model lifecycle management, right? That does versioning and lineage and all the other stuff that the modern sort of quote unquote feature stores do. And so we'd been working on this launch and we relaunched as a feature store and literally
Starting point is 00:30:12 within weeks of sort of changing our category, then the Michelangelo project sort of spun out and you had Feast and Tecton and, you know, they redefined ultimately and not even redefined that we hadn't quite defined it yet, except for ourselves. They really defined ultimately what feature stores were going to become. And their versions of feature stores, which I think still align with the current definition, are more of, in my mind, a model lifecycle management system. They're not a storage system for features.
Starting point is 00:30:42 They're really helping you manage the creation and management of features, offline and online features. And most of them have at least three databases underneath the hood, right? So you've got a variety of databases that are coming together to solve that problem, which is a very different problem than the one we set out to solve. Our problem is that data at scale is very difficult and that you have to copy and move it and that everything is batch, right? Like, and yeah, if I go process my features and batch and stick them in Redis,
Starting point is 00:31:12 yeah, I can serve them really fast. But what if I could just compute those features on the fly? What if I didn't have to pre-process those features? What if my transformations, aggregations, and joins were happening in real time? What if those were in the model instead of in my pipelines and in my batch jobs? It would be so much easier to track lineage and versioning. So that was what we were trying to solve with our feature store, but it became a difficult sales process because our top of funnel was full.
Starting point is 00:31:40 Everybody was interested in feature stores and we'd show up and we'd be like, well, we have a feature storage system. And they're like, well, we want to put this model in production. How are you going to help us? And we're like, we can't. So sadly, we had to sort of pivot out of it. I do think at some point it's going to get redefined again. I feel like the category sort of slowed in interest, but I think features are an unbelievably important part of the future. I mean, it's the way machines think, not the way humans think. Like, we want filing cabinets, let's keep our filing cabinets for the humans. But like machines love features, models love features, you know, CPUs, GPUs, they love features. So I do think features and a feature first future is going to dominate,
Starting point is 00:32:20 you know, the way that we scale data. Yeah, makes total sense. And how you would compare feature-based to vector databases? What's the difference? It's a really good question. So we have floating point support in feature-based now, but we don't have native floating point. And by that, I mean the same technique that we use for integers, we're now applying it, we're in development on it, we're now applying to floating point as well. So being able to store us a 64 bit float, and we'll see exactly where it ends up, but let's
Starting point is 00:32:58 call it 10 bits. So we're pretty excited about that, because our core engine should be able to serve full feature vectors at a fraction of the cost and at much more scale than the ones that are currently out on the market. I also believe that so much of AI today is batch. So much of AI is based on records, training a model, and developing a score, but we avoid the analytical queries like looking at an index, looking at a population when we're outputting those scores because the queries are really expensive. So I think there's an entirely new paradigm when an analytical database can serve both a last mile transactional workload
Starting point is 00:33:36 and the core analytical workload and do the feature vectors all at the same time. And so we have all of that in preview internally, but we're very cognizant that being able to store a full feature vector efficiently is a pretty killer feature. And so we're very quickly in development on it right now. That's super cool. So the only, like, I mean, the only, okay,
Starting point is 00:34:00 it's an important difference, but like the main difference is like in the data types, right? What kind of data one system or the other can handle? So with the vector systems, you have floats representations primarily, right? While right now with feature-based, you are working primarily with integers. And out of the hood,
Starting point is 00:34:21 what you have there is a bitmap, which is a series of zeros and ones, right? Exactly. So what kind of workloads someone can experience today with FeatureBase? What's like, let's say, the best scenario for someone to go and try FeatureBase and have a whoa moment by using it? Yeah, I mean, I think it's for good or for bad, and I'm happy being transparent with all of my flaws and all of our challenges. Like for good or for bad,
Starting point is 00:34:51 feature-based has been a database of last resort for very large workloads, right? So when thousands of servers have not been able to solve the job, our customers have been willing to invest the months that it takes to wrap their head around the data model. You've experienced a little bit of this. It is a distributed system. We've got high availability features. You can do replication factors. You know, all of those
Starting point is 00:35:14 things are pretty important for the type of workloads that we serve. But for the most part, we're serving very high ingest workloads that need rapid segmentation and filtering on that data. So that is really the bread and butter for FutureBase. Now, we've very quickly been able to wrap a lot of other workloads around it, but until it's easier to adopt, which it's becoming, those have been the workloads that we're serving. And I'll give you one example, a company called Tremor Video. When we first started working with them, they had about a thousand node Hadoop and Druid cluster that served up there. They were storing somewhere about a million events in that cluster. And then they would run predictions on the consumers and the devices that were feeding data into this cluster. And it would take them about 24 hours to generate
Starting point is 00:36:01 those predictions. So we came in and we were able to reduce the thousand servers to 11, and we could do those same predictions in about a third of a second. So a couple of things that are important to mention here is it was a thousand to 11 servers, so saving millions a year in compute. Now, don't believe everything HO says. The 11 servers were a lot bigger than the thousand servers that they were using previously, but they did the calculation and it's about a 70% reduction in cost. But I think what's more important is now those workloads, let's just call it a second instead of a third of a second. That query was happening in a second. So instead of taking a day, it would take a second. So you could run
Starting point is 00:36:39 84,000 queries in a day on that same compute cluster that they had. So just absolutely changed the way they ran business. They were now completely real-time in a space. This is the advertising space. And they've now scaled that up to about a trillion events a day and tracking about 20 billion devices globally. And literally, there is just nothing else that can do that. And I love those.
Starting point is 00:37:03 Those are great. Those are like trophies you can put on the wall. But like, that's not the everyday problem, right? That's not the problem that the masses have. And to build a really big company, we've got to find problems that more of the masses have. So that's why we've been maniacally obsessed on, you know, developer experience, which we realize is the key to that mass adoption. Yeah, yeah, 100%.
Starting point is 00:37:25 I have a couple of questions there. So from what I understand, let's say a CDP scenario is pretty much ideal, right? For marketeers, let's say, who won't like to be able to segment and create audiences and all the standard things that, let's say, someone is doing with CDP, you can do that scale and with extreme low latency by using feature-based, right?
Starting point is 00:37:54 Exactly, exactly, exactly. And we typically break it up into three buckets, right? Like consumer experience, which includes personalization, segmentation, recommendation, all the things that you just talked about that are natural to a CDP. Anomaly detection is highly faceted as well, right? So it's something that has to happen really quickly and the feature stores and the approaches today pre-process the data, right? If you're a credit card processor, you have to decide if it's fraudulent or not in like 50 milliseconds. And they can do that, but you know why? Because the fraud vector they're using to make that decision was pre-computed. And it was probably being served out of Redis. But it
Starting point is 00:38:28 might have been pre-computed 24 hours ahead. We see these companies pre-processing in days. So that's not okay. We've got to process that in that moment based on the totality of all of the data. So there's another really great opportunity for this real-time workload. And then lastly, I'll say a lot of the stuff happening in AI is really interesting, especially around computer vision. Things like labels, once they get tokenized and transformed out of their sort of raw formats,
Starting point is 00:38:56 end up being highly categorical, right? They look a lot like consumer behaviors and consumer insights, right? So at the end of the day, there's really unstructured search, which gets turned into structure. And then there's structured search, which is faceted search, right? So at the end of the day, it kind of all leads to the same place. So, you know, I am optimistic that this is going to serve a variety of important workloads as we keep
Starting point is 00:39:20 innovating. 100%. Yeah, I totally agree with that. Yeah, it's super, super interesting. Let's go and talk a little bit more about the developer experience now. You've been building the product for a while now. You have customers. You've seen
Starting point is 00:39:40 what it takes to take a new piece of technology and try to adopt it. It's not easy, right? And it seems that more and more people start to believe that it's not just the user experience, it's also the developer experience, which is pretty important.
Starting point is 00:39:55 Making sure that you can help developers succeed in whatever they do is an important aspect of succeeding or not with bringing a product to the market. not. It's like bringing like a product market. So, so without like a few things through this process, like what you have experienced.
Starting point is 00:40:14 Yeah. I mean, I think being in love with your own technology is a huge problem, you know, and empathizing with the end user is all that matters. Right. But others, people,
Starting point is 00:40:23 what others think of you, your product and your brand is what really matters. Obviously, that's like 101. I think it's important to explain a little bit about our journey. So we definitely were originally an open source project under the name Pelosa. So that was the original name of the project. And by all measures, we were wildly successful. Investors sort of flocked to us and saw all the stars going up. And as soon as we took on investor money, the investors said, well, this is so important that you've got to turn your sales process into an enterprise sales process. Huge mistake, number one, because if you look at the curve of innovation and the chasm
Starting point is 00:41:03 that we all have to cross to get on the other side, like analytics was in a prime spot at that point. Today, I would say analytics is way off on the other side. So at the time, product market fit was great. And we decided to start selling this from an enterprise perspective. We ended up going about as high in the organization as we could. And these sales cycles were long. They were very large contracts, like half a million, million dollar a year contracts. And we would sell before people would adopt it. So the developer experience seemingly didn't matter. And let's hang on that word seemingly, because we'd go sign a contract and then the teams were introduced to FeatureBase and they were like, hey, we just signed this contract, go figure out how to use it.
Starting point is 00:41:43 And so they kind of had no choice and it was a painful process, but we had an army of customer success people and deployment people, and we would go help them get it implemented. About a year ago, I just looked at my board and I said, this is crazy. We can go build a big business, maybe $100 million business, but I don't want to build $100 million. And I'm talking about revenue. I don't want to build $100 million business. And I'm talking about revenue. I don't want to build a hundred million dollars business. I want to build a billion, multi-billion dollar business. And there is no way we're going to get there. We have to recaptivate the hearts and minds of the developers. And so a year ago, we, we fired all of marketing, all of sales, and we went PLG. And at the same time,
Starting point is 00:42:20 we decided we were going to take a year to bring certain things to market. We're almost done with SQL. I know you've been using the product a little bit over the last couple of months. We have a lot more along those lines. We pushed out a whole new iteration of documentation yesterday. And so it's been a mad rush for the last year to remove five APIs. We had two ingest APIs. Everything's now gotten standardized around SQL and SQL was difficult for us to get our minds around before because we're like,
Starting point is 00:42:50 there's a much better language for bitwise operations or a bitmap oriented format, but like it didn't matter, right? That was great in our own heads. So we've had to have a strong dose of reality over the last year as we've worked on this developer sort of adoption. And I think we have another six to 12 months to go before we can say, hey, this is now adoptable. And in the meantime, I'm working on plans to acquire a few companies that are going to eliminate a lot of those challenges too, right? Like why should someone have to buy or install yet another database to go run models on their Snowflake data, like Stalin Snowflake or Databricks or Redshift. You should just be able to tell me where your data is and what model you want to run.
Starting point is 00:43:32 So a lot of what we're working on now on the roadmap sort of addresses that, making it even easier. So we've got cloud and cloud consumption out to market. We've got SQL almost out to market. One of my favorite areas is user-defined functions. So being able to register functions in the database and run them actually in the database, as opposed to having to move and copy data to the models.
Starting point is 00:43:54 And then serverless is another big piece that we're finishing right now, you know, to bring costs and efficiency even further down. That's a pretty busy roadmap. That sounds like fun. Sorry, didn't go again. I's a pretty busy roadmap. That sounds like fun. Sorry, I didn't go again. I didn't want to go. No, I was going to say,
Starting point is 00:44:09 it is, and it's been a year's worth of work, and we're really excited to start to see these things coming to fruition now, but they can never come fast enough. Yeah, yeah, 100%. What was one of the most, let's say, surprising learning
Starting point is 00:44:23 that you went through through this transition of being these high tops from top to down kind of sales in the enterprise and leaving that behind and trying to go after other developers. What surprised you? Yeah, I mean, I think everything is about market timing and product market fit. And just because you have it at one point doesn't mean that you have it at the other. Like we had it four years ago, but when we switched to enterprise, like, you know, we learned that motion, but it was inefficient.
Starting point is 00:44:56 And then by the time we came back, like, you know, who really cares about very fast analytics, right? Like, I mean, it matters, but it's not the problem that's on everybody's minds today, right? Like, I mean, it matters, but it's, it's not the problem that's on everybody's minds today. Right. And, you know, everybody's trying to figure out the machine learning, you know, pipeline and paradigm and, and further yet, now we have large language models. Are they going to eat everything? Like they probably could, like we should be doing our analytics and our machine learning, you know, in, in a singular way. And so, you know, I just keep coming back to the idea that AI sucks,
Starting point is 00:45:26 right? Like for the average company and the average person, it's amazing on TV and in the movies and, you know, and with all this chat GPT stuff, but like practical AI is very difficult and very distant. And so I definitely, you know, I'm going to continue to move as fast and hard as I can to to make that experience really easy. Like I said earlier, we're like an engine. I'm going to buy some wheels and a steering wheel, and I am going to offer an Uber-like service so that you can get from A to B.
Starting point is 00:45:55 And I might not be able to go serve these trillion a day workloads as efficiently, but I'm going to serve the broad masses needs more efficiently. So that's a long-winded way of saying I've worked hard to make this database more adoptable. We have more work to do, but I don't think it's enough. People don't want and need yet another database. What the market needs is solutions to real tangible problems every day. Yeah, yeah, 100%. I totally agree with you.
Starting point is 00:46:25 And by the way, I have to say that it's very impressive and I don't know what other word to use, but talking with someone who has gone through the process of building a company, reached the point where you have product market fit, you go after the enterprise and decide to leave that behind and
Starting point is 00:46:49 in a way, let's say, rebuild the company from scratch, that takes a huge amount of courage to do. I mean, that's super, super impressive. I have to say to share that with you, because I know from my personal experience
Starting point is 00:47:08 and also like by working like in startups, do that, it's like extremely hard. It's super, super hard. Like it's, if you think that's like a leap of faith, like to start a company from zero to one, taking a company from a hundred to zero to go back to one, that's wow. That says a lot about the person. So thank you for sharing that with us. Well, yeah, of course. And I think, you know, at least in my case, I have pretty blind faith in features. Like I really do believe that if we're
Starting point is 00:47:41 going to have the machines doing our work for us, like we need to think like the machines, not like humans, and we're still stuck thinking like the humans. So, you know, I have this blind faith that features will power the future, right? And that everything's going to be feature first. And so we haven't quite found the exact right approach to it, but we're going to, or we're going to die trying. Yeah. Yeah. And it sounds like you are the right person to do that. Well, thanks. I might have you call my board and tell them that. But yeah, thank you. All right.
Starting point is 00:48:13 So one last question from me, then I'll give the microphone back to Eric. So share with us something exciting about feature-based that is coming up in the next couple of weeks or months or something that we should keep in our mind and make sure that we go and check when it comes out? Yeah, I think the most exciting thing that we're working right now
Starting point is 00:48:37 is this user-defined functions, UDF. I think we're not the only ones working on it. Single store is doing a really amazing job. We'll see how it ultimately manifests. We've got all the WASM stuff happening as well. But I do very much believe, like I've got, you know, in my passion for features,
Starting point is 00:48:56 you know, my personal obsessions underneath that is to eliminate copying and moving of data, right? I believe that models and data are going to collapse. If we've learned nothing more from these large language models, like the data and the model are becoming pretty much the same thing. And so I think that the only way to really scale the future is to really think about this as like the working memory of AI. You made this point, you asked this question earlier that we didn't quite get to, but like, you know, we as humans don't go back and analyze everything we've ever done.
Starting point is 00:49:29 Rewatch the videos of everything that we've ever done. Read transcripts of everything we've done. Like you and I've had quite a few interactions. I didn't go reread all of those. We wouldn't have time to do that, but we have just enough knowledge about our prior interactions that we can bring to what's happening in this moment and make decisions, right? So I think the only way we're going to be able to scale the future is to think about it in that same way, right? Like the working memory of AI, right?
Starting point is 00:49:54 Like being able to recall just what you need from the historical context with what's happening at this moment to be able to make decisions. And to do that, we're going to have to bring models to the data, right? We need to stop copying and moving data to the models. And so one way or the other, it's going to happen. I hope we're one of the pioneers of it because models love eating features. We're a feature storage system. So like bringing models to the feature storage system
Starting point is 00:50:18 in my brain makes a lot of sense. But one way or the other, I'm excited to see that both in our own product and in the world. I think it's going to make the world a lot more secure. I think it's going to shift innovation to creating value, not to the data engineering that's involved with all of the machine learning pipelines and the lineage and the versioning. And when we can start to network the output of these models, I think that it's a wonderful future. So the very beginnings of this for us are simple. It's just models written in Python. You know, you would SQL, you go registered in the database and
Starting point is 00:50:50 you can run it as data arrives. You can run it on a cron job. You can run it as you do your query, but it runs in the same compute engine. And further, it's going to run on the same serverless compute engine that we've built, right? So you can isolate the model from the query execution piece. And anyways, we're pretty excited about that. And we hope the world agrees that it's going to be a good new capability. Yeah, yeah. And hopefully, we'll have the chance to talk more about that when it is released. So I'm inviting you already. Well, thank you. Thank you. Well, we have a principal engineer, Matt Jaffe, who's been leading those efforts. And he is brilliant, far more technical, super articulate.
Starting point is 00:51:27 I think you would love Matt Jaffe's dive into not only are features computationally far more effective and efficient than storing data, but now that we've got those serverless capabilities, I say this all the time, I think we're going to cut the cost of analytical workloads by at least 99%. So whoever's making money right now on these workloads, they should tremble because the world is about to shift quickly, right? Like we're going to move to like computing faster and more regularly. We've got to figure out how to make these models continuously trainable. Like, like that's where we need to start to shift our energy. A hundred percent. A hundred percent. All right, Eric, the microphone is back to you.
Starting point is 00:52:06 Yes. As always happens, we can keep going and keep going. But we do have to respect our producer while he's gone. Okay, H.O., this is more of a personal question because you are highly optimistic. You seem extremely high-functioning. You understand technology on a deep level, but you also think existentially, as evidenced by the earlier part of our conversation.
Starting point is 00:52:31 Is there anything on a personal level in terms of productivity or how do you operate in your day-to-day? And is there anything that you could share with our listeners that's been particularly helpful? Because it seems like you have a lot of ideas flowing through the old gray matter up there. Yeah, it's a really great question. And I wish I had a spectacular answer for you. So I'm going to try to get a good answer. I have a general on my team. His name is Cord Campbell.
Starting point is 00:53:00 He was one of the first employees at Splunk. And then he started a company called Logly, if you'll remember. He's been in search for 20 years. So as he sees all this like large language model stuff, he's like, oh, yeah, that's just like the next evolution of search. I'm like, yeah, but it's worth like 30 billion now. So it's a little more than just the next evolution of search. But Cord's brilliant.
Starting point is 00:53:18 And Cord's helping us monetize, not monetize, think about how to democratize these technologies more so that we can use them every day as a co-pilot and so as we work on this next iteration that i was telling you about like how do we create sort of the uber that goes from data to model we want consumership at the forefront of it right like this isn't like about a company it's about individuals and maybe they belong to a company maybe they belong to many companies. But we want the individual to have a free tier where they can index their email. They can index their text.
Starting point is 00:53:51 They can index all of their file. They can index all of these things. And he does this every day. So he's got a technology called Mita that he's constantly indexing everything into. So if you were having a conversation right now, all the things you were saying, he'd be feeding to Mita. It crawls URLs, it eats PDFs. And so as he's working, he's asking it questions, but the biggest piece of it is he gives it feedback. So if Mita comes back with a fact or some opinion that's not right, he just gives it that feedback loop. And so I think prompt engineering, prompt feedback, and being able to apply it to our daily
Starting point is 00:54:26 life is going to be really critical. So Cord is doing what you asked me that I should be doing. And I'm hoping Cord is going to help us productize this so that we can all, including me, be doing this every day, right? It's just, we would be so much more productive if we just had that augmentation. I love it. I love it. Well, H.O., this has been an absolutely fascinating conversation. Thank you so much for giving us some time, and we would love to have you back because we only scratched the surface. Well, thank you all so much. I love what you do. Wow, Costas, what an episode with H.O. Maycott. I mean, his story was amazing, but feature-based seems like quite a technology.
Starting point is 00:55:07 I think my biggest takeaway was actually his optimism. And I thought it was interesting. He had a lot of technology and how that influenced his view of what was possible. And he's really carried that through. And you heard multiple times throughout the episode, I was so insistent that we wouldn't cash anything. He just has this persistence about we shouldn't have to face these limitations.
Starting point is 00:55:46 I think feature-based is a really interesting manifestation of those characteristics of him because he really has overcome some amazing things with a pretty wild piece of technology. Yeah, and like, okay, I have to say something here, which I found like amazingly fascinating. It has to do with a person like the human being like hl it might sound that you have like a very stubborn person right that and that's
Starting point is 00:56:16 needed to go and like me like get something that hasn't been created before and get it to the point where it is adopted and like people use it but at the same time like it's i don't know probably like the only person that has demonstrated like an extreme level of flexibility and what i mean by that is like the story of how they started the company they went to the enterprise they had product market fit and then they decided that like we want to build something even bigger and that required to pretty much go back to zero and start again. That's wow. From a founder perspective, being able to do that and take this amount of risk requires, of course, to be very stubborn with your vision obviously also like a lot of
Starting point is 00:57:06 flexibility at the same time and i think like this whole episode in this whole conversation is like how it's like a testament of like how important like the vision and the belief of the humans behind the technologies for the success of the technology. Of course, we talked a lot about technical things, but this, I think, comes next. It's more important to understand these qualities and how important
Starting point is 00:57:36 they are, and then documentation is out there. We can just go and read it. That's what I'm keeping from these episodes and all of the reasons that I would encourage everyone to We can just go and read it. So yeah, that's what I'm keeping from this episode and some of the reasons that I would encourage everyone to go and listen again. Absolutely.
Starting point is 00:57:52 Yeah. We also talked about the future. We did. Which was pretty wild. And he has some pretty exciting predictions about a super evolution that's coming upon us quickly. Yeah, we also talked about biology and he's an ambassador of the future, right?
Starting point is 00:58:10 He's an ambassador of the future. So yes, definitely check it out if you're interested at all in the next super evolution, bitmap features and super fast database technology and just generally a really optimistic and engaging, brilliant person. We will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show.
Starting point is 00:58:31 Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.